Write a RegEx to Match Open Tags Except XHTML Self-Contained Tags

RegEx Match Open Tags Except XHTML Self-Contained Tags

Matching open HTML tags while excluding XHTML self-closing tags is a common task when working with HTML using regular expressions. In this post, we'll walk through how to construct a regex that matches open tags (like <div>) but ignores self-contained tags (like <img /> or <br />).

Understanding the Problem

HTML open tags typically have the following structure:

<tagname [attributes]>

Self-closing tags in XHTML include a / before the closing >, like this:

<tagname [attributes] />

The goal is to match only open tags that don't have the self-closing slash.

The Regex Solution

Here’s the regex pattern you can use:

<([a-zA-Z][a-zA-Z0-9]*)\b[^>]*?(?<!/)>

Explanation of the Regex

<([a-zA-Z][a-zA-Z0-9]*): Matches the opening < and captures the tag name. The tag name must start with a letter, followed by optional alphanumeric characters.
\b: Ensures a word boundary after the tag name to avoid capturing extra characters.
[^>]*?: Matches any characters between the tag name and >, non-greedily (to capture attributes or spaces).
(?<!/): Negative lookbehind to ensure there isn’t a / before the closing >, filtering out self-closing tags.
>: Matches the closing angle bracket.

Example Usage

Consider the following HTML snippet:

<div class="example">
<img src="image.png" />
<p>Hello, World!</p>
<br />
</div>

Applying the regex will match:

<div class="example">
<p>

Code Example

Here’s a simple Python script to extract matching tags:

import re

# Input HTML
html = """
<div class="example">
<img src="image.png" />
<p>Hello, World!</p>
<br />
</div>
"""

# Regular expression
pattern = r'<([a-zA-Z][a-zA-Z0-9]*)\\b[^>]*?(?<!/)>'

# Find all matches
matches = re.findall(pattern, html)

# Output matched tags
print(matches)

Output:

['div', 'p']

When to Use a Parser Instead

While regex works well for simple and predictable HTML, parsing more complex or malformed HTML is best done with dedicated libraries like BeautifulSoup (Python) or the DOMParser in JavaScript.

Pro Tip: Use regex for quick tasks, but always consider an HTML parser for large-scale or complex projects.

Conclusion

With the regex provided, you can reliably match open HTML tags while excluding self-contained tags in XHTML. Always test your regex against different HTML snippets and remember the limitations of regex for parsing HTML. For more complex needs, a parser is often the better choice.

Happy coding!

Language Lassi

Search This Blog

Write a RegEx to Match Open Tags Except XHTML Self-Contained Tags

RegEx Match Open Tags Except XHTML Self-Contained Tags

Understanding the Problem

The Regex Solution

Explanation of the Regex

Example Usage

Code Example

When to Use a Parser Instead

Conclusion

Labels

Comments

Post a Comment

Popular posts from this blog

Fake CVR Generator Denmark

How To Iterate Dictionary Object

Bing Homepage Quiz: Fun, Win Rewards, and Brain Teasers

"'git' is not recognized as the name of a cmdlet" - 3 ways to resolve the error

SAP UI5 / Fiori - Full Course