Variable Order Regex Syntax

Question

Is there a way to indicate that two or more regex phrases can occur in any order? For instance, XML attributes can be written in any order. Say that I have the following XML:

Solution 1:

No, I believe the best way to do it with a single RE is exactly as you describe. Unfortunately, it'll get very messy when your XML can have 5 different attributes, giving you a large number of different REs to check.

On the other hand, I wouldn't be doing this with an RE at all since they're not meant to be programming languages. What's wrong with the old fashioned approach of using an XML processing library?

If you're required to use an RE, this answer probably won't help much, but I believe in using the right tools for the job.

Solution 2:

Have you considered xpath? (where attribute order doesn't matter)

//a[@class and @title]

Will select both <a> nodes as valid matches. The only caveat being that the input must be xhtml (well formed xml).

Solution 3:

You can create a lookahead for each of the attributes and plug them into a regex for the whole tag. For example, the regex for the tag could be

<a\b[^<>]*>

If you're using this on XML you'll probably need something more elaborate. By itself, this base regex will match a tag with zero or more attributes. Then you add a lookhead for each of the attributes you want to match:

(?=[^<>]*\s+class="link")
(?=[^<>]*\s+title="Home")

The [^<>]* lets it scan ahead for the attribute, but won't let it look beyond the closing angle bracket. Matching the leading whitespace here in the lookahead serves two purposes: it's more flexible than matching it in the base regex, and it ensure that we're matching a whole attribute name. Combining them we get:

<a\b(?=[^<>]*\s+class="link")(?=[^<>]*\s+title="Home")[^<>]+>[^<>]+</a>

Of course, I've made some simplifying assumptions for the sake of clarity. I didn't allow for whitespace around the equals signs, for single-quotes or no quotes around the attribute values, or for angle brackets in the attribute values (which I hear is legal, but I've never seen it done). Plugging those leaks (if you need to) will make the regex uglier, but won't require changes to the basic structure.

Solution 4:

You could use named groups to pull the attributes out of the tag. Run the regex and then loop over the groups doing whatever tests that you need.

Something like this (untested, using .net regex syntax with the \w for word characters and \s for whitespace):

<a ((?<key>\w+)\s?=\s?['"](?<value>\w+)['"])+ />

Solution 5:

The easiest way would be to write a regex that picks up the <a .... > part, and then write two more regexes to pull out the class and the title. Although you could probably do it with a single regex, it would be very complicated, and probably a lot more error prone.

With a single regex you would need something like

<a[^>]*((class="([^"]*)")|(title="([^"]*)"))?((title="([^"]*)")|(class="([^"]*)"))?[^>]*>

Which is just a first hand guess without checking to see if it's even valid. Much easier to just divide and conquer the problem.

Html5 Development