Capture Content Inside Html Tags With Regex
Solution 1:
It sounds like you need to enable the "dot all" (s) flag. This will make . match all characters including line breaks. For example:
preg_match('/<div\s*class="intro-content">(.*)<\/div>/s', $html);
Solution 2:
You should not use regexp's to parse html like this. div
tags can be nested, and since regexp don't have any context, there is no way to parse that. Use a HTML parser instead. For example:
$doc = new DomDocument();
$doc->loadHtml($html);
foreach ($doc->getElementsByClassName("div") as$div) {
var_dump($div);
}
See: DomDocument
Edit:
And then I saw your note:
I am forced to use regex because this application stores regexes in a database and only functions this way. I absolutely cannot change the functionality
Well. At least make sure that you match non-greedy. That way it'll match correct as long as there are no nested tags:
preg_match('/<div\s*class="intro-content">(.*?)<\/div>/s', $html);
Solution 3:
This obviously doesn't work because the
.
character will not match space characters.
Should do, but if it doesn't, we can just add them in:
<div\s*class="intro-content">([ \t\r\n.]*)</div>
You then need to make it lazy, so it captures everything up to the first</div>
and not the last. We do this by adding a question mark:
<div\s*class="intro-content">([ \t\r\n.]*?)</div>
There. Give that a shot. You might be able to replace the space characters (\t\r\n
) between [
and ]
with a single \s
too.
Post a Comment for "Capture Content Inside Html Tags With Regex"