Hi Jesse,
Thanks for brining up those points, but I wouldn't worry about performance or memory consumption issues related to the Multiline
flag when matching patterns in an html document. Pattern matching is slow by nature, and in this case it might not be executed in a
batch process where performance would really be a concern. Also, any expression will probably perform well when executed against any
standard-sized html document.
I think my solution with the addition of the Multiline option should be fine. If the user experiences performance issues due to the
expression, only then would I recommend that a more complex expression be used. A more complex expression is much harder to write
and debug, but it may perform better. Therefore, the user must make a trade-off decision, but I wouldn't recommend sacrificing ease
of writing and debugging, (and therefore, understanding), to address performance concerns that aren't real. When it's known whether
the expression is not going to perform well then the trade-off can be made.
Anyway, I followed your post and your points seemed to make perfect sense, but your expression didn't work when I tested it on the
following document:
string html = @"<html>
<head></head>
<body>
<img src=""test.jpg""></img>
</body>
</html>
";
0 matches.
And didn't work on this document either:
string html = @"<html>
<head></head>
<body>
<img src=""test.jpg""></img>
<img src=""test.jpg"" />
<img src=""test.jpg""></img>
</body>
</html>
";
1 match, but it's invalid:
{<img src="next.jpg"></img>
<img src="next.jpg" />}
Here's the code I used to test your expression:
System.Text.RegularExpressions.Regex re = new System.Text.RegularExpressions.Regex(
@"<img(""[^""]*""|'[^']*'|[^>])*(/>|>((?!</img).)*</img\s+>)");
foreach (System.Text.RegularExpressions.Match match in re.Matches(html))
{
match.GetType(); // break point in debugger
}
I didn't even attempt to do any debugging of my own :)
--
Dave Sexton
[quoted text, click to view] "Jesse Houwing" <jesse.houwing@nospam.sogeti.nl> wrote in message news:ewLFbsk5GHA.3452@TK2MSFTNGP05.phx.gbl...
> GS wrote:
>> what is a good general regex expression for html <img ....> tag?
>> I tried
>> "<img [/:.a-z =0-9\""_;&]*\->", RegexOptions.IgnoreCase)
>> but it is not quite working
>>
>> thank you for your time
>
> It looks like you've already had a working answer, but I still want to comment on a few issues.
>
> By default, the . does not match newlines, so image tags like these
>
> <img
> src = "http://.../img.gif
> />
>
> won't be matched. if you're expression is <img.*>. Adding or removing the ? to make it <img.*?> doesn't change things. There is an
> option to allow . to match newlines, but that option is potentionally very resource intensive (if you're input is 2MB, it will
> match 2MB and start backtracking from there).
>
> A safer expression would be the following: <img[^>]*> this matches everything between <img and > that is not a > itself. This will
> work in most cases. There's one problem though, > is allowed within quotes if you follow the standards. This can also be caught
> in regex:
>
> <img("[^"]*"|'[^']*'|[^>])*>
>
> If you'd want to catch the corresponding </img> tag as well things get harder, though this is still possible to a certain degree.
>
> First we match everything up to the end of the tag
> <img("[^"]*"|'[^']*'|[^>])*
>
> and then we match either /> or >......</img>
>
> (/>|>.*?</img>)
>
> As you can see I added the lazy modifier again, but this will suffer the same issues as before, so is there a better solution you
> might ask... And of course there is :).
>
> By using a negative look-ahead we can match everything that is not the start of </img as follows:
>
> ((?!</img).)*
>
> Combine this with what we already had and you get this:
>
> <img("[^"]*"|'[^']*'|[^>])*(/>|>((?!</img).)*</img>)
>
> Only one issue left to tackle. The </img> tag does not necessarily have the closing > directly after the tagname. Whitespace is
> allowed in the closing tag. This can easily be added:
>
> <img("[^"]*"|'[^']*'|[^>])*(/>|>((?!</img).)*</img\s+>)
>
> Kind regards,
>
> Jesse Houwing