I've been doing some HTML parsing & cleaning lately, which often involves a lot of regular expressions. Turns out that .* doesn't match across newlines, though, so that if you want to grab an HTML page's title value, the following regular expression:
<title>(.*?)</title>
works for this HTML
<title>this is my page title</title>
but not for this HTML
<title>this is mypage title</title>
I usually use regexlib.com for patterns, tips, & testing, & it didn't say anything about the period . not matching against newlines. It supposedly matches any character, but apparently not CR or LF. So...after banging my head against a wall for a while I finally found a good tip at OsterMiller.org that suggested this pattern
<title>((.|[\r\n])*?)</title>
and voila, it worked! So I was able to write my GetTagValue function like so
public static string GetTagValue(string html, string tag){string pattern = @"<\s*" + tag + @"[^>]*>((.|[\r\n])*?)</\s*" + tag + @"[^>]*>";Match m = Regex.Match(html, pattern, RegexOptions.IgnoreCase);if (m == null)return String.Empty;if (!m.Success)return String.Empty;return m.Groups[1].Value;}
Powered by: newtelligence dasBlog 2.0.7226.0
Disclaimer The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.
© Copyright 2008, Ben Strackany
E-mail