home · blog · groups · about us · contact us
DevelopmentNow Blog
 Friday, October 20, 2006
 
 

I've been doing some HTML parsing & cleaning lately, which often involves a lot of regular expressions. Turns out that .* doesn't match across newlines, though, so that if you want to grab an HTML page's title value, the following regular expression:

<title>(.*?)</title>

works for this HTML

<title>this is my page title</title>

but not for this HTML

<title>this is my
page title</title>

I usually use regexlib.com for patterns, tips, & testing, & it didn't say anything about the period . not matching against newlines. It supposedly matches any character, but apparently not CR or LF. So...after banging my head against a wall for a while I finally found a good tip at OsterMiller.org that suggested this pattern

<title>((.|[\r\n])*?)</title>

and voila, it worked! So I was able to write my GetTagValue function like so

public static string GetTagValue(string html, string tag)
{
string pattern = @"<\s*" + tag + @"[^>]*>((.|[\r\n])*?)</\s*" + tag + @"[^>]*>";
Match m = Regex.Match(html, pattern, RegexOptions.IgnoreCase);

if (m == null)
return String.Empty;

if (!m.Success)
return String.Empty;

return m.Groups[1].Value;

}

October 20, 2006    Bookmark to Digg or other social bookmarking
#    Disclaimer  |  Comments [0]

Related posts:
SubSonic DAL layers in Visual Studio
Intro to BDD
Easy FTP Uploads in your EXEs
FireBug 1.0.1 is out
WYSIWYG editor in minutes
jQuery for javascript effects


« How to Renew your Microsoft Empower for ... | Main | Unit Testing with Database Rollbacks »
Comments are closed.