Groups | Blog | Home
all groups > dotnet general > february 2008 >

dotnet general : Regular expression for nested HTML tags


Sudheer
2/27/2008 10:31:27 PM
I am looking for a regular expression for finding a certain content
presnt in a HTML page

The html page looks something like this:

<div class="info">
<h5>Genre:</h5>
<a href="http://www.imdb.com/Sections/Genres/Action/">Action</a> / <a
href="http://www.imdb.com/Sections/Genres/Adventure/">Adventure</a> /
<a href="http://www.imdb.com/Sections/Genres/Crime/">Crime</a> / <a
href="http://www.imdb.com/Sections/Genres/Thriller/">Thriller</a> <a
class="tn15more inline" href="http://www.imdb.com/title/tt0337978/
keywords" onclick="(new Image()).src='/rg/title-tease/keywords/images/
b.gif?link=/title/tt0337978/keywords';">more</a>
</div>

<div class="info">
<h5>Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>

<div class="info">
<h5>Plot Outline:</h5>
John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. <a class="tn15more
inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
link=/title/tt0337978/plotsummary';">more</a>
</div>



now i need a regular expression that looks out the entire HTML and
helps me extract
1. the tagline of the movie
2. the plot outline etc etc.


it is assured that they will be present in a div with id= "info"

Jesse Houwing
2/28/2008 2:08:21 PM
Hello Sudheer,

[quoted text, click to view]


That would be pretty easy to do:

"<div class=\"info\">\s*<h5>Tagline:</h5>(?<Tagline>((?!</div).)+)"
"<div class=\"info\">\s*<h5>Plot Outline:</h5>(?<Plot>((?!</div).)+)"

Or more generic:
"<div class=\"info\">\s*<h5>(?<Key>[^:]+):</h5>(?<Value>((?!</div).)+)"

Another option, that would be a little more rebust, would be to use the HTML
Agility Pack (can be found on www.codeplex.com).

--
Jesse Houwing
jesse.houwing at sogeti.nl

AddThis Social Bookmark Button