Hello Sudheer,
[quoted text, click to view] > I am looking for a regular expression for finding a certain content
> presnt in a HTML page
>
> The html page looks something like this:
>
> <div class="info">
> <h5>Genre:</h5>
> <a href="
http://www.imdb.com/Sections/Genres/Action/">Action</a> / <a
> href="
http://www.imdb.com/Sections/Genres/Adventure/">Adventure</a> /
> <a href="
http://www.imdb.com/Sections/Genres/Crime/">Crime</a> / <a
> href="
http://www.imdb.com/Sections/Genres/Thriller/">Thriller</a> <a
> class="tn15more inline" href="
http://www.imdb.com/title/tt0337978/ > keywords" onclick="(new Image()).src='/rg/title-tease/keywords/images/
> b.gif?link=/title/tt0337978/keywords';">more</a>
> </div>
> <div class="info">
> <h5>Tagline:</h5>
> Yippee Ki Yay Mo - John 6:27
> </div>
> <div class="info">
> <h5>Plot Outline:</h5>
> John McClane takes on an Internet-based terrorist organization who is
> systematically shutting down the United States. <a class="tn15more
> inline" href="
http://www.imdb.com/title/tt0337978/plotsummary" > onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
> link=/title/tt0337978/plotsummary';">more</a>
> </div>
> now i need a regular expression that looks out the entire HTML and
> helps me extract
> 1. the tagline of the movie
> 2. the plot outline etc etc.
> it is assured that they will be present in a div with id= "info"
>
> any help in this regard would be appreciated!
That would be pretty easy to do:
"<div class=\"info\">\s*<h5>Tagline:</h5>(?<Tagline>((?!</div).)+)"
"<div class=\"info\">\s*<h5>Plot Outline:</h5>(?<Plot>((?!</div).)+)"
Or more generic:
"<div class=\"info\">\s*<h5>(?<Key>[^:]+):</h5>(?<Value>((?!</div).)+)"
Another option, that would be a little more rebust, would be to use the HTML
Agility Pack (can be found on
www.codeplex.com).
--
Jesse Houwing
jesse.houwing at sogeti.nl