Groups | Blog | Home
all groups > dotnet xml > january 2005 >

dotnet xml : XSLT and HTML


Mike P
1/26/2005 4:34:31 AM
I am quite new to XML and XSLT, and I know you can apply XSLT to XML to
display data in an XML file according to the XSLT file, but is it
possible to apply an XSLT file to page/s of HTML, so that you aren't
just limited to your XML data, but can transform a whole web page?


Any assistance would be really appreciated.


Cheers,

Mike



*** Sent via Developersdex http://www.developersdex.com ***
Bruce Wood
1/26/2005 10:49:14 AM
An important question may be whether you are doing the XSLT transform
in .NET code, or using a stand-alone XSLT program such as msxsl.exe.

If it's the latter, then Brian is correct: your input must be properly
formed, which means that old-style HTML that doesn't conform to XHTML
won't work, because the tags aren't matched. For example, most people
writing HTML don't bother putting in the closing </p> tag at the ends
of paragraphs. They just put <p> all over the place and expect that the
browser will realize that <p> in the middle of another <p> really means
</p><p>: close the previous paragraph and start a new one.

As well, lots of people writing HTML don't sweat about whether to use
<p> or <P>: they're both the same thing in HTML, but not in XML.

So, if you're using a stand-alone engine then you have to feed it
well-formed HTML, which the industry calls XHTML.

If you're reading the HTML into .NET and then transforming it within C#
or VB, there may be a way, as Martin pointed out, to massage the
resulting data structure into something the XSLT transform engine will
accept.
Brian Staff
1/26/2005 11:29:33 AM
Its really easy if the web pages conform to XHTML, then each page can be
treated as a XML string. If you have the ability to modify the pages, then just
make sure the HTML conforms to XHTML.

e.g.

<hr> should become <hr />
<br> should become <br />
<img .....> should become <img ..... />
<input ...> should become <input .... />
<meta .....> should become <meta .... />
<link ....> should become <link .... />

and any single value attributes like nowrap should become nowrap='nowrap'

That`s about it really...

Brian
Martin Honnen
1/26/2005 2:01:07 PM


[quoted text, click to view]

XSLT (at least in the theory of its specification) transforms an input
tree into a result tree which can then be serialized as text or as xml
or as html or even a custom serialization. The input tree is usually
constructed from XML but it is possible to construct one from other
input, with .NET I think you should be able to use the SgmlReader class
to parse HTML and have the proper input for the .NET transformer.
<http://www.gotdotnet.com/Community/UserSamples/Details.aspx?SampleGuid=B90FDDCE-E60D-43F8-A5C4-C3BD760564BC>

--

Martin Honnen
Mike P
1/27/2005 1:34:18 AM
So XSLT will understand all HTML tags? I'm assuming this will be the
case with .html and .aspx files etc, and then with .xml files it will
treat them as unknown tags?






*** Sent via Developersdex http://www.developersdex.com ***
Mike P
1/27/2005 4:04:22 AM
Does anybody have an example of transforming HTML with XSLT?


Cheers,

Mike



*** Sent via Developersdex http://www.developersdex.com ***
Bruce Wood
1/27/2005 10:12:15 AM
XSLT doesn't "understand" any tags. It's just a pattern-matching
engine: "When you see this tag, generate this stuff." Its only
requirement is that the "XML" coming in must be well-formed:

o Every element must either be self-closing, as in <br />, or must have
a corresponding closing tag.
o Element tags must be nested to form a hierarchy.
o All attribute values must be surrounded by double quotes
o Attribute values must not contain invalid characters such as ' " < >
or &.

If the incoming XML meets these few criteria, XSLT will accept it and
allow you to do pattern matching on it. XHTML is simply HTML that
follows the above rules. You can read more about XHTML at:

http://www.w3.org/MarkUp/2004/xhtml-faq

or just Google for XHTML.

Again, XSLT doesn't "understand" any tags, any more than a text editor
"understands" English above the level of knowing how to recognize what
constitutes a word. XSLT knows what an element looks like, what an
attribute looks like, and what is just text. The rest is up to you. :)
AddThis Social Bookmark Button