Parsing XHTML documents with .NET 4.0 and XmlPreloadedResolver

When I looked at "What's new in System.Xml in .NET 4.0/Visual Studio 2010" with the beta 1 release I presented an example that shows how parsing an XHTML document referencing one of the W3C XHTML 1.0 DTDs can be sped up by using the new XmlReaderSettings.DtdProcessing set to DtdProcessing.Ignore. The drawback I mentioned is that any referenced entity in the document would then throw an exception.

What I overlooked at the time of the beta 1 release but I have found now in the recent beta 2 release is the new class XmlPreloadedResolver in System.Xml.Resolvers. It allows you to avoid any network access to the W3C's server for the XHTML DTDs but nevertheless parse any XHTML document having entity references as it uses copies of those DTDs stored in an assembly deployed with the .NET framework.

If I use that class with an adaption of the older example the code looks as follows:

            Stopwatch watch = new Stopwatch();
XmlReaderSettings settings = new XmlReaderSettings();
settings.DtdProcessing = DtdProcessing.Parse;


string xhtml = @"<!DOCTYPE html
PUBLIC ""-//W3C//DTD XHTML 1.0 Strict//EN""
""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"">
<html xml:lang=""en"">
<head>
<title>Example</title>
</head>
<body>
<p>Price is: 100 &euro;</p>
</body>
</html>"
;
watch.Start();
using (XmlReader reader = XmlReader.Create(new StringReader(xhtml), settings))
{
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.Text)
{
Console.WriteLine(reader.Value);
}
}
}
watch.Stop(); ;
Console.WriteLine("First parse: elapsed time: {0}", watch.Elapsed);

watch.Reset();

settings.XmlResolver = new XmlPreloadedResolver(XmlKnownDtds.Xhtml10);

watch.Start();
using (XmlReader reader = XmlReader.Create(new StringReader(xhtml), settings))
{
while (reader.Read())
{
if (reader.NodeType == XmlNodeType.Text)
{
Console.WriteLine(reader.Value);
}
}
}
watch.Stop(); ;
Console.WriteLine("Second parse: elapsed time: {0}", watch.Elapsed);

Running that code here with Visual Studio 2010 Beta 2 in a virtual machine outputs numbers clearly showing the speed gained by parsing with the XmlPreloadedResolver:

First parse: elapsed time: 00:00:04.6378648
Second parse: elapsed time: 00:00:00.0441933

 

Published Sun, Nov 8 2009 18:56 by Martin Honnen

Leave a Comment

(required) 
(required) 
(optional)
(required)