Parse HTML Pages to Extract Data
Use the SgmlReader class to parse HTML documents and even generate well-formed HTML.
by Dan Wahlin
Posted December 18, 2002
Parsing HTML documents to extract data isn't the easiest task to accomplish using .NET Framework classes. Although you can use many classes in the .NET Framework to parse files line by line, such as the StreamReader, the API exposed by the XmlReader won't work "out of the box" because HTML is not well-formed. You can use regular expressions, but unless you're comfortable writing these expressions you might find them somewhat difficult at first.
Microsoft's XML guru Chris Lovett recently released a new SGML parser named SgmlReader on the http://www.gotdotnet.com Web site that can parse HTML documents and even convert them into a well-formed structure. SgmlReader derives from XmlReader, which means you can parse HTML in the same way you parse XML documents using classes such as the XmlTextReader. I'll provide an introduction to how you can use the SgmlReader class to parse HTML documents and even generate well-formed HTML so that you can use XPath statements to access data.
Create an SgmlReader Instance to Parse HTML
To begin using SgmlReader, download it from gotdotnet.com and place the assembly into your application's bin folder. Once the assembly is ready to use, write code to retrieve the HTML you want to parse. For this example, the HttpWebRequest and HttpWebResponse objects are used to access a remote HTML document:
HttpWebRequest req =
(HttpWebRequest)WebRequest.Create(uri);
HttpWebResponse res =
(HttpWebResponse)req.GetResponse();
StreamReader sReader = new
StreamReader(res.GetResponseStream());
After grabbing the remote HTML document, you can create an instance of the SgmlReader class. Let the reader know you're working with HTML by setting its DocType property to a value of "HTML":
SgmlReader reader = new SgmlReader();
reader.DocType = "HTML";
The response stream of the HTML document can be loaded into the SgmlReader instance for parsing through its InputStream property. First load the HTML document stream into a TextReader object and then assign the TextReader to the InputStream property:
reader.InputStream = new
StringReader(sReader.ReadToEnd());
Back to top
|