Easily Parse HTML Documents in C#

So, you are building a C# application and need to parse a web page’s HTML. You could use regular expressions, but it seems more efficient to use a DOM-based approach. What if you could even take advantage of the power of XPath?

.Net contains an HtmlDocument class, along with HtmlElement, in System.Windows.Forms, which could seem pretty interesting. It does provide basic DOM methods like GetElementById and GetElementsByTagName. However, if you try to create an HtmlDocument object, you will soon notice that it has no public constructor. It is actually a wrapper around an unmanaged class and the only way you can get an instance is through the WebBrowser control. Quite slow and annoying… So, what are the other solutions?

XmlDocument and XmlNode are an interesting solution if you have correctly formatted XML or XHTML. If you are to retrieve content from the web, then you should will need another library that will check the markup and correct it if needed. You may want to try something like Tidy or SGMLReader. Then you can create an XmlDocument and access quite interesting methods to parse and manipulate the nodes.

HtmlAgilityPack

Another solution that I actually now use every time I need to parse HTML is the free and open source HtmlAgilityPack library. It provides HtmlDocument and HtmlNode classes, which are quite similar to .NET’s XmlDocument and XmlNode classes. You can load the HTML either from a file, an URL or a string. There is no need to check the markup validity first as HtmlAgilityPack will take care of making everything valid by closing unclosed tags and fixing other markup errors. Once the document is loaded, you can start having fun parsing through the nodes!

Basic Parsing

The HtmlDocument object provides a getElementById method that let you target a specific node using its Id. You can use properties such as ChildNodes, FirstChild, NextSibling and ParentNode to navigate through the nodes. You can also use the Ancestors and Descendants methods to respectively get a list of all the ancestors or descendants of a node. Optionally, a node name can be given to retrieve only one type of nodes. Use the Attributes property to access a node’s attributes.

Here is a simple example that retrieves a web page and lists all the external links within a given node specified by its Id:

// The HtmlWeb class is a utility class to get the HTML over HTTP
HtmlWeb htmlWeb = new HtmlWeb();

// Creates an HtmlDocument object from an URL
HtmlAgilityPack.HtmlDocument document = htmlWeb.Load("http://www.somewebsite.com");

// Targets a specific node
HtmlNode someNode = document.GetElementbyId("mynode");

// If there is no node with that Id, someNode will be null
if (someNode != null)
{
  // Extracts all links within that node
  IEnumerable<HtmlNode> allLinks = someNode.Descendants("a");

  // Outputs the href for external links
  foreach (HtmlNode link in allLinks)
  {
    // Checks whether the link contains an HREF attribute
    if (link.Attributes.Contains("href"))
    {
      // Simple check: if the href begins with "http://", prints it out
      if (link.Attributes["href"].Value.StartsWith("http://"))
        Console.WriteLine(link.Attributes["href"].Value);
    }
  }
}

Using XPath

As I mentioned above, HtmlAgilityPack supports XPath. If you don’t know XPath, I really suggest you take some time to learn it. It is quite simple, yet powerful. The HtmlNode class provides two methods to retrieve nodes matching an XPath expression: SelectSingleNode and SelectNodes. The first returns only one node (the first one matching) and the latter returns all matching nodes.

Here is almost the same example as above, but using XPath instead. Load the HtmlDocument object the same way and then:

// Targets a specific node
HtmlNode someNode = document.DocumentNode.SelectSingleNode("//*[@id='mynode']");

// If there is no node with that Id, someNode will be null
if (someNode != null)
{
    // Extracts all links within that node
    // Note the leading dot (.) to make it look relative to the current node instead of the whole document
    HtmlNodeCollection allLinks = someNode.SelectNodes(".//a");

The remaining is the same.

But that code is not any shorter or simpler than the previous one! It might even actually seem more complicated with that XPath syntax. That’s right, but here comes the power of XPath. Both expressions could be combined into only one that would do everything at once. And here is the new code after the HtmlDocument object loading as above:

// Extracts all links under a specific node that have an href that begins with "http://"
HtmlNodeCollection allLinks = document.DocumentNode.SelectNodes("//*[@id='mynode']//a[starts-with(@href,'http://')]");

// Outputs the href for external links
foreach (HtmlNode link in allLinks)
    Console.WriteLine(link.Attributes["href"].Value);

Simple enough? Only the XPath part might be a bit hard to understand if you are new to it, but you will get used and eventually read it easily. This example is quite simple, but there is a lot more you can do using XPath to parse through nodes.

I hope this short introduction to HtmlAgilityPack will help you getting started using this really nice library and help you with your projects!

Bookmark and Share
Posted in C# by Olivier at March 30th, 2010.
Tags: , , , , , , , ,
  • Wxiaod1984

    good

  • Jacob Coens

    Thanks for the article, I’ve found HtmlAgilityPack to be very useful!

  • http://blogs.microsoft.co.il/blogs/shimmy Shimm

    Hi!
    I’m using HtmlAgilityPack to parse my HTML docs. However, I am looking for a way to render the HTML doc I’m parsing and highlight a specific element’s content via XPATH or whatever other available way.
    How could that be achieved?

  • http://blogs.microsoft.co.il/blogs/shimmy Shimm

    Hi!
    I’m using HtmlAgilityPack to parse my HTML docs. However, I am looking for a way to render the HTML doc I’m parsing and highlight a specific element’s content via XPATH or whatever other available way.
    How could that be achieved?

  • http://www.harigeek.com/ Harish

     Great .. i was searching for the same.. HtmlAgilityPack helped me a lot … thanks for sharing this article …:)

  • Shadi_shark

    are there method getElementsByTagName like this   document.getElementsByTagName(“div”) ???thanks

  • Anon

    Thanks! This is pretty much what I tried to build myself and wasted a lot of time doing so. Thanks both for the developer and the one who wrote this article about it, really saved my day!

  • Anon

    Thanks for the help. I wasted my time on regex but you gave a very simple solution!!!

  • Sagar

    Hi…….. Can anyone suggest me that which method is best.

  • http://profile.yahoo.com/LQUIGDFWQEXWGEP6S3SRMDCMGY Rampatter
  • Truongthanhnam

    I have a small project and want you help me to solve with paymentpls call me on skype nam.truongthanh

  • http://twitter.com/craastad Chris Raastad

    Great post! Very concise and useful! Thanks!

  • Xmen_eos

    thanks, it helps a lot ..

  • Samina

    Asslam u Alikum wr.wb
    I’m new to HTML Agility Pack, i did’t understand that what to pass at the place of “mynode” in getElementById() function. Kindly guide me.

  • googi

    can i use this for C++ .NET?

  • vishal

    It is really helpful article..many many thanks to you