Parsing and Extracting Web Data

In the last article, I laid out the architecture to dealing with this type of data source.  This time, we’re going to get in to the basics of parsing data.

A Few New Technologies

Before we dive into details, let’s cover the technologies this solution rests on.  First, there’s C#. As a Microsoft data professional, you live and breathe T-SQL.  I’m sure you are hearing more and more about data science.  One of the three aspects of data science is programming.  If you’ve been using T-SQL for a while you most likely have an understanding of many basic programming concepts.  From that understanding, push yourself to learn a new programming language.  It’s not impossible.  I chose to learn C# in order to build out script tasks in SSIS.  Later, I took that knowledge and learned Biml.  With C# and Biml, I was able to automate a lot of the tasks in building data warehouses.

In my C# code, I will use the HTMLAgilityPack (HAP) assembly to navigate around the HTML files after staging them to my local folder. This navigation is enabled by using XPath and Linq.

XPath is a query language that lets you select specific nodes (or tags) from an XML document.  Fortunately for us, HTML is XML. In fact, most of the time if you see an acronym ending with ML, the ML stands for Markup Language.  XPath allows you to construct a query to select specific tags within the HTML document. Let’s create a very simple HTML document.


         <td>John Smith</td> 
         <td>123 Maple Street</td> 
        <td>Tom Jones</td> 
        <td>753 Evergreen Terrace</td> 

Let’s say we wanted the table.  We could use the XPath /html/body/table to retrieve it. We can also use XPath to refer to a collection.  Let’s say we wanted all the rows. We would use the XPath /html/body/table/tr. We would get a collection of three rows.  Notice the XPath looks a lot like a Linux or windows folder path.  That’s the idea of XPath!

I would like to point out a couple of extra points.  First, XPath is case sensitive.  So if I had tried to use /html/body/table/TR, I would find no nodes.

Second, you can use “short hand” in your XPath queries.  //body/table/tr would get you to the same place /html/body/table/tr did.

Third, you can refer to specific instances of an XPath too. Say you only wanted to see the first row, the one with the headers.  Simply use the XPath /html/body/table/tr[1]. A small warning about the square bracket predicates.  Counters like my previous example start at 1, instead of 0.  I try to refer to them as counters rather than indexes, since indexes start at 0.

Lastly, XPath can be created that references nodes by attributes.This helps you write smaller XPath queries to get your data. You could add //*div[@class=’bgblue’] to select all divs with a blue background. You could also select a node by it’s id: //*[@id=’content’] would select an element on the page with the id “content”.

The last technology I’ll use in this solution is Linq.  Linq is a .Net component that allows us to query our objects inside a program. This query language is very similar to SQL, but not exactly the same.  It will take you a little time to get used to using it. I will say that running late will be very beneficial to you and moving from just a SQL server professional today to data science professional.

Let’s take our previous example where we wanted all the rows.  We used the XPath //body/table/tr.  When we are parsing data, we often skip the header row.  Wouldn’t it be nice to skip that header row here too?  With Linq, we can use the Skip(1) command, and skip right over that row in processing.  There are many other functions that will be familiar to you as a Microsoft data professional.  One of the better references I found was Why Linq beats SQL from the guy who created LinqPad.  While I disagree with his conclusion, the guide is pretty good at laying out a SQL statement, and then the equivalent Linq right after it.

I would like to mention that I do use LinqPad to quickly check my C# code before putting it into a SSIS script task, or into a full Visual Studio project.

If you’re not familiar with these technologies, spend some time researching them and getting to know them.  These technologies are turning up in more and more of the solutions I’m delivering.  Trying to avoid them would have costed a lot more time in the long run!  Next time, we’re going to dive into the file staging loop. I’ll show you two different cases you’ll find most often when you start staging those pages.  I’ll also show you how to set up parallelism to pull those files down more quickly!

In the mean time, if you have any questions, send them in! I’m here to help.


By Shannon Lowder

Shannon Lowder is the Database Engineer you've been looking for! Look no further for expertise in: Business Analysis to gather the business requirements for the database; Database Architecting to design the logical design of the database; Database Development to actually build the objects needed by the business logic; finally, Database Administration to keep the database running in top form, and making sure there is a disaster recovery plan.

1 comment

Leave a comment

Your email address will not be published. Required fields are marked *