Skip to content

shannonlowder.com

Menu
  • About
  • Biml Interrogator Demo
  • Latest Posts
Menu

Parsing and Extracting Web Data

Posted on October 4, 2017December 20, 2022 by slowder

In the last article, I laid out the architecture to dealing with this type of data source.  This time, we’re going to get in to the basics of parsing data.

A Few New Technologies

Before we dive into details, let’s cover the technologies this solution rests on.  First, there’s C#. As a Microsoft data professional, you live and breathe T-SQL.  I’m sure you are hearing more and more about data science.  One of the three aspects of data science is programming.  If you’ve been using T-SQL for a while you most likely have an understanding of many basic programming concepts.  From that understanding, push yourself to learn a new programming language.  It’s not impossible.  I chose to learn C# in order to build out script tasks in SSIS.  Later, I took that knowledge and learned Biml.  With C# and Biml, I was able to automate a lot of the tasks in building data warehouses.

In my C# code, I will use the HTMLAgilityPack (HAP) assembly to navigate around the HTML files after staging them to my local folder. This navigation is enabled by using XPath and Linq.

XPath is a query language that lets you select specific nodes (or tags) from an XML document.  Fortunately for us, HTML is XML. In fact, most of the time if you see an acronym ending with ML, the ML stands for Markup Language.  XPath allows you to construct a query to select specific tags within the HTML document. Let’s create a very simple HTML document.

 

<html> 
   <body> 
     <table> 
       <tr> 
         <th>Name</th> 
         <th>Address</th> 
         <th>Phone</th> 
       </tr> 
       <tr> 
         <td>John Smith</td> 
         <td>123 Maple Street</td> 
         <td>704.123.4567</td> 
      </tr> 
      <tr> 
        <td>Tom Jones</td> 
        <td>753 Evergreen Terrace</td> 
        <td>704.987.6543</td> 
      </tr> 
    </table>
  </body>
</html>

Let’s say we wanted the table.  We could use the XPath /html/body/table to retrieve it. We can also use XPath to refer to a collection.  Let’s say we wanted all the rows. We would use the XPath /html/body/table/tr. We would get a collection of three rows.  Notice the XPath looks a lot like a Linux or windows folder path.  That’s the idea of XPath!

I would like to point out a couple of extra points.  First, XPath is case sensitive.  So if I had tried to use /html/body/table/TR, I would find no nodes.

Second, you can use “short hand” in your XPath queries.  //body/table/tr would get you to the same place /html/body/table/tr did.

Third, you can refer to specific instances of an XPath too. Say you only wanted to see the first row, the one with the headers.  Simply use the XPath /html/body/table/tr[1]. A small warning about the square bracket predicates.  Counters like my previous example start at 1, instead of 0.  I try to refer to them as counters rather than indexes, since indexes start at 0.

Lastly, XPath can be created that references nodes by attributes.This helps you write smaller XPath queries to get your data. You could add //*div[@class=’bgblue’] to select all divs with a blue background. You could also select a node by it’s id: //*[@id=’content’] would select an element on the page with the id “content”.

The last technology I’ll use in this solution is Linq.  Linq is a .Net component that allows us to query our objects inside a program. This query language is very similar to SQL, but not exactly the same.  It will take you a little time to get used to using it. I will say that running late will be very beneficial to you and moving from just a SQL server professional today to data science professional.

Let’s take our previous example where we wanted all the rows.  We used the XPath //body/table/tr.  When we are parsing data, we often skip the header row.  Wouldn’t it be nice to skip that header row here too?  With Linq, we can use the Skip(1) command, and skip right over that row in processing.  There are many other functions that will be familiar to you as a Microsoft data professional.  One of the better references I found was Why Linq beats SQL from the guy who created LinqPad.  While I disagree with his conclusion, the guide is pretty good at laying out a SQL statement, and then the equivalent Linq right after it.

I would like to mention that I do use LinqPad to quickly check my C# code before putting it into a SSIS script task, or into a full Visual Studio project.

If you’re not familiar with these technologies, spend some time researching them and getting to know them.  These technologies are turning up in more and more of the solutions I’m delivering.  Trying to avoid them would have costed a lot more time in the long run!  Next time, we’re going to dive into the file staging loop. I’ll show you two different cases you’ll find most often when you start staging those pages.  I’ll also show you how to set up parallelism to pull those files down more quickly!

In the mean time, if you have any questions, send them in! I’m here to help.

 

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • A New File Interrogator
  • Using Generative AI in Data Engineering
  • Getting started with Microsoft Fabric
  • Docker-based Spark
  • Network Infrastructure Updates

Recent Comments

  1. slowder on Data Engineering for Databricks
  2. Alex Ott on Data Engineering for Databricks

Archives

  • July 2023
  • June 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • October 2018
  • August 2018
  • May 2018
  • February 2018
  • January 2018
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • June 2017
  • March 2017
  • February 2014
  • January 2014
  • December 2013
  • November 2013
  • October 2013
  • August 2013
  • July 2013
  • June 2013
  • February 2013
  • January 2013
  • August 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • March 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007
  • March 2007
  • February 2007
  • January 2007
  • December 2006
  • November 2006
  • October 2006
  • September 2006
  • August 2006
  • July 2006
  • June 2006
  • May 2006
  • April 2006
  • March 2006
  • February 2006
  • January 2006
  • December 2005
  • November 2005
  • October 2005
  • September 2005
  • August 2005
  • July 2005
  • June 2005
  • May 2005
  • April 2005
  • March 2005
  • February 2005
  • January 2005
  • November 2004
  • September 2004
  • August 2004
  • July 2004
  • April 2004
  • March 2004
  • June 2002

Categories

  • Career Development
  • Data Engineering
  • Data Science
  • Infrastructure
  • Microsoft SQL
  • Modern Data Estate
  • Personal
  • Random Technology
  • uncategorized
© 2025 shannonlowder.com | Powered by Minimalist Blog WordPress Theme