Skip to content

shannonlowder.com

Menu
  • About
  • Biml Interrogator Demo
  • Latest Posts
Menu

Azure Data Lake, Step by Step

Posted on October 18, 2017November 14, 2022 by slowder

Over the next few blog posts in this series, I’m going to share with you the story of how a Data Lake project comes together. As I tell this story, I’m going to keep pointing back to traditional ETL work and to Automation techniques.  Not all of these will include Biml.  I hope to help you all see this as simply a next step in our jobs as data platform people.

Greenfield Project

This data lake project is a greenfield project.  There was no prior data warehouse before. In fact, there was no prior business before this project.  This project was conceived as a new way of tackling a social problem in our society.  There are many non-profit and government agencies doing their best to address this issue with the tools they have.  This project began as a simple question, “Is there any cooperation between these agencies?”  The answer was: “Not really.”

One of the organizing members suggested sharing research and operational data in such a way that all agencies could learn from successes and failures collectively, rather than individually. The challenge was, all these agencies collect their data in very different ways. They absolutely had a variety problem. They also had a possible Veracity problem.

After the green-fielding was complete, I connected with this group to help them see this vision come to life.  We talked through several technologies that exist in the Microsoft stack, and how each may or may not help them get to their end goal of combining and collating data to help solve a social problem.  We talked through Azure Active Directory to help control access to sensitive data. We talked through Azure Data Lake Storage and Analytics.  We talked about HDInsight, Machine Learning, R Services, PowerBI.  We even discussed non-Microsoft tools like Tableau and Qlik.

At the end of this conversation, I started building a project plan to help them move from that green field to a functional solution.This plan began with laying out the zones of their data lake.

Azure Data Lake Zone Layout

As much as I’m going to show how similar data lake work is like your traditional data warehousing, I do have to point out some differences.  In data warehousing, once we start staging, transforming and loading data, we’re talking about tables in databases.  In data lakes, we don’t have tables, we have folders of files.  Groups of folders and files with a specific purpose are grouped together into zones.  The three most common zones are equivalent to three sets of tables in data warehousing.

Raw Zone

Our first zone is the raw zone.  This zone will serve as the landing point for source files.  Like the extract (or stage) schema in our data warehouse, we want these files to match the source system as close as possible.In the data lake, we actually go one step beyond saying we want the schema of our raw files to match the source system, we also want these files to be immutable.

Immutable means once they are written to the raw folder we shouldn’t be able to modify or delete them.  That way, we can always reconstruct different states from these files without having to retrieve them from the source system.

In this system, we set up a folder under the root of our data lake and named it Raw. Under that folder, we create one folder for each tenant agency that will be sharing data with us.  The reason we set up these children folders is we need to be able to secure the incoming data in a way no tenant can read the raw files until it has been sterilized of any Personally Identifiable Information (PII).  We also want to prevent tenant B from directly seeing tenant A’s raw data, only the standardized data.

To make files easier to identify, we add a new folder under each Tenant’s Raw folder for each file type they contribute.  So if Tenant A had three types of files: one for household data, one for school system data, and one for medical visits there would be three folders under their tenant folder.

\Raw
    \TenantA
        \Household
        \SchoolSystem
        \MedicalVisit  

Currently, we are discussing whether or not to include additional children folders under each file type folder to provide date partitioning.  So under each filetype folder, you would find a year folder in “yyyy” format.  Each yearly folder would include twelve monthly folders in the “mm” format.  And each monthly folder would have a folder for each day that data is provided.

Staged Zone

To allow tenants to only see sterilized versions of the files they contribute to the data lake a second zone was needed.  This zone would be the first place a tenant could see it’s own files in a sterilized format.  That way, no one could get PII information through the data lake. The plan is to build a set of functions in U-SQL or Azure Data Factory to process new files from the Raw to the Staged Zone.

The Staged folder would match the folder structure of its source raw folder.  That way you can easily see where the staged file came from.  So, looking at folder structure in the Raw example you would see the following structure in the staged folder:

\Staged
    \TenantA
        \Household
        \SchoolSystem
        \MedicalVisit  

Development Zone

To allow the data engineers a place to test out new features for the data, we created a development folder beside the raw and staged zones’ folders.  The security for this folder would be set up so that only developers and specific test users could access data here.  Once the new development data were approved for production use, that data could be created in the production zone.

In the development zone, we move away from a folder for each tenant.  From this point forward, users would be using the data to spot patterns and insights across data sets.  Take the example from before.  Tenant A is providing school system data.  Come to find out, 6 other tenants provide this data too.  Developers decide to integrate these six separate sources into one unified data set.  They would store this data in a structure like this:

\Development
    \SchoolSystem 

Production Zone

When development data gets approved the process to build the development data is repeated, except the output would be written to the production folders.  This structure would exactly match Development.  If the data volume in development grew to require multiple files, and the number of files grew, then these folders could also get date partitioned children folders in the yyyy\mm\dd format.

Zone Architecture

Don’t let this article convince you to only set up four zones in your data lake.  There could be valid reasons to set up additional zones.  I’ve seen other architectures set up zones for temporary data.  That way files that only needed to stick around for a few hours had a special home.  I’ve seen zones set up to serve specific client tools.  For example, one set of files set up for consumption by Tableau, and others set up for sending back onto a client’s premises.  The zones should match the logical divisions in your business data.

Also, keep in mind that date partitioning may not work in all cases.  If your business cycle is like nation-wide pizza delivery companies, you may do more than 50% of your business on a single day of the year.  For that one day each year you would find incredible amounts of data compared to the rest of the files.  In that case, the amount of time to process that one file would create bottlenecks when processing a year’s worth of data.  If you are wise, you’ll profile your source data and pick a distribution key that makes sense for your data.

You can always change your distribution key later if you simply don’t know what your data will be.  That’s the case we’re in with this greenfield project.  Just be prepared for the statistics to show that you need to re-arrange your data within your data lake to provide greater performance and lower costs of ownership.

Next Time…

In my next few Azure Data Lake articles, I’ll start sharing some of the questions and answers from this project.  I’m finding that the documentation for Azure data lakes is very good, but sometimes you have to read three or four articles, and then synthesize your answers.  I’d like to save you some time and share some of that research with you.  After adding those articles, I’ll start showing you some U-SQL built by hand, and then show you how to automate that!  Stay tuned!

 

 

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • A New File Interrogator
  • Using Generative AI in Data Engineering
  • Getting started with Microsoft Fabric
  • Docker-based Spark
  • Network Infrastructure Updates

Recent Comments

  1. slowder on Data Engineering for Databricks
  2. Alex Ott on Data Engineering for Databricks

Archives

  • July 2023
  • June 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • October 2018
  • August 2018
  • May 2018
  • February 2018
  • January 2018
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • June 2017
  • March 2017
  • February 2014
  • January 2014
  • December 2013
  • November 2013
  • October 2013
  • August 2013
  • July 2013
  • June 2013
  • February 2013
  • January 2013
  • August 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • March 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007
  • March 2007
  • February 2007
  • January 2007
  • December 2006
  • November 2006
  • October 2006
  • September 2006
  • August 2006
  • July 2006
  • June 2006
  • May 2006
  • April 2006
  • March 2006
  • February 2006
  • January 2006
  • December 2005
  • November 2005
  • October 2005
  • September 2005
  • August 2005
  • July 2005
  • June 2005
  • May 2005
  • April 2005
  • March 2005
  • February 2005
  • January 2005
  • November 2004
  • September 2004
  • August 2004
  • July 2004
  • April 2004
  • March 2004
  • June 2002

Categories

  • Career Development
  • Data Engineering
  • Data Science
  • Infrastructure
  • Microsoft SQL
  • Modern Data Estate
  • Personal
  • Random Technology
  • uncategorized
© 2025 shannonlowder.com | Powered by Minimalist Blog WordPress Theme