Skip to content

shannonlowder.com

Menu
  • About
  • Biml Interrogator Demo
  • Latest Posts
Menu

Data Engineering for Databricks

Posted on December 13, 2022December 20, 2022 by slowder

Since Databricks is a PaaS option for Spark and Spark is optimized to work on many small files, you might find it odd that you have to get your sources into a file format before you see Databricks shine. The good news is Databricks has partnered with several different data ingestion solutions to ease loading source data into your data lake.

Data Ingestion Partners as of 13 December 2022

It is interesting to note that Azure Data Factory was on this list until earlier in 2022. You can still use it to move data from your sources into your data lake. The difference is now, Databricks no longer cross-promotes ADF. In most of my deployments to Databricks in Azure, we still use ADF. The one notable exception is some companies picking up Fivetran to automate their data ingestion.

Fivetran is ETLaaS

Fivetran has built a business around building ETL pipelines for their customers. They’ve even developed a simple user interface to control the process. Customers sign up for the service, click on the service they want to ingest, fill in a few details, and are up and running. Fivetran takes the cost of maintaining the ETL code. You, the customer, pay for what you use as you use it!

You don’t have to struggle with full vs. incremental load logic.

That means you don’t have to fix anything when your salesforce API updates. Fivetran does it.

You don’t have to build auditing and logging; that’s built into the service!

Fivetran is the closest version to ETLaaS I’ve found. You can get a free trial and give it a try. It’s pretty impressive.

ADF to Ingest Source Data

If you’ve already invested in ADF and built an ingestion framework, stay with it. Updating your current framework to work with Databricks is a breeze. All you have to do is define a linked service for your Azure Data Lakes Storage account and then use that linked service as your destination. I highly suggest any training materials Andy Leonard has published if you need tutorials on getting started with ADF.

Some Data Lake Terminology

This source data you’ve landed in your data lake can be referred to as your “raw” or “bronze” zone. The only transformation you should have performed on this raw data is a transformation of storage medium. For example, the source could be SQL Server. You want to read that data into CSV, JSON, or Parquet format. You don’t want to introduce changes to the source data, such as reformatting dates or merging this source data with another data set (for example, adding a lookup value).

When ready to transform your source data, you will write a copy of that transformed data to your “silver” zone. In silver, you could have aggregations, lookups, merges, etc. This is where your conformed data lives. This zone is most similar to operational data stores in database terms. You could model this layer using a DataVault approach if you desire.

Silver data normally is never directly consumed by end users or BI tools.

When you’re ready to publish data to end users, you will land a separate copy in your “gold” zone. This three-zone paradigm is the most common approach in Databricks solutions.

Orchestrating Bronze to Silver to Gold

Once you have landed your source data in bronze, you no longer want to pull that data out again to transform it. You’re now in an ELT mindset. Fortunately, Databricks supports many approaches to transforming the data in your data lake. You can use Databricks-SQL, Python, or scala to perform these transformations. Use the one that works best for you and your team.

I want to introduce you to a few tools built into Databricks that can make this process easier than building it all from scratch. Auto Loader is a tool that can keep track of what data has moved past a certain processing point. That way, it manages what data should be included in your incremental loads! You no longer have to maintain watermarks and timestamps to handle incremental loads.

Delta Live tables (DLT) can allow you to write tiny code snippets and still get fully-featured data pipelines. DLT gives you automatic data quality checks, schema evolution, and monitoring with very little code.

Take these two and combine them with a metadata-driven approach, and you can roll from bronze to Gold with very little effort!

When it comes time to orchestrate your steps, you can define dependencies programmatically or use visual workflows.

Conclusion

Data engineering in Databricks is a vast topic. After getting through the blog entries for this overview, I plan to come back and dig deeper into these details. I’m also planning on sharing what a metadata-driven approach could look like as you move from traditional databases to a broader world.

Next time, I’ll continue the introductory series by showing what Databricks looks like for the BI Developer. Until then, if you have any questions, please send them in!

2 thoughts on “Data Engineering for Databricks”

  1. Alex Ott says:
    January 3, 2023 at 13:32

    Hi Shannon

    is it you commented on my blog post about DLT unit testing? If yes, can you drop me an email? Or maybe better start a discussion on github. For some reason I can’t answer to your comment on blogger 🙁

    Reply
    1. slowder says:
      January 6, 2023 at 09:36

      I’ll reach out by email shortly.

      Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • A New File Interrogator
  • Using Generative AI in Data Engineering
  • Getting started with Microsoft Fabric
  • Docker-based Spark
  • Network Infrastructure Updates

Recent Comments

  1. slowder on Data Engineering for Databricks
  2. Alex Ott on Data Engineering for Databricks

Archives

  • July 2023
  • June 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • October 2018
  • August 2018
  • May 2018
  • February 2018
  • January 2018
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • June 2017
  • March 2017
  • February 2014
  • January 2014
  • December 2013
  • November 2013
  • October 2013
  • August 2013
  • July 2013
  • June 2013
  • February 2013
  • January 2013
  • August 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • March 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007
  • March 2007
  • February 2007
  • January 2007
  • December 2006
  • November 2006
  • October 2006
  • September 2006
  • August 2006
  • July 2006
  • June 2006
  • May 2006
  • April 2006
  • March 2006
  • February 2006
  • January 2006
  • December 2005
  • November 2005
  • October 2005
  • September 2005
  • August 2005
  • July 2005
  • June 2005
  • May 2005
  • April 2005
  • March 2005
  • February 2005
  • January 2005
  • November 2004
  • September 2004
  • August 2004
  • July 2004
  • April 2004
  • March 2004
  • June 2002

Categories

  • Career Development
  • Data Engineering
  • Data Science
  • Infrastructure
  • Microsoft SQL
  • Modern Data Estate
  • Personal
  • Random Technology
  • uncategorized
© 2025 shannonlowder.com | Powered by Minimalist Blog WordPress Theme