Setting up Delta Sharing in Databricks is straightforward once you understand the diagram provided in the Azure Databricks documentation. Delta sharing is implemented as a part of Databrick’s Unity Catalog. Unity catalog is the official data governance solution for Databricks. You can consider it an extension to the metastore catalog or Databricks version of a…
Month: January 2023
SQL Server to Databricks Profiler
Recently I had a client that expressed interest in migrating their data warehouse from Azure SQL DB to Databricks. They weren’t looking to move due to any performance issues in Azure SQL DB. They were running on the Hyperscale offering. They were looking to share a common data architecture between their data warehouse and data…
Delta Sharing – Data Recipients
Recently I was asked to look into Delta Sharing to learn what it’s all about and how it could be used. fter digging in a little bit, It appears that it’s a way to share data in parquet or Delta format. ou can build these shares on top of any modern cloud storage system like…
Metadata-Driven Python
You will manually build your ingestion code when you first learn to ingest data into any new engine. This makes sense; you’re just getting started. You want to learn how the engine will read the data and then write it back again. You want to learn how to log what’s happening during ingestion. You want…
Monolithic vs. Unit-of-Work
When we start developing in a new language or on a new platform, it’s easy to fall into the trap of monolithic design. The free flow from idea to code leads to a single blob of functional code. This leads to quick prototype code that meets the functional requirements. The problem with this approach creeps…
Testing Ingest
Last time, we built a simple transform function in Python, but how do we know if it works? We need to build some tests to find out. I admit data engineering has be late to the practice of test development, but it’s not too hard to adopt. Let’s work through a simple data test, a…
Developing Datbricks Ingestion locally
Spark engines like Databricks are optimized for dealing with many small-ish files that have already been loaded into your Hadoop-compatible file system. If you want to process data from external sources, you’ll want to extract that data into files and store those in your Azure Data Lake Storage (ADLS) account attached to your Databricks Workspace….
Prepare VSC Local Databricks Development
Last time, we walked through how to perform analysis on Databricks using Visual Studio Code (VSC). This time, we will set up a local solution in VSC that will let us build out our data engineering solutions locally. That way, we don’t have to pay for development and testing time. We’d only pay for Databricks,…
Get Started with Databricks in VSCode
You’ve just received a new dataset, and you have to analyze it to prepare for building out the d ta ingestion pipeline. But first, we’ll need to create a cluster to run our analysis. Let’s run through a simple data analysis exercise using Databricks and Visual Studio Code (VSC). Create a cluster from the Web…