Skip to content

shannonlowder.com

Menu
  • About
  • Biml Interrogator Demo
  • Latest Posts
Menu

High-level Databricks Storage

Posted on November 28, 2022November 28, 2022 by slowder

Understanding how Databricks works with storage can go a long way toward improving your use of Databricks.

Storage

In SQL Server, we store our data in two highly structured file types. The first type of file is a data file; these have an extension of MDF or NDF. You get one MDF per database, but you can have multiple NDF files if you want. These data files are broken down into extents, 64kb chunks of data composed of 8 pages of data. Each page contains a header, footer, and as many rows of data as can fit in that 8KB page.

SQL Server Storage

The second file type is a log file. This file contains a record of each data change operation performed in your database. How long that record sticks around is configurable by the database administrator. These log files are broken down into virtual log files of different lengths. While you can have more than one log file for a database, it provides no performance benefit and often complicates your management costs.

Storage and Performance

When using SQL Server, you may find that your performance is limited by storage performance. For example, you have a database that holds every sales transaction for the past 10 years. Currently, that database has a single data file. That data file is stored on a spinning disc (HDD).

You could first try moving that data file to a solid-state drive (SSD). That improves performance considerably, but queries still take a long time to complete. You dig into the queries further and discover that most queries deal with the past two years of data. Users rarely deal with older data.

You decide to add a second file group called archive. Then, you partition your sales table by month and year. Data older than two years is moved to the spinning disc, and new data remains in the primary filegroup on the SSD. Now, when users include dates in their query, which use the partition key on the table, they get significantly faster queries.

Following this example further, we could add additional filegroups and data files, along will additional drives. Each change could improve our IO performance up to a certain point. Eventually, you’ll find that you’ll end up moving your bottleneck from IO to memory or CPU.

Modern Storage Solutions for SQL Server

Today many of our SQL Servers are virtual, and the drives connected to them are too. They’re discs hosted in separate servers near our SQL Servers. The good news is most of those solutions abstract the physical discs that host your data. While you may see a single logical data drive in your SQL Server instance, that drive is 5, 10, or more physical discs under the hood. This means that you have access to more data throughput than if you had one physical drive.

This means you can choose an amount of throughput for your logical disc. And that speed is usually more consistent than a single drive would ever be. All of that is great news, and the bad news is now network latency will show up as IO latency in your database.

Databricks Storage and performance

Databricks takes advantage of this new modern storage solution. Rather than having a single disc, you connect to a file system, a Hadoop filesystem HDFS. This can be hosted in blob storage (ADLS v2), AWS S3, or GCP. When using HDFS, your data gets spread out among dozens, if not hundreds, of discs.

In addition to having so many discs hosting your data, you can also use semi-structured files. You can choose from CSV, JSON, AVRO, or Parquet. The benefit of using semi-structured files is you don’t have to model your tables up front. The downside is you’ll have to pay for that flexibility with additional CPU costs.

Databricks has added one additional file type, Delta. It’s built on Parquet but adds many amazing features to data lakes.

First and foremost, Delta brings ACID transactions to your data lake!

Let that sink in for a minute. That used to be the primary argument against moving from a database to a data lake. It’s gone now.

Delta also gives you schema evolution (just like JSON, AVRO and Parquet). But adds on the ability to perform schema enforcement. You can now enforce that schema and prevent bad data from being inserted in the first place!

Delta adds audit history, time travel, DML operations, and more. All of this is why you’ll start hearing the term Data Lakehouse more often in the future. This is how you deal with the four Vs today.

Please, take the time to read more about what Delta can do.

Conclusion

The way Databricks works with storage is similar to how SQL Server works with data today. It’s all remote storage you connect to via the network. The storage is abstracted. You’re guaranteed a certain amount of storage at a certain number of IOPS. In the case of SQL Server, you could still host that storage locally and pay for it through a combination of CapEx and OpEx. In the case of Databricks, it’s all OpEx. Pay for what you use, as you use it. You get flexibility in Databricks storage. But with that flexibility, you’ll pay a higher CPU cost.

In my next entry, I’ll show how SQL Server and Databricks handle queries differently. Until then, if you have any questions, let me know!

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • A New File Interrogator
  • Using Generative AI in Data Engineering
  • Getting started with Microsoft Fabric
  • Docker-based Spark
  • Network Infrastructure Updates

Recent Comments

  1. slowder on Data Engineering for Databricks
  2. Alex Ott on Data Engineering for Databricks

Archives

  • July 2023
  • June 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • October 2018
  • August 2018
  • May 2018
  • February 2018
  • January 2018
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • June 2017
  • March 2017
  • February 2014
  • January 2014
  • December 2013
  • November 2013
  • October 2013
  • August 2013
  • July 2013
  • June 2013
  • February 2013
  • January 2013
  • August 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • March 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007
  • March 2007
  • February 2007
  • January 2007
  • December 2006
  • November 2006
  • October 2006
  • September 2006
  • August 2006
  • July 2006
  • June 2006
  • May 2006
  • April 2006
  • March 2006
  • February 2006
  • January 2006
  • December 2005
  • November 2005
  • October 2005
  • September 2005
  • August 2005
  • July 2005
  • June 2005
  • May 2005
  • April 2005
  • March 2005
  • February 2005
  • January 2005
  • November 2004
  • September 2004
  • August 2004
  • July 2004
  • April 2004
  • March 2004
  • June 2002

Categories

  • Career Development
  • Data Engineering
  • Data Science
  • Infrastructure
  • Microsoft SQL
  • Modern Data Estate
  • Personal
  • Random Technology
  • uncategorized
© 2025 shannonlowder.com | Powered by Minimalist Blog WordPress Theme