I’ve been a Microsoft Data professional for over 20 years. Most of that time I’ve spent in the SQL Server stack, the core query engine, SSIS, SSRS, and a little SSAS. But times changed, and the business problems grew more complex. As they did, I looked at other technologies to try and answer those questions. My first pass with data lakes and massively parallel processing for analytics was Azure Data Lake Analytics. It was an easy learning process since it used U-SQL, a language that combined SQL and C#, and I was already comfortable with both of those.
Unfortunately, ADLA didn’t see a lot of adoption, and other players did. Databricks is the one I’ve been asked to help with most often, so it’s the one I chose to learn. For me, the most challenging part of picking it up was the heavy dependency on Java and Python. After struggling with it for a while, I think I can help other SQL professionals move from SQL and C# to Python.
Over the following few blog entries, I’ll show how the new features in Databricks relate to familiar concepts in SQL Server. After that, I’ll share how to adopt Python into your data engineering toolkit!
Writing SQL Queries
When you start with SQL Server, you’re just querying existing tables and views. If you start with the Databricks community edition, you can write queries against sample data without setting up your own.
The query writing experience is notebook-driven. So it’s closer to Azure Data Studio / Visual Studio Code than to SQL Server Management Studio. It’ll also pay to have a web browser open to SQL reference – Azure Databricks – Databricks SQL | Microsoft Learn until you get used to the differences in T-SQL and Databricks SQL.
At this point in exploring Databricks, you don’t need to dig into any details about how the data is stored, how the queries are being run, or performance considerations of how you write your queries. For data analysts, this experience, plus the dashboarding experience, is all they need to be effective in Databricks.
Next Time
In my next entry, I’m diving into some internal details on how Databricks takes your queries, breaks them down, and executes them against the data. I’ll share some free official Databricks training links to help explain. Understanding the internals is essential before trying to pick up Python on Databricks. Trying to learn both at the same time is rough. If you have any questions in the meantime, hit me up.