Skip to content

shannonlowder.com

Menu
  • About
  • Biml Interrogator Demo
  • Latest Posts
Menu

Notebooks Explore Data

Posted on October 22, 2022November 14, 2022 by slowder

On a recent engagement, I was asked to provide best practices. I realized that many of the best practices hadn’t been collected here, so it’s time I fix that. The client was early in their journey of adopting Databricks as their data engine, and a lot of the development they were doing was free-form. They were learning Databricks as they went. As a result, they were developing code directly in the Databricks Notebook experience. This leads to a lot of monolithic design, which becomes difficult to maintain. But before I discuss breaking down those monoliths, I wanted to help explain a better way to develop data engineering code.

Notebook pros

Notebooks are great tools for exploring data. Given a set of data, an analyst can ingest, transform, and present findings based on that data with little effort. That process could be documented through markup so that other analysts could understand what’s happening in the notebook. The notebook could be easily shared with many other users in the same workspace. But that notebook isn’t production-ready code.

Notebook cons

That ease that was fundamental to notebooks’ success leads to problems. The thought processes between exploring and building software are different. The notebook user considers the data and question they are trying to answer in the present tense.  A software developer has to consider that, along with the what-ifs, future states, and possible problems that may arise when the code is run again.  Because of this, software developers can’t use notebooks for software development.

IDE pros

They can use an IDE like Visual Studio Code (VSC).  Setting up your local environment takes a few steps. You first install a couple of VSC extensions: Python and Databricks VSCode.  Then set up a virtual environment to closely match your Databricks environment. 

With a full-blown IDE, you gain a ton of useful features. My personal favorite is Code completion. Nothing has helped me move from .Net development to Python development as much as code completion. The default code completion baked into the Python extension is nice, but Co-Pilot has helped me push past many of my Python frustrations.

In addition to getting help constructing code, keeping it clear of “bad code smells” can be difficult without a Linter. Bad code smells are syntactically correct code but follow a pattern that can indicate a deeper problem. With Pylint, I can check my code after every file save. Live linting would be nicer, but for now, this works.

For cost-conscious people, local debugging means you no longer have to spin up a Databricks cluster whenever you want to test your code. With a local environment running the same version of python, PySpark, and any libraries you’re referencing on your cluster, you can debug that code from your local machine. A prerequisite of PySpark is a Java runtime must first be installed. With those, you can run a single-node spark controller and worker on your development machine. Once you’ve verified your code works as intended, you can deploy it to your clusters and start paying then.

The git experience in Databricks has improved, but it’s not fully featured yet. With local development, you get all of git’s abilities. You can easily juggle multiple branches and perform A: B testing without going through crazy configuration steps. You can also handle cherry-picking events much easier with a local text editor than using the web UI version.

Not only can you run your code in local debug mode, but you can start writing unit tests for your code. Suppose you install PyTest and Coverage libraries on your development environment. In that case, you can start writing tests that can be run to verify your code before you commit that code to the remote repository. With test cases written, you will know immediately when a code change has broken something. Once you adopt tests into your development methods, you can use them in your deployment pipelines.

CI/CD pipelines can now take your repository and run tests on check-in. If you design integration tests, you can see how newly checked-in code affects the rest of your codebase. You can automatically make decisions on how to handle this new code. If tests don’t pass, the code can’t be merged into the main or a named branch. You can get very complex with these rules.

IDE cons

The only real con is training and experience. While almost every analyst and developer will feel comfortable in a notebook environment, the move to an IDE can seem intimidating. The only way through that is exercise. You have to use the IDE to get used to it. It will take learning and time before it’s mastered.

Conclusions

Moving from an explorer’s mindset to a software developer’s mindset is not fast or easy, but the payoff is more reliable code.  This move is not for all your analysts. There will still be the need for those explorers.  Once they uncover the insight and the business determines this insight needs to be made production ready, this code is turned over to those developers who are more comfortable working in the core software development life-cycle processes.  They would then make that code ready to deploy.

Notebooks explore data, and IDEs develop software.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • A New File Interrogator
  • Using Generative AI in Data Engineering
  • Getting started with Microsoft Fabric
  • Docker-based Spark
  • Network Infrastructure Updates

Recent Comments

  1. slowder on Data Engineering for Databricks
  2. Alex Ott on Data Engineering for Databricks

Archives

  • July 2023
  • June 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • October 2018
  • August 2018
  • May 2018
  • February 2018
  • January 2018
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • June 2017
  • March 2017
  • February 2014
  • January 2014
  • December 2013
  • November 2013
  • October 2013
  • August 2013
  • July 2013
  • June 2013
  • February 2013
  • January 2013
  • August 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • March 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007
  • March 2007
  • February 2007
  • January 2007
  • December 2006
  • November 2006
  • October 2006
  • September 2006
  • August 2006
  • July 2006
  • June 2006
  • May 2006
  • April 2006
  • March 2006
  • February 2006
  • January 2006
  • December 2005
  • November 2005
  • October 2005
  • September 2005
  • August 2005
  • July 2005
  • June 2005
  • May 2005
  • April 2005
  • March 2005
  • February 2005
  • January 2005
  • November 2004
  • September 2004
  • August 2004
  • July 2004
  • April 2004
  • March 2004
  • June 2002

Categories

  • Career Development
  • Data Engineering
  • Data Science
  • Infrastructure
  • Microsoft SQL
  • Modern Data Estate
  • Personal
  • Random Technology
  • uncategorized
© 2025 shannonlowder.com | Powered by Minimalist Blog WordPress Theme