On a recent engagement, I was asked to provide best practices. I realized that many of the best practices hadn’t been collected here, so it’s time I fix that. The client was early in their journey of adopting Databricks as their data engine, and a lot of the development they were doing was free-form. They were learning Databricks as they went. As a result, they were developing code directly in the Databricks Notebook experience. This leads to a lot of monolithic design, which becomes difficult to maintain. But before I discuss breaking down those monoliths, I wanted to help explain a better way to develop data engineering code.
Notebook pros
Notebooks are great tools for exploring data. Given a set of data, an analyst can ingest, transform, and present findings based on that data with little effort. That process could be documented through markup so that other analysts could understand what’s happening in the notebook. The notebook could be easily shared with many other users in the same workspace. But that notebook isn’t production-ready code.
Notebook cons
That ease that was fundamental to notebooks’ success leads to problems. The thought processes between exploring and building software are different. The notebook user considers the data and question they are trying to answer in the present tense. A software developer has to consider that, along with the what-ifs, future states, and possible problems that may arise when the code is run again. Because of this, software developers can’t use notebooks for software development.
IDE pros
They can use an IDE like Visual Studio Code (VSC). Setting up your local environment takes a few steps. You first install a couple of VSC extensions: Python and Databricks VSCode. Then set up a virtual environment to closely match your Databricks environment.
With a full-blown IDE, you gain a ton of useful features. My personal favorite is Code completion. Nothing has helped me move from .Net development to Python development as much as code completion. The default code completion baked into the Python extension is nice, but Co-Pilot has helped me push past many of my Python frustrations.
In addition to getting help constructing code, keeping it clear of “bad code smells” can be difficult without a Linter. Bad code smells are syntactically correct code but follow a pattern that can indicate a deeper problem. With Pylint, I can check my code after every file save. Live linting would be nicer, but for now, this works.
For cost-conscious people, local debugging means you no longer have to spin up a Databricks cluster whenever you want to test your code. With a local environment running the same version of python, PySpark, and any libraries you’re referencing on your cluster, you can debug that code from your local machine. A prerequisite of PySpark is a Java runtime must first be installed. With those, you can run a single-node spark controller and worker on your development machine. Once you’ve verified your code works as intended, you can deploy it to your clusters and start paying then.
The git experience in Databricks has improved, but it’s not fully featured yet. With local development, you get all of git’s abilities. You can easily juggle multiple branches and perform A: B testing without going through crazy configuration steps. You can also handle cherry-picking events much easier with a local text editor than using the web UI version.
Not only can you run your code in local debug mode, but you can start writing unit tests for your code. Suppose you install PyTest and Coverage libraries on your development environment. In that case, you can start writing tests that can be run to verify your code before you commit that code to the remote repository. With test cases written, you will know immediately when a code change has broken something. Once you adopt tests into your development methods, you can use them in your deployment pipelines.
CI/CD pipelines can now take your repository and run tests on check-in. If you design integration tests, you can see how newly checked-in code affects the rest of your codebase. You can automatically make decisions on how to handle this new code. If tests don’t pass, the code can’t be merged into the main or a named branch. You can get very complex with these rules.
IDE cons
The only real con is training and experience. While almost every analyst and developer will feel comfortable in a notebook environment, the move to an IDE can seem intimidating. The only way through that is exercise. You have to use the IDE to get used to it. It will take learning and time before it’s mastered.
Conclusions
Moving from an explorer’s mindset to a software developer’s mindset is not fast or easy, but the payoff is more reliable code. This move is not for all your analysts. There will still be the need for those explorers. Once they uncover the insight and the business determines this insight needs to be made production ready, this code is turned over to those developers who are more comfortable working in the core software development life-cycle processes. They would then make that code ready to deploy.
Notebooks explore data, and IDEs develop software.