Skip to content

shannonlowder.com

Menu
  • About
  • Biml Interrogator Demo
  • Latest Posts
Menu

Prepare VSC Local Databricks Development

Posted on January 5, 2023January 2, 2023 by slowder

Last time, we walked through how to perform analysis on Databricks using Visual Studio Code (VSC). This time, we will set up a local solution in VSC that will let us build out our data engineering solutions locally. That way, we don’t have to pay for development and testing time. We’d only pay for Databricks, compute, and storage costs when we deploy our solutions!

Install and Configure Python

Before getting started, you will need to install Python on your local development machine. Depending on your cluster’s version of Databricks, you’ll need one of three versions of Python. If you’re running 11.0 or newer, install Python version 3.9.5. If you’re running 9.1 LTS to 10.4 LTS, you’ll install 3.8.10. If you’re running a version older than that, you’ll want 3.7.5.

Once you have a version installed on your machine, ensure you include the path to your install folder in your PATH variable. That way, you’ll be able to execute the command without needing to type out the full path each time.

Also, if you’re using Windows 11 on your development machine, you will need to disable the Application Execution Aliases for Python. Microsoft thinks that when you type in Python, you want to go to the windows store. To disable this “Feature,” simply open Settings -> “App execution aliases” and disable the two for Python.

You’ll also want to Install pipenv. This allows you to run multiple virtual python environments. This will enable you to set up multiple unique python solutions on your machine without having them break each other. Install is easy; open a command prompt to your local development folder, and run the following command.

pip install pipenv

Finally, you’ll want to add the Python extension to VSC. This adds many features we’ll use in developing our Python-based data engineering workloads. Open your Extensions panel, type in Python, and install.

While installing extensions, install Python Test Explorer for Visual Studio Code. We’ll need it when we start building tests for our data engineering solution.

Set Up Your VSC Workspace

In the last entry, we added our Databricks Workspace to our VSC workspace. This time we will add a new local folder to our workspace. This can be a repository folder, or it can be a standalone folder. I do encourage you to get used to working in repositories. You get all the version control features and one of the most accessible deployment options for Databricks missing with a standalone folder.

To add another folder to our VSC workspace, click File -> Add Folder to Workspace. Create a new folder, or choose an existing folder to add to your workspace. Now you should see something like the following in your Explorer view.

Configure Your Virtual Python Environment

I like configuring my virtual environments, so they’re stored in my project folder. That way, they’re easier to manage. To do that, I add a .env file to my project folder. I would also check this file into source control when using a repository folder. My .env file follows the following template.

PIPENV_VENV_IN_PROJECT=True
PIPENV_DEFAULT_PYTHON_VERSION=<full three dot version number>
PIPENV_CUSTOM_VENV_NAME=<project name>

So, for this demo, my env file contains the following:

I’m on a mac, and the closest I can get to 3.9.5 is 3.9

With this file in our folder, we can open a terminal to our project folder, run the following command, and our virtual environment will be created.

pipenv install

You’ll see something like the following scroll by.

And when that completes, you should notice a new folder named .venv, a file named Pipfile, and Pipfile.lock have been added to your local development folder.

Add Packages to Your Virtual Environment

Depending on what your data engineering solution needs to do, you’ll want to add packages to your virtual environment to make that work possible. There are two ways you can do this. You can run a pipenv install command for each package you want to install, or you can create a requirements.txt and install all your requirements at once.

There’s one package I use on every Databricks solution, pySpark. Installing this package is key to developing my solutions offline. You’ll want to install the same version of spark your cluster is running. Currently, 3.3.0 is the version Databricks 11.3 LTS runs.

pipenv install pyspark==3.3.0

If you open the Pipfile in your solution, you’ll notice that the package is now listed in your [packages] section.

If you require many packages for your solution, you can use a requirements.txt file to install them all at once. In my case, I want to install pyspark, delta-spark, and requests for my solution. So I add the following to my requirements.txt file:

Then run the following command from my local development folder.

pipenv install -r requirements.txt

You’ll then find your Pipfile has been updated once more.

Later in this demo, we’ll explore linting, testing, code coverage, and deployment. Let’s go ahead and set up some dev-packages to help us with this development work. Create another file in your local development folder named dev-packages.txt and add the following packages.

coverage
packaging
pylint
pytest
setuptools
wheel

Then run the following command to install those packages to our dev-packages section of our Pipfile.

pipenv install --dev -r dev-packages.txt

Your PIpfile should now look like this.

Conclusion

You’re now ready to start developing your data engineering solutions locally. If you need a hand getting set up, let me know! Double-check that you have Python in your PATH variable if you get any errors while following this guide. Most of the time, messing that up will prevent the rest of this guide from working.

In the next entry, we’ll build a simple ingestion example. This will include unit tests to illustrate how we can build more reliable data engineering solutions. All that before paying a single cent for compute!

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • A New File Interrogator
  • Using Generative AI in Data Engineering
  • Getting started with Microsoft Fabric
  • Docker-based Spark
  • Network Infrastructure Updates

Recent Comments

  1. slowder on Data Engineering for Databricks
  2. Alex Ott on Data Engineering for Databricks

Archives

  • July 2023
  • June 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • October 2018
  • August 2018
  • May 2018
  • February 2018
  • January 2018
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • June 2017
  • March 2017
  • February 2014
  • January 2014
  • December 2013
  • November 2013
  • October 2013
  • August 2013
  • July 2013
  • June 2013
  • February 2013
  • January 2013
  • August 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • March 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007
  • March 2007
  • February 2007
  • January 2007
  • December 2006
  • November 2006
  • October 2006
  • September 2006
  • August 2006
  • July 2006
  • June 2006
  • May 2006
  • April 2006
  • March 2006
  • February 2006
  • January 2006
  • December 2005
  • November 2005
  • October 2005
  • September 2005
  • August 2005
  • July 2005
  • June 2005
  • May 2005
  • April 2005
  • March 2005
  • February 2005
  • January 2005
  • November 2004
  • September 2004
  • August 2004
  • July 2004
  • April 2004
  • March 2004
  • June 2002

Categories

  • Career Development
  • Data Engineering
  • Data Science
  • Infrastructure
  • Microsoft SQL
  • Modern Data Estate
  • Personal
  • Random Technology
  • uncategorized
© 2025 shannonlowder.com | Powered by Minimalist Blog WordPress Theme