Skip to content

shannonlowder.com

Menu
  • About
  • Biml Interrogator Demo
  • Latest Posts
Menu

Get Started with Databricks in VSCode

Posted on January 3, 2023December 30, 2022 by slowder

You’ve just received a new dataset, and you have to analyze it to prepare for building out the d ta ingestion pipeline. But first, we’ll need to create a cluster to run our analysis. Let’s run through a simple data analysis exercise using Databricks and Visual Studio Code (VSC).

Create a cluster from the Web UI

From the Web UI, click the compute icon on the left panel, then click “Create compute” under the All-purpose compute tab. If you click on the plus s bol to add a cluster from VSC, you’ll get a prompt asking if you want to open a web browser to your databricks workspace. The link the extension-in tries to open is a malformed URL. I submitted an issue to get that fixed.

Since this is a free 14-day trial, we’ll spin up a multi-node cluster. The wizard will default give the new cluster “<Your Name>’s Cluster.” Feel free to change that. Since this is an unshared demo, we’ll put it in single-user access mode. We’ll leave the runtime set to 11.3 since that’s the current “Long Term Support” version of Databricks.

For worker type, leave it DS3 version 2. That’s plenty of horsepower for our test. I would change up the minimum number of workers to 1 and the maximum to 4. If you get a warning to the right of Max workers, you’ll have to go into your Azure quota and request an increase.

Azure Quotas

Quotas are a way of controlling your costs in Azure. You can limit the number of a given resource type. In my case, default quotas are still in place for compute resources in East US2. To resolve that, I open quotas in the azure portal and click Compute, then limit my regions to East US2.

You’ll notice there are still 135 quotas that could be set in East US2. Since we’re trying to provision Standard_DS3_v2 workers, enter “Standard DS” into the search box. That drops the number of quotas to six.

You’ll notice that my quota for Standard DSv2 Family vCPUs is currently 0 of 15. 15 is what the original warning said I had available. All I have to do is hit the pencil icon at the far right of this line and request 20 as my new limit. It takes a couple of minutes for the quota to increase.

Once you see the success message, we can return to the Create Cluster page in Databricks. If you refresh the page, you’ll notice the warning goes away. The only other setting I changed on this page is the Terminate after 120 minutes of inactivity to 30 minutes. No need to leave a cluster running for two hours without any work. If you were paying for your compute, those unused hours would cost you!

Now you’re ready to create your cluster. After clicking Create Cluster, you may have to wait five minutes or more for your new cluster to finish building. While waiting for your cluster to finish setting up, you will be able to see it over in VSC.

Databricks also starts your cluster as soon as it’s set up.

Work with Databricks Workspace in VSC

By default, Visual Studio Code lists Explorer as the first icon on your left pane. You would use this to open a single folder (or repository) and explore files inside that folder. The problem was that many projects require you to open several folders simultaneously to work effectively. For example, you would open your source code folder and the source data folder when building ETL projects. That way, you could build code to ingest files.

When you work with Databricks projects, you want to open your local copy of the repo and your Databricks workspace. The Databricks extension allows you to do that easily. Open the workspace section in the Databricks extension. Click on the “Add Databricks Workspace to VSCode workspace button.

When you return to the Extension pane, you will see your Databricks workspace and any folders you have opened. In my case, I have a training repository that holds demo code I’ll share throughout this blog series.

Let’s create our first notebook in our local databricks workspace. Right-click your user folder, and choose “New File.” Give it any name with the extension ipynb. Right-click on your new file and choose “Open with” and then “Configure default editor for ‘*.ipynb.'” Then select “Jupyter Notebook.

From now on, you can double-click ipynb files and get them to open with the Jupyter Notebook editor.

Our first notebook will look at the sample data available to all Databricks workspaces. Check out the File System utilities available in Databricks. That’s where I set up my Databricks workspace. We’ll use dbutils.fs.ls to read the contents of the /databricks-datasets folder.

dbutils.fs.ls("/databricks-datasets")

Hit Shift + Enter to run the cell. You’ll be prompted to select an engine to run the cell. Choose the cluster you created in our previous step.

Within a few seconds, you should see results.

Using your new setup, you could work through Microsoft’s NYC Taxi workshop!

Conclusion

You’ve now set up a cluster and used that cluster to look at demo data. You could continue this way for all of your analysis work. Next time, we’ll take this a step further and set up a local spark engine to develop our data engineering workloads offline. That way, we’re not paying for development and debugging time. We’re only paying for our production workloads.

As usual, if you have any questions, please let me know!

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • A New File Interrogator
  • Using Generative AI in Data Engineering
  • Getting started with Microsoft Fabric
  • Docker-based Spark
  • Network Infrastructure Updates

Recent Comments

  1. slowder on Data Engineering for Databricks
  2. Alex Ott on Data Engineering for Databricks

Archives

  • July 2023
  • June 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • October 2018
  • August 2018
  • May 2018
  • February 2018
  • January 2018
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • June 2017
  • March 2017
  • February 2014
  • January 2014
  • December 2013
  • November 2013
  • October 2013
  • August 2013
  • July 2013
  • June 2013
  • February 2013
  • January 2013
  • August 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • June 2011
  • May 2011
  • April 2011
  • March 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • September 2008
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • April 2007
  • March 2007
  • February 2007
  • January 2007
  • December 2006
  • November 2006
  • October 2006
  • September 2006
  • August 2006
  • July 2006
  • June 2006
  • May 2006
  • April 2006
  • March 2006
  • February 2006
  • January 2006
  • December 2005
  • November 2005
  • October 2005
  • September 2005
  • August 2005
  • July 2005
  • June 2005
  • May 2005
  • April 2005
  • March 2005
  • February 2005
  • January 2005
  • November 2004
  • September 2004
  • August 2004
  • July 2004
  • April 2004
  • March 2004
  • June 2002

Categories

  • Career Development
  • Data Engineering
  • Data Science
  • Infrastructure
  • Microsoft SQL
  • Modern Data Estate
  • Personal
  • Random Technology
  • uncategorized
© 2025 shannonlowder.com | Powered by Minimalist Blog WordPress Theme