Getting started with Databricks!

Neeharika Kusampudi
6 min readDec 9, 2020

--

If you are new like me to Databricks, don’t know where to start, then this article is for you. I have taken up a very simple data science project to demonstrate how to get comfortable with Databricks. I will be taking you through step by step process of the project.

I will be working on Iris Dataset obtained from Kaggle. At this point, I am assuming that you already have registered and had the account ready. We will be using the Databricks Community Edition. Click here while it redirects to the Databricks Community Edition registration page.

After you have created the account it should look somewhat like this.

Home page of Databricks Community Edition

For Databricks to perform computation, you need to create a cluster, this is more like a CPU to Databricks.

Click on the cluster icon located at the bottom left on the black bar under the data icon found on your home page then click on ‘create cluster’

Give the cluster a name and Databricks Runtime Version(Can pick any of the version and name, no restrictions) then click on ‘create cluster’ on top.

Enter your cluster details her

Once the cluster is created, we will proceed to create a notebook. When you click on the home. It should show in the image below. To create a notebook, right-click on the space found underneath your email id.

How to create a new notebook

One disadvantage with it, if the cluster is not used for a while then it goes to an inactive state and there is no way to restart the cluster in the community edition.

Moving on, give the notebook name as per the instructions and the default language used for this already will be Python for this particular project.

As I have mentioned earlier that I will be working on a simple project, why not work on the hello world project for Data Science, pattern recognition for iris dataset. First things first! Upload the dataset onto databricks. It’s quite simple actually all you need to do is go to the Data tab as shown below and select create table as shown below.

Which then gives you the option to upload your data from local storage, cloud, and DBFS. For now, we will be uploading the file from the local storage. Note that I have made a few changes in Excel, in regards to its header.

Click on the browse option, as shown above, select the file from your local storage. As you can see there are two options below, “Create Table with UI” and “Create Table in Notebook”. I believe the purpose of “Create Table with UI“ will save the file in DBFS while “Create Table in Notebook” will create the table using the code in a notebook. Here’s the link for better understanding: https://docs.databricks.com/data/tables.html

When you select “Create Table with UI”, the next thing it would ask you to do is to select a cluster. Use the cluster you have created at the beginning of this article and it’s okay! If the cluster is inactive by the time you decided to take it to the next step. Just clone the existing cluster or you can always create a new one.

Last step to upload the file, specify the table attributes. Databricks give us the opportunity to select the data type of each attribute before uploading it onto DBFS. Make the necessary changes. For this project, everything but the class will be of the double data type.

If you are a curious cat and want to see that if your file has been uploaded or if you are wondering where it has been uploaded then, this is the place you would go to.

DBFS → FileStore → Tables → look for the file you uploaded

The important thing to note here is at the bottom you can see a path. Keep this in mind, we will need this to read the CSV file.

Like I have said I will be working on the iris dataset, the code and datasets are available everywhere on the internet but let me make your life easy! You will find the code here, but you need to sign up(it’s free, free, free) they have a lot of great content related to Data Science and ML. You will find the dataset here.

Before we discuss any further, two important things to know. Numero uno! If you are trying to install a package. For example, pip install nltk? The way you need to do that in Databricks is %sh pip install nltk or it will throw an error.

Numero dos! I have used a different field name for “species” — I called it “class”.

While I won't be explaining what particularly the code does.

I didn't have to install any packages for this, which might be because I might have installed the packages required for this. Next! imported all the libraries in one place. Read the CSV file from DBFS — observe the path here is the same path as discussed before.

Here, I have converted it to pandas because the code found on the internet can be directly used, there are a different set of commands for spark.

Then converted the class to a number while assigning a unique number for each category using label encoder. Implementing a logistic regression model to classify the species. For which, we need to have a test and train datasets.

The following are the results and accuracy for the logistic regression algorithm used for classifying the species.

And that’s it, folks! This is the summary of my one month struggle in understanding Databricks. Let me know if you would do any different than this. Let me know what you think of my article.

Have a question? Reach out to me on Linkedin. And if you liked this article, give it a few claps. I will sincerely appreciate it.

--

--

Neeharika Kusampudi
Neeharika Kusampudi

Written by Neeharika Kusampudi

Data Scientist. Believe knowledge is power.

No responses yet