How Azure Databricks Improves Machine Learning Processes

··

29 min read

Introduction
1. Understanding Azure Databricks
2. Learning Roadmap for Azure Databricks in ML Workflows
3. ML Workflow Using Azure Databricks
- 3.1 Steps in a Typical ML Workflow
- 3.2 Diagram: End-to-End ML Workflow in Azure Databricks
4. Hands-on Example: Predicting House Prices with Azure Databricks
5. Advanced Concepts in Azure Databricks ML Workflows

Introduction

Azure Databricks is a cloud-based analytics platform optimized for big data and machine learning (ML) workflows. It integrates seamlessly with Microsoft Azure services, allowing data engineers, data scientists, and analysts to build scalable ML models efficiently. This guide will take you through Azure Databricks from its basics to advanced ML applications.

1. Understanding Azure Databricks

1.1 What is Azure Databricks?

Azure Databricks is a managed Apache Spark-based analytics platform optimized for cloud computing. It provides an interactive workspace for data preparation, machine learning, and analytics, leveraging distributed computing.

1.2 Key Features of Azure Databricks

Feature	Description
Unified Data Analytics	Combines ETL, ML, and streaming workloads in a single platform.
Optimized Apache Spark	Offers high-performance Spark clusters with auto-scaling.
Deep Integration with Azure	Connects seamlessly with Azure Data Lake, Azure SQL, and other services.
Collaborative Notebooks	Supports Python, Scala, R, and SQL in interactive notebooks.
Job Scheduling	Allows automation of workflows via Databricks Jobs.
Security & Compliance	Offers enterprise-grade security, RBAC, and data encryption.

1.3 Why Use Azure Databricks for ML?

Simplifies data engineering and ML lifecycle.
Handles massive datasets efficiently with Spark’s distributed computing.
Supports MLOps, automating ML model deployment and monitoring.
Provides built-in ML libraries and integrations with MLflow for tracking experiments.

2. Learning Roadmap for Azure Databricks in ML Workflows

Stage	Key Topics	Real-time Use Case
Basic	Introduction to Azure Databricks, Apache Spark, Databricks Notebooks	Uploading and processing CSV files using Databricks
Intermediate	Data exploration, feature engineering, data cleaning, MLlib	Customer segmentation using clustering (K-means)
Advanced	Deep learning, Hyperparameter tuning, MLflow integration	Fraud detection using a deep neural network
Deployment & MLOps	Model deployment, monitoring, A/B testing, CI/CD pipelines	Deploying a recommendation engine on a production website

3. ML Workflow Using Azure Databricks

3.1 Steps in a Typical ML Workflow

Data Ingestion: Import structured/unstructured data from Azure Data Lake, Blob Storage, or SQL databases.
Data Preprocessing: Clean, transform, and prepare data using PySpark and Pandas.
Feature Engineering: Extract and select features using Spark MLlib.
Model Training: Use Databricks MLlib or frameworks like TensorFlow, Scikit-learn.
Hyperparameter Tuning: Optimize model performance with MLflow tracking.
Model Deployment: Deploy models to Azure ML, REST APIs, or integrate with Power BI.
Model Monitoring & Retraining: Track model drift and automate retraining with MLOps.

3.2 Diagram: End-to-End ML Workflow in Azure Databricks

+-----------------------------------+
| Data Ingestion (Azure Data Lake)  |
+-----------------------------------+
              ↓
+----------------------------------+
| Data Preprocessing (PySpark, MLlib) |
+----------------------------------+
              ↓
+---------------------------------+
| Feature Engineering (MLlib, Pandas) |
+---------------------------------+
              ↓
+---------------------------------+
| Model Training (Scikit-learn, TensorFlow) |
+---------------------------------+
              ↓
+---------------------------------+
| Hyperparameter Tuning (MLflow) |
+---------------------------------+
              ↓
+---------------------------------+
| Model Deployment (Azure ML, API) |
+---------------------------------+
              ↓
+---------------------------------+
| Model Monitoring (MLOps, Retraining) |
+---------------------------------+

4. Hands-on Example: Predicting House Prices with Azure Databricks

4.1 Dataset

Use the California Housing Dataset, available in sklearn.datasets.

4.2 Steps in Databricks Notebook

Step 1: Load Data

# Import necessary libraries
from sklearn.datasets import fetch_california_housing
import pandas as pd

# Load California Housing dataset
data = fetch_california_housing()

# Convert to DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)

# Add target column (House Prices)
df['PRICE'] = data.target * 100000  # Convert price to dollars (California dataset target is in $100,000)

# Display first 5 rows
print(df.head())

For Beginners:

The line:

df = pd.DataFrame(data.data, columns=data.feature_names)

Imagine you have a big table (like an Excel sheet) filled with data about houses—things like the number of rooms, location, and price. This table is stored in the data variable.

Now, you want to convert this table into a format that is easy to see, analyze, and use in Python. That’s where Pandas DataFrame comes in.

data.data → This is the actual table (the numbers and values).
data.feature_names → These are the column names (like "Number of Rooms", "Location", "Price").
pd.DataFrame(...) → Converts everything into a proper table format.

Example:

Let’s say the original dataset has these values:

Rooms	Size (sq ft)	Price
3	1200	$200K
4	1500	$250K
2	900	$150K

After running df = pd.DataFrame(data.data, columns=data.feature_names), you get:

      Rooms  Size (sq ft)  Price
0      3         1200    200000
1      4         1500    250000
2      2          900    150000

Now, df is a structured table that can be used for analysis, visualization, or training an ML model.

Difference Between the Original Dataset and a DataFrame

Think of it like this:

Original Dataset (data.data):
- It is like raw data in a spreadsheet with only numbers and no labels.
- It looks like a plain table but doesn’t tell you what each column represents.
DataFrame (df = pd.DataFrame(data.data, columns=data.feature_names)):
- It adds proper column names so you know what each number means.
- It converts raw data into a structured, labeled table, making it easier to understand and analyze.

Example to Visualize It:

Original Dataset (`data.data`)

Imagine a table with just numbers:

0.00632	18.0	2.31	0.0	0.538	6.575	65.2	4.09	1.0	296.0	15.3	396.9	4.98
0.02731	0.0	7.07	0.0	0.469	6.421	78.9	4.967	2.0	242.0	17.8	396.9	9.14

You cannot tell what these numbers represent.

DataFrame (`df`)

After using pd.DataFrame, it looks like this:

CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0.00632	18.0	2.31	0.0	0.538	6.575	65.2	4.09	1.0	296.0	15.3	396.9	4.98
0.02731	0.0	7.07	0.0	0.469	6.421	78.9	4.967	2.0	242.0	17.8	396.9	9.14

Now you clearly know what each column represents:

CRIM → Crime rate
ZN → Zoning classification
INDUS → Industrial area percentage
RM → Number of rooms in a house

👉 In short: The original dataset is just numbers. The DataFrame organizes it properly with column names, making it easier to read and use.

Step 2: Data Preprocessing

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize Spark session
spark = SparkSession.builder.appName("HousePrices").getOrCreate()

# Convert Pandas DataFrame to Spark DataFrame
df_spark = spark.createDataFrame(df)

# Filter out invalid prices (not needed for this dataset but good practice)
df_spark = df_spark.filter(col("PRICE") > 0)  

# Show schema
df_spark.printSchema()

# Show sample data
df_spark.show(5)

Step-by-Step Explanation of the Code

1. Creating a Spark Session

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("HousePrices").getOrCreate()

Explanation:

A Spark Session is like opening a workbook in Excel—it allows us to work with big data efficiently.
builder.appName("HousePrices") gives the session a name (useful for debugging).
.getOrCreate() starts a new session or connects to an existing one.

Think of it like: Opening a Google Spreadsheet to start working on house price data.

2. Converting a Pandas DataFrame to a Spark DataFrame

df_spark = spark.createDataFrame(df)

Explanation:

df is a Pandas DataFrame (a small, local dataset).
createDataFrame(df) converts it into a Spark DataFrame, which is optimized for handling huge datasets using distributed computing.

Think of it like: Converting a small Excel file into a Google Sheets document that can be shared and processed by multiple people at once.

3. Filtering Out Negative Prices

from pyspark.sql.functions import col
df_spark = df_spark.filter(col("PRICE") > 0)

Explanation:

Some house prices in the dataset might be negative due to data errors.
col("PRICE") > 0 selects only rows where the PRICE is greater than zero.
filter(...) removes the bad data from the dataset.

Think of it like: Deleting rows in Excel where the price of a house is negative because those values don’t make sense.

Summary of What’s Happening

Start a Spark session (like opening an Excel sheet).
Convert the data into Spark format (to handle big datasets efficiently).
Remove incorrect/negative price values (cleaning the data before analysis).

Step 3: Train a Regression Model

from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=data.feature_names, outputCol="features")
df_spark = assembler.transform(df_spark)

train, test = df_spark.randomSplit([0.8, 0.2], seed=42)

lr = LinearRegression(featuresCol="features", labelCol="PRICE")
model = lr.fit(train)

This code is used to train a machine learning model to predict house prices using Apache Spark in Azure Databricks.

Prepare the Data for Machine Learning

from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=data.feature_names, outputCol="features")
df_spark = assembler.transform(df_spark)

What is happening here?

Imagine you have a table with many columns like Number of Rooms, Crime Rate, Tax, and House Price.
However, machine learning models don’t understand separate columns; they need everything combined into a single column.
VectorAssembler takes all the necessary columns (like Number of Rooms, Crime Rate) and packs them into a new column called "features".
This makes it easier for the model to learn patterns.

Real-world analogy:
Think of this like putting all ingredients (flour, sugar, milk) into one bowl before making a cake. The model needs all information in one place before "cooking" (training).

Split Data into Training and Testing Sets

train, test = df_spark.randomSplit([0.8, 0.2], seed=42)

What is happening here?

We divide the dataset into two parts:
- Training Data (80%) → Used to train the model.
- Testing Data (20%) → Used to check how well the model performs on unseen data.
- seed=42 is used to make sure that when we split the dataset into training (80%) and testing (20%), we always get the same split every time we run the code.
This prevents the model from memorizing the data and ensures it can make accurate predictions on new data.

Real-world analogy:
Think of this like preparing for an exam:

You study (train) using 80% of the syllabus.
You test yourself (evaluate) with the remaining 20% to see if you really understood.

Train the Machine Learning Model

from pyspark.ml.regression import LinearRegression
lr = LinearRegression(featuresCol="features", labelCol="PRICE")
model = lr.fit(train)

This code is used to train a machine learning model that predicts house prices based on different features like the number of rooms, location, tax, etc.

Step 4: Model Evaluation

# Evaluate model on test data
evaluations = model.evaluate(test)

# Print metrics
print("RMSE:", evaluations.rootMeanSquaredError)
print("R2 Score:", evaluations.r2)

What Do RMSE & R² Mean in Simple Terms?

1. RMSE (Root Mean Squared Error) – Measures Prediction Error

Think of RMSE as how far off your predictions are from actual values on average.
Lower RMSE = Better Predictions
Example:
- If RMSE = 5000, your house price predictions are, on average, $5000 off from the real price.
- If RMSE = 10000, your predictions are $10000 off, which means a worse model.

2. R² Score (R-Squared) – Measures How Well the Model Explains the Data

R² tells us how much of the changes in house prices our model can explain.
Closer to 1 = Better Model
Example:
- R² = 0.90 → Your model explains 90% of house price variations (very good).
- R² = 0.50 → Your model explains only 50%, meaning it's not very reliable.
- R² = 0.00 → Your model is as good as a random guess.
- R² < 0 → Your model is worse than just taking the average price.

Real-Life Analogy:

Imagine you're throwing darts at a target (actual house prices):

Low RMSE & High R²: Most of your darts are close to the bullseye.
High RMSE & Low R²: Your darts are all over the place, missing the target.

Example Interpretation

Model	RMSE	R² Score
Model A	5000	0.85
Model B	10000	0.65

Model A is better because it has a lower RMSE and higher R², meaning it predicts more accurately.
Model B has a higher RMSE, meaning its predictions are farther from actual values, and lower R², meaning it explains less of the variance in the data.

In Short:

RMSE tells us how much error is in predictions (lower is better).
R² tells us how well the model understands the data (closer to 1 is better).

Step 5: Model Deployment with MLflow

import mlflow
mlflow.set_experiment("House Price Prediction")

with mlflow.start_run():
    mlflow.log_param("algorithm", "Linear Regression")
    mlflow.log_metric("RMSE", evaluations.rootMeanSquaredError)
    mlflow.sklearn.log_model(model, "model")

Breaking Down the Code Step by Step

import mlflow

Loads MLflow, the tool we use to track and manage the model.

1️⃣ Set Up an MLflow Experiment

mlflow.set_experiment("House Price Prediction")

What this does?

Creates (or selects) an experiment named "House Price Prediction".
This experiment stores all model versions and results.

🔹 Think of it as a folder 📁 where all details of the training process are stored.

2️⃣ Start a New MLflow Run

with mlflow.start_run():

What this does?

Begins tracking the training process.
Everything inside this block gets logged.

🔹 Think of this as pressing 'Record' on the experiment.

3️⃣ Log Model Parameters

mlflow.log_param("algorithm", "Linear Regression")

What this does?

Logs important settings of the model.
Here, we store "algorithm": "Linear Regression".

🔹 Why?

If we try different models (e.g., Random Forest, XGBoost), we can compare them later.

4️⃣ Log Performance Metrics

mlflow.log_metric("RMSE", evaluations.rootMeanSquaredError)

What this does?

Stores the RMSE (Root Mean Squared Error), a measure of model accuracy.
RMSE tells us how far predictions are from actual values.

🔹 Example:

Model A: RMSE = 4.5  
Model B: RMSE = 3.2  ✅ (Better)

We can track multiple models and compare their performance over time.

5️⃣ Save the Trained Model

mlflow.sklearn.log_model(model, "model")

What this does?

Saves the trained model under the name "model".
Stores it inside MLflow’s tracking system.

🔹 Why?

We can reload & use this model later for predictions without retraining!

What Happens After Running This Code?

The model and its details are stored in MLflow.
You can view the experiment in the MLflow UI (Databricks or local).
The model is ready to be deployed for real-world predictions.

How to Deploy & Use the Model?

After saving, we can reload the model and use it anywhere!

import mlflow.sklearn

# Load the saved model
model = mlflow.sklearn.load_model("model")

# Make a prediction
new_data = [[3, 2, 1500, 1]]  # Example house features
prediction = model.predict(new_data)
print("Predicted House Price:", prediction[0])

Now, your model is LIVE and can predict house prices!

5. Advanced Concepts in Azure Databricks ML Workflows

5.1 Hyperparameter Tuning with Hyperopt

What is Hyperparameter Tuning?

Machine learning models have settings (hyperparameters) that affect performance.
Finding the best settings manually is time-consuming.
Hyperopt is an automatic search tool that finds the best settings for you.

Types of Hyperparameters in ML

Learning Rate (lr) – Controls how fast the model learns.

Too high? Learns too fast and misses important patterns.
Too low? Learns too slowly and may never finish.

Number of Trees (for Random Forest/XGBoost) – More trees = better accuracy but slower speed.

Regularization (regParam) – Prevents the model from memorizing the data (overfitting).

Batch Size (for Deep Learning) – Controls how much data is processed at once.

How Are Hyperparameters Different from Model Parameters?

Feature	Hyperparameters	Model Parameters
Definition	Settings we manually choose	Values learned from data
Examples	Learning rate, number of trees, batch size	Coefficients in linear regression, weights in neural networks
Who Sets It?	You (or an optimizer like Hyperopt)	The machine learning model

Hyperparameter Tuning 🛠️

Since manually testing hyperparameters is hard, we use tuning techniques like:

Grid Search – Tries every possible combination (slow but effective).
Random Search – Picks random values and checks performance.
Bayesian Optimization (like Hyperopt) – Learns from past attempts to find the best values faster.

Step-by-Step Guide to Hyperparameter Tuning with Hyperopt

Step 1: Install Hyperopt (If Not Installed)

If Hyperopt is not installed in your Databricks or Python environment, install it using:

!pip install hyperopt

Step 2: Import Required Libraries

First, we import Hyperopt to perform tuning.

from hyperopt import fmin, tpe, hp
from pyspark.ml.regression import LinearRegression

fmin → Think of this as a treasure hunter. It searches for the best combination of hyperparameters to get the highest accuracy or lowest error in your model.

tpe (Tree-structured Parzen Estimator) → This is a smart search strategy. Instead of testing random values, it learns which values are better and focuses on those, making the search faster and more efficient.

hp (Hyperparameter Space) → This defines the range of values for each hyperparameter. For example, if you want to tune the learning rate, you tell it:

"Try values between 0.01 and 0.1"
or "Choose between batch sizes of 32, 64, or 128."

fmin → Finds the best hyperparameter values.
tpe → Uses a smart search algorithm (Tree-structured Parzen Estimator) to find good values.
hp → Defines the range of hyperparameters to search.

Step 3: Define an Objective Function

The objective function tells Hyperopt what to minimize.
In this case, we want to minimize RMSE (Root Mean Squared Error) because lower RMSE = better model.

def objective(params):
    # Create Linear Regression model with a changing hyperparameter (regParam)
    model = LinearRegression(regParam=params['regParam']).fit(train)

    # Return RMSE as the metric to minimize
    return model.summary.rootMeanSquaredError

🔹 What is happening here?

We define a function that takes params (hyperparameters).
regParam (Regularization Parameter) helps prevent overfitting.
The model is trained with this parameter.
We return RMSE because we want to minimize the prediction error.

Step 4: Define the Search Space

We tell Hyperopt the range of values to test for the hyperparameter.

space = {'regParam': hp.uniform('regParam', 0.01, 0.1)}

Understanding `hp.uniform('regParam', 0.01, 0.1)`

hp.uniform('regParam', 0.01, 0.1) → Selects a random value between 0.01 and 0.1 for regParam.
It tests multiple values within this range to find the best one.

🔹 What does this mean?

hp.uniform('regParam', 0.01, 0.1) → Tests values of regParam between 0.01 and 0.1.
Small regParam → Model may overfit (too closely follows training data).
Large regParam → Model may underfit (too simple to capture patterns).

There are two major problems in model learning: 🔹 Overfitting → The model memorizes the training data but fails on new data. 🔹 Underfitting → The model is too simple and fails to learn important patterns.

Effect of `regParam` on the Model

`regParam` Value	Effect on Model
Too Small (0.01)	Model becomes too flexible → Overfits the training data (high accuracy in training, poor generalization).
Too Large (0.1)	Model becomes too simple → Underfits (fails to learn enough patterns from the data).
Optimal Value	Balances bias and variance, generalizing well to new data.

Step 5: Run Hyperparameter Optimization

Now, we run Hyperopt to find the best value for regParam.

best_params = fmin(fn=objective,  # Function to minimize
                   space=space,   # Search space
                   algo=tpe.suggest,  # Optimization algorithm
                   max_evals=20)  # Number of attempts

print("Best Parameters:", best_params)

🔹 What’s Happening Here?

fn=objective (Function to minimize)

This is the goal of the search → We are trying to reduce the error (or loss) of our model.
The objective function runs the model with different settings and returns a loss value.
The lower the loss, the better.

space=space (Search space)

This tells the search what values to try.
Example: If we are tuning learning_rate, we say:
👉 "Try values between 0.01 and 0.1."

algo=tpe.suggest (Search method)

TPE (Tree-structured Parzen Estimator) is a smart algorithm that learns from past attempts.
Instead of randomly guessing, it picks smarter values based on previous results.

max_evals=20 (Number of attempts)

The algorithm will try 20 different settings before deciding the best one.

print("Best Parameters:", best_params)

After trying 20 different options, it prints the best combination.

Final Step: Train Model with Best Parameters

Once we get the best value for regParam, we use it in our final model:

best_model = LinearRegression(regParam=best_params['regParam']).fit(train)

🔹 Now, we have a model with the best hyperparameters!

Example

Imagine you are finding the best recipe for a cake 🎂.

fn=objective → You bake a cake and taste it (objective function checks the taste).
space=space → You define what ingredients (e.g., sugar, butter, flour) to test.
algo=tpe.suggest → Instead of randomly adding ingredients, you learn from past cakes and adjust smartly.
max_evals=20 → You bake 20 cakes, testing different recipes before deciding the best one.
print(best_params) → Finally, it tells you the best recipe!

Summary (Super Simple Version)

1. Hyperopt automatically finds the best settings for our model.
2. We define what to optimize (minimize RMSE).
3. We tell Hyperopt the range of values to test.
4. It tries different values and picks the best one.
5. We train the final model using the best settings.

5.2 Deploying ML Models as APIs

In today’s world, businesses need real-time machine learning predictions. Instead of running a model manually every time, we can deploy it as an API so that other applications (like websites, mobile apps, or dashboards) can call it and get predictions instantly.

Databricks makes this process easy by using MLflow, an open-source tool that helps train, track, and deploy ML models as APIs.

What is an API? (Simple Explanation)

API (Application Programming Interface) → It’s like a waiter in a restaurant who takes your order and brings back food from the kitchen.

You send a request (like ordering a meal 🍔).
The kitchen (ML model) processes it.
The waiter (API) returns the result (your food 🍽️).

Example:
A weather app 📱 uses an API to get real-time temperature data from a server and display it on your screen. Similarly, we can use an API to send data to an ML model and get predictions in return.

Steps to Deploy a Machine Learning Model as an API in Databricks

We will follow these 4 simple steps to deploy a model as an API.

1️⃣ Train and Save the Model

First, we need to train an ML model (e.g., a house price predictor) and save it using MLflow.

import mlflow
import mlflow.sklearn
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing
import pandas as pd

# Load dataset
data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)
df["PRICE"] = data.target

# Train model
X = df.drop("PRICE", axis=1)
y = df["PRICE"]
model = LinearRegression()
model.fit(X, y)

# Log model with MLflow
mlflow.sklearn.log_model(model, "house_price_model")

print("Model saved successfully!")

What’s Happening?

We load the California Housing dataset .
We train a Linear Regression model to predict house prices .
We use MLflow to save the trained model.

2️⃣ Deploy the Model as an API

Now, we deploy this model as a REST API so it can be accessed by external applications.

# Load MLflow model
model_uri = "models:/house_price_model/latest"

# Serve the model as an API
!mlflow models serve -m $model_uri -p 5001 --no-conda

Load the MLflow Model

model_uri = "models:/house_price_model/latest"

What it does?

Loads a trained machine learning model named "house_price_model".
The "latest" version ensures we use the most recently saved model.
The model is stored in MLflow, a tool used for tracking and managing ML models.

Serve the Model as an API

!mlflow models serve -m $model_uri -p 5001 --no-conda

What this command does?

Runs the model as a web service (API) so that others can send data to it and get predictions.
-m $model_uri → Uses the house price model stored in MLflow.
-p 5001 → Runs the API on port 5001.
--no-conda → Avoids using Conda (a virtual environment tool), making it faster.

3️⃣ Call the API for Predictions

Once the model is deployed, we can send new data to it and get predictions.

import requests
import json

# Define the API endpoint
url = "http://localhost:5001/invocations"

# Sample input data
input_data = {
    "dataframe_split": {
        "columns": ["MedInc", "HouseAge", "AveRooms", "AveBedrms", "Population", "AveOccup", "Latitude", "Longitude"],
        "data": [[8.3252, 41.0, 6.9841, 1.0238, 322.0, 2.5556, 37.88, -122.23]]
    }
}

# Send request
response = requests.post(url, json=input_data, headers={"Content-Type": "application/json"})

# Print response
print("Prediction:", response.json())

🔹 What’s Happening?

We send new house details (e.g., income, age, population) to the API.
The API calls the ML model and returns predicted house price.

Real-Life Example:
Just like Google Translate API translates text when you send a request, this API predicts house prices when you send housing data.

4️⃣ Deploy on Cloud for Public Access (Optional)

If you want others to use the API, deploy it on AWS, Azure, or Google Cloud.

Example using Azure Databricks:

Save the model in MLflow

 model_uri = "models:/house_price_model/latest"
 mlflow.models.log_model(model_uri, "house_price_model")

Deploy it as a REST API on Azure ML Services

 !az ml model deploy --name house-price-api --model house_price_model --cpu 1 --memory 2GB

Get Public API URL and Use It Anywhere

 https://house-price-api.azure.com/predict

Now, anyone with this link can send data and get predictions!

Why Deploy ML Models as APIs?

Feature	Why It's Useful?
Real-Time Predictions	Apps can get instant predictions (e.g., price recommendations 💰).
Easy Integration	APIs can be used in websites, apps, and dashboards.
Scalable	Can handle millions of requests without slowing down.
Cloud-Ready	Works on AWS, Azure, and Google Cloud.
Secure	We can control who accesses the API using authentication.

Conclusion

MLflow makes it super easy to deploy ML models as APIs.
Once deployed, apps and websites can call the API to get predictions in real time.
We can host the API on cloud services like AWS, Azure, or Google Cloud for public access.

5.3 Integrating Deep Learning (TensorFlow, PyTorch)

You can train deep learning models using Databricks ML runtime for deep learning.

Deep learning is transforming the way machines see, understand, and predict! In Databricks, you can easily train deep learning models using frameworks like TensorFlow and PyTorch with the ML runtime. Let’s break it down step by step.

What is Databricks ML Runtime?

Databricks ML Runtime is a pre-configured environment that includes:

Deep Learning frameworks (TensorFlow, PyTorch, Keras)
GPU acceleration (for faster training)
Pre-installed libraries (No need to install everything manually!)

It saves time and lets you focus on model building instead of environment setup.

Step 1: Set Up Databricks for Deep Learning

Before training, ensure your Databricks cluster supports deep learning.

1️⃣ Create a Databricks cluster

Choose Databricks Runtime ML (e.g., ML 13.3 GPU for GPU support)
Select GPU instance type (if available)

2️⃣ Install required libraries (if not pre-installed)

%pip install torch torchvision tensorflow keras

Now your environment is ready for deep learning!

Step 2: Load Data

For this example, we’ll use the MNIST dataset (handwritten digits) in PyTorch.

import torch
import torchvision
import torchvision.transforms as transforms

# Transform: Convert images to tensors
transform = transforms.Compose([transforms.ToTensor()])

# Load MNIST dataset
train_data = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_data = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform)

# Create data loaders
train_loader = torch.utils.data.DataLoader(train_data, batch_size=64, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=64, shuffle=False)

print("Data Loaded Successfully!")

🔹 What this does?

Downloads the MNIST dataset (28x28 grayscale images of numbers)
Converts them into tensors (so the model can process them)
Organizes data into batches for efficient training

Step 3: Build a Simple Deep Learning Model

Let’s create a simple Neural Network in PyTorch to classify handwritten digits.

How Does a Neural Network Work?

Think of it like a human brain:

It has layers of neurons (like brain cells).
Each neuron takes input, processes it, and passes it to the next layer.
It learns by adjusting connections (weights and biases) to improve predictions.

Imagine teaching a child to recognize animals:

You show pictures of different animals 🦁🐶🐱.
The child observes patterns (size, color, shape).
At first, they make mistakes, but learn from feedback.
Over time, they become better at recognizing animals.

import torch.nn as nn
import torch.optim as optim

# Define Neural Network
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(28 * 28, 128)  # Input Layer
        self.fc2 = nn.Linear(128, 64)       # Hidden Layer
        self.fc3 = nn.Linear(64, 10)        # Output Layer (10 classes)

    def forward(self, x):
        x = x.view(-1, 28 * 28)  # Flatten image
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Initialize Model
model = SimpleNN()
print(model)

🔹 What this model does?

Takes a 28x28 image and flattens it into a single line of numbers
Passes it through 3 layers (input, hidden, output)
Uses ReLU activation for learning patterns (ReLU (Rectified Linear Unit) is a function that helps a Neural Network learn patterns efficiently by keeping only positive values and ignoring negative values.)

What is the Input? (28x28 Images)

The model takes a black-and-white image of a digit, which is 28x28 pixels in size.
Instead of looking at the image as a grid, we convert it into a single line of numbers (like unrolling a rug 🧵).
The new shape of the data = 1 row with 784 values (since 28×28 = 784).

Example:
Think of it as a spreadsheet with 784 columns, where each column represents a pixel’s brightness (from 0 = black to 255 = white).

First Layer (Input Layer)

The first layer takes the 784 numbers and processes them using 128 neurons.
Each neuron learns to detect small features like edges, curves, and lines.

Analogy:
Imagine you're trying to recognize a face 👩‍🎨.

The first thing you notice is basic shapes (eyes, nose, mouth).
Similarly, this layer detects basic patterns in the image.

Second Layer (Hidden Layer)

The first layer passes its knowledge to the second layer.
This layer has 64 neurons and helps combine features into meaningful shapes (like loops or strokes).

Analogy:
Now, you start seeing more details—not just eyes and noses, but the full shape of a face! 😃

Output Layer (Final Prediction)

The last layer has 10 neurons (because there are 10 possible digits: 0-9).
Each neuron gives a score for how likely the image is each number.
The highest score wins, and the model predicts that number.

Analogy:
Imagine you see a blurry picture of a cat or dog 🐶🐱.

Your brain analyzes the features (ears, eyes, nose) and guesses what it is.
Similarly, the model picks the most likely digit.

Activation Function (ReLU)

The model uses ReLU (Rectified Linear Unit), which helps it learn patterns better.
ReLU keeps only positive values and ignores negatives (like ignoring unnecessary noise).

Analogy:
Think of it like highlighting important words in a book 📖—you keep only the useful information and ignore the rest!

Training the Model (Improving Accuracy)

The model makes mistakes at first (like a student learning math 🧮).
It uses a training algorithm to adjust itself over time.
The more it practices, the better it gets at recognizing digits correctly! ✅

Final Summary (Super Simple Version)

Takes a 28×28 image and flattens it into 784 numbers.
First layer detects basic patterns (edges, curves).
Second layer refines these patterns (full digit shapes).
Final layer makes a prediction (which digit it is).
ReLU activation helps the model learn effectively.
The model gets better over time by training on lots of examples.

Why is This Important?

This type of model is used in:

✅ Handwriting recognition (bank cheques, forms).
✅ CAPTCHA solving.
✅ License plate recognition.
✅ AI-powered OCR (Optical Character Recognition).

Step 4: Train the Model

Now, we train the model using a loss function and an optimizer.

# Define loss function and optimizer
criterion = nn.CrossEntropyLoss()  # Measures how wrong the predictions are
optimizer = optim.Adam(model.parameters(), lr=0.001)  # Adjusts weights to reduce error

# Training loop
for epoch in range(5):  # Train for 5 cycles
    for images, labels in train_loader:
        optimizer.zero_grad()  # Clear previous gradients
        output = model(images)  # Forward pass
        loss = criterion(output, labels)  # Calculate loss
        loss.backward()  # Backpropagation (adjusts weights)
        optimizer.step()  # Update model

    print(f"Epoch {epoch+1} - Loss: {loss.item():.4f}")

print("Training Complete!")

✅ What happens here?

The model makes predictions on the images.
We calculate the error (loss).
The optimizer adjusts the model weights to improve accuracy.
We repeat this process for 5 cycles (epochs).

This is the training process of a Neural Network in PyTorch. Let's go step by step and understand what each part does.

Define Loss Function and Optimizer

1️⃣ Loss Function (nn.CrossEntropyLoss())

criterion = nn.CrossEntropyLoss()

What it does?

Measures how wrong the model's predictions are.
Compares the predicted labels vs. actual labels.
Used for classification tasks (e.g., digit recognition, cat vs. dog).

🔹 Example:

If the model says an image is "Dog" (90%) but the actual label is "Cat", the loss will be high.
If the model is correct (Dog = 100%), the loss will be low.

2️⃣ Optimizer (optim.Adam())

optimizer = optim.Adam(model.parameters(), lr=0.001)

What it does?

Improves the model by adjusting weights after each training step.
Adam is a smart optimizer that adapts learning rates for better results.
lr=0.001 means "learn slowly but steadily".

🔹 Example:

If the model keeps predicting wrong, the optimizer tweaks its internal settings to improve future guesses.

Training Loop

This is where the model learns from data.

for epoch in range(5):  # Train for 5 cycles (epochs)
    for images, labels in train_loader:  # Loop through training data

What happens here?

Epoch: A full pass through the dataset.
train_loader: Holds images and their correct labels.

Training Process

1️⃣ Clear Previous Gradients

optimizer.zero_grad()

Why?

Before updating, clear past calculations so they don't interfere.
Think of it as erasing an old to-do list before writing a new one.

2️⃣ Forward Pass (Make Predictions)

output = model(images)

What happens?

The model processes the images and predicts labels.

🔹 Example:

You show the model a picture of a cat 🐱.
It predicts "Dog: 20%, Cat: 80%".

3️⃣ Compute Loss

loss = criterion(output, labels)

What happens?

Compares model predictions to actual labels.
Big loss = bad prediction ❌
Small loss = good prediction ✅

🔹 Example:

Model predicts "Dog: 70%, Cat: 30%", but the label is Cat → High Loss.
Model predicts "Dog: 10%, Cat: 90%", and the label is Cat → Low Loss.

4️⃣ Backpropagation (Learning Step)

loss.backward()

What happens?

Calculates how much each neuron contributed to the error.
Think of it like a teacher giving feedback on every mistake.

5️⃣ Update the Model

optimizer.step()

What happens?

The optimizer adjusts the model’s weights to reduce future mistakes.
The model learns from past errors and improves predictions.

🔹 Example:

If it thought a cat was a dog, next time it adjusts to be more accurate.

Print Training Progress

print(f"Epoch {epoch+1} - Loss: {loss.item():.4f}")

Why?

Shows how much the model is improving after each epoch.
Lower loss = better learning.

🔹 Example Output:

yamlCopyEditEpoch 1 - Loss: 2.3456
Epoch 2 - Loss: 1.8973
Epoch 3 - Loss: 0.9654
Epoch 4 - Loss: 0.5432
Epoch 5 - Loss: 0.2345
Training Complete!

The loss decreases, meaning the model is improving! 🎉

Summary: What Happens in Training?

1️⃣ The model sees an image
2️⃣ It makes a guess
3️⃣ The loss function checks if it’s right or wrong
4️⃣ Backpropagation adjusts the model
5️⃣ After many cycles, the model becomes smarter!

Now, your neural network is trained!

Step 5: Evaluate the Model

After training, we test how well the model performs on new unseen data.

correct = 0
total = 0

# No need to compute gradients during evaluation
with torch.no_grad():
    for images, labels in test_loader:
        outputs = model(images)
        _, predicted = torch.max(outputs, 1)  # Get predicted class
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f"Accuracy: {100 * correct / total:.2f}%")

What this does?

Runs the model on test data (images it hasn’t seen before)
Compares predictions to actual labels
Prints accuracy percentage

Step 6: Deploy Model in Databricks (Optional)

To use this model in real-world applications, we save and deploy it using MLflow.

import mlflow.pytorch

# Save Model in MLflow
mlflow.pytorch.log_model(model, "digit_classifier")
print("Model saved in MLflow!")

🔹 Now, you can load and use this trained model anytime for real-time predictions!

Summary: What We Did

1️⃣ Loaded the dataset (MNIST).
2️⃣ Built a simple neural network using PyTorch.
3️⃣ Trained the model to recognize handwritten digits.
4️⃣ Evaluated accuracy on test data.
5️⃣ Saved the model for deployment.

Why Use Databricks for Deep Learning?

✅ GPU acceleration – Speeds up training
✅ Auto-scaling clusters – Handles large datasets easily
✅ Pre-installed ML libraries – Saves time
✅ MLflow integration – Tracks & deploys models easily

✅ #Databricks #MLflow #ApacheSpark #BigData ✅ #MachineLearningPipeline #AIinCloud #DataScienceTools ✅ #ModelDeployment #SparkML #CloudComputing #MLOps ✅ #DeepLearningInDatabricks #DataEngineering #AIModelTracking

How Azure Databricks Improves Machine Learning Processes

Table of contents

Introduction

1. Understanding Azure Databricks

1.1 What is Azure Databricks?

1.2 Key Features of Azure Databricks

1.3 Why Use Azure Databricks for ML?

2. Learning Roadmap for Azure Databricks in ML Workflows

3. ML Workflow Using Azure Databricks

3.1 Steps in a Typical ML Workflow

3.2 Diagram: End-to-End ML Workflow in Azure Databricks

4. Hands-on Example: Predicting House Prices with Azure Databricks

4.1 Dataset

4.2 Steps in Databricks Notebook

Step 1: Load Data

Example:

Difference Between the Original Dataset and a DataFrame

Example to Visualize It:

Original Dataset (data.data)

DataFrame (df)

Step 2: Data Preprocessing

Step-by-Step Explanation of the Code

1. Creating a Spark Session

2. Converting a Pandas DataFrame to a Spark DataFrame

Summary of What’s Happening

Step 3: Train a Regression Model

Prepare the Data for Machine Learning

Split Data into Training and Testing Sets

Train the Machine Learning Model

Step 4: Model Evaluation

What Do RMSE & R² Mean in Simple Terms?

1. RMSE (Root Mean Squared Error) – Measures Prediction Error

2. R² Score (R-Squared) – Measures How Well the Model Explains the Data

Real-Life Analogy:

Example Interpretation

In Short:

Step 5: Model Deployment with MLflow

5. Advanced Concepts in Azure Databricks ML Workflows

5.1 Hyperparameter Tuning with Hyperopt

Types of Hyperparameters in ML

How Are Hyperparameters Different from Model Parameters?

Hyperparameter Tuning 🛠️

Step-by-Step Guide to Hyperparameter Tuning with Hyperopt

Understanding hp.uniform('regParam', 0.01, 0.1)

Effect of regParam on the Model

Final Step: Train Model with Best Parameters

Example

Summary (Super Simple Version)

5.2 Deploying ML Models as APIs

What is an API? (Simple Explanation)

Steps to Deploy a Machine Learning Model as an API in Databricks

5.3 Integrating Deep Learning (TensorFlow, PyTorch)

Print Training Progress

Follow & Subscribe to stay updated for More Databricks, MLflow, and Spark Insights! Don’t miss out on cutting-edge techniques, in-depth guides, and practical use cases to boost your career in Data Science & AI!

Original Dataset (`data.data`)

DataFrame (`df`)

Understanding `hp.uniform('regParam', 0.01, 0.1)`

Effect of `regParam` on the Model