Spark SQL: A Guide to Effective Data Processing
Different Object Creations in Spark SQL
Table of contents
- 1. Objects You Can Create in the Latest Version of Spark SQL
- 1.1 Databases (Catalogs & Namespaces)
- 1.2 Tables (Managed, External, Temporary, Delta, Iceberg, Hudi)
- Alternative: Creating a Global Temporary View
- When Should You Use Delta Lake Instead of Parquet?
- When Should You Use Iceberg Instead of Parquet or Delta?
- What is Apache Hudi?
- When Should You Use Hudi Instead of Parquet, Delta, or Iceberg?
- 1.3 Views (Logical Tables)
- 1.4 Functions (User-Defined Functions - UDFs)
- Step 1: Implementing SplitString UDTF in Java (or Scala)
- Step 2: Registering the UDTF in Spark
- Step 3: Using the UDTF in SQL
- Alternative Approach Without UDTFs
- Summary: Table UDFs vs. Pandas UDFs in SQL
- 1.5 Indexes
- B) Indexing in Delta Lake (OPTIMIZE + ZORDER)
- 2️⃣ Data Skipping (Automatic Indexing in Delta)
- 3️⃣ Bloom Filter Index (Fast Exact Match Lookup)
- 4️⃣ Indexing Using Partitioning
- Summary: Indexing Methods in Delta Lake
- C.) Apache Iceberg Indexing (Hidden Partitioning & Sort Order)
- 2️⃣ Sorting-Based Indexing (Sort Order)
- 3️⃣ Data Skipping Indexing (Manifest & Metadata Tracking)
- 4️⃣Bloom Filter Indexing (Fast Lookup Indexing)
- 5️⃣ Secondary Indexing (Apache Iceberg + Apache Lucene)
- Summary: Different Indexing Methods in Apache Iceberg
- D.) Apache Hudi Indexing (Bloom Index, Global Index)
- 1️⃣ Bloom Filter Indexing (Default Index in Hudi)
- 2️⃣ Simple Index (Hudi Default)
- 3️⃣ Bucket Indexing (For Large-Scale Data)
- 4️⃣ Global Index (For Multi-Partition Lookups)
- 5️⃣ HBase Indexing (For Large-Scale Upserts)
- Summary: Apache Hudi Indexing Types
- Final Summary: Different Indexing Methods in Spark SQL
- 1.6 Partitions (For Large Tables)
- 3️⃣ Dynamic vs. Static Partitioning
- 4️⃣ Optimizing Partitioned Queries
- Summary: Creating & Using Partitions in Spark SQL
- 1.7 Buckets (Cluster Data for Performance)
- Optimizing Queries on Bucketed Tables
- Summary: Types of Bucketing in Spark SQL
- 1.8 Indexing in Delta Lake
- 2️⃣ Indexing in Apache Iceberg
- Summary: Delta vs. Iceberg Indexing
- 1.10 Temporal Tables & Streaming Tables in Spark SQL
- A) Delta Lake Time Travel (Version-Based Temporal Tables)
- B) Slowly Changing Dimensions (SCD) Type 2 Temporal Table
- A) Creating a Streaming Table From Kafka
- B) Creating a Streaming Table From Delta Lake
- C) Creating a Streaming Table in Iceberg
- 1.11 Materialized Views
- 1.12 Schema & Metadata Objects
- Summary: Latest Objects You Can Create in Spark SQL
- 🔹 1️⃣ Databases (Catalogs & Namespaces)
- 🔹 2️⃣ Tables (Managed, External, Temporary, Delta, Iceberg, Hudi)
- 🔹 3️⃣ Views (Logical Tables)
- 🔹 4️⃣ Functions (User-Defined Functions - UDFs)
- 🔹 5️⃣ Indexes (Query Optimization)
- 🔹 6️⃣ Partitions (Large Table Optimization)
- 🔹 7️⃣ Buckets (Cluster Data for Performance)
- 🔹 8️⃣ Materialized Views (Precomputed Query Caching)
Apache Spark SQL (latest versions, including Spark 3.x and 4.x) supports various objects for efficient data processing, querying, and managing structured & semi-structured data.
1. Objects You Can Create in the Latest Version of Spark SQL
Below are the key objects you can create using the latest version of Spark SQL:
1.1 Databases (Catalogs & Namespaces)
Purpose: Organize structured data like schemas in a database.
How to Create a Database?
CREATE DATABASE IF NOT EXISTS sales_db;
Use SHOW DATABASES;
to list all databases.
1.2 Tables (Managed, External, Temporary, Delta, Iceberg, Hudi)
Purpose: Store structured data in Spark SQL.
A) Managed Table (Spark owns the storage)
CREATE TABLE sales_data (
id INT,
amount DOUBLE,
date STRING
) USING PARQUET;
Storage is managed by Spark (saved in warehouse location).
Stores the table in PARQUET
format for efficient storage & retrieval.
Parquet is a columnar storage format, meaning:
✔️ Faster queries than CSV or JSON.
✔️ Compresses data efficiently.
✔️ Optimized for big data processing in Spark.
In Databricks, the table is saved in DBFS (Databricks File System) at:
/user/hive/warehouse/sales_data
In Spark Standalone, it will be stored in the default warehouse location.
To check where it is stored, run:
DESCRIBE FORMATTED sales_data;
B) External Table (Data stored outside Spark)
CREATE EXTERNAL TABLE sales_external (
id INT,
amount DOUBLE,
date STRING
) USING PARQUET LOCATION 's3://my-bucket/sales/';
An External Table in Spark SQL is a table where the data is stored outside Spark’s managed storage, such as AWS S3, Azure Data Lake, Google Cloud Storage, or HDFS.
Unlike Managed Tables, where Spark controls the storage, External Tables only define the schema and location of existing data without moving or deleting the actual files.
Useful when data is in cloud storage (S3, ADLS, GCS, HDFS).
C) Temporary Tables (Session-Specific)
CREATE TEMPORARY VIEW temp_sales AS
SELECT * FROM sales_data;
A Temporary View is a session-scoped virtual table that:
Is created dynamically from a query.
Exists only for the current session.
Disappears once the session ends (not stored permanently).
Example Use Case:
- You don’t want to create a physical table, but you need to reuse a query multiple times.
Alternative: Creating a Global Temporary View
If you need the view to be accessible across multiple sessions, use Global Temporary Views:
CREATE GLOBAL TEMPORARY VIEW global_sales AS
SELECT * FROM sales_data;
Global views persist until all sessions are closed.
To query a global view, use:
SELECT * FROM global_temp.global_sales;
D) Delta Tables (Latest Feature)
Delta Lake provides ACID transactions, versioning, and schema evolution.
CREATE TABLE delta_sales
USING DELTA
AS SELECT * FROM parquet.`/mnt/datalake/sales/`;
Delta Lake is an enhanced storage format built on top of Parquet, designed to handle big data, real-time analytics, and machine learning workloads.
Key Features of Delta Lake:
ACID Transactions – Ensures data integrity (unlike Parquet).
Schema Evolution – Allows adding/removing columns dynamically.
Time Travel – Lets you query previous versions of data.
Deletes & Updates – Supports UPDATE
, DELETE
, and MERGE
commands (unlike Parquet).
Example: Time Travel (Query Old Data)
SELECT * FROM delta_sales VERSION AS OF 2;
When Should You Use Delta Lake Instead of Parquet?
Feature | Parquet | Delta Lake |
ACID Transactions | ❌ No | ✅ Yes |
Schema Evolution | ❌ Manual | ✅ Automatic |
Time Travel | ❌ No | ✅ Yes |
Supports Updates/Deletes | ❌ No | ✅ Yes |
Performance (Indexing & Caching) | 🚀 Fast | ⚡ Faster |
Delta Lake is ideal for production-ready big data & machine learning pipelines.
E) Iceberg Table (Latest Feature)
Apache Iceberg supports large-scale table management with schema evolution.
CREATE TABLE iceberg_sales
USING ICEBERG
AS SELECT * FROM parquet.`/mnt/datalake/sales/`;
This process converts Parquet data into Iceberg format, enabling advanced features like time travel, schema evolution, hidden partitioning, and incremental data processing.
What is Apache Iceberg?
Apache Iceberg is an open-source table format designed for large-scale analytics on data lakes (AWS S3, Azure Data Lake, Google Cloud Storage, HDFS).
Key Features of Apache Iceberg:
Time Travel – Query old versions of data effortlessly.
Schema Evolution – Add, remove, or rename columns without breaking queries.
Hidden Partitioning – No need to manually create partitions.
Incremental Processing – Query only the changed/new data efficiently.
ACID Transactions – Ensures data consistency like a database.
When Should You Use Iceberg Instead of Parquet or Delta?
Feature | Parquet | Delta Lake | Iceberg |
ACID Transactions | ❌ No | ✅ Yes | ✅ Yes |
Schema Evolution | ❌ Manual | ✅ Yes | ✅ Yes |
Time Travel | ❌ No | ✅ Yes | ✅ Yes |
Automatic Partitioning | ❌ No | ❌ No | ✅ Yes |
Incremental Queries | ❌ No | ❌ No | ✅ Yes |
Cloud-Native & Open-Source | ✅ Yes | ❌ Mostly Databricks | ✅ Yes |
Apache Iceberg is best for production data lakes where performance, time travel, and incremental processing are required.
F) Hudi Table (For Streaming & Incremental Processing)
Apache Hudi is best for real-time updates & transactional tables.
CREATE TABLE hudi_sales
USING HUDI
AS SELECT * FROM parquet.`/mnt/datalake/sales/`;
Supports upserts & incremental queries. This process converts the Parquet dataset into Apache Hudi format, enabling real-time data ingestion, updates, and incremental queries.
What is Apache Hudi?
Apache Hudi (Hadoop Upserts, Deletes, and Incrementals) is a transactional storage format designed for real-time big data processing on data lakes like AWS S3, Azure Data Lake, Google Cloud Storage, or HDFS.
Key Features of Apache Hudi:
Upserts & Deletes – Supports row-level updates and deletes efficiently.
Incremental Queries – Retrieve only changed records instead of full scans.
Time Travel & Versioning – Query historical snapshots of data.
Streaming & Real-Time Processing – Optimized for streaming ingestion.
ACID Transactions – Ensures data consistency in distributed environments.
When Should You Use Hudi Instead of Parquet, Delta, or Iceberg?
Feature | Parquet | Delta Lake | Iceberg | Hudi |
ACID Transactions | ❌ No | ✅ Yes | ✅ Yes | ✅ Yes |
Schema Evolution | ❌ Manual | ✅ Yes | ✅ Yes | ✅ Yes |
Time Travel | ❌ No | ✅ Yes | ✅ Yes | ✅ Yes |
Automatic Partitioning | ❌ No | ❌ No | ✅ Yes | ✅ Yes |
Incremental Queries | ❌ No | ❌ No | ✅ Yes | ✅ Yes |
Upserts & Deletes | ❌ No | ✅ Yes | ❌ No | ✅ Yes |
Optimized for Streaming | ❌ No | ❌ No | ❌ No | ✅ Yes |
Apache Hudi is best for real-time streaming data and transactional updates.
1.3 Views (Logical Tables)
Purpose: Save complex queries as reusable objects.
A) Permanent View
CREATE VIEW sales_view AS
SELECT id, amount, date FROM sales_data WHERE amount > 1000;
🔹 Acts like a saved query.
B) Temporary View (Session-Specific)
CREATE TEMP VIEW temp_view AS
SELECT * FROM sales_data WHERE date > '2024-01-01';
🔹 Exists only during session runtime.
1.4 Functions (User-Defined Functions - UDFs)
Purpose: Create custom functions for advanced data processing.UDFs help extend Spark SQL’s built-in functions, making complex operations easier!
A) Scalar UDF (Returns Single Value)
Think of Scalar UDFs like a function in Excel that applies a formula to each individual cell separately.
CREATE FUNCTION to_upper AS 'org.apache.spark.sql.functions.upper';
Converts text to uppercase.
B) Aggregate UDF (Returns Aggregated Values)
Think of Aggregate UDFs like SUM or AVG in Excel, where instead of modifying individual cells, they calculate a single output for a group of values.
CREATE FUNCTION custom_sum AS 'org.apache.spark.sql.functions.sum';
🔹 Performs custom aggregation operations.
C) Table UDFs
Creating a Table UDF in SQL
The reference to com.example.spark.udtf.SplitString
in Spark SQL suggests a User-Defined Table Function (UDTF) that splits a comma-separated string into multiple rows.
Since Spark does not natively support UDTFs in SQL (like Hive does), you must implement it in Java or Scala and register it in Spark before using it.
CREATE FUNCTION split_string_udtf AS 'com.example.spark.udtf.SplitString';
This function takes a string and returns multiple rows.
Step 1: Implementing SplitString
UDTF in Java (or Scala)
This class implements the UDTF logic in Java.
package com.example.spark.udtf;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException;
import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.*;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import java.util.ArrayList;
public class SplitString extends GenericUDTF {
@Override
public StructObjectInspector initialize(ObjectInspector[] args) throws UDFArgumentException {
if (args.length != 1) {
throw new UDFArgumentLengthException("SplitString() takes exactly one argument.");
}
if (args[0].getCategory() != ObjectInspector.Category.PRIMITIVE) {
throw new UDFArgumentTypeException(0, "Only string type arguments are allowed.");
}
ArrayList<String> fieldNames = new ArrayList<>();
ArrayList<ObjectInspector> fieldOIs = new ArrayList<>();
fieldNames.add("word");
fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);
}
@Override
public void process(Object[] args) throws HiveException {
if (args[0] == null) return;
String input = args[0].toString();
String[] words = input.split(",");
for (String word : words) {
forward(new Object[]{word.trim()});
}
}
@Override
public void close() throws HiveException { }
}
What This Code Does:
Takes a comma-separated string as input.
Splits the string into individual words.
Emits each word as a separate row.
Step 2: Registering the UDTF in Spark
After compiling and adding the JAR file to Spark, register the UDTF:
CREATE FUNCTION split_string_udtf AS 'com.example.spark.udtf.SplitString';
Step 3: Using the UDTF in SQL
Now, you can call the function in Spark SQL:
SELECT split_string_udtf('apple,banana,orange');
Expected Output:
fruit |
apple |
banana |
orange |
Alternative Approach Without UDTFs
If UDTFs are not available in your Spark setup, you can use the built-in EXPLODE()
function:
SELECT EXPLODE(SPLIT('apple,banana,orange', ',')) AS fruit;
Using the Table UDF in SQL
SELECT split_string_udtf('apple,banana,orange');
Expected Output:
fruit |
apple |
banana |
orange |
D.) PANDAS UDF
Creating Pandas UDFs in Spark SQL
Pandas UDFs (also called Vectorized UDFs) apply operations on entire columns at once, making them much faster than row-based UDFs.
Example: Creating a Pandas UDF in SQL
Use Case: Multiply a column's values by 2
for better performance.
Register Pandas UDF in Python
Since Pandas UDFs require Python, first define the function in a PySpark notebook.
from pyspark.sql.functions import pandas_udf
import pandas as pd
@pandas_udf("double") # Define return type
def multiply_by_2(s: pd.Series) -> pd.Series:
return s * 2
# Register the function in Spark SQL
spark.udf.register("multiply_by_2", multiply_by_2)
This function is now available in Spark SQL.
Using the Pandas UDF in SQL
SELECT multiply_by_2(amount) AS doubled_amount FROM sales_data;
Expected Output:
amount | doubled_amount |
100 | 200 |
250 | 500 |
75 | 150 |
Pandas UDFs execute much faster than row-based UDFs!
Summary: Table UDFs vs. Pandas UDFs in SQL
Feature | Table UDF (UDTF) | Pandas UDF |
Returns | Multiple rows per input | One transformed column |
Example | "apple,banana" → apple , banana | Multiply column values by 2 |
Performance | Medium | Fastest (Vectorized Processing) |
Use Case | Splitting strings, flattening arrays | Math transformations, scaling data |
1.5 Indexes
Purpose: Improve performance for frequent queries.
Traditional Indexing in Spark SQL (Basic Index)
This is similar to SQL indexing, but Spark does not support direct CREATE INDEX
commands like traditional RDBMS (MySQL, PostgreSQL).
Instead, indexing is done using sorting and bucketing.
A) Bucketing (Indexing Alternative)
Bucketing groups similar data into fixed-size partitions for faster joins and aggregations.
Works like indexing but divides data into pre-defined "buckets".
🔹 Creating a Bucketed Table in Spark
CREATE TABLE sales_bucketed (
id INT,
amount DOUBLE
) USING PARQUET
CLUSTERED BY (id) INTO 4 BUCKETS;
This clusters data based on id
into 4 partitions, improving performance.
CREATE TABLE sales_bucketed (...)
Creates a new table named sales_bucketed
.
id INT, amount DOUBLE
Defines two columns:
id
→ Unique identifier (Integer).amount
→ Sales amount (Decimal/Double).
USING PARQUET
Stores the table in Parquet format, which is a compressed and optimized file format for big data.
CLUSTERED BY (id) INTO 4 BUCKETS
Organizes the data into 4 buckets based on the id
column.
All rows with the same
id
are assigned to the same bucket.Spark pre-assigns the data to buckets, making joins and searches much faster.
B) Indexing in Delta Lake (OPTIMIZE
+ ZORDER
)
Delta Lake improves query performance by using indexing mechanisms like Z-Ordering, Data Skipping, and Bloom Filters.
1️⃣Z-Ordering Index (Multi-Dimensional Indexing)
Best for large datasets with multiple filters (e.g., filtering by
date
andamount
).Helps Spark read fewer files instead of scanning all data.
🔹 Creating a Z-Ordering Index
OPTIMIZE delta_sales ZORDER BY (date, amount);
This improves query performance when filtering by date
and amount
.
Example Query That Runs Faster:
SELECT * FROM delta_sales WHERE date > '2024-01-01' AND amount > 5000;
Query That Runs Faster After Z-Ordering
SELECT * FROM delta_sales WHERE date > '2024-01-01' AND amount > 5000;
Only relevant files are scanned, making the query faster.
2️⃣ Data Skipping (Automatic Indexing in Delta)
Delta Lake tracks metadata like min
and max
values in each file, so Spark skips irrelevant files during queries.
How It Works?
Each Delta file contains statistics (e.g., min/max values for columns).
When running a query, Delta Lake only reads the files that contain relevant data.
Example Query That Benefits From Data Skipping
SELECT * FROM delta_sales WHERE amount > 1000;
Delta Lake automatically avoids scanning files where amount
is always below 1000
.
3️⃣ Bloom Filter Index (Fast Exact Match Lookup)
Best for finding specific values in large datasets.
Works like a search index → quickly tells if a value exists in a file.
Used for fast lookups in high-cardinality columns (e.g., unique customer IDs).
Enabling Bloom Indexing in Delta
ALTER TABLE delta_sales
SET TBLPROPERTIES ('delta.bloomFilter.columns' = 'id');
This enables a Bloom filter index on the id
column.
Query That Runs Faster After Bloom Indexing
SELECT * FROM delta_sales WHERE id = 1001;
Instead of scanning all files, Spark directly finds the relevant files.
4️⃣ Indexing Using Partitioning
Partitioning is a type of indexing where data is split into folders based on a column.
- Best for columns with low cardinality (e.g.,
year
,country
).
Creating a Partitioned Delta Table
CREATE TABLE delta_sales (
id INT,
amount DOUBLE,
date STRING
) USING DELTA
PARTITIONED BY (date);
Example Query That Benefits From Partitioning
SELECT * FROM delta_sales WHERE date = '2024-02-01';
Spark reads only the date=2024-02-01
partition, making it much faster.
Summary: Indexing Methods in Delta Lake
Index Type | Best For | How to Enable? |
Z-Ordering | Multi-column filtering | OPTIMIZE ZORDER BY (date, amount); |
Data Skipping | Auto-optimized query performance | Enabled by default in Delta |
Bloom Filter Index | Fast lookups on unique values | ALTER TABLE SET TBLPROPERTIES ('delta.bloomFilter.columns' = 'id'); |
Partitioning | Filtering based on low-cardinality columns | PARTITIONED BY (date); |
C.) Apache Iceberg Indexing (Hidden Partitioning
& Sort Order
)
Iceberg provides automatic partitioning and indexing, reducing query scan time.
1️⃣ Hidden Partition Indexing (Automatic Partitioning in Iceberg)
Traditional partitioning requires managing folders manually, but Iceberg automatically tracks partitions without needing to create directories.
Why Use Hidden Partitioning?
Iceberg automatically tracks partition values in metadata (no need for
PARTITIONED BY
in SQL).Queries don’t need to specify partition columns explicitly.
Creating an Iceberg Table with Automatic Partition Indexing
CREATE TABLE iceberg_sales (
id INT,
amount DOUBLE,
date STRING
) USING ICEBERG;
Iceberg automatically partitions the date
column behind the scenes.
Query That Benefits From Hidden Partitioning
SELECT * FROM iceberg_sales WHERE date > '2024-01-01';
Spark reads only relevant partitions instead of scanning the entire dataset.
2️⃣ Sorting-Based Indexing (Sort Order)
Iceberg allows sorting at the file level, improving query efficiency for ordered queries.
Why Use Sort Order Indexing?
Helps with range-based queries (
BETWEEN
,ORDER BY
).Stores data physically in sorted order, reducing scan time.
Enabling Sort Order Indexing
ALTER TABLE iceberg_sales WRITE ORDERED BY (date, amount);
Example Query That Runs Faster With Sorting
SELECT * FROM iceberg_sales WHERE amount BETWEEN 500 AND 1000 ORDER BY date;
Sorted storage reduces the number of files scanned.
3️⃣ Data Skipping Indexing (Manifest & Metadata Tracking)
Iceberg automatically tracks column statistics (min
, max
, count
, etc.) so Spark skips unnecessary files when executing queries.
How Does It Work?
Each data file contains metadata about min/max values for indexed columns.
When querying, Iceberg only scans files that contain relevant values.
Example Query That Benefits From Data Skipping
SELECT * FROM iceberg_sales WHERE amount > 5000;
Instead of scanning all files, Iceberg reads only files where amount
might be >5000.
4️⃣Bloom Filter Indexing (Fast Lookup Indexing)
Bloom Filters help with fast, exact-match lookups on high-cardinality columns (like id
).
Why Use Bloom Filters?
Helps fast lookup queries (
WHERE id = 1001
).Avoids scanning unnecessary data files.
🔹 Enabling Bloom Filters in Iceberg
ALTER TABLE iceberg_sales SET TBLPROPERTIES (
'write.metadata.bloom-filter-columns'='id'
);
Example Query That Runs Faster With Bloom Indexing
SELECT * FROM iceberg_sales WHERE id = 1001;
Spark directly finds the relevant file instead of scanning all records.
5️⃣ Secondary Indexing (Apache Iceberg + Apache Lucene)
Apache Iceberg can integrate with Apache Lucene for advanced secondary indexing (experimental feature).
Why Use Secondary Indexing?
- Helps speed up text searches and non-primary key lookups.
Example Query That Runs Faster With Secondary Indexing
SELECT * FROM iceberg_sales WHERE customer_name LIKE 'John%';
If Lucene indexing is enabled, text-based lookups become much faster.
Summary: Different Indexing Methods in Apache Iceberg
Index Type | Best For | How to Enable? |
Hidden Partitioning | Fast partition-based filtering | Enabled by default in Iceberg |
Sorting-Based Indexing | Range queries, ORDER BY | WRITE ORDERED BY (column) |
Data Skipping (Manifest Indexing) | Auto-optimized queries | Enabled by default in Iceberg |
Bloom Filter Indexing | Fast lookups on unique values | ALTER TABLE SET TBLPROPERTIES ('write.metadata.bloom-filter-columns'='id'); |
Secondary Indexing (Lucene) | Full-text search, complex lookups | Experimental (Lucene-based indexing) |
D.) Apache Hudi Indexing (Bloom Index
, Global Index
)
Apache Hudi (Hadoop Upserts, Deletes, and Incrementals) provides multiple types of indexing mechanisms to speed up queries and optimize data processing.
Hudi is designed for real-time, incremental, and streaming data workloads, so its indexing methods focus on fast updates, upserts, and lookups.
1️⃣ Bloom Filter Indexing (Default Index in Hudi)
Best for searching records by primary key (id
) without scanning all files.
Hudi maintains a Bloom Filter index at the file level, helping quickly check whether a record exists in a file before scanning it.
How to Enable Bloom Filter Indexing in Hudi
CREATE TABLE hudi_sales (
id INT,
amount DOUBLE,
date STRING
) USING HUDI
TBLPROPERTIES ('hoodie.index.type' = 'BLOOM');
Creates a Bloom Filter index on the id
column.
Example Query That Runs Faster With Bloom Indexing
SELECT * FROM hudi_sales WHERE id = 1001;
Instead of scanning all files, Hudi quickly finds the relevant file using Bloom Filters.
2️⃣ Simple Index (Hudi Default)
Best for fast updates and deletes on transactional datasets.
Hudi maintains a mapping of record keys to file locations (faster than scanning all files).
Works well for small to medium datasets.
How to Enable Simple Indexing
ALTER TABLE hudi_sales SET TBLPROPERTIES ('hoodie.index.type' = 'SIMPLE');
Example Query That Runs Faster
MERGE INTO hudi_sales AS target
USING updates AS source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET target.amount = source.amount;
Performs fast upserts using the simple index.
3️⃣ Bucket Indexing (For Large-Scale Data)
Best for large-scale data where primary keys are evenly distributed.
✔️ Divides data into fixed-size buckets, reducing lookup time.
How to Enable Bucket Indexing
ALTER TABLE hudi_sales SET TBLPROPERTIES (
'hoodie.index.type' = 'BUCKET',
'hoodie.bucket.index.numBuckets' = '4'
);
Example Query That Runs Faster
SELECT * FROM hudi_sales WHERE id = 5000;
Hudi searches only in the relevant bucket instead of scanning all files.
4️⃣ Global Index (For Multi-Partition Lookups)
Best for searching across multiple partitions in large datasets.
Unlike other indexes, Global Index does not restrict searches to a single partition.
How to Enable Global Indexing
ALTER TABLE hudi_sales SET TBLPROPERTIES ('hoodie.index.type' = 'GLOBAL_BLOOM');
Example Query That Runs Faster
SELECT * FROM hudi_sales WHERE id = 7500;
Finds id = 7500
across multiple partitions without scanning all files.
5️⃣ HBase Indexing (For Large-Scale Upserts)
Best for OLTP-like workloads where upserts are frequent.
✔️ Maintains a mapping of record keys in Apache HBase for ultra-fast lookups.
How to Enable HBase Indexing
ALTER TABLE hudi_sales SET TBLPROPERTIES ('hoodie.index.type' = 'HBASE');
Example Query That Runs Faster
SELECT * FROM hudi_sales WHERE id = 9999;
Uses HBase for fast upserts and searches instead of scanning all files.
Summary: Apache Hudi Indexing Types
Index Type | Best For | How to Enable? |
Bloom Filter Indexing (Default) | Fast lookups on primary keys | 'hoodie.index.type' = 'BLOOM' |
Simple Indexing | Fast updates and deletes | 'hoodie.index.type' = 'SIMPLE' |
Bucket Indexing | Large-scale data with evenly distributed keys | 'hoodie.index.type' = 'BUCKET' + numBuckets=4 |
Global Indexing | Searching across multiple partitions | 'hoodie.index.type' = 'GLOBAL_BLOOM' |
HBase Indexing | Ultra-fast upserts with external storage | 'hoodie.index.type' = 'HBASE' |
5️⃣ Materialized Views (Query Indexing Alternative)
Materialized Views precompute results for faster retrieval.
Creating a Materialized View
CREATE MATERIALIZED VIEW mv_sales
AS SELECT id, SUM(amount) AS total_sales FROM sales_data GROUP BY id;
Querying this view is faster than running the aggregation repeatedly.
Final Summary: Different Indexing Methods in Spark SQL
Index Type | Best For | How to Use in Spark SQL? |
Bucketing (Partitioning Alternative) | Faster Joins & Aggregations | CLUSTERED BY (id) INTO 4 BUCKETS; |
Z-Ordering (Delta Lake) | Multi-Column Filtering | OPTIMIZE ZORDER BY (date, amount); |
Hidden Partitioning (Iceberg) | Auto-Partitioning & Indexing | No explicit indexing required |
Bloom Index (Hudi) | Fast Record Lookups | TBLPROPERTIES ('hoodie.index.type' = 'BLOOM'); |
Materialized Views | Precomputed Query Results | CREATE MATERIALIZED VIEW ... |
Choose the right indexing strategy based on your dataset and query patterns.
1.6 Partitions (For Large Tables)
Partitioning in Spark SQL improves query performance by logically dividing large datasets into smaller, manageable pieces. Instead of scanning the entire dataset, Spark only reads the relevant partitions, making queries much faster.
A) Single-Column Partitioning
CREATE TABLE sales_partitioned (
id INT,
amount DOUBLE,
date STRING
) USING PARQUET
PARTITIONED BY (date);
This creates a table partitioned by the date
column.
Each unique date will create a separate partition folder.
Query That Runs Faster With Partitioning
SELECT * FROM sales_partitioned WHERE date = '2024-02-01';
Spark reads only the date=2024-02-01
partition instead of scanning all data.
B) Multi-Column Partitioning
For large datasets, multi-column partitioning helps with complex queries.
CREATE TABLE sales_multi_partitioned (
id INT,
amount DOUBLE,
country STRING,
year INT
) USING PARQUET
PARTITIONED BY (year, country);
This creates partitions based on year
and country
.
Data is stored in directories like:
/data/sales/year=2023/country=USA/part-0001.parquet
/data/sales/year=2023/country=India/part-0002.parquet
Query That Runs Faster With Multi-Column Partitioning
SELECT * FROM sales_multi_partitioned WHERE year = 2023 AND country = 'USA';
Spark scans only /year=2023/country=USA/
instead of the entire table.
3️⃣ Dynamic vs. Static Partitioning
Spark supports both Static and Dynamic Partitioning when inserting data.
A) Static Partitioning
You specify partition values manually when inserting data.
INSERT INTO sales_partitioned PARTITION(date='2024-02-01')
SELECT id, amount FROM sales_data WHERE date = '2024-02-01';
Data is inserted only into the date=2024-02-01
partition.
B) Dynamic Partitioning
Spark automatically detects partition values while inserting data.
Useful for bulk data loading.
SET spark.sql.sources.partitionOverwriteMode = DYNAMIC;
INSERT INTO sales_partitioned
SELECT id, amount, date FROM sales_data;
Spark automatically creates partitions based on date
values.
4️⃣ Optimizing Partitioned Queries
Partitioning is effective only when queries use filters on partitioned columns.
A) Enabling Partition Pruning
Partition pruning ensures Spark only scans relevant partitions.
SELECT * FROM sales_partitioned WHERE date >= '2024-01-01';
Spark reads only partitions where date >= 2024-01-01
instead of full table scans.
B) Optimizing Query Performance With Partition Indexing
For Delta Tables, use ZORDER BY
for better performance.
OPTIMIZE sales_partitioned ZORDER BY (amount);
Z-Ordering clusters similar data together inside partitions for faster queries.
Summary: Creating & Using Partitions in Spark SQL
Feature | What It Does? | Example Query |
Single-Column Partitioning | Creates partitions by one column | PARTITIONED BY (date); |
Multi-Column Partitioning | Creates partitions based on multiple columns | PARTITIONED BY (year, country); |
Static Partitioning | Manually insert data into a specific partition | PARTITION (date='2024-02-01') |
Dynamic Partitioning | Automatically detects partitions when inserting data | INSERT INTO sales_partitioned SELECT ... |
Partition Pruning | Skips unnecessary partitions for faster queries | WHERE date = '2024-02-01' |
Z-Ordering (Delta Optimization) | Optimizes query performance inside partitions | OPTIMIZE sales_partitioned ZORDER BY (amount); |
1.7 Buckets (Cluster Data for Performance)
Purpose: Improve query speed by distributing data into equal-sized buckets.
Types of Bucketing in Spark SQL
1️⃣ Hash-Based Bucketing (Default)
Uses a hash function to distribute data across buckets.
Ensures even distribution of data across all buckets.
Best for: Faster joins, lookups, and aggregations on high-cardinality columns.
Example: Creating a Hash-Based Bucketed Table
CREATE TABLE sales_bucketed (
id INT,
amount DOUBLE
) USING PARQUET
CLUSTERED BY (id) INTO 4 BUCKETS;
Spark creates 4 buckets based on the hash of id
.
Query That Benefits From Hash Bucketing
SELECT * FROM sales_bucketed WHERE id = 1001;
Spark directly looks in the relevant bucket instead of scanning all data.
2️⃣ Range-Based Bucketing
Divides data into buckets based on a range of values (instead of hashing).
Ensures data locality, improving query performance.
Best for: Queries involving range filters (
BETWEEN
,ORDER BY
).
Example: Creating a Range-Based Bucketed Table
CREATE TABLE salary_bucketed (
employee_id INT,
salary DOUBLE
) USING PARQUET
CLUSTERED BY (salary) INTO 5 BUCKETS;
Spark distributes salaries into 5 buckets based on defined salary ranges.
Query That Benefits From Range Bucketing
SELECT * FROM salary_bucketed WHERE salary BETWEEN 5000 AND 10000;
Spark only scans the relevant buckets instead of all rows.
3️⃣ Multi-Column Bucketing (Composite Buckets)
Uses multiple columns to distribute data into buckets.
Best for: Multi-key joins, filtering, and grouped queries.
Example: Creating a Multi-Column Bucketed Table
CREATE TABLE customer_bucketed (
customer_id INT,
region STRING,
amount DOUBLE
) USING PARQUET
CLUSTERED BY (customer_id, region) INTO 6 BUCKETS;
Data is bucketed first by customer_id
, then by region
.
Query That Benefits From Multi-Column Bucketing
SELECT * FROM customer_bucketed WHERE customer_id = 101 AND region = 'US';
Spark quickly finds the bucket with both conditions, reducing scan time.
4️⃣ Bucketed Tables with Sorted Data (Hybrid Bucketing)
Combines bucketing and sorting for further optimization.
Best for: Faster aggregation & range queries (
GROUP BY
,ORDER BY
).
Example: Creating a Bucketed Table With Sorted Data
CREATE TABLE sales_sorted_bucketed (
id INT,
amount DOUBLE,
date STRING
) USING PARQUET
CLUSTERED BY (id) SORTED BY (date) INTO 4 BUCKETS;
Buckets data by id
and sorts within each bucket by date
.
Query That Benefits From Sorted Bucketing
SELECT * FROM sales_sorted_bucketed WHERE id = 500 ORDER BY date;
Spark avoids scanning unnecessary buckets and retrieves sorted data quickly.
Optimizing Queries on Bucketed Tables
Enable bucket pruning to avoid unnecessary scans
SET spark.sql.bucketing.enabled = true;
Join two bucketed tables efficiently
SELECT *
FROM sales_bucketed s
JOIN customers_bucketed c
ON s.id = c.customer_id;
This avoids shuffle operations, making joins faster.
Summary: Types of Bucketing in Spark SQL
Bucket Type | Best For | Example |
Hash-Based Bucketing | Joins & exact match lookups | CLUSTERED BY (id) INTO 4 BUCKETS; |
Range-Based Bucketing | Range queries (BETWEEN , ORDER BY ) | CLUSTERED BY (salary) INTO 5 BUCKETS; |
Multi-Column Bucketing | Optimized multi-key lookups | CLUSTERED BY (customer_id, region) INTO 6 BUCKETS; |
Sorted Bucketing | Faster aggregations & range queries | CLUSTERED BY (id) SORTED BY (date) INTO 4 BUCKETS; |
In Delta & Iceberg, indexing helps Spark find data faster without scanning the entire table.
1.8 Indexing in Delta Lake
Delta Lake supports multiple types of indexing, including Z-Ordering, Bloom Filters, and Secondary Indexing.
A) Z-Ordering (Multi-Column Sorting for Faster Queries)
🔹 Best for range queries & multi-column filtering
🔹 Example Use Case: Finding transactions by date
and amount
quickly.
Creating a Z-Ordered Index in Delta
OPTIMIZE delta_sales ZORDER BY (date, amount);
✔️ Reorders data inside Delta files to make queries like this faster:
SELECT * FROM delta_sales WHERE date > '2024-01-01' AND amount > 5000;
Instead of scanning all files, Spark reads only the relevant ones.
B) Bloom Filter Indexing (Fast Exact Match Lookups)
🔹 Best for finding unique values in large datasets (e.g., customer_id
, email
).
🔹 Example Use Case: Checking if a customer exists in a billion-record table.
Enabling Bloom Filter Indexing in Delta
ALTER TABLE delta_sales
SET TBLPROPERTIES ('delta.bloomFilter.columns' = 'customer_id');
✔️ Optimized for queries like:
SELECT * FROM delta_sales WHERE customer_id = 'CUST1234';
Delta skips unnecessary file scans & finds the exact match faster.
C) Secondary Indexing in Delta Lake
🔹 Best for creating indexes on frequently searched columns.
🔹 Example Use Case: Speeding up product_category
searches in e-commerce data.
Enabling Secondary Indexing in Delta
1️⃣ Install Hyperspace for Spark (An Open-Source Secondary Indexing Tool)
from hyperspace import Hyperspace
hs = Hyperspace(spark)
2️⃣ Create a Secondary Index on product_category
hs.createIndex(
deltaTable=df,
indexName="product_category_index",
indexedColumns=["product_category"]
)
✔️ Query That Runs Faster With Secondary Index
SELECT * FROM delta_sales WHERE product_category = 'Electronics';
Instead of scanning the entire Delta table, Spark reads the prebuilt index.
2️⃣ Indexing in Apache Iceberg
Iceberg automatically indexes partitions & supports secondary indexing for complex queries.
A) Hidden Partition Indexing (Auto-Partitioning)
🔹 Best for partitioned tables (e.g., date
or region
).
🔹 Example Use Case: Speeding up queries on year
partitions.
Creating an Iceberg Table With Hidden Partitioning
CREATE TABLE iceberg_sales (
id INT,
amount DOUBLE,
year INT
) USING ICEBERG;
✔️ Query That Runs Faster With Auto-Partition Indexing
SELECT * FROM iceberg_sales WHERE year = 2024;
Only scans the year=2024
partition instead of all data.
B) Bloom Filter Indexing in Iceberg
🔹 Best for finding records in large datasets (e.g., searching for a specific email
in user logs).
🔹 Example Use Case: Quickly checking if a transaction exists by transaction_id
.
Enabling Bloom Filters in Iceberg
ALTER TABLE iceberg_sales SET TBLPROPERTIES (
'write.metadata.bloom-filter-columns'='transaction_id'
);
Query That Runs Faster
SELECT * FROM iceberg_sales WHERE transaction_id = 'TXN789123';
Quickly finds transactions instead of scanning all records.
C) Lucene-Based Secondary Indexing (For Text & Complex Queries)
🔹 Best for full-text search (e.g., finding product_description
containing "wireless").
🔹 Example Use Case: Searching large Iceberg tables for keyword matches.
Using Lucene Indexing in Iceberg
CREATE INDEX product_desc_index ON iceberg_sales (product_description)
USING LUCENE;
Optimized for queries like:
SELECT * FROM iceberg_sales WHERE product_description LIKE '%wireless%';
Instead of scanning billions of rows, Lucene finds matches instantly.
Summary: Delta vs. Iceberg Indexing
Index Type | Delta Lake | Apache Iceberg | Best For |
Z-Ordering | ✅ Supported (OPTIMIZE ZORDER BY ) | ❌ Not Available | Multi-column filtering |
Hidden Partition Indexing | ❌ Not Automatic (Must PARTITION BY ) | ✅ Built-In (No Explicit Partitioning) | Partitioned Queries |
Bloom Filter Indexing | ✅ delta.bloomFilter.columns | ✅ write.metadata.bloom-filter-columns | Fast exact match lookups |
Secondary Indexing | ✅ Hyperspace (External Tool) | ✅ Lucene (Experimental) | Non-primary key lookups |
Text Search Indexing | ❌ Not Built-In (External Tools Needed) | ✅ Lucene (Full-Text Search) | Searching large text fields |
Choose Delta Lake if you need ACID transactions & Spark-native optimizations.
Choose Iceberg if you need automatic partitioning & text-based indexing.
Real-World Use Cases
Use Case | Best Indexing Type | Table Format |
Speeding up customer_id lookups | Bloom Filter Indexing | Delta & Iceberg |
Faster filtering on date & amount | Z-Ordering | Delta |
Searching for exact product_id | Secondary Index | Delta & Iceberg |
Text search in product_description | Lucene Index | Iceberg |
1.10 Temporal Tables & Streaming Tables in Spark SQL
Temporal Tables and Streaming Tables are used in real-time analytics to track changes, analyze historical data, and process continuous data streams.
1️⃣ Temporal Tables in Spark SQL
Temporal Tables store historical versions of data, allowing time travel (querying past data) and slowly changing dimensions (SCDs).
A) Delta Lake Time Travel (Version-Based Temporal Tables)
✔ Best for: Auditing, historical analysis, undo operations.
✔ Automatically stores past versions of data.
✔ Uses VERSION AS OF
or TIMESTAMP AS OF
.
Creating a Delta Table With Time Travel
CREATE TABLE delta_sales (
id INT,
amount DOUBLE,
date STRING
) USING DELTA;
Querying a Previous Version
SELECT * FROM delta_sales VERSION AS OF 5;
Querying Data at a Specific Timestamp
SELECT * FROM delta_sales TIMESTAMP AS OF '2024-02-01 10:00:00';
This helps analyze how data looked at a past point in time.
B) Slowly Changing Dimensions (SCD) Type 2 Temporal Table
Best for: Tracking changes in customer, product, or employee records over time.
Each change creates a new row with start_date
& end_date
.
Creating an SCD Type 2 Table
CREATE TABLE customer_history (
customer_id INT,
name STRING,
status STRING,
start_date TIMESTAMP,
end_date TIMESTAMP
) USING DELTA;
Inserting a New Record
INSERT INTO customer_history
VALUES (1, 'John Doe', 'Active', '2024-02-01', NULL);
Updating a Record (Closing Old Version & Adding New)
UPDATE customer_history
SET end_date = '2024-03-01'
WHERE customer_id = 1 AND end_date IS NULL;
INSERT INTO customer_history
VALUES (1, 'John Doe', 'Inactive', '2024-03-01', NULL);
Now, the table maintains a full history of changes.
2️⃣ Streaming Tables in Spark SQL (Real-Time Data Processing)
Streaming Tables process continuous data streams from sources like Kafka, Event Hubs, or Cloud Storage.
A) Creating a Streaming Table From Kafka
✔ Best for: Processing event streams, logs, user activity tracking.
✔ Reads data continuously without stopping.
Creating a Streaming Table (Kafka Source)
CREATE STREAMING TABLE kafka_sales_stream (
id INT,
amount DOUBLE,
event_time TIMESTAMP
) USING KAFKA
OPTIONS (
"kafka.bootstrap.servers" = "localhost:9092",
"subscribe" = "sales_topic"
);
Querying the Streaming Table in Real-Time
SELECT * FROM kafka_sales_stream;
This continuously pulls new records as they arrive in Kafka.
B) Creating a Streaming Table From Delta Lake
Best for: Processing real-time changes in a Delta Table.
Useful for CDC (Change Data Capture) & Live Dashboards.
Creating a Streaming Table From a Delta Source
CREATE STREAMING TABLE live_sales_stream
USING DELTA
LOCATION '/mnt/delta/sales';
Querying the Streaming Delta Table
SELECT * FROM live_sales_stream;
Updates as soon as new data is written to /mnt/delta/sales
.
C) Creating a Streaming Table in Iceberg
Best for: Incremental processing in Iceberg tables.
Creating a Streaming Table From Iceberg
CREATE STREAMING TABLE iceberg_sales_stream
USING ICEBERG
LOCATION 's3://my-bucket/iceberg_sales';
Querying Iceberg Streaming Data
SELECT * FROM iceberg_sales_stream;
Supports automatic snapshot-based incremental queries.
Summary: Temporal vs. Streaming Tables in Spark SQL
Feature | Delta Lake Temporal Tables | Streaming Tables (Delta, Kafka, Iceberg) |
Best For? | Historical data tracking, SCD, Time Travel | Real-time processing, event-driven data |
Time Travel Query? | ✅ VERSION AS OF , TIMESTAMP AS OF | ❌ Not Supported |
Slowly Changing Dimensions? | ✅ Yes (start_date , end_date ) | ❌ No |
Real-Time Updates? | ❌ No | ✅ Yes |
Example Source? | Historical customer records | Kafka, Delta, Iceberg, Streaming APIs |
Use Temporal Tables for historical analysis & tracking changes.
Use Streaming Tables for real-time data ingestion & event processing.
1.11 Materialized Views
Purpose: Cache precomputed results for fast queries.
A) Creating a Materialized View
CREATE MATERIALIZED VIEW mv_sales
AS SELECT id, SUM(amount) AS total_sales FROM sales_data GROUP BY id;
🔹 Improves performance for frequent aggregations.
1.12 Schema & Metadata Objects
Purpose: Manage structured data definitions.
A) Show Schema
DESCRIBE TABLE sales_data;
B) Alter Schema
ALTER TABLE sales_data ADD COLUMNS (category STRING);
Summary: Latest Objects You Can Create in Spark SQL
This table summarizes all Spark SQL object types, their creation methods, and their purpose based on the latest features.
🔹 1️⃣ Databases (Catalogs & Namespaces)
Object Type | Example | Purpose |
Database | CREATE DATABASE sales_db; | Organize structured data like schemas in a database. |
🔹 2️⃣ Tables (Managed, External, Temporary, Delta, Iceberg, Hudi)
Table Type | Example | Purpose |
Managed Table | CREATE TABLE sales_data (id INT, amount DOUBLE) USING PARQUET; | Spark manages storage and metadata. |
External Table | CREATE EXTERNAL TABLE sales_external (id INT, amount DOUBLE) USING PARQUET LOCATION 's3://my-bucket/sales/'; | Stores schema in Spark but data in external storage (S3, ADLS, HDFS). |
Temporary Table | CREATE TEMP VIEW temp_sales AS SELECT * FROM sales_data; | Exists only in session memory, disappears after session ends. |
Global Temporary View | CREATE GLOBAL TEMP VIEW global_sales AS SELECT * FROM sales_data; | Accessible across sessions until all are closed. |
Delta Table | CREATE TABLE delta_sales USING DELTA AS SELECT * FROM parquet.\ /mnt/datalake/sales/`;` | Provides ACID transactions, schema evolution, time travel. |
Iceberg Table | CREATE TABLE iceberg_sales USING ICEBERG AS SELECT * FROM parquet.\ /mnt/datalake/sales/`;` | Supports hidden partitioning, time travel, and schema evolution. |
Hudi Table | CREATE TABLE hudi_sales USING HUDI AS SELECT * FROM parquet.\ /mnt/datalake/sales/`;` | Supports upserts, incremental queries, and real-time ingestion. |
🔹 3️⃣ Views (Logical Tables)
View Type | Example | Purpose |
Permanent View | CREATE VIEW sales_view AS SELECT id, amount FROM sales_data WHERE amount > 1000; | Saves a query result for reuse across sessions. |
Temporary View | CREATE TEMP VIEW temp_view AS SELECT * FROM sales_data WHERE date > '2024-01-01'; | Exists only within the current session. |
🔹 4️⃣ Functions (User-Defined Functions - UDFs)
UDF Type | Example | Purpose |
Scalar UDF | CREATE FUNCTION to_upper AS 'org.apache.spark.sql.functions.upper'; | Applies transformation to single row values (e.g., converts text to uppercase). |
Aggregate UDF | CREATE FUNCTION custom_sum AS 'org.apache.spark.sql.functions.sum'; | Aggregates multiple rows into a single value (e.g., SUM, AVG). |
Table UDF (UDTF) | CREATE FUNCTION split_string_udtf AS 'com.example.spark.udtf.SplitString'; | Converts one row → multiple rows (splitting strings into rows). |
Pandas UDF (Vectorized UDF) | @pandas_udf("double") | Applies batch-based transformations using Pandas for faster execution. |
🔹 5️⃣ Indexes (Query Optimization)
Index Type | Example | Purpose |
Bucketing (Indexing Alternative) | CREATE TABLE sales_bucketed CLUSTERED BY (id) INTO 4 BUCKETS; | Groups data into fixed partitions for faster joins & lookups. |
Z-Ordering (Delta Lake) | OPTIMIZE delta_sales ZORDER BY (date, amount); | Optimizes multi-column filtering by minimizing file scanning. |
Hidden Partitioning (Iceberg) | CREATE TABLE iceberg_sales USING ICEBERG; | Auto-indexing without explicit partitions. |
Bloom Filter Indexing (Hudi) | ALTER TABLE hudi_sales SET TBLPROPERTIES ('hoodie.index.type' = 'BLOOM'); | Fast lookups on high-cardinality columns like id . |
Materialized Views (Query Indexing Alternative) | CREATE MATERIALIZED VIEW mv_sales AS SELECT id, SUM(amount) FROM sales_data GROUP BY id; | Precomputes results for faster querying. |
🔹 6️⃣ Partitions (Large Table Optimization)
Partition Type | Example | Purpose |
Single-Column Partitioning | CREATE TABLE sales_partitioned PARTITIONED BY (date); | Splits data into folders based on a single column. |
Multi-Column Partitioning | CREATE TABLE sales_multi_partitioned PARTITIONED BY (year, country); | Partitions by multiple columns for complex queries. |
Static Partitioning | INSERT INTO sales_partitioned PARTITION(date='2024-02-01') SELECT id, amount FROM sales_data; | Manually assigns partition values when inserting data. |
Dynamic Partitioning | INSERT INTO sales_partitioned SELECT id, amount, date FROM sales_data; | Automatically detects partitions while inserting data. |
🔹 7️⃣ Buckets (Cluster Data for Performance)
Bucket Type | Example | Purpose |
Hash-Based Bucketing | CREATE TABLE sales_bucketed CLUSTERED BY (id) INTO 4 BUCKETS; | Speeds up joins & lookups by hashing id . |
Range-Based Bucketing | CREATE TABLE salary_bucketed CLUSTERED BY (salary) INTO 5 BUCKETS; | Optimizes range queries (BETWEEN , ORDER BY ). |
Multi-Column Bucketing | CREATE TABLE customer_bucketed CLUSTERED BY (customer_id, region) INTO 6 BUCKETS; | Best for multi-key joins & grouped queries. |
Sorted Bucketing | CREATE TABLE sales_sorted_bucketed CLUSTERED BY (id) SORTED BY (date) INTO 4 BUCKETS; | Optimizes range queries & aggregations. |
🔹 8️⃣ Materialized Views (Precomputed Query Caching)
Materialized View Type | Example | Purpose |
Standard Materialized View | CREATE MATERIALIZED VIEW mv_sales AS SELECT id, SUM(amount) FROM sales_data GROUP BY id; | Stores precomputed query results for faster performance. |
Incremental Refresh Materialized View | CREATE MATERIALIZED VIEW mv_incremental_sales USING DELTA LIVE TABLE AS SELECT id, SUM(amount) FROM live_delta_sales; | Automatically updates as new data arrives. |
Partitioned Materialized View | CREATE MATERIALIZED VIEW mv_partitioned_sales PARTITIONED BY (year, region) AS SELECT year, region, SUM(amount) FROM sales_data GROUP BY year, region; | Optimized for filtering by partitioned columns. |
Auto-Refresh Materialized View (Databricks) | ALTER MATERIALIZED VIEW mv_sales SET TBLPROPERTIES ('delta.autoOptimize.autoCompact' = 'true'); | Automatically refreshes on a schedule. |