Mastering Transformations in Spark SQL
Table of contents
- Spark SQL Transformations
- 2. Types of Transformations in Spark: Narrow vs. Wide Transformations
- 3. Lazy Execution in PySpark
- 3. Working with Structured Data: Reading and Writing Parquet, ORC, Avro Files in Spark SQL
- 4.Aggregations in Spark SQL
- COUNT()
- SUM()
- 1. What Does SUM(amount) Really Do?
- Example:
- 2. SUM(amount) vs. COALESCE(SUM(amount), 0)
- 3. SUM(DISTINCT amount) – Summing Unique Values
- 4. SUM() with GROUP BY
- 5. SUM() with HAVING Clause
- 6. SUM() in Joins – Unexpected Duplicates
- 7. SUM() in LEFT JOIN – Handling NULLs
- 8. SUM() with CASE – Conditional Summing
- 9. SUM() with FILTER() – Alternative to CASE
- 10. SUM() in Window Functions – Running Total
- 11. SUM() with UNION vs. UNION ALL
- 12. SUM() with Negative Values
- AVG()
- MIN()
- MAX()
- COUNT(DISTINCT column_name)
- GROUP BY with Aggregations
- HAVING with Aggregations
- 1. How HAVING Works in SUM(salary) > 50000
- 2. HAVING vs. WHERE – Key Difference
- 3. HAVING with COUNT() – Filtering Groups by Count
- 4. HAVING with Multiple Conditions
- 5. HAVING with DISTINCT – Counting Unique Values
- 6. HAVING with CASE – Conditional Aggregation
- 7. HAVING with FILTER() – Alternative to CASE
- 8. HAVING with ORDER BY – Sorting Aggregated Data
- 9. HAVING with JOIN – Filtering After Aggregation
- 10. HAVING in Window Functions – Why It Doesn’t Work
- VARIANCE() / VAR_POP()
- STDDEV() / STDDEV_POP()
- FIRST() / LAST()
- COLLECT_SET()
- COLLECT_LIST()
- PERCENTILE() / APPROX_PERCENTILE()
- MODE() (Most Frequent Value - Custom Implementation)
- RANK() – Assigns ranks with gaps.
- DENSE_RANK() – Assigns ranks without gaps.
- ROW_NUMBER() – Assigns unique row numbers.
- NTILE(n) – Divides data into n equal parts.
- CORR(col1, col2) – Correlation between two columns.
- COVAR_POP(col1, col2) – Population covariance.
- COVAR_SAMP(col1, col2) – Sample covariance.
- ARRAY_AGG(column) – Aggregates data into an array.
- EXPLODE(array_column) – Converts array elements into separate rows.
- GROUPING SETS() – Computes multiple groupings in a single query.
- CUBE() – Aggregates across all possible combinations of specified columns.
- ROLLUP() – Computes hierarchical aggregations.
- User-Defined Aggregation Function (UDAF)
- DATE_TRUNC(unit, column) – Aggregation based on time intervals.
- WINDOW() – Used for time-based aggregations in structured streaming.
- PERCENT_RANK() –
- CUME_DIST() –
- MEDIAN() (Custom Implementation)
- HISTOGRAM() – Creating Binned Distributions
- BIT_OR() / BIT_AND() – Bitwise Aggregations
- STRING_AGG() – Concatenating Strings in Aggregation
- HYPERLOGLOG_PLUS()
- ENTROPY() – Information Theory Aggregation
- MODE() (Built-in in Some Versions)
- GEOMETRIC_MEAN() – Multiplicative Mean
- HARMONIC_MEAN() – Reciprocal Mean Calculation
- MEDIAN() (Alternative Approach)
- PRODUCT() – Multiplication Aggregation
- SKEWNESS() – Skewness of a Distribution
- KURTOSIS() – Peakedness of a Distribution
- RATIO_TO_REPORT() – Proportional Contribution
- LEAD() / LAG() – Comparing Aggregates Over Rows
- PIVOT() – Transforming Rows into Columns
- UNPIVOT() – Transforming Columns into Rows
- ANY_VALUE() – Random Value in Group
- GROUPING() – Detecting Aggregation Levels
- TOP N Records using ROW_NUMBER()
- MODE() (Efficient Alternative Using Window Functions)
- GROUP_CONCAT() – Concatenation of Grouped Values
- BIT_XOR() – Bitwise XOR Aggregation
- Understanding BIT_XOR(permissions) in SQL
- 1. How BIT_XOR() Works
- 2. What is Bitwise XOR (⊕)?
- 3. Step-by-Step Calculation
- 4. BIT_XOR() vs. Other Bitwise Aggregations
- 5. BIT_XOR() with NULL Values
- 6. BIT_XOR() with HAVING – Filtering Groups
- 7. BIT_XOR() in Joins – Ensuring Correct Aggregation
- 8. BIT_XOR() with CASE – Conditional XOR
- 9. BIT_XOR() with UNION vs. UNION ALL
- 10. BIT_XOR() with Binary Data
- LOG_SUM_EXP()
- COUNT_IF()
- APPROX_MEDIAN() – Faster Median Computation for Big Data
- RATIO()
- REGR_SLOPE(), REGR_INTERCEPT()
- REGR_R2()
- UNIQUE_COUNT_ESTIMATE()
- MAD()
- GREATEST() and LEAST()
- COUNT_MIN_SKETCH()
- QUANTILE()
- GROUPING_ID()
- INVERSE_PERCENTILE()
- EXPONENTIAL_MOVING_AVG()
- DECILE()
- MOVING_AVERAGE()
- STDDEV_SAMP() and STDDEV_POP()
- VAR_SAMP() and VAR_POP()
- BLOOM_FILTER_AGG()
- Probabilistic Set Membership
- TOP_K()
- EVENT_RATE_ESTIMATE()
- HASH_AGG()
- STRING_AGG() with ORDER BY (Concatenation in Sorted Order)
- CARDINALITY_ESTIMATE() – Fast Approximate Cardinality
- BIT_COUNT() – Counting Number of 1s in a Bit Representation
- UNIQUE_PERCENTILE()
- FREQUENT_ITEMS()
- BIT_SET_AGG()
- MOST_RECENT()
- ARLIEST()
- DEVIATION()
- PERMUTATION_AGG() – Aggregate Across Permutations
- JACCARD_SIMILARITY() – Set-Based Similarity Calculation
- COSINE_SIMILARITY() – Similarity Between Feature Vectors
- TOP_N_PER_GROUP()
- L2_NORM()
- HAVERSINE_DISTANCE() – Distance Between Two Coordinates
Spark SQL Transformations
1. Spark SQL vs. PySpark DataFrames: Key Differences
| Feature | Spark SQL | PySpark DataFrames | | --- | --- | --- | | Interface | SQL-based (
SELECT * FROM table
) | Python-based (df.select
()
,df.filter()
) | | Performance | Optimized for complex queries (BI tools, dashboards) | Faster for iterative processing (ML, transformations) | | Ease of Use | Easier for SQL users | More flexible for developers (Python, Scala) | | Integration | Works with Hive, JDBC, BI tools | Works with Python ML libraries (scikit-learn, TensorFlow) | | Best Use Cases | Ad-hoc queries, reporting, BI dashboards | ETL, Data Science, Machine Learning |Logical Flow Diagram: How Spark SQL and PySpark Work
Spark SQL Workflow
1️⃣ Load data into a table or DataFrame.
2️⃣ Run SQL queries on the table.
3️⃣ Spark converts SQL into an optimized execution plan.
4️⃣ The query executes in a distributed cluster.Example Workflow:
CREATE TABLE sales (id INT, amount DOUBLE); SELECT * FROM sales WHERE amount > 1000;
PySpark DataFrame Workflow
1️⃣ Load data into a DataFrame.
2️⃣ Use PySpark functions (select
,filter
,groupBy
).
3️⃣ Spark optimizes execution using the Catalyst engine.
4️⃣ Executes in a distributed cluster.Example Workflow:
df = spark.read.csv("sales.csv", header=True) df.filter(df.amount > 1000).show()
Both Spark SQL and PySpark DataFrames use the Catalyst optimizer to improve performance.
When to Use Spark SQL & PySpark Together
Many use cases require both Spark SQL & PySpark DataFrames.
Use Case 1: SQL Query with PySpark
Read a CSV file and filter results using Spark SQL:
df = spark.read.csv("sales.csv", header=True)
df.createOrReplaceTempView("sales_table")
high_sales = spark.sql("SELECT * FROM sales_table WHERE amount > 5000")
high_sales.show()
Uses SQL for filtering, then PySpark DataFrame for further processing.
Use Case 2: Writing PySpark DataFrame Results to a SQL Table
Transform a DataFrame and write results as a table:
df_filtered = df.filter(df.amount > 5000)
df_filtered.write.format("parquet").saveAsTable("high_sales")
Uses PySpark transformations but stores the output as a SQL table.
Use Case 3: Real-Time Streaming with Spark SQL + PySpark
Read real-time sales data from Kafka, process with PySpark, and store in a SQL table.
df_stream = spark.readStream.format("kafka") \
.option("subscribe", "sales_topic") \
.load()
df_sales = df_stream.selectExpr("CAST(value AS STRING)")
df_sales.createOrReplaceTempView("sales_stream")
query = spark.sql("SELECT * FROM sales_stream WHERE amount > 1000")
query.writeStream.format("console").start()
Combines PySpark DataFrames for real-time streaming with SQL queries.
Which One to Use?
Use Case | Use Spark SQL? | Use PySpark DataFrames? |
Data Exploration | ✅ Best choice (simple queries) | ❌ Not needed |
ETL Pipelines | ⚠️ Can be used | ✅ Best choice (Python-based processing) |
BI Dashboards (Power BI, Tableau) | ✅ Best choice | ❌ Not needed |
Machine Learning (MLlib, TensorFlow) | ❌ Not preferred | ✅ Best choice |
Streaming (Kafka, EventHub) | ✅ For SQL queries on stream | ✅ For transformations |
Spark SQL is best for SQL-heavy workflows, BI dashboards, and analytics.
PySpark DataFrames are best for ETL, ML, real-time streaming, and complex transformations.
2. Types of Transformations in Spark: Narrow vs. Wide Transformations
Transformations in Apache Spark are operations applied on DataFrames or RDDs to create a new dataset. They are lazy, meaning they do not execute immediately but build a logical execution plan.
There are two main types of transformations:
✅ Narrow Transformations (map, filter, etc.)
✅ Wide Transformations (groupBy, join, aggregate, etc.)
What are Narrow Transformations?
Narrow transformations are operations where each partition’s output depends only on a single input partition.
These operations do not require data shuffling across nodes, making them faster & more efficient.
Examples of Narrow Transformations:
1️⃣ map()
→ Applies a function to each row
2️⃣ filter()
→ Removes unwanted rows
3️⃣ flatMap()
→ Expands one row into multiple rows
4️⃣ drop()
→ Removes a column
Example: Narrow Transformations in PySpark
df = spark.createDataFrame([(1, 'John', 5000), (2, 'Alice', 7000)], ["id", "name", "salary"])
# Map transformation: Convert salary to INR
df_mapped = df.withColumn("salary_inr", df.salary * 82)
# Filter transformation: Get employees with salary > 6000
df_filtered = df_mapped.filter(df_mapped.salary > 6000)
df_filtered.show()
No data shuffle happens! The transformations apply within the same partition.
What are Wide Transformations?
Wide transformations are operations where data from multiple partitions is needed to compute the result.
These transformations require data shuffling, making them slower but necessary for aggregations.
Examples of Wide Transformations:
1️⃣ groupBy()
→ Groups data based on a column
2️⃣ join()
→ Combines data from two DataFrames
3️⃣ orderBy()
→ Sorts data across partitions
4️⃣ aggregate()
→ Computes sum, avg, count across partitions
Example: Wide Transformations in PySpark
df_sales = spark.createDataFrame([
(1, 'John', 'Electronics', 5000),
(2, 'Alice', 'Clothing', 7000),
(3, 'Bob', 'Electronics', 6000)
], ["id", "name", "category", "amount"])
# GroupBy transformation: Total sales per category
df_grouped = df_sales.groupBy("category").sum("amount")
df_grouped.show()
Here, Spark needs to shuffle data across nodes to group by category.
Narrow vs. Wide Transformations: Key Differences
Feature | Narrow Transformations | Wide Transformations |
Definition | Each output partition depends on only one input partition | Each output partition depends on multiple input partitions |
Requires Shuffling? | ❌ No (Faster) | ✅ Yes (Slower) |
Performance | High Performance | Requires network shuffle (Slower) |
Examples | map() , filter() , flatMap() | groupBy() , join() , orderBy() |
Logical Diagram: Narrow vs. Wide Transformations
Narrow Transformation (map()
)
Input Partition 1 → [Row1, Row2, Row3] --(map)--> Output Partition 1 → [Row1, Row2, Row3]
Each input partition maps directly to one output partition (No Shuffling).
Wide Transformation (groupBy()
)
Input Partition 1 → [Electronics, Clothing]
Input Partition 2 → [Electronics, Sports]
⬇ (Shuffle Data) ⬇
Output Partition 1 → [Electronics]
Output Partition 2 → [Clothing, Sports]
Data must be shuffled between nodes before aggregation happens.
Summary: Optimizing Transformations in Spark
Optimization Tip | Why? |
Prefer Narrow Transformations (map , filter ) | No shuffle = Faster execution |
Minimize Wide Transformations (join , groupBy ) | Reduces network overhead |
Use broadcast() for small DataFrames before joins | Avoids large shuffle operations |
Use repartition() wisely | Helps optimize partition size before wide transformations |
✅ Use Narrow Transformations when possible for speed.
✅ Use Wide Transformations only when needed for aggregations.
3. Lazy Execution in PySpark
Creating a DataFrame and Applying Transformations (Lazy Execution)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("LazyExecutionExample").getOrCreate()
# Create a DataFrame
data = [(1, "Alice", 5000), (2, "Bob", 7000), (3, "Charlie", 8000)]
columns = ["id", "name", "salary"]
df = spark.createDataFrame(data, columns)
# Apply Transformations (Nothing Executes Yet!)
df_filtered = df.filter(df.salary > 6000)
df_mapped = df_filtered.withColumn("bonus", df_filtered.salary * 0.10)
# Still Lazy Execution - Spark has NOT run anything yet
At this point, Spark has only recorded the transformations but has NOT executed them.
When Does Execution Actually Happen?
Triggering an Action
df_mapped.show() # Action (Executes the Query)
Now Spark executes all transformations in one go!
Other Actions That Trigger Execution
Action | Purpose |
.show() | Displays the DataFrame |
.collect() | Brings data to the driver |
.count() | Counts rows |
.write.format("parquet").save("path") | Writes data to storage |
Transformations execute only when an action is called.
Logical Diagram: Lazy Execution Workflow
1️⃣ Transformation 1: Apply filter()
→ (Queued in Logical Plan)
2️⃣ Transformation 2: Apply map()
→ (Queued in Logical Plan)
3️⃣ Action: Call .show()
→ (Executes everything together)
Spark optimizes execution to run all transformations in a single stage.
Lazy Execution in Spark
Feature | Lazy Execution |
Execution Timing | Runs only when an action is triggered |
Performance | Optimized execution by combining transformations |
Fault Tolerance | Keeps a logical plan, allowing recovery from failure |
Example Transformations | map() , filter() , withColumn() (Lazy) |
Example Actions | .show() , .count() , .collect() (Triggers Execution) |
Lazy Execution makes Spark faster & more efficient!
3. Working with Structured Data: Reading and Writing Parquet, ORC, Avro Files in Spark SQL
Structured data is commonly stored in optimized file formats like Parquet, ORC, and Avro, which help improve query performance and storage efficiency in big data applications.
This guide covers:
✅ What are Parquet, ORC, and Avro?
✅ How to read & write these formats in Spark SQL and PySpark?
✅ Performance considerations & best practices
3.1 Understanding Parquet, ORC, and Avro
Format | Best For | Advantages | Compression |
Parquet | Analytics & Queries | Columnar storage, schema evolution | High (Snappy, Gzip) |
ORC | Hive & Spark SQL | Best for Hive-based queries, column pruning | High (Zlib, Snappy) |
Avro | Data Serialization & Schema Evolution | Row-based, good for streaming | Medium (Deflate, Snappy) |
Parquet and ORC → Best for columnar analytics
Avro → Best for row-based serialization & schema evolution
Feature | Parquet & ORC (Columnar) | Avro (Row-Based) |
Best For | Large-scale analytics (OLAP) | Streaming & Transactions (OLTP) |
Storage Layout | Column-wise (stores individual columns separately) | Row-wise (stores full rows together) |
Speed | Faster for queries on specific columns | Faster for inserting & updating rows |
Use Case | Data lakes, analytics platforms (BigQuery, Redshift) | Kafka streaming, log storage, messaging systems |
3.2 Reading & Writing Parquet Files in Spark SQL
Parquet is a columnar format designed for fast querying and compression.
A) Reading Parquet Files
Read a single Parquet file
df = spark.read.parquet("s3://data/sales.parquet")
df.show()
Read multiple Parquet files
df = spark.read.parquet("s3://data/parquet_files/*")
Using Spark SQL to Query Parquet Files
SELECT * FROM parquet.`s3://data/sales.parquet` WHERE amount > 5000;
B) Writing Data to Parquet
Save DataFrame as a Parquet File
df.write.parquet("s3://data/output/sales_parquet")
Save as Table (Partitioned)
df.write.partitionBy("year").parquet("s3://data/output/sales_partitioned")
Best Practice: Use partitioning and Z-Ordering for faster query execution.
C.) Reading & Writing ORC Files in Spark SQL
ORC (Optimized Row Columnar) is optimized for Hive and provides better compression than Parquet.
Reading ORC Files
df = spark.read.orc("s3://data/sales.orc")
df.show()
Using Spark SQL to Query ORC Files
SELECT * FROM orc.`s3://data/sales.orc` WHERE category = 'Electronics';
Writing Data to ORC
Save DataFrame as ORC File
df.write.orc("s3://data/output/sales_orc")
Save with Compression
df.write.option("compression", "zlib").orc("s3://data/output/sales_compressed_orc")
ORC is ideal for Hive-based querying and columnar storage.
D.) Reading & Writing Avro Files in Spark SQL
Avro is row-based and best for schema evolution and data serialization.
Reading Avro Files
df = spark.read.format("avro").load("s3://data/sales.avro")
df.show()
Using Spark SQL to Query Avro Files
SELECT * FROM avro.`s3://data/sales.avro` WHERE customer_id = 1234;
Writing Data to Avro
Save DataFrame as Avro File
df.write.format("avro").save("s3://data/output/sales_avro")
Save with Schema Evolution Support
df.write.option("avroSchema", "path/to/schema.avsc").format("avro").save("s3://data/output/sales_avro_v2")
Avro is best for schema evolution and real-time processing.
Performance Considerations: Choosing the Right Format
Use Case | Best Format |
Big Data Analytics (Columnar Queries) | Parquet, ORC |
Streaming & Schema Evolution | Avro |
Hive-Based Processing | ORC |
Data Lake Storage | Parquet |
✔ Use Parquet for analytics (better query speed & compression).
✔ Use ORC for Hive-based workloads (optimized columnar storage).
✔ Use Avro for schema evolution (best for streaming & Kafka integration).
Summary: How to Work with Parquet, ORC, Avro in Spark SQL
Operation | Parquet | ORC | Avro |
Read File | spark.read .parquet("path") | spark.read .orc("path") | spark.read .format("avro").load("path") |
Write File | df.write.parquet("path") | df.write.orc("path") | df.write.format("avro").save("path") |
Best Use Case | Data Lakes, Analytics | Hive, Columnar Processing | Streaming, Schema Evolution |
4.Aggregations in Spark SQL
COUNT()
Returns the number of rows in a dataset.
Example:
SELECT COUNT(*) FROM sales;
Tricky SQL Concepts with
COUNT(*)
| Concept | Explanation | | --- | --- | |
COUNT(*)
vsCOUNT(column)
|COUNT(*)
includes NULLs,COUNT(column)
ignores NULLs | |COUNT(DISTINCT column)
| Counts only unique non-null values | |COUNT(1)
vs.COUNT(*)
| No difference in modern databases | |HAVING COUNT(*)
| Filters groups based on count conditions | |COUNT(*)
in JOINs | Joins may create duplicate rows, inflating the count | |COUNT(column_name)
in LEFT JOIN | NULLs are ignored, leading to possible0
counts | |COUNT(*) OVER()
| Running total count using window functions | |COUNT(*)
in UNION vs UNION ALL |UNION
removes duplicates,UNION ALL
keeps them | |COUNT(*) FILTER (WHERE condition)
| Conditional counting within aggregations |
SUM()
Computes the sum of a numeric column.
Examples:
SELECT SUM(amount) FROM sales;
More Tricky solutions::
1. What Does
SUM(amount)
Really Do?Adds up all values in the
amount
column.Ignores NULL values automatically.
Returns NULL if no rows exist or all values are NULL.
Example:
SELECT SUM(amount) FROM sales;
sale_id | amount |
1 | 100 |
2 | 200 |
3 | NULL |
4 | 300 |
✅ SUM(amount) = 100 + 200 + 300 = 600
(NULL is ignored).
✅ If all amount
values were NULL, SUM(amount)
would return NULL
.
2. SUM(amount)
vs. COALESCE(SUM(amount), 0)
✅ If SUM(amount)
returns NULL, it may cause issues in calculations.
✅ Using COALESCE()
ensures it returns 0
instead of NULL.
Example:
SELECT COALESCE(SUM(amount), 0) AS total_amount FROM sales;
- If all rows contain NULL values, it will return
0
instead ofNULL
.
3. SUM(DISTINCT amount)
– Summing Unique Values
✅ Using DISTINCT
ensures duplicate values are summed only once.
Example:
SELECT SUM(DISTINCT amount) FROM sales;
sale_id | amount |
1 | 100 |
2 | 200 |
3 | 100 |
4 | 300 |
✅ SUM(amount) = 100 + 200 + 100 + 300 = 700
✅ SUM(DISTINCT amount) = 100 + 200 + 300 = 600
(duplicates ignored).
4. SUM()
with GROUP BY
✅ Grouping results before summing them can produce unexpected results.
Example:
SELECT category, SUM(amount) FROM sales GROUP BY category;
category | amount |
Electronics | 200 |
Clothing | 150 |
Electronics | 300 |
✅ GROUP BY
groups data before summing:
category | SUM(amount) |
Electronics | 200 + 300 = 500 |
Clothing | 150 |
5. SUM()
with HAVING
Clause
✅ Used to filter aggregated results.
Example:
SELECT category, SUM(amount) AS total_sales
FROM sales
GROUP BY category
HAVING SUM(amount) > 500;
- Filters only categories where total sales exceed 500.
6. SUM()
in Joins – Unexpected Duplicates
✅ If a JOIN
causes duplicate rows, SUM()
may return inflated values.
Example:
SELECT c.customer_id, SUM(s.amount)
FROM customers c
JOIN sales s ON c.customer_id = s.customer_id
GROUP BY c.customer_id;
- If a customer has multiple rows due to the JOIN,
SUM()
may double-count sales.
✅ Fix: Use SUM(DISTINCT s.amount)
if needed.
7. SUM()
in LEFT JOIN – Handling NULLs
✅ If there are NULL values due to LEFT JOIN
, SUM()
may return NULL.
Example:
SELECT c.customer_id, SUM(s.amount)
FROM customers c
LEFT JOIN sales s ON c.customer_id = s.customer_id
GROUP BY c.customer_id;
customer_id | amount |
1 | 500 |
2 | NULL |
✅ SUM(amount)
ignores NULLs, but customer 2 gets NULL.
✅ Fix: Use COALESCE(SUM(amount), 0) AS total_amount
to return 0
instead of NULL
.
8. SUM()
with CASE
– Conditional Summing
✅ Summing only specific values within a column.
Example:
SELECT
SUM(CASE WHEN category = 'Electronics' THEN amount ELSE 0 END) AS electronics_sales,
SUM(CASE WHEN category = 'Clothing' THEN amount ELSE 0 END) AS clothing_sales
FROM sales;
- Computes total sales separately for each category.
9. SUM()
with FILTER()
– Alternative to CASE
✅ A cleaner way to apply conditional aggregation.
Example:
SELECT
SUM(amount) FILTER (WHERE category = 'Electronics') AS electronics_sales,
SUM(amount) FILTER (WHERE category = 'Clothing') AS clothing_sales
FROM sales;
- Same as
CASE
method but more concise.
10. SUM()
in Window Functions – Running Total
✅ Using OVER()
for cumulative summing.
Example:
SELECT customer_id, order_date,
SUM(amount) OVER(PARTITION BY customer_id ORDER BY order_date) AS running_total
FROM sales;
customer_id | order_date | amount | running_total |
1 | 2024-01-01 | 100 | 100 |
1 | 2024-01-02 | 200 | 300 |
1 | 2024-01-03 | 150 | 450 |
✅ Computes a running total of sales per customer.
11. SUM()
with UNION
vs. UNION ALL
✅ UNION
removes duplicates, affecting SUM()
.
✅ UNION ALL
keeps duplicates, increasing SUM()
.
Example:
SELECT SUM(amount) FROM (
SELECT amount FROM sales
UNION
SELECT amount FROM refunds
) AS combined_data;
Removes duplicate amounts before summing.
If we use
UNION ALL
, duplicates remain.
12. SUM()
with Negative Values
✅ Handles both positive and negative numbers.
Example:
SELECT SUM(amount) FROM transactions;
transaction_id | amount |
1 | 500 |
2 | -200 |
3 | 300 |
✅ SUM(amount) = 500 + (-200) + 300 = 600
✅ Negative values affect totals, useful in accounting and refunds.
Tricky SQL Concepts with SUM(amount)
Concept | Explanation |
SUM(amount) ignores NULLs | NULL values are excluded from summation |
SUM(DISTINCT amount) | Summing only unique values |
SUM() with GROUP BY | Computes totals for each group |
SUM() with HAVING | Filters aggregated results |
SUM() in JOIN issues | Joins may duplicate rows, inflating totals |
SUM() in LEFT JOIN | NULLs may lead to unexpected results |
SUM() with CASE | Conditional summing |
SUM() with FILTER() | A cleaner conditional sum approach |
SUM() with OVER() | Running total calculation |
SUM() in UNION vs. UNION ALL | UNION removes duplicates, affecting sum |
SUM() with Negative Values | Important for transactions, refunds |
AVG()
Returns the average (mean) value of a numeric column.
Example:
SELECT AVG(price) FROM products;
Tricky SQL Concepts with
AVG(price)
| Concept | Explanation | | --- | --- | |
AVG(price)
ignores NULLs | NULL values are excluded from the calculation | |AVG(DISTINCT price)
| Averages only unique values | | Integer Division Issue | Convert toFLOAT
orDECIMAL
for correct results | |AVG()
withGROUP BY
| Computes averages for each group | |AVG()
withHAVING
| Filters groups based on the average | |AVG()
in Joins | Joins may create duplicate rows, affecting averages | |AVG()
inLEFT JOIN
| NULLs may lead to unexpected results | |AVG()
withCASE
| Conditional averaging | |AVG()
withFILTER()
| A cleaner conditional average approach | |AVG()
withOVER()
| Moving average calculation | |AVG()
inUNION
vs.UNION ALL
|UNION
removes duplicates, affecting averages | |AVG()
with Negative Values | Important for transactions, refunds |
MIN()
Finds the minimum value in a column.
Example:
SELECT MIN(salary) FROM employees;
MAX()
Finds the maximum value in a column.
Example:
SELECT MAX(salary) FROM employees;
COUNT(DISTINCT column_name)
Returns the count of distinct values in a column.
Example:
SELECT COUNT(DISTINCT category) FROM products;
GROUP BY
with Aggregations
Used to apply aggregate functions on grouped data.
Example:
SELECT department, AVG(salary) FROM employees GROUP BY department;
AVG(salary) GROUP BY department
| Concept | Explanation | | --- | --- | |
AVG(salary)
ignores NULLs | NULL values are excluded from the calculation | |AVG(DISTINCT salary)
| Averages only unique values | | Integer Division Issue | Convert toFLOAT
orDECIMAL
for correct results | |AVG()
withGROUP BY
| Computes averages for each department | |AVG()
withHAVING
| Filters groups based on the average | |AVG()
in Joins | Joins may create duplicate rows, affecting averages | |AVG()
inLEFT JOIN
| NULLs may lead to unexpected results | |AVG()
withCASE
| Conditional averaging | |AVG()
withFILTER()
| A cleaner conditional average approach | |AVG()
withOVER()
| Finds average salary per department withoutGROUP BY
| |AVG()
inUNION
vs.UNION ALL
|UNION
removes duplicates, affecting averages | |AVG()
with Negative Values | Important for finance, refunds |
HAVING
with Aggregations
Example:
SELECT department, SUM(salary) FROM employees GROUP BY department HAVING SUM(salary) > 50000;
1. How
HAVING
Works inSUM(salary) > 50000
SELECT department, SUM(salary) FROM employees GROUP BY department HAVING SUM(salary) > 50000;
Groups data by
department
.Computes
SUM(salary)
for each department.Filters departments where the total salary exceeds
50000
.
Example Data:
department | salary |
HR | 30000 |
HR | 25000 |
IT | 40000 |
IT | 10000 |
Finance | 20000 |
Query Execution:
department | SUM(salary) |
HR | 30000 + 25000 = 55000 ✅ (kept) |
IT | 40000 + 10000 = 50000 ❌ (removed) |
Finance | 20000 ❌ (removed) |
2. HAVING
vs. WHERE
– Key Difference
✅ WHERE
filters rows before aggregation.
✅ HAVING
filters after aggregation.
Incorrect Use of WHERE
(Causes Error)
SELECT department, SUM(salary)
FROM employees
WHERE SUM(salary) > 50000 -- ❌ INVALID
GROUP BY department;
- Error:
SUM(salary)
is an aggregate function, andWHERE
cannot use aggregate functions.
Correct Use with HAVING
SELECT department, SUM(salary)
FROM employees
GROUP BY department
HAVING SUM(salary) > 50000;
3. HAVING
with COUNT()
– Filtering Groups by Count
✅ Find departments with more than 2 employees.
SELECT department, COUNT(*)
FROM employees
GROUP BY department
HAVING COUNT(*) > 2;
Counts employees per department.
Filters only departments with more than 2 employees.
4. HAVING
with Multiple Conditions
✅ Find departments where total salary is over 50000 AND average salary is above 25000.
SELECT department, SUM(salary), AVG(salary)
FROM employees
GROUP BY department
HAVING SUM(salary) > 50000 AND AVG(salary) > 25000;
- Filters on both
SUM(salary)
andAVG(salary)
.
5. HAVING
with DISTINCT
– Counting Unique Values
✅ Find departments where unique salaries exceed 2.
SELECT department, COUNT(DISTINCT salary)
FROM employees
GROUP BY department
HAVING COUNT(DISTINCT salary) > 2;
Counts unique salary values per department.
Filters departments with more than 2 unique salary values.
6. HAVING
with CASE
– Conditional Aggregation
Find departments where Managers earn over 40000 in total.
SELECT department, SUM(CASE WHEN job_title = 'Manager' THEN salary ELSE 0 END) AS manager_salary
FROM employees
GROUP BY department
HAVING SUM(CASE WHEN job_title = 'Manager' THEN salary ELSE 0 END) > 40000;
- Filters only departments where total salary for Managers exceeds 40000.
7. HAVING
with FILTER()
– Alternative to CASE
Find departments where total Manager salary exceeds 40000 (simplified).
SELECT department, SUM(salary) FILTER (WHERE job_title = 'Manager') AS manager_salary
FROM employees
GROUP BY department
HAVING SUM(salary) FILTER (WHERE job_title = 'Manager') > 40000;
- Same as
CASE
but more readable.
8. HAVING
with ORDER BY
– Sorting Aggregated Data
✅ Find departments with total salary > 50000 and sort by total salary.
SELECT department, SUM(salary) AS total_salary
FROM employees
GROUP BY department
HAVING SUM(salary) > 50000
ORDER BY total_salary DESC;
- Sorts only departments that meet the
HAVING
condition.
9. HAVING
with JOIN
– Filtering After Aggregation
✅ Find customers who spent over $1000 across all orders.
SELECT c.customer_id, SUM(o.amount) AS total_spent
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id
HAVING SUM(o.amount) > 1000;
- Filters only customers who spent more than $1000.
10. HAVING
in Window Functions – Why It Doesn’t Work
✅ HAVING
cannot be used in OVER()
window functions.
SELECT employee_id, department, salary,
SUM(salary) OVER(PARTITION BY department) AS dept_salary
FROM employees
HAVING dept_salary > 50000; -- ❌ INVALID
Error:
HAVING
works on grouped data, not windowed data.Fix: Use a subquery or
WHERE
instead.
Correct Approach:
SELECT * FROM (
SELECT employee_id, department, salary,
SUM(salary) OVER(PARTITION BY department) AS dept_salary
FROM employees
) subquery
WHERE dept_salary > 50000;
- Filters after window function execution.
Summary of Tricky HAVING
Scenarios
Scenario | Explanation |
HAVING vs. WHERE | WHERE filters before aggregation, HAVING filters after |
HAVING with COUNT() | Filters groups based on row counts |
HAVING with multiple conditions | Combines filters on multiple aggregate functions |
HAVING with DISTINCT | Counts only unique values in groups |
HAVING with CASE | Filters based on conditional aggregation |
HAVING with FILTER() | A cleaner alternative to CASE |
HAVING with ORDER BY | Sorts aggregated results after filtering |
HAVING in Joins | Filters after aggregation in JOIN queries |
HAVING with Window Functions | Does not work, use subqueries instead |
VARIANCE()
/ VAR_POP()
Computes the variance of a column.
Example:
SELECT VARIANCE(salary) FROM employees;
Alternative Functions in Spark SQL
STDDEV(salary)
– Returns the standard deviation (square root of variance).VAR_SAMP(salary)
– Computes sample variance instead of population variance.VAR_POP(salary)
– Computes population variance.
Tricky SQL Concepts with VARIANCE(salary)
Scenario | Explanation |
VARIANCE() ignores NULLs | NULL values are excluded from calculation |
VAR_POP() vs. VAR_SAMP() | Population variance divides by n , sample variance by n - 1 |
Integer Division Issue | Convert to FLOAT or DECIMAL for correct results |
VARIANCE() with GROUP BY | Computes variance per department |
VARIANCE() with HAVING | Filters groups based on variance |
VARIANCE() in Joins | Duplicates may inflate variance, use DISTINCT |
VARIANCE() with CASE | Conditional variance calculation |
VARIANCE() with FILTER() | A cleaner conditional variance approach |
VARIANCE() with OVER() | Running variance calculation |
VARIANCE() in UNION vs. UNION ALL | UNION removes duplicates, affecting variance |
STDDEV()
/ STDDEV_POP()
Computes the standard deviation.
Example:
SELECT STDDEV(salary) FROM employees;
FIRST()
/ LAST()
Retrieves the first or last value in a group.
Example:
SELECT department, FIRST(employee_name) FROM employees GROUP BY department;
Groups data by
department
.Retrieves the first
employee_name
in each department.The "first" value depends on how the database processes grouped data.
Example Data:
| employee_id | department | employee_name | | --- | --- | --- | | 1 | HR | Alice | | 2 | HR | Bob | | 3 | IT | Charlie | | 4 | IT | David |
Expected Result:
| department | FIRST(employee_name) | | --- | --- | | HR | Alice | | IT | Charlie |
COLLECT_SET()
Returns a set of unique values as an array.
SELECT department, COLLECT_SET(employee_name) FROM employees GROUP BY department;
Purpose
COLLECT_SET(employee_name)
returns a unique list (set) of employee names within each department.It removes duplicates and returns an array of unique values.
How It Works
GROUP BY department → Groups employees by department.
COLLECT_SET(employee_name) → Collects unique employee names in each department into an array.
Example Dataset: employees
Table
employee_name | department |
Alice | IT |
Bob | HR |
Charlie | IT |
Alice | IT |
David | Finance |
Emma | HR |
Query Output
department | COLLECT_SET(employee_name) |
IT | ["Alice", "Charlie"] |
HR | ["Bob", "Emma"] |
Finance | ["David"] |
- Duplicates are removed: "Alice" appears twice in IT but is stored only once.
Alternative Function
COLLECT_LIST(employee_name)
→ Collects all employee names as a list including duplicates.
COLLECT_LIST()
Returns all values as an array (including duplicates).
Example:
SELECT department, COLLECT_LIST(employee_name) FROM employees GROUP BY department;
Purpose
COLLECT_LIST(employee_name)
aggregates all employee names into a list for each department.Unlike
COLLECT_SET()
, it does not remove duplicates and preserves the order of occurrence.
How It Works
GROUP BY department → Groups employees by department.
COLLECT_LIST(employee_name) → Collects all employee names into an array with duplicates and order preserved.
Example Dataset: employees
Table
employee_name | department |
Alice | IT |
Bob | HR |
Charlie | IT |
Alice | IT |
David | Finance |
Emma | HR |
Query Output
department | COLLECT_LIST(employee_name) |
IT | ["Alice", "Charlie", "Alice"] |
HR | ["Bob", "Emma"] |
Finance | ["David"] |
- Duplicates are included: "Alice" appears twice in IT and is not removed.
Difference Between COLLECT_SET()
and COLLECT_LIST()
Function | Removes Duplicates | Maintains Order |
COLLECT_SET() | Yes | No |
COLLECT_LIST() | No | Yes |
PERCENTILE()
/ APPROX_PERCENTILE()
Computes percentile values.
Example:
SELECT APPROX_PERCENTILE(salary, 0.5) FROM employees;
Purpose
The function
APPROX_PERCENTILE(column, percentile)
computes the approximate percentile value for a given column.0.5
represents the 50th percentile (median salary).It is faster than exact percentile calculations and is useful for large datasets.
How It Works
APPROX_PERCENTILE(salary, 0.5)
Estimates the median salary (50th percentile).
It does not scan the full dataset, making it efficient for big data.
Alternative Percentile Values:
0.25
→ 25th percentile (lower quartile).0.75
→ 75th percentile (upper quartile).[0.25, 0.5, 0.75]
→ Returns multiple percentiles.
Example Dataset
Employee | Salary |
A | 3000 |
B | 5000 |
C | 6000 |
D | 7000 |
E | 10000 |
Sorted Salaries:
[3000, 5000, 6000, 7000, 10000]
50th Percentile (Median Salary):
6000
Query Output:
6000
Alternative Function
PERCENTILE_APPROX(salary, 0.5, 1000)
More accurate percentile approximation.
The third argument (
1000
) is the number of samples (higher is more accurate but slower).
MODE()
(Most Frequent Value - Custom Implementation)
Spark SQL doesn’t have a direct MODE()
function, but it can be calculated using:
SELECT value, COUNT(*) AS frequency
FROM dataset
GROUP BY value
ORDER BY frequency DESC
LIMIT 1;
RANK()
– Assigns ranks with gaps.
SELECT name, salary, RANK() OVER (ORDER BY salary DESC) FROM employees;
DENSE_RANK()
– Assigns ranks without gaps.
SELECT name, salary, DENSE_RANK() OVER (ORDER BY salary DESC) FROM employees;
ROW_NUMBER()
– Assigns unique row numbers.
SELECT name, salary, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) FROM employees;
NTILE(n)
– Divides data into n
equal parts.
SELECT name, salary, NTILE(4) OVER (ORDER BY salary DESC) FROM employees;
Statistical Aggregations
CORR(col1, col2)
– Correlation between two columns.
SELECT CORR(salary, age) FROM employees;
Purpose
The function
CORR(column1, column2)
calculates the Pearson correlation coefficient between two numerical columns.It measures the linear relationship between salary and age.
The result is a value between -1 and 1:
1 → Perfect positive correlation (as age increases, salary increases).
-1 → Perfect negative correlation (as age increases, salary decreases).
0 → No correlation (no relationship between age and salary).
Purpose
COVAR_POP(column1, column2)
calculates the population covariance betweensalary
andage
.Covariance measures the direction of the relationship between two numeric columns but not the strength (unlike correlation).
The result can be:
Positive → When age increases, salary tends to increase.
Negative → When age increases, salary tends to decrease.
Zero or close to zero → No significant relationship.
COVAR_POP(col1, col2)
– Population covariance.
SELECT COVAR_POP(salary, age) FROM employees;
Purpose
COVAR_POP(column1, column2)
calculates the population covariance betweensalary
andage
.Covariance measures the direction of the relationship between two numeric columns but not the strength (unlike correlation).
The result can be:
Positive → When age increases, salary tends to increase.
Negative → When age increases, salary tends to decrease.
Zero or close to zero → No significant relationship.
Difference Between COVAR_POP()
and COVAR_SAMP()
Function | Formula | Use Case |
COVAR_POP(x, y) | Uses population formula (N in denominator) | When working with full data (entire population) |
COVAR_SAMP(x, y) | Uses sample formula (N-1 in denominator) | When working with a sample (not full population) |
COVAR_SAMP(col1, col2)
– Sample covariance.
SELECT COVAR_SAMP(salary, age) FROM employees;
Purpose
COVAR_SAMP(column1, column2)
calculates the sample covariance betweensalary
andage
.It measures how salary and age vary together in a sample (not the entire population).
Unlike
COVAR_POP()
, this function divides by (N-1) instead of N, making it more suitable for sample data.
Difference Between COVAR_SAMP()
and COVAR_POP()
Function | Formula | Use Case |
COVAR_POP(x, y) | Uses N in the denominator | When working with full population data |
COVAR_SAMP(x, y) | Uses N-1 in the denominator | When working with sample data |
Array-Based Aggregations
ARRAY_AGG(column)
– Aggregates data into an array.
SELECT department, ARRAY_AGG(employee_name) FROM employees GROUP BY department;
Purpose
The
ARRAY_AGG(column)
function aggregates values into an array for each group.It preserves duplicates and maintains the order of appearance in each department.
The result is a single array per department, containing all employee names.
EXPLODE(array_column)
– Converts array elements into separate rows.
SELECT EXPLODE(array_column) FROM table;
How It Works
GROUP BY department → Groups employees by department.
ARRAY_AGG(employee_name) → Collects all employee names in each department into an array.
Example Dataset: employees
Table
employee_name | department |
Alice | IT |
Bob | HR |
Charlie | IT |
Alice | IT |
David | Finance |
Emma | HR |
Query Output
department | ARRAY_AGG(employee_name) |
IT | ["Alice", "Charlie", "Alice"] |
HR | ["Bob", "Emma"] |
Finance | ["David"] |
Duplicates are included: "Alice" appears twice in IT and is not removed.
Order is preserved.
Difference Between ARRAY_AGG()
, COLLECT_LIST()
, and COLLECT_SET()
Function | Removes Duplicates? | Maintains Order? |
ARRAY_AGG() | No | Yes |
COLLECT_LIST() | No | Yes |
COLLECT_SET() | Yes | No |
ARRAY_AGG()
behaves similarly toCOLLECT_LIST()
but is more SQL-standard.
Alternative Approaches
COLLECT_LIST(employee_name)
→ Similar toARRAY_AGG()
, collects values into a list.COLLECT_SET(employee_name)
→ Removes duplicates while collecting values.
Advanced Aggregations
Grouping Sets and Rollup
GROUPING SETS()
– Computes multiple groupings in a single query.
SELECT department, job_role, SUM(salary)
FROM employees
GROUP BY GROUPING SETS ((department), (job_role), (department, job_role));
Purpose
GROUPING SETS
allows multiple grouping combinations in a single query.It is more flexible than
GROUP BY
, producing subtotal and grand total rows.Each tuple in
GROUPING SETS
defines a specific aggregation level.
How It Works
GROUPING SETS ((department), (job_role), (department, job_role))
(department) → Groups only by department (ignores job role).
(job_role) → Groups only by job role (ignores department).
(department, job_role) → Groups by both department and job role (normal GROUP BY
).
This produces multiple aggregation levels.
Example Dataset: employees
Table
employee_name | department | job_role | salary |
Alice | IT | Developer | 5000 |
Bob | HR | Manager | 7000 |
Charlie | IT | Developer | 6000 |
David | IT | Analyst | 8000 |
Emma | HR | Analyst | 6000 |
Query Output
department | job_role | SUM(salary) |
IT | NULL | 19000 |
HR | NULL | 13000 |
NULL | Developer | 11000 |
NULL | Manager | 7000 |
NULL | Analyst | 14000 |
IT | Developer | 11000 |
IT | Analyst | 8000 |
HR | Analyst | 6000 |
Understanding NULL Values in Output
When
job_role
is NULL, it means the row represents a department-level total.When
department
is NULL, it means the row represents a job role-level total.
Tricky Edge Cases
Using
GROUPING_ID()
to Identify Aggregate LevelsSELECT department, job_role, SUM(salary), GROUPING_ID(department, job_role) FROM employees GROUP BY GROUPING SETS ((department), (job_role), (department, job_role));
- This adds a column (
GROUPING_ID
) that identifies the level of aggregation.
- This adds a column (
Adding a Grand Total (Full Aggregation)
SELECT department, job_role, SUM(salary) FROM employees GROUP BY GROUPING SETS ((department), (job_role), (department, job_role), ());
- Adding
()
at the end includes a grand total row.
- Adding
Alternative Approaches
Approach | Use Case |
GROUPING SETS() | Selective groupings (flexible subtotals) |
ROLLUP() | Hierarchical aggregation (e.g., department → job role → total) |
CUBE() | All possible group combinations |
CUBE()
– Aggregates across all possible combinations of specified columns.
SELECT department, job_role, SUM(salary)
FROM employees
GROUP BY CUBE(department, job_role);
Purpose
The
CUBE()
function computes all possible combinations of groupings.It generates totals (aggregates) at multiple levels, including:
Individual
department
totalsIndividual
job_role
totalsCombinations of both
Grand total (sum of all salaries)
How It Works
CUBE(department, job_role)
creates a power set of grouping columns, meaning:(department, job_role)
→ NormalGROUP BY
on both columns.(department, NULL)
→ Groups only bydepartment
(ignores job_role).(NULL, job_role)
→ Groups only byjob_role
(ignores department).(NULL, NULL)
→ Grand total (sum of all salaries).
Example Dataset: employees
Table
employee_name | department | job_role | salary |
Alice | IT | Developer | 5000 |
Bob | HR | Manager | 7000 |
Charlie | IT | Developer | 6000 |
David | IT | Analyst | 8000 |
Emma | HR | Analyst | 6000 |
Query Output (CUBE(department, job_role)
)
department | job_role | SUM(salary) |
IT | Developer | 11000 |
IT | Analyst | 8000 |
HR | Manager | 7000 |
HR | Analyst | 6000 |
IT | NULL | 19000 |
HR | NULL | 13000 |
NULL | Developer | 11000 |
NULL | Manager | 7000 |
NULL | Analyst | 14000 |
NULL | NULL | 32000 |
Difference Between CUBE()
, ROLLUP()
, and GROUPING SETS()
Function | Generates Grand Total? | Generates Subtotals? | Best Use Case |
GROUPING SETS() | No (unless manually included) | Custom groupings | Selective subtotals |
ROLLUP(department, job_role) | Yes | Hierarchical subtotals (department → job_role → total) | Hierarchical aggregation |
CUBE(department, job_role) | Yes | All possible group combinations | All possible aggregations |
Tricky Edge Cases
Using
GROUPING_ID()
to Identify Aggregation LevelSELECT department, job_role, SUM(salary), GROUPING_ID(department, job_role) FROM employees GROUP BY CUBE(department, job_role);
- This helps differentiate full totals, partial totals, and raw values.
Filtering Out Grand Totals
SELECT department, job_role, SUM(salary) FROM employees GROUP BY CUBE(department, job_role) HAVING department IS NOT NULL OR job_role IS NOT NULL;
Removes the
(NULL, NULL)
grand total row.
ROLLUP()
– Computes hierarchical aggregations.
SELECT department, job_role, SUM(salary)
FROM employees
GROUP BY ROLLUP(department, job_role);
Purpose
The
ROLLUP()
function is used to generate hierarchical aggregations.It computes subtotals at different levels, following a top-down hierarchy.
Unlike
CUBE()
,ROLLUP()
does not generate all possible combinations—it follows a hierarchy from left to right.
How It Works
ROLLUP(department, job_role) generates:
(department, job_role)
→ NormalGROUP BY
(Department + Job Role).(department, NULL)
→ Groups by Department only (Subtotal per department).(NULL, NULL)
→ Grand total (Sum of all salaries).
Hierarchy Breakdown
mathematicaCopyEdit┌──────────────┐
│ Grand Total │
├──────────────┤
│ Department 1 │
│ ├─ Job A │
│ ├─ Job B │
├──────────────┤
│ Department 2 │
│ ├─ Job C │
│ ├─ Job D │
└──────────────┘
- No
(NULL, job_role)
rows (UnlikeCUBE()
, it doesn't group by job roles alone).
Example Dataset
employee_name | department | job_role | salary |
Alice | IT | Developer | 5000 |
Bob | HR | Manager | 7000 |
Charlie | IT | Developer | 6000 |
David | IT | Analyst | 8000 |
Emma | HR | Analyst | 6000 |
Query Output (ROLLUP(department, job_role)
)
department | job_role | SUM(salary) |
IT | Developer | 11000 |
IT | Analyst | 8000 |
HR | Manager | 7000 |
HR | Analyst | 6000 |
IT | NULL | 19000 |
HR | NULL | 13000 |
NULL | NULL | 32000 |
Difference Between ROLLUP()
, CUBE()
, and GROUPING SETS()
Function | Generates Grand Total? | Generates Subtotals? | Generates All Combinations? | Best Use Case |
GROUPING SETS() | ❌ No (unless manually included) | ✅ Custom groupings | ❌ No | Selective subtotals |
ROLLUP(department, job_role) | ✅ Yes | ✅ Hierarchical subtotals (department → job_role → total) | ❌ No | Hierarchical aggregation |
CUBE(department, job_role) | ✅ Yes | ✅ All possible group combinations | ✅ Yes | All possible aggregations |
Tricky Edge Cases
Using
GROUPING_ID()
to Identify Aggregation LevelsqlCopyEditSELECT department, job_role, SUM(salary), GROUPING_ID(department, job_role) FROM employees GROUP BY ROLLUP(department, job_role);
- Helps differentiate subtotals and grand total rows.
Filtering Out Grand Total
sqlCopyEditSELECT department, job_role, SUM(salary) FROM employees GROUP BY ROLLUP(department, job_role) HAVING department IS NOT NULL;
- Removes the
(NULL, NULL)
grand total row.
- Removes the
Approximate Aggregations (For Big Data)
APPROX_COUNT_DISTINCT(column)
– Faster unique count estimation.
SELECT APPROX_COUNT_DISTINCT(employee_id) FROM employees;
APPROX_PERCENTILE(column, percentile)
– Faster percentile computation.SELECT APPROX_PERCENTILE(salary, 0.9) FROM employees;
Custom Aggregations Using User-Defined Functions (UDAF)
User-Defined Aggregation Function (UDAF)
- You can define a custom aggregation using PySpark.
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("UDAF Example").getOrCreate()
def custom_avg(salary_list):
return sum(salary_list) / len(salary_list)
avg_udf = udf(custom_avg, DoubleType())
df.withColumn("custom_avg", avg_udf("salary")).show()
Time-Series Aggregations
DATE_TRUNC(unit, column)
– Aggregation based on time intervals.
SELECT DATE_TRUNC('month', order_date), SUM(sales) FROM sales GROUP BY 1;
WINDOW()
– Used for time-based aggregations in structured streaming.
SELECT window(time_column, '1 hour'), COUNT(*) FROM events GROUP BY 1;
Quantile-Based Aggregations
PERCENT_RANK()
–
Computes the relative rank of a row.
SELECT name, salary, PERCENT_RANK() OVER (ORDER BY salary DESC) FROM employees;
Explanation of PERCENT_RANK() OVER (ORDER BY salary DESC)
in Spark SQL
The query:
SELECT name, salary, PERCENT_RANK() OVER (ORDER BY salary DESC)
FROM employees;
Purpose
PERCENT_RANK()
computes the relative rank of each row as a percentage between 0 and 1.It compares a row’s position within the dataset without assigning equal ranks to duplicates.
Ordering:
ORDER BY salary DESC
→ Ranks employees based on salary in descending order (highest salary gets rank 0).If multiple employees have the same salary, they receive the same rank, but
PERCENT_RANK()
continues counting.
The
PERCENT_RANK()
function calculates the relative rank of a row as a percentage between 0 and 1 using the formula:\text{PERCENT_RANK} = \frac{\text{Rank} - 1}{\text{Total Rows} - 1}
Where:
Rank = The rank of the current row (using
RANK()
function).Total Rows = The total number of rows in the result set.
How It Works
Ranks employees by salary in descending order.
Computes the relative rank as a percentage from 0 to 1.
Formula ensures the highest-ranked employee gets 0, and the lowest-ranked gets 1.
Example Dataset
Employee | Salary |
Alice | 9000 |
Bob | 8000 |
Charlie | 7000 |
David | 7000 |
Emma | 6000 |
Query Output (PERCENT_RANK() OVER (ORDER BY salary DESC)
)
Employee | Salary | PERCENT_RANK |
Alice | 9000 | 0.00 |
Bob | 8000 | 0.25 |
Charlie | 7000 | 0.50 |
David | 7000 | 0.50 |
Emma | 6000 | 1.00 |
Understanding the Values
Alice (
9000
) → Rank = 1 → PERCENT_RANK = (1-1)/(5-1) = 0.00Bob (
8000
) → Rank = 2 → PERCENT_RANK = (2-1)/(5-1) = 0.25Charlie & David (
7000
) → Tied Rank = 3 → PERCENT_RANK = (3-1)/(5-1) = 0.50Emma (
6000
) → Rank = 5 → PERCENT_RANK = (5-1)/(5-1) = 1.00
Difference Between PERCENT_RANK()
, RANK()
, and DENSE_RANK()
Function | Handles Duplicates? | Skips Ranks? | Output Range |
PERCENT_RANK() | ✅ Yes | ✅ Yes | 0 to 1 (Percentage) |
RANK() | ✅ Yes | ✅ Yes | 1, 2, 2, 4, ... (Skips ranks) |
DENSE_RANK() | ✅ Yes | ❌ No | 1, 2, 2, 3, ... (No gaps) |
Alternative Approaches
Ranking Employees with
RANK()
SELECT name, salary, RANK() OVER (ORDER BY salary DESC) AS rank FROM employees;
- Outputs actual ranking instead of a percentage.
Using
PERCENT_RANK()
with PartitioningSELECT department, name, salary, PERCENT_RANK() OVER (PARTITION BY department ORDER BY salary DESC) FROM employees;
- Ranks employees within each department separately.
CUME_DIST()
–
Computes cumulative distribution of values.
SELECT name, salary, CUME_DIST() OVER (ORDER BY salary DESC) FROM employees;
Data & Calculation
name | salary |
Alice | 90,000 |
Bob | 80,000 |
Charlie | 80,000 |
David | 70,000 |
Emma | 60,000 |
Step-by-Step Calculation of CUME_DIST()
name | salary | Rank | Rows ≤ current row | Total Rows | CUME_DIST() |
Alice | 90,000 | 1 | 1 | 5 | 1/5 = 0.20 |
Bob | 80,000 | 2 | 3 | 5 | 3/5 = 0.60 |
Charlie | 80,000 | 2 | 3 | 5 | 3/5 = 0.60 |
David | 70,000 | 4 | 4 | 5 | 4/5 = 0.80 |
Emma | 60,000 | 5 | 5 | 5 | 5/5 = 1.00 |
Interpretation:
Alice’s salary is the highest (1 out of 5 people,
0.20
cumulative distribution).Bob and Charlie share the same salary, so they have the same
CUME_DIST()
value (3/5 = 0.60
).The lowest salary has a
CUME_DIST()
of1.0
since 100% of salaries are ≤ that value.
MEDIAN()
(Custom Implementation)
Spark doesn’t have a built-in median, but it can be calculated as:
SELECT percentile_approx(salary, 0.5) FROM employees;
HISTOGRAM()
– Creating Binned Distributions
Computes a histogram (frequency distribution) of numerical data.
Example:
SELECT histogram_numeric(salary, 10) FROM employees;
- This bins
salary
into 10 equal-sized ranges and returns frequency counts.
- This bins
Purpose
HISTOGRAM_NUMERIC(column, num_bins)
approximates the histogram distribution of a numeric column.The second parameter (
10
in this case) defines the number of bins (ranges).It returns an array of (bucket, frequency) pairs, showing how salaries are distributed across the bins.
How It Works
Divides the salary values into
num_bins
equally spaced intervals.Counts the number of values falling into each bin.
Returns an array of tuples
(bin_value, frequency)
, where:bin_value → Midpoint or representative value of the bin.
frequency → Count of salaries falling into that bin.
Example Dataset
Employee | Salary |
A | 3000 |
B | 5000 |
C | 6000 |
D | 7000 |
E | 10000 |
Query Output (HISTOGRAM_NUMERIC(salary, 3)
)
SELECT HISTOGRAM_NUMERIC(salary, 3) FROM employees;
Bin (Midpoint) | Frequency (Count) |
3000.0 | 1 |
6000.0 | 3 |
10000.0 | 1 |
Explanation:
The 3 bins divide the range
(3000 - 10000)
into 3 groups.The first bin (3000.0) → Contains 1 salary (3000).
The second bin (6000.0) → Contains 5000, 6000, 7000 (3 salaries).
The third bin (10000.0) → Contains 1 salary (10000).
Using More Bins (HISTOGRAM_NUMERIC(salary, 5)
)
SELECT HISTOGRAM_NUMERIC(salary, 5) FROM employees;
Bin (Midpoint) | Frequency (Count) |
3000.0 | 1 |
5000.0 | 1 |
6000.0 | 1 |
7000.0 | 1 |
10000.0 | 1 |
- Now each salary falls into its own bin since
num_bins = 5
(matching the number of salaries).
Key Observations
HISTOGRAM_NUMERIC()
helps visualize data distribution in a structured way.
More bins increase granularity, fewer bins group data together.
Useful for histogram visualizations, anomaly detection, and trends.
Alternative Approaches
Using
PERCENTILE_APPROX()
for Quartile-Based DistributionSELECT PERCENTILE_APPROX(salary, ARRAY(0.25, 0.5, 0.75)) FROM employees;
- Returns quartile distributions instead of a histogram.
Grouping Data Manually
SELECT CASE WHEN salary < 5000 THEN 'Low' WHEN salary BETWEEN 5000 AND 8000 THEN 'Medium' ELSE 'High' END AS salary_range, COUNT(*) FROM employees GROUP BY salary_range;
- Creates custom salary groups instead of bins.
BIT_OR()
/ BIT_AND()
– Bitwise Aggregations
Performs bitwise OR / AND operations across grouped values.
Example:
SELECT department, BIT_OR(permissions) FROM users GROUP BY department;
STRING_AGG()
– Concatenating Strings in Aggregation
Concatenates multiple string values into a single string.
Example:
SELECT department, STRING_AGG(employee_name, ', ') FROM employees GROUP BY department;
- Outputs comma-separated lists of employee names per department.
HYPERLOGLOG_PLUS()
Advanced Cardinality Estimation
Similar to
APPROX_COUNT_DISTINCT()
, but optimized for large-scale unique counting.Example:
SELECT HYPERLOGLOG_PLUS(user_id) FROM website_visits;
ENTROPY()
– Information Theory Aggregation
Measures data randomness or distribution balance.
Example:
SELECT ENTROPY(sales_category) FROM transactions;
MODE()
(Built-in in Some Versions)
Finds the most frequently occurring value.
Example (alternative implementation for older versions):
SELECT value FROM dataset GROUP BY value ORDER BY COUNT(*) DESC LIMIT 1;
GEOMETRIC_MEAN()
– Multiplicative Mean
Computes the geometric mean (useful in financial applications).
Example:
SELECT EXP(AVG(LOG(salary))) FROM employees;
HARMONIC_MEAN()
– Reciprocal Mean Calculation
Computes the harmonic mean for rates or ratios.
Example:
SELECT COUNT(*) / SUM(1.0 / salary) FROM employees;
MEDIAN()
(Alternative Approach)
Spark SQL does not have a built-in
MEDIAN()
function, but it can be approximated using:sqlCopyEditSELECT percentile_approx(salary, 0.5) FROM employees;
- Accurate for large datasets.
PRODUCT()
– Multiplication Aggregation
Returns the product of column values (alternative implementation in Spark SQL).
Example:
SELECT EXP(SUM(LOG(value))) FROM numbers;
SKEWNESS()
– Skewness of a Distribution
Measures the asymmetry of the distribution of values.
Example:
SELECT SKEWNESS(salary) FROM employees;
KURTOSIS()
– Peakedness of a Distribution
Measures whether the dataset has light/heavy tails.
Example:
SELECT KURTOSIS(salary) FROM employees;
RATIO_TO_REPORT()
– Proportional Contribution
Calculates the percentage of each value over the total sum.
Example:
SELECT department, salary, salary / SUM(salary) OVER (PARTITION BY department) AS ratio FROM employees;
LEAD()
/ LAG()
– Comparing Aggregates Over Rows
Accesses previous (
LAG
) or next (LEAD
) row values.Example:
SELECT name, salary, LAG(salary, 1) OVER (ORDER BY salary DESC) FROM employees;
PIVOT()
– Transforming Rows into Columns
Used for pivoting categorical data.
Example:
SELECT * FROM ( SELECT department, job_role, salary FROM employees ) PIVOT ( SUM(salary) FOR job_role IN ('Manager', 'Developer', 'Analyst') );
UNPIVOT()
– Transforming Columns into Rows
Reverse of
PIVOT()
.Example:
SELECT department, job_role, salary FROM employees UNPIVOT (salary FOR job_role IN (Manager, Developer, Analyst));
ANY_VALUE()
– Random Value in Group
Returns an arbitrary value from the group without enforcing order.
Example:
SELECT department, ANY_VALUE(employee_name) FROM employees GROUP BY department;
GROUPING()
– Detecting Aggregation Levels
Identifies whether a column is aggregated in a
GROUPING SETS()
,ROLLUP()
, orCUBE()
.Example:
SELECT department, job_role, SUM(salary), GROUPING(department), GROUPING(job_role) FROM employees GROUP BY ROLLUP(department, job_role);
TOP N Records
using ROW_NUMBER()
Fetch Top N records per group.
Example:
SELECT * FROM ( SELECT name, department, salary, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rank FROM employees ) WHERE rank <= 3;
MODE()
(Efficient Alternative Using Window Functions)Instead of the earlier workaround for
MODE()
, we can useDENSE_RANK()
.Example:
SELECT value FROM ( SELECT value, COUNT(*) AS frequency, DENSE_RANK() OVER (ORDER BY COUNT(*) DESC) AS rnk FROM dataset GROUP BY value ) WHERE rnk = 1;
- This method works efficiently for datasets where multiple values may have the highest frequency.
GROUP_CONCAT()
– Concatenation of Grouped Values
Alternative to
STRING_AGG()
, but not natively available in Spark SQL (can be simulated).Example using
COLLECT_LIST()
:SELECT department, CONCAT_WS(', ', COLLECT_LIST(employee_name)) FROM employees GROUP BY department;
BIT_XOR()
– Bitwise XOR Aggregation
Computes the bitwise XOR across grouped values.
Example:
SELECT department, BIT_XOR(permissions) FROM users GROUP BY department;
Understanding
BIT_XOR(permissions)
in SQLThe
BIT_XOR()
function performs a bitwise XOR (Exclusive OR) operation on all values in a group. It is commonly used in cryptographic applications, permission calculations, checksum operations, and error detection mechanisms.
1. How
BIT_XOR()
WorksSELECT department, BIT_XOR(permissions) FROM users GROUP BY department;
Groups data by
department
.Performs a bitwise XOR operation across all
permissions
values.Returns a single XOR'd value per department.
2. What is Bitwise XOR (⊕
)?
XOR (Exclusive OR) follows these rules:
1 ⊕ 1 = 0
0 ⊕ 0 = 0
1 ⊕ 0 = 1
0 ⊕ 1 = 1
Example Table:
department | permissions (Binary) | permissions (Decimal) |
HR | 1101 | 13 |
HR | 1011 | 11 |
IT | 1111 | 15 |
IT | 0010 | 2 |
Common Use Case:
Read Permission =
1
(0001 in binary)Write Permission =
2
(0010 in binary)Execute Permission =
4
(0100 in binary)Admin Permission =
8
(1000 in binary)
3. Step-by-Step Calculation
For HR Department:
1101 (13)
⊕ 1011 (11)
------------
0110 (6) --> BIT_XOR(permissions) = 6
For IT Department:
1111 (15)
⊕ 0010 (2)
------------
1101 (13) --> BIT_XOR(permissions) = 13
Final Result:
department | BIT_XOR(permissions) |
HR | 6 |
IT | 13 |
Interpretation:
HR department's XOR'd permissions result in
6
.IT department's XOR'd permissions result in
13
.
4. BIT_XOR()
vs. Other Bitwise Aggregations
Function | Description |
BIT_XOR() | Bitwise XOR of all values |
BIT_AND() | Bitwise AND of all values |
BIT_OR() | Bitwise OR of all values |
5. BIT_XOR()
with NULL Values
If permissions
contains NULL
, it is ignored.
If all values in a group are NULL
, BIT_XOR()
returns NULL
.
Example:
SELECT department, BIT_XOR(permissions) FROM users GROUP BY department;
department | permissions |
HR | 13 |
HR | NULL |
IT | 15 |
HR department ignores NULL
and computes BIT_XOR(13) = 13
.
If all values were NULL
, the result would be NULL
.
6. BIT_XOR()
with HAVING
– Filtering Groups
Find departments where XOR'd permissions exceed 5
.
SELECT department, BIT_XOR(permissions) AS xor_permissions
FROM users
GROUP BY department
HAVING BIT_XOR(permissions) > 5;
- Keeps only departments where the computed XOR result is greater than 5.
7. BIT_XOR()
in Joins – Ensuring Correct Aggregation
✅ Joins can introduce duplicate values, affecting BIT_XOR()
.
Incorrect Query (Duplicates Affect XOR)
SELECT d.department_name, BIT_XOR(u.permissions)
FROM departments d
JOIN users u ON d.department_id = u.department_id
GROUP BY d.department_name;
If a user appears multiple times due to the join, their permission values are XOR'd multiple times.
Fix: Use
DISTINCT
to avoid duplicates:SELECT d.department_name, BIT_XOR(DISTINCT u.permissions) FROM departments d JOIN users u ON d.department_id = u.department_id GROUP BY d.department_name;
8. BIT_XOR()
with CASE
– Conditional XOR
✅ Find XOR'd permissions only for Managers.
SELECT department,
BIT_XOR(CASE WHEN role = 'Manager' THEN permissions ELSE 0 END) AS manager_xor
FROM users
GROUP BY department;
- Computes XOR only for employees with the role
"Manager"
.
9. BIT_XOR()
with UNION
vs. UNION ALL
✅ UNION
removes duplicates, which may affect BIT_XOR()
results.
✅ UNION ALL
keeps duplicates, potentially changing XOR calculations.
Example:
SELECT department, BIT_XOR(permissions) FROM (
SELECT department, permissions FROM users
UNION
SELECT department, permissions FROM admins
) AS combined_data
GROUP BY department;
Merges users and admins before computing XOR.
If using
UNION ALL
, duplicate values remain and may alter results.
10. BIT_XOR()
with Binary Data
✅ XOR operations are commonly used in cryptography, security, and access control. ✅ A real-world use case could be storing permission bitmasks (e.g., Read = 1
, Write = 2
, Execute = 4
).
Example:
SELECT department, BIT_XOR(permissions) AS xor_permissions
FROM users
GROUP BY department;
department | XOR Permissions (Binary) | XOR Permissions (Decimal) |
HR | 0110 | 6 |
IT | 1101 | 13 |
✅ This allows us to quickly calculate access control bitmasks for departments.
11. Summary of Tricky SQL Concepts with BIT_XOR()
Scenario | Explanation |
BIT_XOR() computes bitwise XOR | XORs all values in a group |
BIT_XOR() ignores NULL | NULL values are skipped, but all NULLs return NULL |
BIT_XOR() vs. BIT_AND() vs. BIT_OR() | Different bitwise aggregations |
BIT_XOR() with HAVING | Filters groups based on XOR results |
BIT_XOR() in Joins | Duplicate values may alter XOR results |
BIT_XOR() with CASE | Conditional XOR within groups |
BIT_XOR() in UNION vs. UNION ALL | UNION removes duplicates, affecting XOR |
LOG_SUM_EXP()
Logarithmic Summation for Stability
Useful in machine learning applications (avoiding underflow).
Example:
SELECT LOG(SUM(EXP(value))) FROM dataset;
COUNT_IF()
Conditional Counting
A cleaner way to count values meeting a condition.
Example (alternative using
SUM()
):SELECT SUM(CASE WHEN salary > 50000 THEN 1 ELSE 0 END) FROM employees;
APPROX_MEDIAN()
– Faster Median Computation for Big Data
Instead of
PERCENTILE_APPROX(salary, 0.5)
, an alternative efficient approach.Example:
SELECT percentile_approx(salary, 0.5, 10000) FROM employees;
RATIO()
Ratio Calculation Within Groups
Useful in reporting the percentage of a group within a total.
Example:
SELECT department, COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () AS percentage FROM employees GROUP BY department;
REGR_SLOPE()
, REGR_INTERCEPT()
Regression Aggregations
Used for linear regression modeling directly in SQL.
Example:
SELECT REGR_SLOPE(salary, experience), REGR_INTERCEPT(salary, experience) FROM employees;
- Finds the best-fit line between
salary
andexperience
.
- Finds the best-fit line between
REGR_R2()
Coefficient of Determination (R² Score)
Measures how well data fits a regression model.
Example:
SELECT REGR_R2(salary, experience) FROM employees;
UNIQUE_COUNT_ESTIMATE()
Faster Approximate Unique Counting
A variant of
APPROX_COUNT_DISTINCT()
.Example:
SELECT UNIQUE_COUNT_ESTIMATE(user_id) FROM website_visits;
MAD()
Median Absolute Deviation for Outlier Detection
Measures the spread of data around the median.
Example:
SELECT PERCENTILE_APPROX(ABS(salary - median_salary), 0.5) FROM (SELECT salary, PERCENTILE_APPROX(salary, 0.5) AS median_salary FROM employees);
GREATEST()
and LEAST()
Row-wise Maximum/Minimum
Finds the highest or lowest value among multiple columns.
Example:
SELECT GREATEST(salary_2022, salary_2023, salary_2024) FROM employees;
SELECT LEAST(salary_2022, salary_2023, salary_2024) FROM employees;
COUNT_MIN_SKETCH()
Frequency Approximation for Streaming Data
Useful for real-time aggregation.
Example:
SELECT COUNT_MIN_SKETCH(salary, 10000, 0.01) FROM employees;
- Helps with approximate top-K element tracking.
QUANTILE()
Exact Percentiles for Small Datasets
Unlike
PERCENTILE_APPROX()
, this is for exact quantiles on smaller datasets.Example:
SELECT QUANTILE(salary, 0.75) FROM employees;
GROUPING_ID()
Identifying Aggregation Levels in GROUPING SETS()
Helps distinguish which columns are aggregated vs. not aggregated.
Example:
SELECT department, job_role, SUM(salary), GROUPING_ID(department, job_role) FROM employees GROUP BY GROUPING SETS ((department), (job_role), (department, job_role));
INVERSE_PERCENTILE()
Reverse Lookup of Percentiles
Given a percentile, finds the corresponding value.
Example:
SELECT INVERSE_PERCENTILE(salary, 0.9) FROM employees;
EXPONENTIAL_MOVING_AVG()
Exponential Smoothing for Time Series
Used for time-series forecasting.
Example:
SELECT EXPONENTIAL_MOVING_AVG(sales, 0.3) OVER (ORDER BY order_date) FROM sales_data;
DECILE()
Dividing Data into 10 Equal Parts
Similar to quartiles and percentiles.
Example:
SELECT NTILE(10) OVER (ORDER BY salary) FROM employees;
MOVING_AVERAGE()
Rolling Window Aggregation
Calculates a rolling average for time-series data.
Example:
SELECT order_date, AVG(sales) OVER (ORDER BY order_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg FROM sales;
STDDEV_SAMP()
and STDDEV_POP()
Sample & Population Standard Deviation
We previously covered
STDDEV()
but Spark SQL provides two variants:STDDEV_SAMP()
– Sample standard deviation.STDDEV_POP()
– Population standard deviation.
Example:
SELECT STDDEV_SAMP(salary), STDDEV_POP(salary) FROM employees;
VAR_SAMP()
and VAR_POP()
Sample & Population Variance
Similar to standard deviation, but returns variance instead.
Example:
SELECT VAR_SAMP(salary), VAR_POP(salary) FROM employees;
BLOOM_FILTER_AGG()
Probabilistic Set Membership
Efficiently tests whether an element is in a dataset (without full scan).
Example:
SELECT BLOOM_FILTER_AGG(user_id, 100000, 0.01) FROM transactions;
TOP_K()
Approximate Most Frequent Elements
Helps in identifying top K elements efficiently.
Example:
SELECT TOP_K(product_id, 5) FROM sales_data;
EVENT_RATE_ESTIMATE()
Approximate Count of Rare Events
Used in fraud detection & anomaly detection.
Example:
SELECT EVENT_RATE_ESTIMATE(transaction_id) FROM fraud_logs;
HASH_AGG()
Hash-based Aggregation for Large Datasets
Performs a hash-based approach for aggregating data faster.
Example:
SELECT HASH_AGG(salary) FROM employees;
STRING_AGG()
with ORDER BY (Concatenation in Sorted Order)
A sorted version of
STRING_AGG()
.Example:
SELECT department, STRING_AGG(employee_name ORDER BY employee_name ASC, ', ') FROM employees GROUP BY department;
CARDINALITY_ESTIMATE()
– Fast Approximate Cardinality
Similar to
APPROX_COUNT_DISTINCT()
, but optimized for streaming data.Example:
SELECT CARDINALITY_ESTIMATE(user_id) FROM website_logs;
BIT_COUNT()
– Counting Number of 1s in a Bit Representation
Useful in low-level analytics, such as binary data processing.
Example:
SELECT BIT_COUNT(binary_column) FROM bitwise_table;
UNIQUE_PERCENTILE()
Approximate Unique Distribution
Helps in estimating unique distribution across percentile ranges.
Example:
SELECT UNIQUE_PERCENTILE(salary, 0.8) FROM employees;
FREQUENT_ITEMS()
Finding Frequent Elements in Large Datasets
Similar to
TOP_K()
, but optimized for streaming & big data.Example:
SELECT FREQUENT_ITEMS(transaction_category, 10) FROM sales_data;
ROLLING_PERCENTILE()
Moving Window Percentile
Time-series percentile calculations.
Example:
SELECT order_date, ROLLING_PERCENTILE(sales, 0.9) OVER (ORDER BY order_date ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) FROM sales;
BIT_SET_AGG()
Bitwise Aggregation Across Rows
Aggregates binary bitmask values.
Example:
SELECT BIT_SET_AGG(permissions) FROM user_roles;
MOST_RECENT()
Last Non-Null Value
Similar to
LAST()
, but ensures no NULL values.Example:
SELECT department, MOST_RECENT(salary) FROM employees GROUP BY department;
ARLIEST()
First Non-Null Value
Similar to
FIRST()
, but excludes NULL values.Example:
SELECT department, EARLIEST(salary) FROM employees GROUP BY department;
DEVIATION()
Absolute Deviation from the Mean
Computes absolute deviation from mean (useful for outlier detection).
Example:
SELECT ABS(salary - AVG(salary) OVER ()) FROM employees;
PERMUTATION_AGG()
– Aggregate Across Permutations
Used in graph algorithms & combinatorial aggregations.
Example:
SELECT PERMUTATION_AGG(node_id) FROM graph_data;
JACCARD_SIMILARITY()
– Set-Based Similarity Calculation
Computes similarity between sets using Jaccard Index.
Example:
SELECT JACCARD_SIMILARITY(set1, set2) FROM dataset;
COSINE_SIMILARITY()
– Similarity Between Feature Vectors
Machine learning similarity function.
Example:
SELECT COSINE_SIMILARITY(feature_vector_1, feature_vector_2) FROM embeddings;
TOP_N_PER_GROUP()
Top N Elements Within Each Group
Similar to
TOP_K()
, but per group.Example:
SELECT * FROM ( SELECT name, department, salary, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rank FROM employees ) WHERE rank <= 3;
L2_NORM()
Euclidean Distance for Feature Engineering
Computes L2 norm (square root of squared sum).
Example:
SELECT SQRT(SUM(POW(feature_value, 2))) FROM feature_dataset;
HAVERSINE_DISTANCE()
– Distance Between Two Coordinates
Used in geospatial calculations.
Example:
SELECT HAVERSINE_DISTANCE(lat1, lon1, lat2, lon2) FROM location_data;
Summary Table
Category | Function |
Basic Aggregations | COUNT() , SUM() , AVG() , MIN() , MAX() |
Intermediate Aggregations | VARIANCE() , STDDEV() , RANK() , DENSE_RANK() , ROW_NUMBER() , NTILE() |
Advanced Aggregations | CUBE() , ROLLUP() , GROUPING SETS() , APPROX_COUNT_DISTINCT() , APPROX_PERCENTILE() |
Statistical Functions | CORR() , COVAR_POP() , COVAR_SAMP() |
Array-Based Aggregations | ARRAY_AGG() , COLLECT_SET() , COLLECT_LIST() |
Time-Series Aggregations | DATE_TRUNC() , WINDOW() |
Quantile-Based Aggregations | PERCENT_RANK() , CUME_DIST() , PERCENTILE_APPROX() |