Mastering Transformations in Spark SQL

Tech Academy

··

51 min read

Spark SQL Transformations
- 1. Spark SQL vs. PySpark DataFrames: Key Differences
- Logical Flow Diagram: How Spark SQL and PySpark Work
- When to Use Spark SQL & PySpark Together
- Which One to Use?
2. Types of Transformations in Spark: Narrow vs. Wide Transformations
- What are Narrow Transformations?
- What are Wide Transformations?
- Narrow vs. Wide Transformations: Key Differences
- Logical Diagram: Narrow vs. Wide Transformations
- Summary: Optimizing Transformations in Spark
3. Lazy Execution in PySpark
- When Does Execution Actually Happen?
- Logical Diagram: Lazy Execution Workflow
- Lazy Execution in Spark
3. Working with Structured Data: Reading and Writing Parquet, ORC, Avro Files in Spark SQL
- 3.1 Understanding Parquet, ORC, and Avro
- 3.2 Reading & Writing Parquet Files in Spark SQL
  - A) Reading Parquet Files
  - B) Writing Data to Parquet
- C.) Reading & Writing ORC Files in Spark SQL
  - Reading ORC Files
  - Writing Data to ORC
- D.) Reading & Writing Avro Files in Spark SQL
- Summary: How to Work with Parquet, ORC, Avro in Spark SQL
4.Aggregations in Spark SQL
COUNT()
- Tricky SQL Concepts with COUNT(*)
SUM()
- 1. What Does SUM(amount) Really Do?
- Example:
- 2. SUM(amount) vs. COALESCE(SUM(amount), 0)
  - Example:
- 3. SUM(DISTINCT amount) – Summing Unique Values
  - Example:
- 4. SUM() with GROUP BY
  - Example:
- 5. SUM() with HAVING Clause
  - Example:
- 6. SUM() in Joins – Unexpected Duplicates
  - Example:
- 7. SUM() in LEFT JOIN – Handling NULLs
  - Example:
- 8. SUM() with CASE – Conditional Summing
  - Example:
- 9. SUM() with FILTER() – Alternative to CASE
  - Example:
- 10. SUM() in Window Functions – Running Total
  - Example:
- 11. SUM() with UNION vs. UNION ALL
  - Example:
- 12. SUM() with Negative Values
  - Example:
  - Tricky SQL Concepts with SUM(amount)
AVG()
- Tricky SQL Concepts with AVG(price)
MIN()
MAX()
COUNT(DISTINCT column_name)
GROUP BY with Aggregations
- AVG(salary) GROUP BY department
HAVING with Aggregations
- 1. How HAVING Works in SUM(salary) > 50000
  - Example Data:
  - Query Execution:
- 2. HAVING vs. WHERE – Key Difference
  - Incorrect Use of WHERE (Causes Error)
  - Correct Use with HAVING
- 3. HAVING with COUNT() – Filtering Groups by Count
- 4. HAVING with Multiple Conditions
- 5. HAVING with DISTINCT – Counting Unique Values
- 6. HAVING with CASE – Conditional Aggregation
- 7. HAVING with FILTER() – Alternative to CASE
- 8. HAVING with ORDER BY – Sorting Aggregated Data
- 9. HAVING with JOIN – Filtering After Aggregation
- 10. HAVING in Window Functions – Why It Doesn’t Work
  - Correct Approach:
  - Summary of Tricky HAVING Scenarios
VARIANCE() / VAR_POP()
- Alternative Functions in Spark SQL
- Tricky SQL Concepts with VARIANCE(salary)
STDDEV() / STDDEV_POP()
FIRST() / LAST()
- Example Data:
COLLECT_SET()
- Purpose
- How It Works
- Example Dataset: employees Table
- Query Output
- Alternative Function
COLLECT_LIST()
- Purpose
- How It Works
- Example Dataset: employees Table
- Query Output
- Difference Between COLLECT_SET() and COLLECT_LIST()
PERCENTILE() / APPROX_PERCENTILE()
- Purpose
- How It Works
- Example Dataset
- Alternative Function
MODE() (Most Frequent Value - Custom Implementation)
RANK() – Assigns ranks with gaps.
DENSE_RANK() – Assigns ranks without gaps.
ROW_NUMBER() – Assigns unique row numbers.
NTILE(n) – Divides data into n equal parts.
- Statistical Aggregations
CORR(col1, col2) – Correlation between two columns.
- Purpose
- Purpose
COVAR_POP(col1, col2) – Population covariance.
- Purpose
- Difference Between COVAR_POP() and COVAR_SAMP()
COVAR_SAMP(col1, col2) – Sample covariance.
- Purpose
- Difference Between COVAR_SAMP() and COVAR_POP()
- Array-Based Aggregations
ARRAY_AGG(column) – Aggregates data into an array.
- Purpose
EXPLODE(array_column) – Converts array elements into separate rows.
- How It Works
- Example Dataset: employees Table
- Query Output
- Difference Between ARRAY_AGG(), COLLECT_LIST(), and COLLECT_SET()
- Alternative Approaches
- Advanced Aggregations
  - Grouping Sets and Rollup
GROUPING SETS() – Computes multiple groupings in a single query.
- Purpose
- How It Works
- Example Dataset: employees Table
- Query Output
- Understanding NULL Values in Output
- Tricky Edge Cases
- Alternative Approaches
CUBE() – Aggregates across all possible combinations of specified columns.
- Purpose
- How It Works
- Example Dataset: employees Table
- Query Output (CUBE(department, job_role))
- Difference Between CUBE(), ROLLUP(), and GROUPING SETS()
- Tricky Edge Cases
ROLLUP() – Computes hierarchical aggregations.
- Purpose
- How It Works
  - Hierarchy Breakdown
- Example Dataset
- Query Output (ROLLUP(department, job_role))
- Difference Between ROLLUP(), CUBE(), and GROUPING SETS()
- Tricky Edge Cases
- Approximate Aggregations (For Big Data)
- Custom Aggregations Using User-Defined Functions (UDAF)
User-Defined Aggregation Function (UDAF)
- Time-Series Aggregations
DATE_TRUNC(unit, column) – Aggregation based on time intervals.
WINDOW() – Used for time-based aggregations in structured streaming.
- Quantile-Based Aggregations
PERCENT_RANK() –
- Explanation of PERCENT_RANK() OVER (ORDER BY salary DESC) in Spark SQL
- Purpose
- How It Works
- Example Dataset
- Query Output (PERCENT_RANK() OVER (ORDER BY salary DESC))
- Understanding the Values
- Difference Between PERCENT_RANK(), RANK(), and DENSE_RANK()
- Alternative Approaches
CUME_DIST() –
- Data & Calculation
  - Step-by-Step Calculation of CUME_DIST()
MEDIAN() (Custom Implementation)
HISTOGRAM() – Creating Binned Distributions
- Purpose
- How It Works
- Example Dataset
- Query Output (HISTOGRAM_NUMERIC(salary, 3))
  - Explanation:
- Using More Bins (HISTOGRAM_NUMERIC(salary, 5))
- Key Observations
- Alternative Approaches
BIT_OR() / BIT_AND() – Bitwise Aggregations
STRING_AGG() – Concatenating Strings in Aggregation
HYPERLOGLOG_PLUS()
- Advanced Cardinality Estimation
ENTROPY() – Information Theory Aggregation
MODE() (Built-in in Some Versions)
GEOMETRIC_MEAN() – Multiplicative Mean
HARMONIC_MEAN() – Reciprocal Mean Calculation
MEDIAN() (Alternative Approach)
PRODUCT() – Multiplication Aggregation
SKEWNESS() – Skewness of a Distribution
KURTOSIS() – Peakedness of a Distribution
RATIO_TO_REPORT() – Proportional Contribution
LEAD() / LAG() – Comparing Aggregates Over Rows
PIVOT() – Transforming Rows into Columns
UNPIVOT() – Transforming Columns into Rows
ANY_VALUE() – Random Value in Group
GROUPING() – Detecting Aggregation Levels
TOP N Records using ROW_NUMBER()
MODE() (Efficient Alternative Using Window Functions)
GROUP_CONCAT() – Concatenation of Grouped Values
BIT_XOR() – Bitwise XOR Aggregation
- Understanding BIT_XOR(permissions) in SQL
- 1. How BIT_XOR() Works
- 2. What is Bitwise XOR (⊕)?
  - Example Table:
- 3. Step-by-Step Calculation
  - Final Result:
- 4. BIT_XOR() vs. Other Bitwise Aggregations
- 5. BIT_XOR() with NULL Values
  - Example:
- 6. BIT_XOR() with HAVING – Filtering Groups
- 7. BIT_XOR() in Joins – Ensuring Correct Aggregation
  - Incorrect Query (Duplicates Affect XOR)
- 8. BIT_XOR() with CASE – Conditional XOR
- 9. BIT_XOR() with UNION vs. UNION ALL
  - Example:
- 10. BIT_XOR() with Binary Data
  - Example:
  - 11. Summary of Tricky SQL Concepts with BIT_XOR()
LOG_SUM_EXP()
- Logarithmic Summation for Stability
COUNT_IF()
- Conditional Counting
APPROX_MEDIAN() – Faster Median Computation for Big Data
RATIO()
- Ratio Calculation Within Groups
REGR_SLOPE(), REGR_INTERCEPT()
- Regression Aggregations
REGR_R2()
- Coefficient of Determination (R² Score)
UNIQUE_COUNT_ESTIMATE()
- Faster Approximate Unique Counting
MAD()
- Median Absolute Deviation for Outlier Detection
GREATEST() and LEAST()
- Row-wise Maximum/Minimum
COUNT_MIN_SKETCH()
- Frequency Approximation for Streaming Data
QUANTILE()
- Exact Percentiles for Small Datasets
GROUPING_ID()
- Identifying Aggregation Levels in GROUPING SETS()
INVERSE_PERCENTILE()
- Reverse Lookup of Percentiles
EXPONENTIAL_MOVING_AVG()
- Exponential Smoothing for Time Series
DECILE()
- Dividing Data into 10 Equal Parts
MOVING_AVERAGE()
- Rolling Window Aggregation
STDDEV_SAMP() and STDDEV_POP()
- Sample & Population Standard Deviation
VAR_SAMP() and VAR_POP()
- Sample & Population Variance
BLOOM_FILTER_AGG()
Probabilistic Set Membership
TOP_K()
- Approximate Most Frequent Elements
EVENT_RATE_ESTIMATE()
- Approximate Count of Rare Events
HASH_AGG()
- Hash-based Aggregation for Large Datasets
STRING_AGG() with ORDER BY (Concatenation in Sorted Order)
CARDINALITY_ESTIMATE() – Fast Approximate Cardinality
BIT_COUNT() – Counting Number of 1s in a Bit Representation
UNIQUE_PERCENTILE()
- Approximate Unique Distribution
FREQUENT_ITEMS()
- Finding Frequent Elements in Large Datasets
- ROLLING_PERCENTILE()
- Moving Window Percentile
BIT_SET_AGG()
- Bitwise Aggregation Across Rows
MOST_RECENT()
- Last Non-Null Value
ARLIEST()
- First Non-Null Value
DEVIATION()
- Absolute Deviation from the Mean
PERMUTATION_AGG() – Aggregate Across Permutations
JACCARD_SIMILARITY() – Set-Based Similarity Calculation
COSINE_SIMILARITY() – Similarity Between Feature Vectors
TOP_N_PER_GROUP()
- Top N Elements Within Each Group
L2_NORM()
- Euclidean Distance for Feature Engineering
HAVERSINE_DISTANCE() – Distance Between Two Coordinates

Spark SQL Transformations

1. Spark SQL vs. PySpark DataFrames: Key Differences

| Feature | Spark SQL | PySpark DataFrames | | --- | --- | --- | | Interface | SQL-based (SELECT * FROM table) | Python-based (df.select(), df.filter()) | | Performance | Optimized for complex queries (BI tools, dashboards) | Faster for iterative processing (ML, transformations) | | Ease of Use | Easier for SQL users | More flexible for developers (Python, Scala) | | Integration | Works with Hive, JDBC, BI tools | Works with Python ML libraries (scikit-learn, TensorFlow) | | Best Use Cases | Ad-hoc queries, reporting, BI dashboards | ETL, Data Science, Machine Learning |

Logical Flow Diagram: How Spark SQL and PySpark Work

Spark SQL Workflow

1️⃣ Load data into a table or DataFrame.
2️⃣ Run SQL queries on the table.
3️⃣ Spark converts SQL into an optimized execution plan.
4️⃣ The query executes in a distributed cluster.

Example Workflow:
```
 CREATE TABLE sales (id INT, amount DOUBLE);
 SELECT * FROM sales WHERE amount > 1000;
```
PySpark DataFrame Workflow

1️⃣ Load data into a DataFrame.
2️⃣ Use PySpark functions (select, filter, groupBy).
3️⃣ Spark optimizes execution using the Catalyst engine.
4️⃣ Executes in a distributed cluster.

Example Workflow:
```
 df = spark.read.csv("sales.csv", header=True)
 df.filter(df.amount > 1000).show()
```
Both Spark SQL and PySpark DataFrames use the Catalyst optimizer to improve performance.

When to Use Spark SQL & PySpark Together

Many use cases require both Spark SQL & PySpark DataFrames.

Use Case 1: SQL Query with PySpark

Read a CSV file and filter results using Spark SQL:

df = spark.read.csv("sales.csv", header=True)
df.createOrReplaceTempView("sales_table")

high_sales = spark.sql("SELECT * FROM sales_table WHERE amount > 5000")
high_sales.show()

Uses SQL for filtering, then PySpark DataFrame for further processing.

Use Case 2: Writing PySpark DataFrame Results to a SQL Table

Transform a DataFrame and write results as a table:

df_filtered = df.filter(df.amount > 5000)
df_filtered.write.format("parquet").saveAsTable("high_sales")

Uses PySpark transformations but stores the output as a SQL table.

Use Case 3: Real-Time Streaming with Spark SQL + PySpark

Read real-time sales data from Kafka, process with PySpark, and store in a SQL table.

df_stream = spark.readStream.format("kafka") \
    .option("subscribe", "sales_topic") \
    .load()

df_sales = df_stream.selectExpr("CAST(value AS STRING)")
df_sales.createOrReplaceTempView("sales_stream")

query = spark.sql("SELECT * FROM sales_stream WHERE amount > 1000")
query.writeStream.format("console").start()

Combines PySpark DataFrames for real-time streaming with SQL queries.

Which One to Use?

Use Case	Use Spark SQL?	Use PySpark DataFrames?
Data Exploration	✅ Best choice (simple queries)	❌ Not needed
ETL Pipelines	⚠️ Can be used	✅ Best choice (Python-based processing)
BI Dashboards (Power BI, Tableau)	✅ Best choice	❌ Not needed
Machine Learning (MLlib, TensorFlow)	❌ Not preferred	✅ Best choice
Streaming (Kafka, EventHub)	✅ For SQL queries on stream	✅ For transformations

Spark SQL is best for SQL-heavy workflows, BI dashboards, and analytics.
PySpark DataFrames are best for ETL, ML, real-time streaming, and complex transformations.

2. Types of Transformations in Spark: Narrow vs. Wide Transformations

Transformations in Apache Spark are operations applied on DataFrames or RDDs to create a new dataset. They are lazy, meaning they do not execute immediately but build a logical execution plan.

There are two main types of transformations:
✅ Narrow Transformations (map, filter, etc.)
✅ Wide Transformations (groupBy, join, aggregate, etc.)

What are Narrow Transformations?

Narrow transformations are operations where each partition’s output depends only on a single input partition.
These operations do not require data shuffling across nodes, making them faster & more efficient.

Examples of Narrow Transformations:
1️⃣ map() → Applies a function to each row
2️⃣ filter() → Removes unwanted rows
3️⃣ flatMap() → Expands one row into multiple rows
4️⃣ drop() → Removes a column

Example: Narrow Transformations in PySpark

df = spark.createDataFrame([(1, 'John', 5000), (2, 'Alice', 7000)], ["id", "name", "salary"])

# Map transformation: Convert salary to INR
df_mapped = df.withColumn("salary_inr", df.salary * 82)

# Filter transformation: Get employees with salary > 6000
df_filtered = df_mapped.filter(df_mapped.salary > 6000)

df_filtered.show()

No data shuffle happens! The transformations apply within the same partition.

What are Wide Transformations?

Wide transformations are operations where data from multiple partitions is needed to compute the result.
These transformations require data shuffling, making them slower but necessary for aggregations.

Examples of Wide Transformations:
1️⃣ groupBy() → Groups data based on a column
2️⃣ join() → Combines data from two DataFrames
3️⃣ orderBy() → Sorts data across partitions
4️⃣ aggregate() → Computes sum, avg, count across partitions

Example: Wide Transformations in PySpark

df_sales = spark.createDataFrame([
    (1, 'John', 'Electronics', 5000),
    (2, 'Alice', 'Clothing', 7000),
    (3, 'Bob', 'Electronics', 6000)
], ["id", "name", "category", "amount"])

# GroupBy transformation: Total sales per category
df_grouped = df_sales.groupBy("category").sum("amount")

df_grouped.show()

Here, Spark needs to shuffle data across nodes to group by category.

Narrow vs. Wide Transformations: Key Differences

Feature	Narrow Transformations	Wide Transformations
Definition	Each output partition depends on only one input partition	Each output partition depends on multiple input partitions
Requires Shuffling?	❌ No (Faster)	✅ Yes (Slower)
Performance	High Performance	Requires network shuffle (Slower)
Examples	`map()`, `filter()`, `flatMap()`	`groupBy()`, `join()`, `orderBy()`

Logical Diagram: Narrow vs. Wide Transformations

Narrow Transformation (map())

Input Partition 1 → [Row1, Row2, Row3]  --(map)-->  Output Partition 1 → [Row1, Row2, Row3]

Each input partition maps directly to one output partition (No Shuffling).

Wide Transformation (groupBy())

Input Partition 1 → [Electronics, Clothing]  
Input Partition 2 → [Electronics, Sports]  
       ⬇ (Shuffle Data) ⬇  
Output Partition 1 → [Electronics]  
Output Partition 2 → [Clothing, Sports]

Data must be shuffled between nodes before aggregation happens.

Summary: Optimizing Transformations in Spark

Optimization Tip	Why?
Prefer Narrow Transformations (`map`, `filter`)	No shuffle = Faster execution
Minimize Wide Transformations (`join`, `groupBy`)	Reduces network overhead
Use `broadcast()` for small DataFrames before joins	Avoids large shuffle operations
Use `repartition()` wisely	Helps optimize partition size before wide transformations

✅ Use Narrow Transformations when possible for speed.
✅ Use Wide Transformations only when needed for aggregations.

3. Lazy Execution in PySpark

Creating a DataFrame and Applying Transformations (Lazy Execution)

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("LazyExecutionExample").getOrCreate()

# Create a DataFrame
data = [(1, "Alice", 5000), (2, "Bob", 7000), (3, "Charlie", 8000)]
columns = ["id", "name", "salary"]

df = spark.createDataFrame(data, columns)

# Apply Transformations (Nothing Executes Yet!)
df_filtered = df.filter(df.salary > 6000)
df_mapped = df_filtered.withColumn("bonus", df_filtered.salary * 0.10)

# Still Lazy Execution - Spark has NOT run anything yet

At this point, Spark has only recorded the transformations but has NOT executed them.

When Does Execution Actually Happen?

Triggering an Action

df_mapped.show()  # Action (Executes the Query)

Now Spark executes all transformations in one go!

Other Actions That Trigger Execution

Action	Purpose
`.show()`	Displays the DataFrame
`.collect()`	Brings data to the driver
`.count()`	Counts rows
`.write.format("parquet").save("path")`	Writes data to storage

Transformations execute only when an action is called.

Logical Diagram: Lazy Execution Workflow

1️⃣ Transformation 1: Apply filter() → (Queued in Logical Plan)
2️⃣ Transformation 2: Apply map() → (Queued in Logical Plan)
3️⃣ Action: Call .show() → (Executes everything together)

Spark optimizes execution to run all transformations in a single stage.

Lazy Execution in Spark

Feature	Lazy Execution
Execution Timing	Runs only when an action is triggered
Performance	Optimized execution by combining transformations
Fault Tolerance	Keeps a logical plan, allowing recovery from failure
Example Transformations	`map()`, `filter()`, `withColumn()` (Lazy)
Example Actions	`.show()`, `.count()`, `.collect()` (Triggers Execution)

Lazy Execution makes Spark faster & more efficient!

3. Working with Structured Data: Reading and Writing Parquet, ORC, Avro Files in Spark SQL

Structured data is commonly stored in optimized file formats like Parquet, ORC, and Avro, which help improve query performance and storage efficiency in big data applications.

This guide covers:
✅ What are Parquet, ORC, and Avro?
✅ How to read & write these formats in Spark SQL and PySpark?
✅ Performance considerations & best practices

3.1 Understanding Parquet, ORC, and Avro

Format	Best For	Advantages	Compression
Parquet	Analytics & Queries	Columnar storage, schema evolution	High (Snappy, Gzip)
ORC	Hive & Spark SQL	Best for Hive-based queries, column pruning	High (Zlib, Snappy)
Avro	Data Serialization & Schema Evolution	Row-based, good for streaming	Medium (Deflate, Snappy)

Parquet and ORC → Best for columnar analytics
Avro → Best for row-based serialization & schema evolution

Feature	Parquet & ORC (Columnar)	Avro (Row-Based)
Best For	Large-scale analytics (OLAP)	Streaming & Transactions (OLTP)
Storage Layout	Column-wise (stores individual columns separately)	Row-wise (stores full rows together)
Speed	Faster for queries on specific columns	Faster for inserting & updating rows
Use Case	Data lakes, analytics platforms (BigQuery, Redshift)	Kafka streaming, log storage, messaging systems

3.2 Reading & Writing Parquet Files in Spark SQL

Parquet is a columnar format designed for fast querying and compression.

A) Reading Parquet Files

Read a single Parquet file

df = spark.read.parquet("s3://data/sales.parquet")
df.show()

Read multiple Parquet files

df = spark.read.parquet("s3://data/parquet_files/*")

Using Spark SQL to Query Parquet Files

SELECT * FROM parquet.`s3://data/sales.parquet` WHERE amount > 5000;

B) Writing Data to Parquet

Save DataFrame as a Parquet File

df.write.parquet("s3://data/output/sales_parquet")

Save as Table (Partitioned)

df.write.partitionBy("year").parquet("s3://data/output/sales_partitioned")

Best Practice: Use partitioning and Z-Ordering for faster query execution.

C.) Reading & Writing ORC Files in Spark SQL

ORC (Optimized Row Columnar) is optimized for Hive and provides better compression than Parquet.

Reading ORC Files

df = spark.read.orc("s3://data/sales.orc")
df.show()

Using Spark SQL to Query ORC Files

SELECT * FROM orc.`s3://data/sales.orc` WHERE category = 'Electronics';

Writing Data to ORC

Save DataFrame as ORC File

df.write.orc("s3://data/output/sales_orc")

Save with Compression

df.write.option("compression", "zlib").orc("s3://data/output/sales_compressed_orc")

ORC is ideal for Hive-based querying and columnar storage.

D.) Reading & Writing Avro Files in Spark SQL

Avro is row-based and best for schema evolution and data serialization.

Reading Avro Files

df = spark.read.format("avro").load("s3://data/sales.avro")
df.show()

Using Spark SQL to Query Avro Files

SELECT * FROM avro.`s3://data/sales.avro` WHERE customer_id = 1234;

Writing Data to Avro

Save DataFrame as Avro File

df.write.format("avro").save("s3://data/output/sales_avro")

Save with Schema Evolution Support

df.write.option("avroSchema", "path/to/schema.avsc").format("avro").save("s3://data/output/sales_avro_v2")

Avro is best for schema evolution and real-time processing.

Performance Considerations: Choosing the Right Format

Use Case	Best Format
Big Data Analytics (Columnar Queries)	Parquet, ORC
Streaming & Schema Evolution	Avro
Hive-Based Processing	ORC
Data Lake Storage	Parquet

✔ Use Parquet for analytics (better query speed & compression).
✔ Use ORC for Hive-based workloads (optimized columnar storage).
✔ Use Avro for schema evolution (best for streaming & Kafka integration).

Summary: How to Work with Parquet, ORC, Avro in Spark SQL

Operation	Parquet	ORC	Avro
Read File	`spark.read.parquet("path")`	`spark.read.orc("path")`	`spark.read.format("avro").load("path")`
Write File	`df.write.parquet("path")`	`df.write.orc("path")`	`df.write.format("avro").save("path")`
Best Use Case	Data Lakes, Analytics	Hive, Columnar Processing	Streaming, Schema Evolution

4.Aggregations in Spark SQL

`COUNT()`

Returns the number of rows in a dataset.

- Example:
```
   SELECT COUNT(*) FROM sales;
```
  Tricky SQL Concepts with COUNT(*)
  
  | Concept | Explanation | | --- | --- | | COUNT(*) vs COUNT(column) | COUNT(*) includes NULLs, COUNT(column) ignores NULLs | | COUNT(DISTINCT column) | Counts only unique non-null values | | COUNT(1) vs. COUNT(*) | No difference in modern databases | | HAVING COUNT(*) | Filters groups based on count conditions | | COUNT(*) in JOINs | Joins may create duplicate rows, inflating the count | | COUNT(column_name) in LEFT JOIN | NULLs are ignored, leading to possible 0 counts | | COUNT(*) OVER() | Running total count using window functions | | COUNT(*) in UNION vs UNION ALL | UNION removes duplicates, UNION ALL keeps them | | COUNT(*) FILTER (WHERE condition) | Conditional counting within aggregations |

`SUM()`

- Computes the sum of a numeric column.
  - Examples:
```
 SELECT SUM(amount) FROM sales;
```
    More Tricky solutions::
    
    1. What Does SUM(amount) Really Do?
    - Adds up all values in the amount column.
    - Ignores NULL values automatically.
    - Returns NULL if no rows exist or all values are NULL.

Example:

        SELECT SUM(amount) FROM sales;

sale_id	amount
1	100
2	200
3	NULL
4	300

✅ SUM(amount) = 100 + 200 + 300 = 600 (NULL is ignored).
✅ If all amount values were NULL, SUM(amount) would return NULL.

2. `SUM(amount)` vs. `COALESCE(SUM(amount), 0)`

✅ If SUM(amount) returns NULL, it may cause issues in calculations.
✅ Using COALESCE() ensures it returns 0 instead of NULL.

Example:

        SELECT COALESCE(SUM(amount), 0) AS total_amount FROM sales;

If all rows contain NULL values, it will return 0 instead of NULL.

3. `SUM(DISTINCT amount)` – Summing Unique Values

✅ Using DISTINCT ensures duplicate values are summed only once.

Example:

        SELECT SUM(DISTINCT amount) FROM sales;

sale_id	amount
1	100
2	200
3	100
4	300

✅ SUM(amount) = 100 + 200 + 100 + 300 = 700
✅ SUM(DISTINCT amount) = 100 + 200 + 300 = 600 (duplicates ignored).

4. `SUM()` with `GROUP BY`

✅ Grouping results before summing them can produce unexpected results.

Example:

        SELECT category, SUM(amount) FROM sales GROUP BY category;

category	amount
Electronics	200
Clothing	150
Electronics	300

✅ GROUP BY groups data before summing:

category	SUM(amount)
Electronics	200 + 300 = 500
Clothing	150

5. `SUM()` with `HAVING` Clause

✅ Used to filter aggregated results.

Example:

        SELECT category, SUM(amount) AS total_sales
        FROM sales
        GROUP BY category
        HAVING SUM(amount) > 500;

Filters only categories where total sales exceed 500.

6. `SUM()` in Joins – Unexpected Duplicates

✅ If a JOIN causes duplicate rows, SUM() may return inflated values.

Example:

        SELECT c.customer_id, SUM(s.amount)
        FROM customers c
        JOIN sales s ON c.customer_id = s.customer_id
        GROUP BY c.customer_id;

If a customer has multiple rows due to the JOIN, SUM() may double-count sales.

✅ Fix: Use SUM(DISTINCT s.amount) if needed.

7. `SUM()` in LEFT JOIN – Handling NULLs

✅ If there are NULL values due to LEFT JOIN, SUM() may return NULL.

Example:

        SELECT c.customer_id, SUM(s.amount)
        FROM customers c
        LEFT JOIN sales s ON c.customer_id = s.customer_id
        GROUP BY c.customer_id;

customer_id	amount
1	500
2	NULL

✅ SUM(amount) ignores NULLs, but customer 2 gets NULL.
✅ Fix: Use COALESCE(SUM(amount), 0) AS total_amount to return 0 instead of NULL.

8. `SUM()` with `CASE` – Conditional Summing

✅ Summing only specific values within a column.

Example:

        SELECT 
            SUM(CASE WHEN category = 'Electronics' THEN amount ELSE 0 END) AS electronics_sales,
            SUM(CASE WHEN category = 'Clothing' THEN amount ELSE 0 END) AS clothing_sales
        FROM sales;

Computes total sales separately for each category.

9. `SUM()` with `FILTER()` – Alternative to `CASE`

✅ A cleaner way to apply conditional aggregation.

Example:

        SELECT 
            SUM(amount) FILTER (WHERE category = 'Electronics') AS electronics_sales,
            SUM(amount) FILTER (WHERE category = 'Clothing') AS clothing_sales
        FROM sales;

Same as CASE method but more concise.

10. `SUM()` in Window Functions – Running Total

✅ Using OVER() for cumulative summing.

Example:

        SELECT customer_id, order_date, 
               SUM(amount) OVER(PARTITION BY customer_id ORDER BY order_date) AS running_total
        FROM sales;

customer_id	order_date	amount	running_total
1	2024-01-01	100	100
1	2024-01-02	200	300
1	2024-01-03	150	450

✅ Computes a running total of sales per customer.

11. `SUM()` with `UNION` vs. `UNION ALL`

✅ UNION removes duplicates, affecting SUM().
✅ UNION ALL keeps duplicates, increasing SUM().

Example:

        SELECT SUM(amount) FROM (
            SELECT amount FROM sales
            UNION
            SELECT amount FROM refunds
        ) AS combined_data;

Removes duplicate amounts before summing.
If we use UNION ALL, duplicates remain.

12. `SUM()` with Negative Values

✅ Handles both positive and negative numbers.

Example:

        SELECT SUM(amount) FROM transactions;

transaction_id	amount
1	500
2	-200
3	300

✅ SUM(amount) = 500 + (-200) + 300 = 600
✅ Negative values affect totals, useful in accounting and refunds.

Tricky SQL Concepts with `SUM(amount)`

Concept	Explanation
`SUM(amount)` ignores NULLs	NULL values are excluded from summation
`SUM(DISTINCT amount)`	Summing only unique values
`SUM()` with `GROUP BY`	Computes totals for each group
`SUM()` with `HAVING`	Filters aggregated results
`SUM()` in `JOIN` issues	Joins may duplicate rows, inflating totals
`SUM()` in `LEFT JOIN`	NULLs may lead to unexpected results
`SUM()` with `CASE`	Conditional summing
`SUM()` with `FILTER()`	A cleaner conditional sum approach
`SUM()` with `OVER()`	Running total calculation
`SUM()` in `UNION` vs. `UNION ALL`	`UNION` removes duplicates, affecting sum
`SUM()` with Negative Values	Important for transactions, refunds

`AVG()`

- Returns the average (mean) value of a numeric column.
  - Example:
```
 SELECT AVG(price) FROM products;
```
    Tricky SQL Concepts with AVG(price)
    
    | Concept | Explanation | | --- | --- | | AVG(price) ignores NULLs | NULL values are excluded from the calculation | | AVG(DISTINCT price) | Averages only unique values | | Integer Division Issue | Convert to FLOAT or DECIMAL for correct results | | AVG() with GROUP BY | Computes averages for each group | | AVG() with HAVING | Filters groups based on the average | | AVG() in Joins | Joins may create duplicate rows, affecting averages | | AVG() in LEFT JOIN | NULLs may lead to unexpected results | | AVG() with CASE | Conditional averaging | | AVG() with FILTER() | A cleaner conditional average approach | | AVG() with OVER() | Moving average calculation | | AVG() in UNION vs. UNION ALL | UNION removes duplicates, affecting averages | | AVG() with Negative Values | Important for transactions, refunds |

`MIN()`

Finds the minimum value in a column.

Example:

SELECT MIN(salary) FROM employees;

`MAX()`

Finds the maximum value in a column.

Example:

SELECT MAX(salary) FROM employees;

`COUNT(DISTINCT column_name)`

Returns the count of distinct values in a column.

Example:

SELECT COUNT(DISTINCT category) FROM products;

`GROUP BY` with Aggregations

- Used to apply aggregate functions on grouped data.
  - Example:
```
 SELECT department, AVG(salary) FROM employees GROUP BY department;
```
    AVG(salary) GROUP BY department
    
    | Concept | Explanation | | --- | --- | | AVG(salary) ignores NULLs | NULL values are excluded from the calculation | | AVG(DISTINCT salary) | Averages only unique values | | Integer Division Issue | Convert to FLOAT or DECIMAL for correct results | | AVG() with GROUP BY | Computes averages for each department | | AVG() with HAVING | Filters groups based on the average | | AVG() in Joins | Joins may create duplicate rows, affecting averages | | AVG() in LEFT JOIN | NULLs may lead to unexpected results | | AVG() with CASE | Conditional averaging | | AVG() with FILTER() | A cleaner conditional average approach | | AVG() with OVER() | Finds average salary per department without GROUP BY | | AVG() in UNION vs. UNION ALL | UNION removes duplicates, affecting averages | | AVG() with Negative Values | Important for finance, refunds |

`HAVING` with Aggregations

Example:

   SELECT department, SUM(salary) 
   FROM employees 
   GROUP BY department 
   HAVING SUM(salary) > 50000;

1. How `HAVING` Works in `SUM(salary) > 50000`

   SELECT department, SUM(salary) 
   FROM employees 
   GROUP BY department 
   HAVING SUM(salary) > 50000;

Groups data by department.
Computes SUM(salary) for each department.
Filters departments where the total salary exceeds 50000.

Example Data:

department	salary
HR	30000
HR	25000
IT	40000
IT	10000
Finance	20000

Query Execution:

department	SUM(salary)
HR	30000 + 25000 = 55000 ✅ (kept)
IT	40000 + 10000 = 50000 ❌ (removed)
Finance	20000 ❌ (removed)

2. `HAVING` vs. `WHERE` – Key Difference

✅ WHERE filters rows before aggregation.
✅ HAVING filters after aggregation.

Incorrect Use of `WHERE` (Causes Error)

        SELECT department, SUM(salary) 
        FROM employees 
        WHERE SUM(salary) > 50000  -- ❌ INVALID
        GROUP BY department;

Error: SUM(salary) is an aggregate function, and WHERE cannot use aggregate functions.

Correct Use with `HAVING`

        SELECT department, SUM(salary) 
        FROM employees 
        GROUP BY department 
        HAVING SUM(salary) > 50000;

3. `HAVING` with `COUNT()` – Filtering Groups by Count

✅ Find departments with more than 2 employees.

        SELECT department, COUNT(*) 
        FROM employees 
        GROUP BY department 
        HAVING COUNT(*) > 2;

Counts employees per department.
Filters only departments with more than 2 employees.

4. `HAVING` with Multiple Conditions

✅ Find departments where total salary is over 50000 AND average salary is above 25000.

        SELECT department, SUM(salary), AVG(salary)
        FROM employees 
        GROUP BY department 
        HAVING SUM(salary) > 50000 AND AVG(salary) > 25000;

Filters on both SUM(salary) and AVG(salary).

5. `HAVING` with `DISTINCT` – Counting Unique Values

✅ Find departments where unique salaries exceed 2.

        SELECT department, COUNT(DISTINCT salary)
        FROM employees 
        GROUP BY department 
        HAVING COUNT(DISTINCT salary) > 2;

Counts unique salary values per department.
Filters departments with more than 2 unique salary values.

6. `HAVING` with `CASE` – Conditional Aggregation

Find departments where Managers earn over 40000 in total.

        SELECT department, SUM(CASE WHEN job_title = 'Manager' THEN salary ELSE 0 END) AS manager_salary
        FROM employees 
        GROUP BY department 
        HAVING SUM(CASE WHEN job_title = 'Manager' THEN salary ELSE 0 END) > 40000;

Filters only departments where total salary for Managers exceeds 40000.

7. `HAVING` with `FILTER()` – Alternative to `CASE`

Find departments where total Manager salary exceeds 40000 (simplified).

        SELECT department, SUM(salary) FILTER (WHERE job_title = 'Manager') AS manager_salary
        FROM employees 
        GROUP BY department 
        HAVING SUM(salary) FILTER (WHERE job_title = 'Manager') > 40000;

Same as CASE but more readable.

8. `HAVING` with `ORDER BY` – Sorting Aggregated Data

✅ Find departments with total salary > 50000 and sort by total salary.

        SELECT department, SUM(salary) AS total_salary
        FROM employees 
        GROUP BY department 
        HAVING SUM(salary) > 50000
        ORDER BY total_salary DESC;

Sorts only departments that meet the HAVING condition.

9. `HAVING` with `JOIN` – Filtering After Aggregation

✅ Find customers who spent over $1000 across all orders.

        SELECT c.customer_id, SUM(o.amount) AS total_spent
        FROM customers c
        JOIN orders o ON c.customer_id = o.customer_id
        GROUP BY c.customer_id
        HAVING SUM(o.amount) > 1000;

Filters only customers who spent more than $1000.

10. `HAVING` in Window Functions – Why It Doesn’t Work

✅ HAVING cannot be used in OVER() window functions.

        SELECT employee_id, department, salary, 
               SUM(salary) OVER(PARTITION BY department) AS dept_salary
        FROM employees
        HAVING dept_salary > 50000;  -- ❌ INVALID

Error: HAVING works on grouped data, not windowed data.
Fix: Use a subquery or WHERE instead.

Correct Approach:

        SELECT * FROM (
            SELECT employee_id, department, salary, 
                   SUM(salary) OVER(PARTITION BY department) AS dept_salary
            FROM employees
        ) subquery
        WHERE dept_salary > 50000;

Filters after window function execution.

Summary of Tricky `HAVING` Scenarios

Scenario	Explanation
`HAVING` vs. `WHERE`	`WHERE` filters before aggregation, `HAVING` filters after
`HAVING` with `COUNT()`	Filters groups based on row counts
`HAVING` with multiple conditions	Combines filters on multiple aggregate functions
`HAVING` with `DISTINCT`	Counts only unique values in groups
`HAVING` with `CASE`	Filters based on conditional aggregation
`HAVING` with `FILTER()`	A cleaner alternative to `CASE`
`HAVING` with `ORDER BY`	Sorts aggregated results after filtering
`HAVING` in Joins	Filters after aggregation in `JOIN` queries
`HAVING` with Window Functions	Does not work, use subqueries instead

`VARIANCE()` / `VAR_POP()`

Computes the variance of a column.

Example:

 SELECT VARIANCE(salary) FROM employees;

Alternative Functions in Spark SQL

STDDEV(salary) – Returns the standard deviation (square root of variance).
VAR_SAMP(salary) – Computes sample variance instead of population variance.
VAR_POP(salary) – Computes population variance.

Tricky SQL Concepts with `VARIANCE(salary)`

Scenario	Explanation
`VARIANCE()` ignores NULLs	NULL values are excluded from calculation
`VAR_POP()` vs. `VAR_SAMP()`	Population variance divides by `n`, sample variance by `n - 1`
Integer Division Issue	Convert to `FLOAT` or `DECIMAL` for correct results
`VARIANCE()` with `GROUP BY`	Computes variance per department
`VARIANCE()` with `HAVING`	Filters groups based on variance
`VARIANCE()` in Joins	Duplicates may inflate variance, use `DISTINCT`
`VARIANCE()` with `CASE`	Conditional variance calculation
`VARIANCE()` with `FILTER()`	A cleaner conditional variance approach
`VARIANCE()` with `OVER()`	Running variance calculation
`VARIANCE()` in `UNION` vs. `UNION ALL`	`UNION` removes duplicates, affecting variance

`STDDEV()` / `STDDEV_POP()`

Computes the standard deviation.

Example:

SELECT STDDEV(salary) FROM employees;

`FIRST()` / `LAST()`

Retrieves the first or last value in a group.

Example:

SELECT department, FIRST(employee_name) FROM employees GROUP BY department;

Groups data by department.
Retrieves the first employee_name in each department.
The "first" value depends on how the database processes grouped data.

Example Data:

| employee_id | department | employee_name | | --- | --- | --- | | 1 | HR | Alice | | 2 | HR | Bob | | 3 | IT | Charlie | | 4 | IT | David |

Expected Result:

| department | FIRST(employee_name) | | --- | --- | | HR | Alice | | IT | Charlie |

`COLLECT_SET()`

Returns a set of unique values as an array.

SELECT department, COLLECT_SET(employee_name) FROM employees GROUP BY department;

Purpose

COLLECT_SET(employee_name) returns a unique list (set) of employee names within each department.
It removes duplicates and returns an array of unique values.

How It Works

GROUP BY department → Groups employees by department.
COLLECT_SET(employee_name) → Collects unique employee names in each department into an array.

Example Dataset: `employees` Table

employee_name	department
Alice	IT
Bob	HR
Charlie	IT
Alice	IT
David	Finance
Emma	HR

Query Output

department	COLLECT_SET(employee_name)
IT	["Alice", "Charlie"]
HR	["Bob", "Emma"]
Finance	["David"]

Duplicates are removed: "Alice" appears twice in IT but is stored only once.

Alternative Function

COLLECT_LIST(employee_name) → Collects all employee names as a list including duplicates.

`COLLECT_LIST()`

Returns all values as an array (including duplicates).

Example:

SELECT department, COLLECT_LIST(employee_name) FROM employees GROUP BY department;

Purpose

COLLECT_LIST(employee_name) aggregates all employee names into a list for each department.
Unlike COLLECT_SET(), it does not remove duplicates and preserves the order of occurrence.

How It Works

GROUP BY department → Groups employees by department.
COLLECT_LIST(employee_name) → Collects all employee names into an array with duplicates and order preserved.

Example Dataset: `employees` Table

employee_name	department
Alice	IT
Bob	HR
Charlie	IT
Alice	IT
David	Finance
Emma	HR

Query Output

department	COLLECT_LIST(employee_name)
IT	["Alice", "Charlie", "Alice"]
HR	["Bob", "Emma"]
Finance	["David"]

Duplicates are included: "Alice" appears twice in IT and is not removed.

Difference Between `COLLECT_SET()` and `COLLECT_LIST()`

Function	Removes Duplicates	Maintains Order
`COLLECT_SET()`	Yes	No
`COLLECT_LIST()`	No	Yes

`PERCENTILE()` / `APPROX_PERCENTILE()`

Computes percentile values.

Example:

SELECT APPROX_PERCENTILE(salary, 0.5) FROM employees;

Purpose

The function APPROX_PERCENTILE(column, percentile) computes the approximate percentile value for a given column.
0.5 represents the 50th percentile (median salary).
It is faster than exact percentile calculations and is useful for large datasets.

How It Works

APPROX_PERCENTILE(salary, 0.5)
- Estimates the median salary (50th percentile).
- It does not scan the full dataset, making it efficient for big data.
Alternative Percentile Values:
- 0.25 → 25th percentile (lower quartile).
- 0.75 → 75th percentile (upper quartile).
- [0.25, 0.5, 0.75] → Returns multiple percentiles.

Example Dataset

Employee	Salary
A	3000
B	5000
C	6000
D	7000
E	10000

Sorted Salaries: [3000, 5000, 6000, 7000, 10000]
50th Percentile (Median Salary): 6000
Query Output: 6000

Alternative Function

PERCENTILE_APPROX(salary, 0.5, 1000)
- More accurate percentile approximation.
- The third argument (1000) is the number of samples (higher is more accurate but slower).

`MODE()` (Most Frequent Value - Custom Implementation)

Spark SQL doesn’t have a direct MODE() function, but it can be calculated using:

SELECT value, COUNT(*) AS frequency 
FROM dataset 
GROUP BY value 
ORDER BY frequency DESC 
LIMIT 1;

`RANK()` – Assigns ranks with gaps.

SELECT name, salary, RANK() OVER (ORDER BY salary DESC) FROM employees;

`DENSE_RANK()` – Assigns ranks without gaps.

SELECT name, salary, DENSE_RANK() OVER (ORDER BY salary DESC) FROM employees;

`ROW_NUMBER()` – Assigns unique row numbers.

SELECT name, salary, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) FROM employees;

`NTILE(n)` – Divides data into `n` equal parts.

SELECT name, salary, NTILE(4) OVER (ORDER BY salary DESC) FROM employees;

Statistical Aggregations

`CORR(col1, col2)` – Correlation between two columns.

SELECT CORR(salary, age) FROM employees;

Purpose

The function CORR(column1, column2) calculates the Pearson correlation coefficient between two numerical columns.
It measures the linear relationship between salary and age.
The result is a value between -1 and 1:
- 1 → Perfect positive correlation (as age increases, salary increases).
- -1 → Perfect negative correlation (as age increases, salary decreases).
- 0 → No correlation (no relationship between age and salary).

Purpose

COVAR_POP(column1, column2) calculates the population covariance between salary and age.
Covariance measures the direction of the relationship between two numeric columns but not the strength (unlike correlation).
The result can be:
- Positive → When age increases, salary tends to increase.
- Negative → When age increases, salary tends to decrease.
- Zero or close to zero → No significant relationship.

`COVAR_POP(col1, col2)` – Population covariance.

SELECT COVAR_POP(salary, age) FROM employees;

Purpose

COVAR_POP(column1, column2) calculates the population covariance between salary and age.
Covariance measures the direction of the relationship between two numeric columns but not the strength (unlike correlation).
The result can be:
- Positive → When age increases, salary tends to increase.
- Negative → When age increases, salary tends to decrease.
- Zero or close to zero → No significant relationship.

Difference Between `COVAR_POP()` and `COVAR_SAMP()`

Function	Formula	Use Case
`COVAR_POP(x, y)`	Uses population formula (`N` in denominator)	When working with full data (entire population)
`COVAR_SAMP(x, y)`	Uses sample formula (`N-1` in denominator)	When working with a sample (not full population)

`COVAR_SAMP(col1, col2)` – Sample covariance.

SELECT COVAR_SAMP(salary, age) FROM employees;

Purpose

COVAR_SAMP(column1, column2) calculates the sample covariance between salary and age.
It measures how salary and age vary together in a sample (not the entire population).
Unlike COVAR_POP(), this function divides by (N-1) instead of N, making it more suitable for sample data.

Difference Between `COVAR_SAMP()` and `COVAR_POP()`

Function	Formula	Use Case
`COVAR_POP(x, y)`	Uses N in the denominator	When working with full population data
`COVAR_SAMP(x, y)`	Uses N-1 in the denominator	When working with sample data

Array-Based Aggregations

`ARRAY_AGG(column)` – Aggregates data into an array.

SELECT department, ARRAY_AGG(employee_name) FROM employees GROUP BY department;

Purpose

The ARRAY_AGG(column) function aggregates values into an array for each group.
It preserves duplicates and maintains the order of appearance in each department.
The result is a single array per department, containing all employee names.

`EXPLODE(array_column)` – Converts array elements into separate rows.

SELECT EXPLODE(array_column) FROM table;

How It Works

GROUP BY department → Groups employees by department.
ARRAY_AGG(employee_name) → Collects all employee names in each department into an array.

Example Dataset: `employees` Table

employee_name	department
Alice	IT
Bob	HR
Charlie	IT
Alice	IT
David	Finance
Emma	HR

Query Output

department	ARRAY_AGG(employee_name)
IT	`["Alice", "Charlie", "Alice"]`
HR	`["Bob", "Emma"]`
Finance	`["David"]`

Duplicates are included: "Alice" appears twice in IT and is not removed.
Order is preserved.

Difference Between `ARRAY_AGG()`, `COLLECT_LIST()`, and `COLLECT_SET()`

Function	Removes Duplicates?	Maintains Order?
`ARRAY_AGG()`	No	Yes
`COLLECT_LIST()`	No	Yes
`COLLECT_SET()`	Yes	No

ARRAY_AGG() behaves similarly to COLLECT_LIST() but is more SQL-standard.

Alternative Approaches

COLLECT_LIST(employee_name) → Similar to ARRAY_AGG(), collects values into a list.
COLLECT_SET(employee_name) → Removes duplicates while collecting values.

Advanced Aggregations

Grouping Sets and Rollup

`GROUPING SETS()` – Computes multiple groupings in a single query.

SELECT department, job_role, SUM(salary) 
FROM employees 
GROUP BY GROUPING SETS ((department), (job_role), (department, job_role));

Purpose

GROUPING SETS allows multiple grouping combinations in a single query.
It is more flexible than GROUP BY, producing subtotal and grand total rows.
Each tuple in GROUPING SETS defines a specific aggregation level.

How It Works

GROUPING SETS ((department), (job_role), (department, job_role))

(department) → Groups only by department (ignores job role).

(job_role) → Groups only by job role (ignores department).

(department, job_role) → Groups by both department and job role (normal GROUP BY).

This produces multiple aggregation levels.

Example Dataset: `employees` Table

employee_name	department	job_role	salary
Alice	IT	Developer	5000
Bob	HR	Manager	7000
Charlie	IT	Developer	6000
David	IT	Analyst	8000
Emma	HR	Analyst	6000

Query Output

department	job_role	SUM(salary)
IT	NULL	19000
HR	NULL	13000
NULL	Developer	11000
NULL	Manager	7000
NULL	Analyst	14000
IT	Developer	11000
IT	Analyst	8000
HR	Analyst	6000

Understanding NULL Values in Output

When job_role is NULL, it means the row represents a department-level total.
When department is NULL, it means the row represents a job role-level total.

Tricky Edge Cases

Using GROUPING_ID() to Identify Aggregate Levels

 SELECT department, job_role, SUM(salary), GROUPING_ID(department, job_role) 
 FROM employees 
 GROUP BY GROUPING SETS ((department), (job_role), (department, job_role));

This adds a column (GROUPING_ID) that identifies the level of aggregation.

Adding a Grand Total (Full Aggregation)

 SELECT department, job_role, SUM(salary) 
 FROM employees 
 GROUP BY GROUPING SETS ((department), (job_role), (department, job_role), ());

Adding () at the end includes a grand total row.

Alternative Approaches

Approach	Use Case
`GROUPING SETS()`	Selective groupings (flexible subtotals)
`ROLLUP()`	Hierarchical aggregation (e.g., department → job role → total)
`CUBE()`	All possible group combinations

`CUBE()` – Aggregates across all possible combinations of specified columns.

SELECT department, job_role, SUM(salary) 
FROM employees 
GROUP BY CUBE(department, job_role);

Purpose

The CUBE() function computes all possible combinations of groupings.
It generates totals (aggregates) at multiple levels, including:
- Individual department totals
- Individual job_role totals
- Combinations of both
- Grand total (sum of all salaries)

How It Works

CUBE(department, job_role) creates a power set of grouping columns, meaning:
1. (department, job_role) → Normal GROUP BY on both columns.
2. (department, NULL) → Groups only by department (ignores job_role).
3. (NULL, job_role) → Groups only by job_role (ignores department).
4. (NULL, NULL) → Grand total (sum of all salaries).

Example Dataset: `employees` Table

employee_name	department	job_role	salary
Alice	IT	Developer	5000
Bob	HR	Manager	7000
Charlie	IT	Developer	6000
David	IT	Analyst	8000
Emma	HR	Analyst	6000

Query Output (`CUBE(department, job_role)`)

department	job_role	SUM(salary)
IT	Developer	11000
IT	Analyst	8000
HR	Manager	7000
HR	Analyst	6000
IT	NULL	19000
HR	NULL	13000
NULL	Developer	11000
NULL	Manager	7000
NULL	Analyst	14000
NULL	NULL	32000

Difference Between `CUBE()`, `ROLLUP()`, and `GROUPING SETS()`

Function	Generates Grand Total?	Generates Subtotals?	Best Use Case
`GROUPING SETS()`	No (unless manually included)	Custom groupings	Selective subtotals
`ROLLUP(department, job_role)`	Yes	Hierarchical subtotals (department → job_role → total)	Hierarchical aggregation
`CUBE(department, job_role)`	Yes	All possible group combinations	All possible aggregations

Tricky Edge Cases

Using GROUPING_ID() to Identify Aggregation Level

 SELECT department, job_role, SUM(salary), GROUPING_ID(department, job_role) 
 FROM employees 
 GROUP BY CUBE(department, job_role);

This helps differentiate full totals, partial totals, and raw values.

Filtering Out Grand Totals

 SELECT department, job_role, SUM(salary) 
 FROM employees 
 GROUP BY CUBE(department, job_role)
 HAVING department IS NOT NULL OR job_role IS NOT NULL;

Removes the (NULL, NULL) grand total row.

`ROLLUP()` – Computes hierarchical aggregations.

SELECT department, job_role, SUM(salary) 
FROM employees 
GROUP BY ROLLUP(department, job_role);

Purpose

The ROLLUP() function is used to generate hierarchical aggregations.
It computes subtotals at different levels, following a top-down hierarchy.
Unlike CUBE(), ROLLUP() does not generate all possible combinations—it follows a hierarchy from left to right.

How It Works

ROLLUP(department, job_role) generates:
1. (department, job_role) → Normal GROUP BY (Department + Job Role).
2. (department, NULL) → Groups by Department only (Subtotal per department).
3. (NULL, NULL) → Grand total (Sum of all salaries).

Hierarchy Breakdown

mathematicaCopyEdit┌──────────────┐
│ Grand Total  │
├──────────────┤
│ Department 1 │
│   ├─ Job A   │
│   ├─ Job B   │
├──────────────┤
│ Department 2 │
│   ├─ Job C   │
│   ├─ Job D   │
└──────────────┘

No (NULL, job_role) rows (Unlike CUBE(), it doesn't group by job roles alone).

Example Dataset

employee_name	department	job_role	salary
Alice	IT	Developer	5000
Bob	HR	Manager	7000
Charlie	IT	Developer	6000
David	IT	Analyst	8000
Emma	HR	Analyst	6000

Query Output (`ROLLUP(department, job_role)`)

department	job_role	SUM(salary)
IT	Developer	11000
IT	Analyst	8000
HR	Manager	7000
HR	Analyst	6000
IT	NULL	19000
HR	NULL	13000
NULL	NULL	32000

Difference Between `ROLLUP()`, `CUBE()`, and `GROUPING SETS()`

Function	Generates Grand Total?	Generates Subtotals?	Generates All Combinations?	Best Use Case
`GROUPING SETS()`	❌ No (unless manually included)	✅ Custom groupings	❌ No	Selective subtotals
`ROLLUP(department, job_role)`	✅ Yes	✅ Hierarchical subtotals (department → job_role → total)	❌ No	Hierarchical aggregation
`CUBE(department, job_role)`	✅ Yes	✅ All possible group combinations	✅ Yes	All possible aggregations

Tricky Edge Cases

Using GROUPING_ID() to Identify Aggregation Level

 sqlCopyEditSELECT department, job_role, SUM(salary), GROUPING_ID(department, job_role) 
 FROM employees 
 GROUP BY ROLLUP(department, job_role);

Helps differentiate subtotals and grand total rows.

Filtering Out Grand Total

 sqlCopyEditSELECT department, job_role, SUM(salary) 
 FROM employees 
 GROUP BY ROLLUP(department, job_role)
 HAVING department IS NOT NULL;

Removes the (NULL, NULL) grand total row.

Approximate Aggregations (For Big Data)

APPROX_COUNT_DISTINCT(column) – Faster unique count estimation.

SELECT APPROX_COUNT_DISTINCT(employee_id) FROM employees;

APPROX_PERCENTILE(column, percentile) – Faster percentile computation.
```
SELECT APPROX_PERCENTILE(salary, 0.9) FROM employees;
```
Custom Aggregations Using User-Defined Functions (UDAF)

User-Defined Aggregation Function (UDAF)
- You can define a custom aggregation using PySpark.

    from pyspark.sql.functions import udf
    from pyspark.sql.types import DoubleType
    from pyspark.sql import SparkSession

    spark = SparkSession.builder.appName("UDAF Example").getOrCreate()

    def custom_avg(salary_list):
        return sum(salary_list) / len(salary_list)

    avg_udf = udf(custom_avg, DoubleType())

    df.withColumn("custom_avg", avg_udf("salary")).show()

Time-Series Aggregations

`DATE_TRUNC(unit, column)` – Aggregation based on time intervals.

SELECT DATE_TRUNC('month', order_date), SUM(sales) FROM sales GROUP BY 1;

`WINDOW()` – Used for time-based aggregations in structured streaming.

SELECT window(time_column, '1 hour'), COUNT(*) FROM events GROUP BY 1;

Quantile-Based Aggregations

`PERCENT_RANK()` –

Computes the relative rank of a row.

SELECT name, salary, PERCENT_RANK() OVER (ORDER BY salary DESC) FROM employees;

Explanation of `PERCENT_RANK() OVER (ORDER BY salary DESC)` in Spark SQL

The query:

SELECT name, salary, PERCENT_RANK() OVER (ORDER BY salary DESC) 
FROM employees;

Purpose

PERCENT_RANK() computes the relative rank of each row as a percentage between 0 and 1.
It compares a row’s position within the dataset without assigning equal ranks to duplicates.
Ordering:
- ORDER BY salary DESC → Ranks employees based on salary in descending order (highest salary gets rank 0).
- If multiple employees have the same salary, they receive the same rank, but PERCENT_RANK() continues counting.
The PERCENT_RANK() function calculates the relative rank of a row as a percentage between 0 and 1 using the formula:

\text{PERCENT_RANK} = \frac{\text{Rank} - 1}{\text{Total Rows} - 1}

Where:
- Rank = The rank of the current row (using RANK() function).
- Total Rows = The total number of rows in the result set.

How It Works

Ranks employees by salary in descending order.
Computes the relative rank as a percentage from 0 to 1.
Formula ensures the highest-ranked employee gets 0, and the lowest-ranked gets 1.

Example Dataset

Employee	Salary
Alice	9000
Bob	8000
Charlie	7000
David	7000
Emma	6000

Query Output (`PERCENT_RANK() OVER (ORDER BY salary DESC)`)

Employee	Salary	PERCENT_RANK
Alice	9000	0.00
Bob	8000	0.25
Charlie	7000	0.50
David	7000	0.50
Emma	6000	1.00

Understanding the Values

Alice (9000) → Rank = 1 → PERCENT_RANK = (1-1)/(5-1) = 0.00
Bob (8000) → Rank = 2 → PERCENT_RANK = (2-1)/(5-1) = 0.25
Charlie & David (7000) → Tied Rank = 3 → PERCENT_RANK = (3-1)/(5-1) = 0.50
Emma (6000) → Rank = 5 → PERCENT_RANK = (5-1)/(5-1) = 1.00

Difference Between `PERCENT_RANK()`, `RANK()`, and `DENSE_RANK()`

Function	Handles Duplicates?	Skips Ranks?	Output Range
`PERCENT_RANK()`	✅ Yes	✅ Yes	0 to 1 (Percentage)
`RANK()`	✅ Yes	✅ Yes	1, 2, 2, 4, ... (Skips ranks)
`DENSE_RANK()`	✅ Yes	❌ No	1, 2, 2, 3, ... (No gaps)

Alternative Approaches

Ranking Employees with RANK()

 SELECT name, salary, RANK() OVER (ORDER BY salary DESC) AS rank
 FROM employees;

Outputs actual ranking instead of a percentage.

Using PERCENT_RANK() with Partitioning

 SELECT department, name, salary, 
        PERCENT_RANK() OVER (PARTITION BY department ORDER BY salary DESC) 
 FROM employees;

Ranks employees within each department separately.

`CUME_DIST()` –

Computes cumulative distribution of values.

SELECT name, salary, CUME_DIST() OVER (ORDER BY salary DESC) FROM employees;

Data & Calculation

name	salary
Alice	90,000
Bob	80,000
Charlie	80,000
David	70,000
Emma	60,000

Step-by-Step Calculation of `CUME_DIST()`

name	salary	Rank	Rows ≤ current row	Total Rows	`CUME_DIST()`
Alice	90,000	1	1	5	1/5 = 0.20
Bob	80,000	2	3	5	3/5 = 0.60
Charlie	80,000	2	3	5	3/5 = 0.60
David	70,000	4	4	5	4/5 = 0.80
Emma	60,000	5	5	5	5/5 = 1.00

Interpretation:

Alice’s salary is the highest (1 out of 5 people, 0.20 cumulative distribution).
Bob and Charlie share the same salary, so they have the same CUME_DIST() value (3/5 = 0.60).
The lowest salary has a CUME_DIST() of 1.0 since 100% of salaries are ≤ that value.

`MEDIAN()` (Custom Implementation)

Spark doesn’t have a built-in median, but it can be calculated as:

SELECT percentile_approx(salary, 0.5) FROM employees;

`HISTOGRAM()` – Creating Binned Distributions

Computes a histogram (frequency distribution) of numerical data.

Example:
```
  SELECT histogram_numeric(salary, 10) FROM employees;
```
- This bins salary into 10 equal-sized ranges and returns frequency counts.

Purpose

HISTOGRAM_NUMERIC(column, num_bins) approximates the histogram distribution of a numeric column.
The second parameter (10 in this case) defines the number of bins (ranges).
It returns an array of (bucket, frequency) pairs, showing how salaries are distributed across the bins.

How It Works

Divides the salary values into num_bins equally spaced intervals.
Counts the number of values falling into each bin.
Returns an array of tuples (bin_value, frequency), where:
- bin_value → Midpoint or representative value of the bin.
- frequency → Count of salaries falling into that bin.

Example Dataset

Employee	Salary
A	3000
B	5000
C	6000
D	7000
E	10000

Query Output (`HISTOGRAM_NUMERIC(salary, 3)`)

    SELECT HISTOGRAM_NUMERIC(salary, 3) FROM employees;

Bin (Midpoint)	Frequency (Count)
3000.0	1
6000.0	3
10000.0	1

Explanation:

The 3 bins divide the range (3000 - 10000) into 3 groups.
The first bin (3000.0) → Contains 1 salary (3000).
The second bin (6000.0) → Contains 5000, 6000, 7000 (3 salaries).
The third bin (10000.0) → Contains 1 salary (10000).

Using More Bins (`HISTOGRAM_NUMERIC(salary, 5)`)

    SELECT HISTOGRAM_NUMERIC(salary, 5) FROM employees;

Bin (Midpoint)	Frequency (Count)
3000.0	1
5000.0	1
6000.0	1
7000.0	1
10000.0	1

Now each salary falls into its own bin since num_bins = 5 (matching the number of salaries).

Key Observations

HISTOGRAM_NUMERIC() helps visualize data distribution in a structured way.
More bins increase granularity, fewer bins group data together.
Useful for histogram visualizations, anomaly detection, and trends.

Alternative Approaches

Using PERCENTILE_APPROX() for Quartile-Based Distribution
```
 SELECT PERCENTILE_APPROX(salary, ARRAY(0.25, 0.5, 0.75)) FROM employees;
```
- Returns quartile distributions instead of a histogram.

Grouping Data Manually

 SELECT 
     CASE 
         WHEN salary < 5000 THEN 'Low'
         WHEN salary BETWEEN 5000 AND 8000 THEN 'Medium'
         ELSE 'High'
     END AS salary_range,
     COUNT(*)
 FROM employees
 GROUP BY salary_range;

Creates custom salary groups instead of bins.

`BIT_OR()` / `BIT_AND()` – Bitwise Aggregations

Performs bitwise OR / AND operations across grouped values.

Example:

  SELECT department, BIT_OR(permissions) FROM users GROUP BY department;

`STRING_AGG()` – Concatenating Strings in Aggregation

Concatenates multiple string values into a single string.

Example:

  SELECT department, STRING_AGG(employee_name, ', ') FROM employees GROUP BY department;

Outputs comma-separated lists of employee names per department.

`HYPERLOGLOG_PLUS()`

Advanced Cardinality Estimation

Similar to APPROX_COUNT_DISTINCT(), but optimized for large-scale unique counting.

Example:

  SELECT HYPERLOGLOG_PLUS(user_id) FROM website_visits;

`ENTROPY()` – Information Theory Aggregation

Measures data randomness or distribution balance.

Example:

  SELECT ENTROPY(sales_category) FROM transactions;

`MODE()` (Built-in in Some Versions)

Finds the most frequently occurring value.

Example (alternative implementation for older versions):

  SELECT value FROM dataset GROUP BY value ORDER BY COUNT(*) DESC LIMIT 1;

`GEOMETRIC_MEAN()` – Multiplicative Mean

Computes the geometric mean (useful in financial applications).

Example:

  SELECT EXP(AVG(LOG(salary))) FROM employees;

`HARMONIC_MEAN()` – Reciprocal Mean Calculation

Computes the harmonic mean for rates or ratios.

Example:

  SELECT COUNT(*) / SUM(1.0 / salary) FROM employees;

`MEDIAN()` (Alternative Approach)

Spark SQL does not have a built-in MEDIAN() function, but it can be approximated using:
```
  sqlCopyEditSELECT percentile_approx(salary, 0.5) FROM employees;
```
- Accurate for large datasets.

`PRODUCT()` – Multiplication Aggregation

Returns the product of column values (alternative implementation in Spark SQL).

Example:

  SELECT EXP(SUM(LOG(value))) FROM numbers;

`SKEWNESS()` – Skewness of a Distribution

Measures the asymmetry of the distribution of values.

Example:

  SELECT SKEWNESS(salary) FROM employees;

`KURTOSIS()` – Peakedness of a Distribution

Measures whether the dataset has light/heavy tails.

Example:

  SELECT KURTOSIS(salary) FROM employees;

`RATIO_TO_REPORT()` – Proportional Contribution

Calculates the percentage of each value over the total sum.

Example:

  SELECT department, salary, salary / SUM(salary) OVER (PARTITION BY department) AS ratio FROM employees;

`LEAD()` / `LAG()` – Comparing Aggregates Over Rows

Accesses previous (LAG) or next (LEAD) row values.

Example:

  SELECT name, salary, LAG(salary, 1) OVER (ORDER BY salary DESC) FROM employees;

`PIVOT()` – Transforming Rows into Columns

Used for pivoting categorical data.

Example:

  SELECT * FROM ( 
    SELECT department, job_role, salary FROM employees 
  ) PIVOT (
    SUM(salary) FOR job_role IN ('Manager', 'Developer', 'Analyst')
  );

`UNPIVOT()` – Transforming Columns into Rows

Reverse of PIVOT().

Example:

  SELECT department, job_role, salary 
  FROM employees 
  UNPIVOT (salary FOR job_role IN (Manager, Developer, Analyst));

`ANY_VALUE()` – Random Value in Group

Returns an arbitrary value from the group without enforcing order.

Example:

  SELECT department, ANY_VALUE(employee_name) FROM employees GROUP BY department;

`GROUPING()` – Detecting Aggregation Levels

Identifies whether a column is aggregated in a GROUPING SETS(), ROLLUP(), or CUBE().

Example:

  SELECT department, job_role, SUM(salary), GROUPING(department), GROUPING(job_role) 
  FROM employees 
  GROUP BY ROLLUP(department, job_role);

`TOP N Records` using `ROW_NUMBER()`

Fetch Top N records per group.

Example:

  SELECT * FROM (
    SELECT name, department, salary, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rank
    FROM employees
  ) WHERE rank <= 3;

`MODE()` (Efficient Alternative Using Window Functions)

Instead of the earlier workaround for MODE(), we can use DENSE_RANK().

Example:

  SELECT value 
  FROM (
    SELECT value, COUNT(*) AS frequency, DENSE_RANK() OVER (ORDER BY COUNT(*) DESC) AS rnk 
    FROM dataset 
    GROUP BY value
  ) WHERE rnk = 1;

This method works efficiently for datasets where multiple values may have the highest frequency.

`GROUP_CONCAT()` – Concatenation of Grouped Values

Alternative to STRING_AGG(), but not natively available in Spark SQL (can be simulated).

Example using COLLECT_LIST():

  SELECT department, CONCAT_WS(', ', COLLECT_LIST(employee_name)) FROM employees GROUP BY department;

`BIT_XOR()` – Bitwise XOR Aggregation

Computes the bitwise XOR across grouped values.
Example:
```
  SELECT department, BIT_XOR(permissions) FROM users GROUP BY department;
```
Understanding BIT_XOR(permissions) in SQL

The BIT_XOR() function performs a bitwise XOR (Exclusive OR) operation on all values in a group. It is commonly used in cryptographic applications, permission calculations, checksum operations, and error detection mechanisms.

1. How BIT_XOR() Works
```
  SELECT department, BIT_XOR(permissions) 
  FROM users 
  GROUP BY department;
```
- Groups data by department.
- Performs a bitwise XOR operation across all permissions values.
- Returns a single XOR'd value per department.

2. What is Bitwise XOR (`⊕`)?

XOR (Exclusive OR) follows these rules:
- 1 ⊕ 1 = 0
- 0 ⊕ 0 = 0
- 1 ⊕ 0 = 1
- 0 ⊕ 1 = 1

Example Table:

department	permissions (Binary)	permissions (Decimal)
HR	1101	13
HR	1011	11
IT	1111	15
IT	0010	2

Common Use Case:

Read Permission = 1 (0001 in binary)
Write Permission = 2 (0010 in binary)
Execute Permission = 4 (0100 in binary)
Admin Permission = 8 (1000 in binary)

3. Step-by-Step Calculation

For HR Department:

           1101 (13)
        ⊕ 1011 (11)
        ------------
          0110 (6)  --> BIT_XOR(permissions) = 6

For IT Department:

           1111 (15)
        ⊕ 0010 (2)
        ------------
          1101 (13) --> BIT_XOR(permissions) = 13

Final Result:

department	BIT_XOR(permissions)
HR	6
IT	13

Interpretation:

HR department's XOR'd permissions result in 6.
IT department's XOR'd permissions result in 13.

4. `BIT_XOR()` vs. Other Bitwise Aggregations

Function	Description
`BIT_XOR()`	Bitwise XOR of all values
`BIT_AND()`	Bitwise AND of all values
`BIT_OR()`	Bitwise OR of all values

5. `BIT_XOR()` with NULL Values

If permissions contains NULL, it is ignored.
If all values in a group are NULL, BIT_XOR() returns NULL.

Example:

        SELECT department, BIT_XOR(permissions) FROM users GROUP BY department;

department	permissions
HR	13
HR	NULL
IT	15

HR department ignores NULL and computes BIT_XOR(13) = 13.

If all values were NULL, the result would be NULL.

6. `BIT_XOR()` with `HAVING` – Filtering Groups

Find departments where XOR'd permissions exceed 5.

        SELECT department, BIT_XOR(permissions) AS xor_permissions
        FROM users
        GROUP BY department
        HAVING BIT_XOR(permissions) > 5;

Keeps only departments where the computed XOR result is greater than 5.

7. `BIT_XOR()` in Joins – Ensuring Correct Aggregation

✅ Joins can introduce duplicate values, affecting BIT_XOR().

Incorrect Query (Duplicates Affect XOR)

        SELECT d.department_name, BIT_XOR(u.permissions)
        FROM departments d
        JOIN users u ON d.department_id = u.department_id
        GROUP BY d.department_name;

If a user appears multiple times due to the join, their permission values are XOR'd multiple times.

Fix: Use DISTINCT to avoid duplicates:

  SELECT d.department_name, BIT_XOR(DISTINCT u.permissions)
  FROM departments d
  JOIN users u ON d.department_id = u.department_id
  GROUP BY d.department_name;

8. `BIT_XOR()` with `CASE` – Conditional XOR

✅ Find XOR'd permissions only for Managers.

        SELECT department, 
               BIT_XOR(CASE WHEN role = 'Manager' THEN permissions ELSE 0 END) AS manager_xor
        FROM users
        GROUP BY department;

Computes XOR only for employees with the role "Manager".

9. `BIT_XOR()` with `UNION` vs. `UNION ALL`

✅ UNION removes duplicates, which may affect BIT_XOR() results.
✅ UNION ALL keeps duplicates, potentially changing XOR calculations.

Example:

        SELECT department, BIT_XOR(permissions) FROM (
            SELECT department, permissions FROM users
            UNION
            SELECT department, permissions FROM admins
        ) AS combined_data
        GROUP BY department;

Merges users and admins before computing XOR.
If using UNION ALL, duplicate values remain and may alter results.

10. `BIT_XOR()` with Binary Data

✅ XOR operations are commonly used in cryptography, security, and access control. ✅ A real-world use case could be storing permission bitmasks (e.g., Read = 1, Write = 2, Execute = 4).

Example:

        SELECT department, BIT_XOR(permissions) AS xor_permissions
        FROM users
        GROUP BY department;

department	XOR Permissions (Binary)	XOR Permissions (Decimal)
HR	0110	6
IT	1101	13

✅ This allows us to quickly calculate access control bitmasks for departments.

11. Summary of Tricky SQL Concepts with `BIT_XOR()`

Scenario	Explanation
`BIT_XOR()` computes bitwise XOR	XORs all values in a group
`BIT_XOR()` ignores `NULL`	NULL values are skipped, but all NULLs return NULL
`BIT_XOR()` vs. `BIT_AND()` vs. `BIT_OR()`	Different bitwise aggregations
`BIT_XOR()` with `HAVING`	Filters groups based on XOR results
`BIT_XOR()` in Joins	Duplicate values may alter XOR results
`BIT_XOR()` with `CASE`	Conditional XOR within groups
`BIT_XOR()` in `UNION` vs. `UNION ALL`	`UNION` removes duplicates, affecting XOR

`LOG_SUM_EXP()`

Logarithmic Summation for Stability

Useful in machine learning applications (avoiding underflow).

Example:

  SELECT LOG(SUM(EXP(value))) FROM dataset;

`COUNT_IF()`

Conditional Counting

A cleaner way to count values meeting a condition.

Example (alternative using SUM()):

  SELECT SUM(CASE WHEN salary > 50000 THEN 1 ELSE 0 END) FROM employees;

`APPROX_MEDIAN()` – Faster Median Computation for Big Data

Instead of PERCENTILE_APPROX(salary, 0.5), an alternative efficient approach.

Example:

  SELECT percentile_approx(salary, 0.5, 10000) FROM employees;

`RATIO()`

Ratio Calculation Within Groups

Useful in reporting the percentage of a group within a total.

Example:

  SELECT department, COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () AS percentage 
  FROM employees 
  GROUP BY department;

`REGR_SLOPE()`, `REGR_INTERCEPT()`

Regression Aggregations

Used for linear regression modeling directly in SQL.

Example:

  SELECT REGR_SLOPE(salary, experience), REGR_INTERCEPT(salary, experience) FROM employees;

Finds the best-fit line between salary and experience.

`REGR_R2()`

Coefficient of Determination (R² Score)

Measures how well data fits a regression model.

Example:

  SELECT REGR_R2(salary, experience) FROM employees;

`UNIQUE_COUNT_ESTIMATE()`

Faster Approximate Unique Counting

A variant of APPROX_COUNT_DISTINCT().

Example:

  SELECT UNIQUE_COUNT_ESTIMATE(user_id) FROM website_visits;

`MAD()`

Median Absolute Deviation for Outlier Detection

Measures the spread of data around the median.

Example:

  SELECT PERCENTILE_APPROX(ABS(salary - median_salary), 0.5) 
  FROM (SELECT salary, PERCENTILE_APPROX(salary, 0.5) AS median_salary FROM employees);

`GREATEST()` and `LEAST()`

Row-wise Maximum/Minimum

Finds the highest or lowest value among multiple columns.

Example:

  SELECT GREATEST(salary_2022, salary_2023, salary_2024) FROM employees;

  SELECT LEAST(salary_2022, salary_2023, salary_2024) FROM employees;

`COUNT_MIN_SKETCH()`

Frequency Approximation for Streaming Data

Useful for real-time aggregation.

Example:

  SELECT COUNT_MIN_SKETCH(salary, 10000, 0.01) FROM employees;

Helps with approximate top-K element tracking.

`QUANTILE()`

Exact Percentiles for Small Datasets

Unlike PERCENTILE_APPROX(), this is for exact quantiles on smaller datasets.

Example:

  SELECT QUANTILE(salary, 0.75) FROM employees;

`GROUPING_ID()`

Identifying Aggregation Levels in `GROUPING SETS()`

Helps distinguish which columns are aggregated vs. not aggregated.

Example:

  SELECT department, job_role, SUM(salary), GROUPING_ID(department, job_role) 
  FROM employees 
  GROUP BY GROUPING SETS ((department), (job_role), (department, job_role));

`INVERSE_PERCENTILE()`

Reverse Lookup of Percentiles

Given a percentile, finds the corresponding value.

Example:

  SELECT INVERSE_PERCENTILE(salary, 0.9) FROM employees;

`EXPONENTIAL_MOVING_AVG()`

Exponential Smoothing for Time Series

Used for time-series forecasting.

Example:

  SELECT EXPONENTIAL_MOVING_AVG(sales, 0.3) OVER (ORDER BY order_date) FROM sales_data;

`DECILE()`

Dividing Data into 10 Equal Parts

Similar to quartiles and percentiles.

Example:

  SELECT NTILE(10) OVER (ORDER BY salary) FROM employees;

`MOVING_AVERAGE()`

Rolling Window Aggregation

Calculates a rolling average for time-series data.

Example:

  SELECT order_date, AVG(sales) OVER (ORDER BY order_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg 
  FROM sales;

`STDDEV_SAMP()` and `STDDEV_POP()`

Sample & Population Standard Deviation

We previously covered STDDEV() but Spark SQL provides two variants:
- STDDEV_SAMP() – Sample standard deviation.
- STDDEV_POP() – Population standard deviation.

Example:

  SELECT STDDEV_SAMP(salary), STDDEV_POP(salary) FROM employees;

`VAR_SAMP()` and `VAR_POP()`

Sample & Population Variance

Similar to standard deviation, but returns variance instead.

Example:

  SELECT VAR_SAMP(salary), VAR_POP(salary) FROM employees;

`BLOOM_FILTER_AGG()`

Probabilistic Set Membership

Efficiently tests whether an element is in a dataset (without full scan).

Example:

  SELECT BLOOM_FILTER_AGG(user_id, 100000, 0.01) FROM transactions;

`TOP_K()`

Approximate Most Frequent Elements

Helps in identifying top K elements efficiently.

Example:

  SELECT TOP_K(product_id, 5) FROM sales_data;

`EVENT_RATE_ESTIMATE()`

Approximate Count of Rare Events

Used in fraud detection & anomaly detection.

Example:

  SELECT EVENT_RATE_ESTIMATE(transaction_id) FROM fraud_logs;

`HASH_AGG()`

Hash-based Aggregation for Large Datasets

Performs a hash-based approach for aggregating data faster.

Example:

  SELECT HASH_AGG(salary) FROM employees;

`STRING_AGG()` with ORDER BY (Concatenation in Sorted Order)

A sorted version of STRING_AGG().

Example:

  SELECT department, STRING_AGG(employee_name ORDER BY employee_name ASC, ', ') 
  FROM employees 
  GROUP BY department;

`CARDINALITY_ESTIMATE()` – Fast Approximate Cardinality

Similar to APPROX_COUNT_DISTINCT(), but optimized for streaming data.

Example:

  SELECT CARDINALITY_ESTIMATE(user_id) FROM website_logs;

`BIT_COUNT()` – Counting Number of 1s in a Bit Representation

Useful in low-level analytics, such as binary data processing.

Example:

  SELECT BIT_COUNT(binary_column) FROM bitwise_table;

`UNIQUE_PERCENTILE()`

Approximate Unique Distribution

Helps in estimating unique distribution across percentile ranges.

Example:

  SELECT UNIQUE_PERCENTILE(salary, 0.8) FROM employees;

`FREQUENT_ITEMS()`

Finding Frequent Elements in Large Datasets

Similar to TOP_K(), but optimized for streaming & big data.

Example:

  SELECT FREQUENT_ITEMS(transaction_category, 10) FROM sales_data;

`ROLLING_PERCENTILE()`

Moving Window Percentile

Time-series percentile calculations.

Example:

  SELECT order_date, ROLLING_PERCENTILE(sales, 0.9) OVER (ORDER BY order_date ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) 
  FROM sales;

`BIT_SET_AGG()`

Bitwise Aggregation Across Rows

Aggregates binary bitmask values.

Example:

  SELECT BIT_SET_AGG(permissions) FROM user_roles;

`MOST_RECENT()`

Last Non-Null Value

Similar to LAST(), but ensures no NULL values.

Example:

  SELECT department, MOST_RECENT(salary) FROM employees GROUP BY department;

`ARLIEST()`

First Non-Null Value

Similar to FIRST(), but excludes NULL values.

Example:

  SELECT department, EARLIEST(salary) FROM employees GROUP BY department;

`DEVIATION()`

Absolute Deviation from the Mean

Computes absolute deviation from mean (useful for outlier detection).

Example:

  SELECT ABS(salary - AVG(salary) OVER ()) FROM employees;

`PERMUTATION_AGG()` – Aggregate Across Permutations

Used in graph algorithms & combinatorial aggregations.

Example:

  SELECT PERMUTATION_AGG(node_id) FROM graph_data;

`JACCARD_SIMILARITY()` – Set-Based Similarity Calculation

Computes similarity between sets using Jaccard Index.

Example:

  SELECT JACCARD_SIMILARITY(set1, set2) FROM dataset;

`COSINE_SIMILARITY()` – Similarity Between Feature Vectors

Machine learning similarity function.

Example:

  SELECT COSINE_SIMILARITY(feature_vector_1, feature_vector_2) FROM embeddings;

`TOP_N_PER_GROUP()`

Top N Elements Within Each Group

Similar to TOP_K(), but per group.

Example:

  SELECT * FROM (
    SELECT name, department, salary, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rank
    FROM employees
  ) WHERE rank <= 3;

`L2_NORM()`

Euclidean Distance for Feature Engineering

Computes L2 norm (square root of squared sum).

Example:

  SELECT SQRT(SUM(POW(feature_value, 2))) FROM feature_dataset;

`HAVERSINE_DISTANCE()` – Distance Between Two Coordinates

Used in geospatial calculations.

Example:

  SELECT HAVERSINE_DISTANCE(lat1, lon1, lat2, lon2) FROM location_data;

Summary Table

Category	Function
Basic Aggregations	`COUNT()`, `SUM()`, `AVG()`, `MIN()`, `MAX()`
Intermediate Aggregations	`VARIANCE()`, `STDDEV()`, `RANK()`, `DENSE_RANK()`, `ROW_NUMBER()`, `NTILE()`
Advanced Aggregations	`CUBE()`, `ROLLUP()`, `GROUPING SETS()`, `APPROX_COUNT_DISTINCT()`, `APPROX_PERCENTILE()`
Statistical Functions	`CORR()`, `COVAR_POP()`, `COVAR_SAMP()`
Array-Based Aggregations	`ARRAY_AGG()`, `COLLECT_SET()`, `COLLECT_LIST()`
Time-Series Aggregations	`DATE_TRUNC()`, `WINDOW()`
Quantile-Based Aggregations	`PERCENT_RANK()`, `CUME_DIST()`, `PERCENTILE_APPROX()`

Spark data transformations Spark SQL Machine Learning Databricks Pyspark Training Spark For Data Science

Mastering Transformations in Spark SQL

Table of contents

Spark SQL Transformations

1. Spark SQL vs. PySpark DataFrames: Key Differences

Logical Flow Diagram: How Spark SQL and PySpark Work

When to Use Spark SQL & PySpark Together

Use Case 1: SQL Query with PySpark

Use Case 2: Writing PySpark DataFrame Results to a SQL Table

Use Case 3: Real-Time Streaming with Spark SQL + PySpark

Which One to Use?

2. Types of Transformations in Spark: Narrow vs. Wide Transformations

What are Narrow Transformations?

What are Wide Transformations?

Narrow vs. Wide Transformations: Key Differences

Logical Diagram: Narrow vs. Wide Transformations

Summary: Optimizing Transformations in Spark

3. Lazy Execution in PySpark

When Does Execution Actually Happen?

Logical Diagram: Lazy Execution Workflow

Lazy Execution in Spark

3. Working with Structured Data: Reading and Writing Parquet, ORC, Avro Files in Spark SQL

3.1 Understanding Parquet, ORC, and Avro

3.2 Reading & Writing Parquet Files in Spark SQL

A) Reading Parquet Files

B) Writing Data to Parquet

C.) Reading & Writing ORC Files in Spark SQL

Reading ORC Files

Writing Data to ORC

D.) Reading & Writing Avro Files in Spark SQL

Reading Avro Files

Writing Data to Avro

Performance Considerations: Choosing the Right Format

Summary: How to Work with Parquet, ORC, Avro in Spark SQL

4.Aggregations in Spark SQL

COUNT()

Tricky SQL Concepts with COUNT(*)

SUM()

1. What Does SUM(amount) Really Do?

Example:

2. SUM(amount) vs. COALESCE(SUM(amount), 0)

Example:

3. SUM(DISTINCT amount) – Summing Unique Values

Example:

4. SUM() with GROUP BY

Example:

5. SUM() with HAVING Clause

Example:

6. SUM() in Joins – Unexpected Duplicates

Example:

7. SUM() in LEFT JOIN – Handling NULLs

Example:

8. SUM() with CASE – Conditional Summing

Example:

9. SUM() with FILTER() – Alternative to CASE

Example:

10. SUM() in Window Functions – Running Total

Example:

11. SUM() with UNION vs. UNION ALL

Example:

12. SUM() with Negative Values

Example:

Tricky SQL Concepts with SUM(amount)

AVG()

Tricky SQL Concepts with AVG(price)

MIN()

MAX()

COUNT(DISTINCT column_name)

GROUP BY with Aggregations

AVG(salary) GROUP BY department

HAVING with Aggregations

1. How HAVING Works in SUM(salary) > 50000

Example Data:

Query Execution:

2. HAVING vs. WHERE – Key Difference

Incorrect Use of WHERE (Causes Error)

Correct Use with HAVING

3. HAVING with COUNT() – Filtering Groups by Count

4. HAVING with Multiple Conditions

5. HAVING with DISTINCT – Counting Unique Values

6. HAVING with CASE – Conditional Aggregation

`COUNT()`

Tricky SQL Concepts with `COUNT(*)`

`SUM()`

1. What Does `SUM(amount)` Really Do?

2. `SUM(amount)` vs. `COALESCE(SUM(amount), 0)`

3. `SUM(DISTINCT amount)` – Summing Unique Values

4. `SUM()` with `GROUP BY`

5. `SUM()` with `HAVING` Clause

6. `SUM()` in Joins – Unexpected Duplicates

7. `SUM()` in LEFT JOIN – Handling NULLs

8. `SUM()` with `CASE` – Conditional Summing

9. `SUM()` with `FILTER()` – Alternative to `CASE`

10. `SUM()` in Window Functions – Running Total

11. `SUM()` with `UNION` vs. `UNION ALL`

12. `SUM()` with Negative Values

Tricky SQL Concepts with `SUM(amount)`

`AVG()`

Tricky SQL Concepts with `AVG(price)`

`MIN()`

`MAX()`

`COUNT(DISTINCT column_name)`

`GROUP BY` with Aggregations

`AVG(salary) GROUP BY department`

`HAVING` with Aggregations

1. How `HAVING` Works in `SUM(salary) > 50000`

2. `HAVING` vs. `WHERE` – Key Difference

Incorrect Use of `WHERE` (Causes Error)

Correct Use with `HAVING`

3. `HAVING` with `COUNT()` – Filtering Groups by Count

4. `HAVING` with Multiple Conditions

5. `HAVING` with `DISTINCT` – Counting Unique Values

6. `HAVING` with `CASE` – Conditional Aggregation

7. `HAVING` with `FILTER()` – Alternative to `CASE`

8. `HAVING` with `ORDER BY` – Sorting Aggregated Data

9. `HAVING` with `JOIN` – Filtering After Aggregation

10. `HAVING` in Window Functions – Why It Doesn’t Work

Summary of Tricky `HAVING` Scenarios

`VARIANCE()` / `VAR_POP()`

Tricky SQL Concepts with `VARIANCE(salary)`

`STDDEV()` / `STDDEV_POP()`

`FIRST()` / `LAST()`

`COLLECT_SET()`

Example Dataset: `employees` Table

`COLLECT_LIST()`

Example Dataset: `employees` Table

Difference Between `COLLECT_SET()` and `COLLECT_LIST()`

`PERCENTILE()` / `APPROX_PERCENTILE()`

`MODE()` (Most Frequent Value - Custom Implementation)

`RANK()` – Assigns ranks with gaps.

`DENSE_RANK()` – Assigns ranks without gaps.

`ROW_NUMBER()` – Assigns unique row numbers.

`NTILE(n)` – Divides data into `n` equal parts.

`CORR(col1, col2)` – Correlation between two columns.

`COVAR_POP(col1, col2)` – Population covariance.

Difference Between `COVAR_POP()` and `COVAR_SAMP()`

`COVAR_SAMP(col1, col2)` – Sample covariance.

Difference Between `COVAR_SAMP()` and `COVAR_POP()`

`ARRAY_AGG(column)` – Aggregates data into an array.

`EXPLODE(array_column)` – Converts array elements into separate rows.

Example Dataset: `employees` Table

Difference Between `ARRAY_AGG()`, `COLLECT_LIST()`, and `COLLECT_SET()`

`GROUPING SETS()` – Computes multiple groupings in a single query.

Example Dataset: `employees` Table

`CUBE()` – Aggregates across all possible combinations of specified columns.

Example Dataset: `employees` Table

Query Output (`CUBE(department, job_role)`)

Difference Between `CUBE()`, `ROLLUP()`, and `GROUPING SETS()`