Part 1 - PySpark: RDDs, Transformations, and the Spark Execution Model
Published on Wednesday, Feb 26, 2025
1 min read
This is the first blog in my PySpark learning journey, where I explore the foundational concepts of Spark, RDDs, and key PySpark functions. From understanding transformations to working with higher-order functions and partitioning, this blog serves as an introduction to the core building blocks of Spark and PySpark.
Exploring Apache Spark and PySpark
Apache Spark is a powerful, plug-and-play compute engine designed for distributed processing. Whether you’re dealing with massive datasets or building scalable machine learning models, Spark offers the tools and flexibility required for high-performance data processing.
What Does Spark Need to Operate?
To function effectively, Spark relies on two critical components:
-
Storage: This can range from distributed file systems like HDFS to cloud-based solutions like:
- Azure Data Lake Storage (ADLS Gen2)
- Amazon S3
- Google Cloud Storage
-
Resource Manager: This oversees resource allocation and scheduling, and options include:
- YARN
- Mesos
- Kubernetes
Formal Definition of Apache Spark
Apache Spark is a multi-language engine designed for executing data engineering, data science, and machine learning tasks on a single node or a distributed cluster. Its seamless integration with Python, referred to as PySpark, makes it a popular choice among developers and data professionals.
Understanding RDDs: The Building Blocks of Spark
At the heart of Apache Spark lies the Resilient Distributed Dataset (RDD), the fundamental data structure that enables Spark’s distributed data processing capabilities.
Key Features of RDDs
- Immutability: Once an RDD is created, it cannot be altered. This ensures data consistency and allows for efficient fault tolerance.
- Lineage: RDDs maintain a record of all transformations applied to their parent RDDs. This “lineage” enables Spark to recover lost data by re-executing the transformations as per the Directed Acyclic Graph (DAG) execution plan.
The DAG Execution Model
Spark employs a Directed Acyclic Graph (DAG) to represent the sequence of operations (transformations and actions) performed on RDDs. This model optimizes task execution and ensures resilience to failures. If an RDD is lost, Spark uses its lineage and the DAG to rebuild the data effortlessly.
Higher-Order Functions in PySpark
Higher-order functions are an essential concept in PySpark and functional programming. These are functions that can take another function as a parameter or return a function as output.
Examples of Higher-Order Functions in PySpark
map
: Applies a function to each element in the dataset, resulting in an equal number of output rows.reduce
: Aggregates the dataset into a single value.reduceByKey
: Performs aggregation for each distinct key in the dataset.
Key Differences: reduce
vs reduceByKey
reduce
: Outputs a single aggregated value from the entire dataset.reduceByKey
: Outputs an aggregated result for each distinct key. For example, if there are 100 distinct keys in the input data of 1000 rows, the output will contain 100 rows.
Commonly Used PySpark Functions
-
map
:- Behavior: Number of output rows equals the number of input rows.
- Usage: Transform each row in the dataset.
-
reduce
:- Behavior: Aggregates all rows into a single output value.
- Usage: Compute totals, averages, or any aggregate result over the entire dataset.
-
reduceByKey
:- Behavior: Number of output rows equals the number of distinct keys.
- Example: If the input dataset contains 1000 rows with 100 unique keys, the output will have 100 rows.
-
filter
:- Behavior: Filters rows based on a condition. Number of output rows is less than or equal to the number of input rows.
- Usage: Narrow down the dataset based on specific criteria.
-
sortBy
/sortByKey
:sortBy
: Sorts rows based on values in ascending (default) or descending order.sortByKey
: Sorts rows based on keys.- Behavior: Number of output rows equals the number of input rows.
-
distinct
:- Behavior: Outputs only unique rows, reducing the number of output rows.
- Usage: Remove duplicates and get distinct values.
Chaining Functions in PySpark
Function chaining is a common and powerful approach in PySpark that allows for cleaner, more concise code. By chaining transformations and actions together, we can streamline our data processing workflows, reducing the need for intermediate steps.
Understanding RDD Partitions
In PySpark, partitions define the logical division of data across nodes in a cluster. Partitions play a crucial role in parallel processing and can significantly impact performance.
Checking the Number of RDD Partitions
The number of partitions in an RDD can be checked using the getNumPartitions()
method:
words = ("Nitish", "kumar", "nkumar37", "ipl", "football", "cricket")
words_rdd = spark.sparkContext.parallelize(words)
print(words_rdd.getNumPartitions()) # Output: 2
Default Partitioning Properties
- defaultMinPartitions: Determines the minimum number of partitions Spark will create for an RDD.
- Accessed via spark.SparkContext.defaultMinPartitions.
- defaultParallelism: Indicates the number of default tasks that can run in parallel.
- Accessed via spark.SparkContext.defaultParallelism.
Choosing Between countByValue and reduceByKey
What is countByValue?
countByValue is an action in PySpark that combines the functionality of map and reduceByKey into a single step. It aggregates data and returns the count of each unique value.
When to Use countByValue vs reduceByKey
- countByValue:
-
Captures the output on a local (gateway) node.
-
Best suited when no further parallel processing is required on the results.
- reduceByKey:
-
A transformation, meaning results remain distributed across the cluster.
-
Ideal when further parallel processing is necessary.
Decision Guide:
Use map + reduceByKey
if you plan to perform additional parallel processing.
Use countByValue
for final results where no further transformations are required.
Categories of Transformations
Transformations in PySpark are the foundation of data manipulation. They define how data is processed and distributed across the cluster.
1. Narrow Transformations
- Definition: Transformations where no data shuffling occurs across the cluster.
- Characteristics:
- Transformed data remains on the same machine.
- Highly efficient as no network transfer is involved.
- Examples:
map
filter
flatMap
2. Wide Transformations
- Definition: Transformations where data is shuffled across nodes for a global aggregation.
- Characteristics:
- Involves network transfer of data.
- Computationally expensive due to shuffling.
- Examples:
reduceByKey
groupByKey
Best Practices for Wide Transformations:
- Minimize the use of wide transformations as they involve costly shuffles.
- Perform wide transformations towards the end of the data pipeline when the dataset is already narrowed down.
Tasks, Jobs, and Stages in PySpark
The execution model in PySpark is designed for distributed computation and is broken down into three key components: tasks, jobs, and stages.
1. Tasks
- Definition: The smallest unit of execution in Spark.
- Characteristics:
- The number of tasks equals the number of partitions.
- Each partition is processed by an associated task.
2. Jobs
- Definition: Represents a complete execution of an action.
- Characteristics:
- The number of jobs equals the number of actions executed in the code.
- Each action (e.g.,
count
,collect
,save
) triggers a new job.
3. Stages
- Definition: Represents a set of transformations executed as a single unit.
- Characteristics:
- The number of stages equals the number of wide operations plus one.
- A new stage is created whenever a wide transformation (e.g.,
reduceByKey
) is encountered.
In the next blog, I will dive into concepts such as Spark joins, broadcast joins, the differences between repartition and coalesce, and higher-level APIs in Apache Spark, including DataFrames.
References: