Get Started on Apache PySpark (Part 2)

Cedric Yang
3 min readMar 9, 2021

Actions on RDD

As we mentioned in part 1, there are two types of operations on RDD which are actions and transformations. In this section, we will discuss actions.

Before get started, we need to know SparkContext which is the main entry point for any Spark features. It represents the connection to a Spark cluster (gateway) and can be used to create RDDs. Only on SparkContext may be active per JVM and you need to stop active SparkContext before creating a new one.

Now let’s create a RDD and perform some actions on it.

Another import action on RDD is reduce. It will combined elements according to the operations you used.

Fold action is similar to reduce but it get one more value parameter than reduce.

Transformation on RDD

Transformations are operations that we can apply on RDD. It takes RDD as input and produce new RDD as output. And remember, due to lazy transformation, it will not create any new RDD, but only check syntax until you apply actions on it. Before we apply action, all the transformations are stored in a direct acyclic graph (DAG).

We have two types of transformation, one is narrow transformation and the other is wide transformation. Narrow transformation provides one to one transformation which there is no shuffling of data cross the nodes (every node do their own work can the results are concatenated). Some examples of narrow transformation include: map(), flatMap(), filter(), sample(), union(). For wide transformation, all the elements are required to do the operation. Partition live in many partitions of parent RDD which results in shuffling of data across the nodes. Some examples of wide transformation include: intersection(), distinct(), groupByKey(), reduceByKey().

Difference between narrow and wide transformation (http://java.dzone.com/articles/big-data-processing-spark)

Some narrow transformations

Some wide transformations

We have seen the wide transformation on simple RDD without key value pairs. Now let’s look at RDD with key value pairs. We can view key value pair as a dictionary where attributes of a certain key is stored in RDD as value.

There are two most important wide operations on key value pairs which are reduceByKey() and groupByKey(). Both of them will group the values with same keys; however, groupByKey() takes more computational power.

In groupByKey(), key values pairs of all partitions are combined together first. After that, the values with same key are grouped together.

In reduceByKey(), values are grouped within partitions first before combined together. Since grouping the values first greatly reduce the size of the partitions that are going to be combined, the time consumed in merging all the partitions are greatly reduced.

Difference between reduceByKey() and groupByKey() operations. (https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html)

From the code above, we can tell that:

reduceByKey() = groupByKey().mapValues()

Below are some other wide transformation operations on key value pairs

--

--