Get Started on Apache PySpark (Part 3)

Cedric Yang
2 min readMar 10, 2021

RDD Dataframe and Dataset

In this part, I will talk about what is dataset, data frame (not Pandas data frame!)and RDD all of which are considered as SparkAPI. Data frame was introduced after RDD and dataset is the newest format that introduced by Spark and it enjoy both the benefits from RDD and data frame. Their differences are shown below.

A comparison between RDD, data frame and dataset

All of these APIs have the unique features of Spark API which are fault tolerant, distributedly stored and immutable. For data frame and dataset, schema is supported which contains information (e.g. data type) of each column and they run much faster than RDD on non-JVM language. Furthermore, data frame and dataset have execution optimisation function and both of them support SQL queries.

There are also some advantages of RDD inherited by dataset but not data frame. For example, data set is type safe which compilers can verify the variable type and prevent type error. Moreover, different from data frame, analysis errors are checked at compile time instead of run time in RDD and dataset. Even though dataset requires less memory, it currently does not support Python and R which RDD and data frame do.

Next, let’s see how to create a data frame. The data we will be using is book dataset (https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/books.csv)

Machine learning on Apache Spark

In this section, we will talk about machine learning library used in PySpark and run some simple machine learning algorithm on Spark data frame. PySpark has two machine learning libraries, ml for data frame and mllib for RDD. We will use admission prediction dataset to make prediction using ml library on the admission (https://www.kaggle.com/mohansacharya/graduate-admissions). For classification, the data preprocessing and the way of implementing machine learning algorithm are similar to the regression problem.

--

--