Get Started on Apache PySpark (Part 1)
What is and Why using Apache Spark?
Before knowing this you need to know how normally data are flowed from storage to CPU.
Firstly, data are stored in hard disk and it will be loaded to RAM which is an extremely slow process. Then the data will be loaded to Cache and then to registers. Finally the computation will be done in ALU.
Since reading data from hard disk is slow. How about we directly read data from RAM? But remember, size of RAM is much smaller than the size of hard disk, when the size of the data file is larger than size of RAM, our system will crashed! How about use parallelisation which our data is distributed across multiple systems? Previously we have Hadoop ecosystem to handle such distributed systems which consists of storage unit (HDFS), parallel compute engine (MapReduce) and resources manager (YARN). A Hadoop cluster has master/slave architecture which resource manager is ran on the master node and computations are run in parallel on slave nodes.
Illustration of Hadoop cluster (https://www.kdnuggets.com/2017/05/hdfs-hbase-need-know.html)
People commonly has a misunderstanding that Apache Spark is used to replace Hadoop. But in fact, it is developed to replace MapReduce. Thus it is not a valid comparison when we make comparison between Hadoop and Spark. For the storage, we can still use HDFS or Amazon s3. And we also have many choices for resource manager like YARN or Spark internal manager.
So how Apache Spark solve the loading speed issue? Spark can directly perform operations and computations on the data stored at RAM and then send the results back to hard disk. However, for Hadoop cluster, the data is stored on hard disk and need to be retrieved from hard disk when required for processing. Moreover, the data read from hard disk is the complete dataset which is normally more than what we need. This makes MapReduced 100x slower than Spark.
Comparison between Spark and MapReduce (https://platzi.com/clases/2045-spark-2020/31905-introduccion-a-apache-spark)
Besides the efficient in-memory computation, Lazy execution is also one of the special features of Spark. It mean when we write a query to read to the data, it will not be executed until operations are performed on the data. The other feature is parallel processing which is similar to Hadoop cluster that can distribute work load to multiple nodes. But different from Hadoop, Spark supports real-time batch processing of data.
A summary comparison of MapReduce and Spark
Pandas v.s. PySpark
Till Now, you may use Pandas data frame to store and perform operations. But this only works when your data size is smaller than your RAM. In the case of large data size, you have to use PySpark to process your data. Moreover, your Pandas data frame will only use one core of your CPU for computation, but PySpark data frame can run in parallel use multiple cores.
In PySpark, the data frame is immutable which is different from Pandas data frame which ensures the data safety.
There are also some drawbacks of PySpark data frame:
- PySpark API supports less type of operations
- Complex operations is harder than Pandas data frame
- Access the PySpark data frame is slower as they are stored on multiple nodes (but process faster!)
- Tedious process for setting up cluster
Now it’s time to give the PySpark data frame a name! It is called resilient distributed dataset (RDD) and it is the fundamental unit of Spark/PySpark. ‘Resilient’ : it is fault tolerant is capable to rebuild the data on failure. (data is copied from the hard disk to RAM and we got the original copy on hard disk. RDD also keeps track of data lineage information to rebuild lost data. So, no worry for some incidents!). ‘Distributed’: data is distributed among multiple nodes in a cluster. ‘Dataset’: collection of partitioned data with values.
There are two operations on RDD includes transformation & action. Transformation creates a new RDD (e.g. filter, groupBy, map) and action instruct Spark to perform computation and send the results back to the driver (master node).
Illustration of how RDD is partitioned(https://medium.com/@lavishj77/spark-fundamentals-part-2-a2d1a78eff73)