Someone at the office requested me for good material to learn about Spark in-depth. In the process, I setup most of the documentation included in this post. Most resources are based on PySpark.
The Prerequisites: Hadoop and MapReduce
- Why? : Spark is the successor to MapReduce and essentially uses a combination of multiple map-reduce operations internally. Its thus important to learn about MapReduce to get a good grasp over its internals. Hadoop and HDFS is still very much in existence and is very popular with Spark users.
- Understanding Hadoop and HDFS(article,video)
- Understanding MapReduce
- Exercises: Its best to write up a few simple to complex programs to get a good idea of how MapReduce works. These can work outside of a cluster environment, but it’d be great to practice in a multinode cluster environment because that is where you face a number of issues.
Environment: Python 2.7 with MRJob(Use pip)
Datasets: I used UCI ML Repository datasets(such as Iris when learning)
- K Nearest Neighbours classifier
- Random Forest classifier(Slightly advanced - feel free to use scikit-learn for Decision Trees)
- Related Advanced Topics
The Basics of Apache Spark
- Overview: Spark is essentially a beast of a system hidden under a lot of abstraction. Before approaching the internals it would be a good idea to experiment with and understand the following: RDDs, DataFrames, Transformations, Actions, Spark I/O, UDFs, Partitioning, Caching
- Books and Literature
- Environment: Miniconda Virtualenv with PySpark and Jupyter notebooks. Its a good idea to learn in an interactive notebook as there will be a lot of trial and error.
- Read the original RDD research paper
- Week 1 of this Coursera course(link)
- Basics of RDDs, Dataframes and Datasets(article)
- Transformations vs. Actions(blog post)
- IPython notebooks on Spark basics(link)
- Caching(article and videos from Week 1 of above Coursera course)
- Checkpointing(article1,article2): Article 2 is slightly advanced
- Pyspark ML Pipelines(official docs): ML not MLLib.
Exercises The best exercise to do would be to pick up 1 or 2 problems on Kaggle such as the Titanic dataset and experiment with loading the data onto a DataFrame and run a ML pipeline on it. Be sure to apply Cross Validation to evaluate your results. Pipelines in Spark are slightly different than scikit-learn - Be sure to use them. This can be done with Spark standalone - local mode and is enough to learn all of the above basics.
- Extras The thing with Spark is that its evolving in leaps and bounds. Most hardcover books are generally outdated as a result. There’s two additional courses on Spark, I’ve heard excellent things about:
Advanced Spark: Tuning and Internals
This section will cover more advanced concepts of Spark: mainly Tuning Guides and Spark internals. Strongly recommended to be working with a cluster at this stage - either inhouse or provided by Amazon/Google cloud services. I’m currently in this stage myself, so will update the guide as I learn more.
- Why learn this? Tuning spark has only got easier with improvements such as dynamic allocation and Project Tungsten. It is still important to understand as, as peculiar use cases always show up, and clusters dont always behave optimally in dynamic allocation.
- Spark Official Tuning Guide(documentation): Concise with great content. Must Read.
- Cloudera Tuning Guide(article): The concepts from this article might be slightly dated, mainly because a lot is handled automatically by dynamic allocation. It is still an excellent post to understand about individual components of the cluster, and how they can be tuned for optimal results.
- Spark Internals in depth(github docs): Hover to the contents tab - This material is advanced and should be approached only after finishing everything else.
- Big Data Cluster computing in Production(book): As the name suggests, its a book meant to understand spark tuning and configuration to go into production level applications. I’m currently reading this and its pretty good at the start at least.
Feel free to leave your comments or doubts below, or you can email me at email@example.com.