Apache Spark: A Learning Path

Reading time ~7 minutes


Someone at the office requested me for good material to learn about Spark in-depth. In the process, I setup most of the documentation included in this post. Most resources are based on PySpark.


  • The Prerequisites: Hadoop and MapReduce
  • The Basics of Apache Spark
  • Advanced Spark: Tuning and Internals
  • The Prerequisites: Hadoop and MapReduce

    1. Why? : Spark is the successor to MapReduce and essentially uses a combination of multiple map-reduce operations internally. Its thus important to learn about MapReduce to get a good grasp over its internals. Hadoop and HDFS is still very much in existence and is very popular with Spark users.
    2. Understanding Hadoop and HDFS(article,video)
    3. Understanding MapReduce
      1. What is Map and Reduce in MapReduce?(SO link)
      2. MapReduce Explained in detail(article,video)
    4. Exercises: Its best to write up a few simple to complex programs to get a good idea of how MapReduce works. These can work outside of a cluster environment, but it’d be great to practice in a multinode cluster environment because that is where you face a number of issues.
      Environment: Python 2.7 with MRJob(Use pip)
      Datasets: I used UCI ML Repository datasets(such as Iris when learning)
      1. Word-Count
      2. K Nearest Neighbours classifier
      3. Random Forest classifier(Slightly advanced - feel free to use scikit-learn for Decision Trees)
    5. Related Advanced Topics
      1. The Small File Problem(article)
      2. YARN in depth(article)

    The Basics of Apache Spark

    1. Overview: Spark is essentially a beast of a system hidden under a lot of abstraction. Before approaching the internals it would be a good idea to experiment with and understand the following: RDDs, DataFrames, Transformations, Actions, Spark I/O, UDFs, Partitioning, Caching
    2. Books and Literature
      1. Learning Spark: I don’t think it covers Spark 2.0 components well.
      2. Mastering Apache Spark 2: I use this book often for reference.
      3. Blogs/Documentation/Stack Overflow: These are generally most updated with the latest.
      4. Original RDD Research paper: Must Read!
    3. Environment: Miniconda Virtualenv with PySpark and Jupyter notebooks. Its a good idea to learn in an interactive notebook as there will be a lot of trial and error.
    4. Learning:
      1. Read the original RDD research paper
      2. Week 1 of this Coursera course(link)
      3. Basics of RDDs, Dataframes and Datasets(article)
      4. Transformations vs. Actions(blog post)
      5. IPython notebooks on Spark basics(link)
      6. Partitions(article1,article2)
      7. Caching(article and videos from Week 1 of above Coursera course)
      8. Checkpointing(article1,article2): Article 2 is slightly advanced
      9. Pyspark ML Pipelines(official docs): ML not MLLib.
    5. Exercises The best exercise to do would be to pick up 1 or 2 problems on Kaggle such as the Titanic dataset and experiment with loading the data onto a DataFrame and run a ML pipeline on it. Be sure to apply Cross Validation to evaluate your results. Pipelines in Spark are slightly different than scikit-learn - Be sure to use them. This can be done with Spark standalone - local mode and is enough to learn all of the above basics.

    6. Extras The thing with Spark is that its evolving in leaps and bounds. Most hardcover books are generally outdated as a result. There’s two additional courses on Spark, I’ve heard excellent things about:
      1. Big Data Analysis with Apache Spark from UCBerkeley(course link): Amongst top 50 online courses according to Class Central. Uses Pyspark.
      2. Big Data Analysis with Scala and Spark from EPFL(Scala powerhouse)(course link): Needs a good understanding of Scala. The concepts are covered very well.

    Advanced Spark: Tuning and Internals

    This section will cover more advanced concepts of Spark: mainly Tuning Guides and Spark internals. Strongly recommended to be working with a cluster at this stage - either inhouse or provided by Amazon/Google cloud services. I’m currently in this stage myself, so will update the guide as I learn more.

    1. Why learn this? Tuning spark has only got easier with improvements such as dynamic allocation and Project Tungsten. It is still important to understand as, as peculiar use cases always show up, and clusters dont always behave optimally in dynamic allocation.
    2. Spark Official Tuning Guide(documentation): Concise with great content. Must Read.
    3. Cloudera Tuning Guide(article): The concepts from this article might be slightly dated, mainly because a lot is handled automatically by dynamic allocation. It is still an excellent post to understand about individual components of the cluster, and how they can be tuned for optimal results.
    4. Spark Internals in depth(github docs): Hover to the contents tab - This material is advanced and should be approached only after finishing everything else.
    5. Big Data Cluster computing in Production(book): As the name suggests, its a book meant to understand spark tuning and configuration to go into production level applications. I’m currently reading this and its pretty good at the start at least.

    Feel free to leave your comments or doubts below, or you can email me at saatvikshah1994@gmail.com.

    Paper Summary: Kafka

    A short visual summary of the original Kafka paper. Continue reading