Spark

Broadcast Joins

Introduction to Spark Broadcast Joins

Local Setup

Working with JSON

Performance Optimization

Basic Concepts

Filter, Projection and Pushdown

Data Skew

Read CSV

PySpark Read CSV file into DataFrame

Spark Metastore

Metastore in Apache Spark

Pandas on Spark

XGBoost on PySpark

XGBoost on Spark

XGBoost4J-Spark Tutorial

Data Modeling

Jupyter + PySpark Setup on AWS EMR

Setup Jupyter Notebook with EMR to run spark job in 5 minutes

Accumulators

Instrumenting PySpark Applications using Spark Accumulators

Spark Issues

Rishabh Pandey(He/Him) on LinkedIn

Different types of Issues in Spark.?

1) Out of Memory Exceptions 2) Missing data 3) Data Skewness 4) Spark job repeatedly fails 5) FileAlreadyExistsException in Spark jobs 6) Serialization Issues 7) Inferschema Issues 8) Creating Small Files 9) Too Large Frame error 10) Error when the total size of results is greater than the Spark Driver Max Result Size value. 11) Spark Shell Command failure 12) Join/Shuffle 13) Spark jobs fail because of compilation failures 14) Reading Encoding Files 15) Executor Misconfiguration 16) Broadcasting Large Data 17) Result Exceeds Driver Memory 18) Too Small and Too Large Partitions 19) Optimizing Long Running Jobs 20) Using coalesce() – Creates Uneven Partitions

Some challenges occur at the job level; these challenges are shared right across the data team. They include:

How many executors should each job use?
How much memory should I allocate for each job?
How do I find and eliminate data skew?
How do I make my pipelines work better?
How do I know if a specific job is optimized?

Other challenges come up at the cluster level, or even at the stack level, as you decide what jobs to run on what clusters. These problems tend to be the remit of operations people and data engineers. They include:

How do I size my nodes, and match them to the right servers/instance types?
How do I see what’s going on across the Spark stack and apps?
Is my data partitioned correctly for my SQL queries?
When do I take advantage of auto-scaling?
How do I get insights into jobs that have problems?