Spark
Broadcast Joins
Local Setup
- How to Install PySpark on Windows
- Tutorial: Running PySpark inside Docker containers
- How to create a Docker Container with Pyspark ready to work with Elyra
Working with JSON
Performance Optimization
- Performance Optimization in Apache Spark
- Part 3: Cost Efficient Executor Configuration for Apache Spark
- Spark Performance Tuning - Decide Number of executors
- Garbage Collection Tuning Concepts in Spark
- Spark Performance Tuning: Skewness Part 1
- Spark Performance Tuning: Skewness Part 2
- Spark Tuning, Optimization, and Performance Techniques
Basic Concepts
- Apache Spark: RDDs, DataFrames, Datasets
- What are workers, executors, cores in Spark Standalone cluster?
- Understanding the working of Spark Driver and Executor
- Course on Spark Core
- Course on Spark Structured Streaming 3.0
Filter, Projection and Pushdown
- Apache Spark and Predicate Pushdown
- Projection and Filter Pushdown with Apache Spark DataFrames and Datasets
- Important Considerations when filtering in Spark with filter and where
Data Skew
Read CSV
Spark Metastore
Pandas on Spark
XGBoost on PySpark
- dmlc - eXtreme Gradient Boosting
- PySpark ML and XGBoost full integration tested on the Kaggle Titanic dataset
XGBoost on Spark
Data Modeling
- Data Modeling in Apache Spark - Part 1 : Date Dimension
- Data Modeling in Apache Spark - Part 2 : Working With Multiple Dates
- Continuous Data Processing with Star Schema Data Warehouse using Apache Spark
- Processing a Slowly Changing Dimension Type 2 Using PySpark in AWS
Jupyter + PySpark Setup on AWS EMR
Accumulators
Spark Issues
Rishabh Pandey(He/Him) on LinkedIn
Different types of Issues in Spark.?
1) Out of Memory Exceptions 2) Missing data 3) Data Skewness 4) Spark job repeatedly fails 5) FileAlreadyExistsException in Spark jobs 6) Serialization Issues 7) Inferschema Issues 8) Creating Small Files 9) Too Large Frame error 10) Error when the total size of results is greater than the Spark Driver Max Result Size value. 11) Spark Shell Command failure 12) Join/Shuffle 13) Spark jobs fail because of compilation failures 14) Reading Encoding Files 15) Executor Misconfiguration 16) Broadcasting Large Data 17) Result Exceeds Driver Memory 18) Too Small and Too Large Partitions 19) Optimizing Long Running Jobs 20) Using coalesce() – Creates Uneven Partitions
Some challenges occur at the job level; these challenges are shared right across the data team. They include:
- How many executors should each job use?
- How much memory should I allocate for each job?
- How do I find and eliminate data skew?
- How do I make my pipelines work better?
- How do I know if a specific job is optimized?
Other challenges come up at the cluster level, or even at the stack level, as you decide what jobs to run on what clusters. These problems tend to be the remit of operations people and data engineers. They include:
- How do I size my nodes, and match them to the right servers/instance types?
- How do I see what’s going on across the Spark stack and apps?
- Is my data partitioned correctly for my SQL queries?
- When do I take advantage of auto-scaling?
- How do I get insights into jobs that have problems?