Spark Submit — Job Launcher Script
Launch PySpark jobs with spark-submit including cluster configuration, dependencies, and monitoring.
#!/usr/bin/env bash
set -euo pipefail
# Config
APP_NAME="daily-etl"
APP_FILE="etl_pipeline.py"
DATE=${1:-$(date -d yesterday +%Y-%m-%d)}
log() { echo "[$(date '+%H:%M:%S')] $*"; }
log "Submitting Spark job for ${DATE}"
spark-submit \
--name "${APP_NAME}-${DATE}" \
--master yarn \
--deploy-mode cluster \
--driver-memory 4g \
--executor-memory 8g \
--executor-cores 4 \
--num-executors 10 \
--conf spark.sql.adaptive.enabled=true \
--conf spark.sql.adaptive.coalescePartitions.enabled=true \
--conf spark.sql.shuffle.partitions=200 \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.minExecutors=5 \
--conf spark.dynamicAllocation.maxExecutors=20 \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.sql.parquet.compression.codec=zstd \
--py-files deps.zip \
--files app.conf \
"${APP_FILE}" \
--date "${DATE}" \
--input s3://data-lake/raw/ \
--output s3://data-lake/processed/
# Check job status
APP_ID=$(yarn application -list -appStates RUNNING 2>/dev/null | grep "$APP_NAME" | awk '{print $1}')
if [[ -n "$APP_ID" ]]; then
log "Job running: ${APP_ID}"
log "UI: http://spark-history:18080/history/${APP_ID}"
else
log "Job submitted (check YARN for status)"
fiSponsored
Databricks
Use Cases
- Launching PySpark batch jobs on YARN clusters
- Configuring Spark for optimal performance
- Automated daily data processing jobs
Tags
Related Snippets
Similar patterns you can reuse in the same workflow.
PySpark DataFrame — Filter and Aggregate
Common PySpark DataFrame operations: filter, group by, window functions, and write to Parquet.
Best for: Large-scale data aggregation on distributed clusters
Spark SQL Query Example
PySpark DataFrame operations with SQL queries, window functions, and aggregations for big data.
Best for: Processing large-scale datasets with Spark
Bash ETL Pipeline Script
Build a complete ETL script in Bash with logging, error handling, notifications, and idempotent runs.
Best for: Automating daily data extract and load jobs
Cron Data Sync — Database to S3
Automated script to export database tables to compressed CSV and sync to S3 on a schedule.
Best for: Nightly database exports to cloud storage