#data-engineering
19 snippets tagged with #data-engineering
Python ETL Pipeline Example
Complete extract-transform-load pipeline with error handling, logging, and incremental processing.
Best for: Automating data ingestion from CSV to warehouse
Read Large CSV in Chunks with Pandas
Process CSV files larger than RAM by reading in chunks — memory-efficient ETL pattern for data pipelines.
Best for: Processing multi-GB CSV files without running out of memory
Polars Lazy Query — Fast DataFrame Processing
Use Polars lazy evaluation for high-performance data transformations that outperform pandas.
Best for: High-performance data processing replacing pandas
PySpark DataFrame — Filter and Aggregate
Common PySpark DataFrame operations: filter, group by, window functions, and write to Parquet.
Best for: Large-scale data aggregation on distributed clusters
Airflow DAG with Python Operators
Create an Apache Airflow DAG with task dependencies, retries, and XCom data passing between tasks.
Best for: Orchestrating daily ETL pipelines
dbt Incremental Model Pattern
Build efficient dbt incremental models that process only new or changed data instead of full refreshes.
Best for: Efficient data warehouse builds processing only deltas
SQL Data Quality Checks and Assertions
Reusable SQL queries for data quality: null checks, uniqueness, referential integrity, and freshness.
Best for: Automated data quality gates in ETL pipelines
Snowflake MERGE with Slowly Changing Dim
Implement SCD Type 2 in Snowflake using MERGE to track historical changes in dimension tables.
Best for: Tracking full history of dimension changes
dbt Source Freshness and Testing
Configure dbt source freshness checks and schema tests to validate upstream data pipelines.
Best for: Ensuring upstream data sources are fresh
Bash ETL Pipeline Script
Build a complete ETL script in Bash with logging, error handling, notifications, and idempotent runs.
Best for: Automating daily data extract and load jobs
Cron Data Sync — Database to S3
Automated script to export database tables to compressed CSV and sync to S3 on a schedule.
Best for: Nightly database exports to cloud storage
Spark Submit — Job Launcher Script
Launch PySpark jobs with spark-submit including cluster configuration, dependencies, and monitoring.
Best for: Launching PySpark batch jobs on YARN clusters
dbt Run and Test — CI/CD Pipeline Script
Bash script for running dbt build with testing, documentation generation, and failure notifications.
Best for: Automating dbt builds in CI/CD pipelines
Kafka Consumer in Python — Stream Processing
Build a Kafka consumer in Python with offset management, error handling, and batch processing.
Best for: Real-time event processing from Kafka topics
Kafka Topic — Create and Manage with CLI
Create, describe, alter, and manage Kafka topics using the kafka-topics CLI with partitioning config.
Best for: Setting up Kafka topics for new data streams
DuckDB — Query Parquet Files with Python
Use DuckDB to query Parquet files and CSVs directly from Python without loading into memory first.
Best for: Ad-hoc analytics on Parquet files without Spark
PostgreSQL COPY — Fast CSV Import
Use PostgreSQL COPY command for high-speed bulk data loading from CSV files with error handling.
Best for: High-speed bulk data loading into PostgreSQL
Bash Pipeline Monitoring and Alerting
Monitor data pipeline health with row counts, runtime tracking, SLA checks, and Slack alerting.
Best for: Monitoring data pipeline health and freshness
Database Backup and Restore to S3
Automated PostgreSQL backup script with compression, S3 upload, retention policy, and restore commands.
Best for: Automated daily database backups to S3