#data-engineering

19 snippets tagged with #data-engineering

pythonadvanced

Python ETL Pipeline Example

Complete extract-transform-load pipeline with error handling, logging, and incremental processing.

Best for: Automating data ingestion from CSV to warehouse

#etl#pipeline
pythonintermediate

Read Large CSV in Chunks with Pandas

Process CSV files larger than RAM by reading in chunks — memory-efficient ETL pattern for data pipelines.

Best for: Processing multi-GB CSV files without running out of memory

#pandas#csv
pythonintermediate

Polars Lazy Query — Fast DataFrame Processing

Use Polars lazy evaluation for high-performance data transformations that outperform pandas.

Best for: High-performance data processing replacing pandas

#polars#dataframe
pythonadvanced

PySpark DataFrame — Filter and Aggregate

Common PySpark DataFrame operations: filter, group by, window functions, and write to Parquet.

Best for: Large-scale data aggregation on distributed clusters

#spark#pyspark
pythonintermediate

Airflow DAG with Python Operators

Create an Apache Airflow DAG with task dependencies, retries, and XCom data passing between tasks.

Best for: Orchestrating daily ETL pipelines

#airflow#dag
sqlintermediate

dbt Incremental Model Pattern

Build efficient dbt incremental models that process only new or changed data instead of full refreshes.

Best for: Efficient data warehouse builds processing only deltas

#dbt#incremental
sqlintermediate

SQL Data Quality Checks and Assertions

Reusable SQL queries for data quality: null checks, uniqueness, referential integrity, and freshness.

Best for: Automated data quality gates in ETL pipelines

#sql#data-quality
sqladvanced

Snowflake MERGE with Slowly Changing Dim

Implement SCD Type 2 in Snowflake using MERGE to track historical changes in dimension tables.

Best for: Tracking full history of dimension changes

#snowflake#merge
sqlbeginner

dbt Source Freshness and Testing

Configure dbt source freshness checks and schema tests to validate upstream data pipelines.

Best for: Ensuring upstream data sources are fresh

#dbt#testing
bashintermediate

Bash ETL Pipeline Script

Build a complete ETL script in Bash with logging, error handling, notifications, and idempotent runs.

Best for: Automating daily data extract and load jobs

#bash#etl
bashintermediate

Cron Data Sync — Database to S3

Automated script to export database tables to compressed CSV and sync to S3 on a schedule.

Best for: Nightly database exports to cloud storage

#bash#cron
bashadvanced

Spark Submit — Job Launcher Script

Launch PySpark jobs with spark-submit including cluster configuration, dependencies, and monitoring.

Best for: Launching PySpark batch jobs on YARN clusters

#spark#pyspark
bashbeginner

dbt Run and Test — CI/CD Pipeline Script

Bash script for running dbt build with testing, documentation generation, and failure notifications.

Best for: Automating dbt builds in CI/CD pipelines

#dbt#bash
pythonadvanced

Kafka Consumer in Python — Stream Processing

Build a Kafka consumer in Python with offset management, error handling, and batch processing.

Best for: Real-time event processing from Kafka topics

#kafka#streaming
bashbeginner

Kafka Topic — Create and Manage with CLI

Create, describe, alter, and manage Kafka topics using the kafka-topics CLI with partitioning config.

Best for: Setting up Kafka topics for new data streams

#kafka#bash
pythonbeginner

DuckDB — Query Parquet Files with Python

Use DuckDB to query Parquet files and CSVs directly from Python without loading into memory first.

Best for: Ad-hoc analytics on Parquet files without Spark

#duckdb#parquet
sqlbeginner

PostgreSQL COPY — Fast CSV Import

Use PostgreSQL COPY command for high-speed bulk data loading from CSV files with error handling.

Best for: High-speed bulk data loading into PostgreSQL

#postgres#copy
bashintermediate

Bash Pipeline Monitoring and Alerting

Monitor data pipeline health with row counts, runtime tracking, SLA checks, and Slack alerting.

Best for: Monitoring data pipeline health and freshness

#bash#monitoring
bashbeginner

Database Backup and Restore to S3

Automated PostgreSQL backup script with compression, S3 upload, retention policy, and restore commands.

Best for: Automated daily database backups to S3

#bash#backup