#data-engineering

19 snippets tagged with #data-engineering

pythonadvanced

Python ETL Pipeline Example

Complete extract-transform-load pipeline with error handling, logging, and incremental processing.

Best for: Automating data ingestion from CSV to warehouse

#etl#pipeline

pythonintermediate

Read Large CSV in Chunks with Pandas

Process CSV files larger than RAM by reading in chunks — memory-efficient ETL pattern for data pipelines.

Best for: Processing multi-GB CSV files without running out of memory

#pandas#csv

pythonintermediate

Polars Lazy Query — Fast DataFrame Processing

Use Polars lazy evaluation for high-performance data transformations that outperform pandas.

Best for: High-performance data processing replacing pandas

#polars#dataframe

pythonadvanced

PySpark DataFrame — Filter and Aggregate

Common PySpark DataFrame operations: filter, group by, window functions, and write to Parquet.

Best for: Large-scale data aggregation on distributed clusters

#spark#pyspark

pythonintermediate

Airflow DAG with Python Operators

Create an Apache Airflow DAG with task dependencies, retries, and XCom data passing between tasks.

Best for: Orchestrating daily ETL pipelines

#airflow#dag

sqlintermediate

dbt Incremental Model Pattern

Build efficient dbt incremental models that process only new or changed data instead of full refreshes.

Best for: Efficient data warehouse builds processing only deltas

#dbt#incremental

sqlintermediate

SQL Data Quality Checks and Assertions

Reusable SQL queries for data quality: null checks, uniqueness, referential integrity, and freshness.

Best for: Automated data quality gates in ETL pipelines

#sql#data-quality

sqladvanced

Snowflake MERGE with Slowly Changing Dim

Implement SCD Type 2 in Snowflake using MERGE to track historical changes in dimension tables.

Best for: Tracking full history of dimension changes

#snowflake#merge

sqlbeginner

dbt Source Freshness and Testing

Configure dbt source freshness checks and schema tests to validate upstream data pipelines.

Best for: Ensuring upstream data sources are fresh

#dbt#testing

bashintermediate

Bash ETL Pipeline Script

Build a complete ETL script in Bash with logging, error handling, notifications, and idempotent runs.

Best for: Automating daily data extract and load jobs

#bash#etl

bashintermediate

Cron Data Sync — Database to S3

Automated script to export database tables to compressed CSV and sync to S3 on a schedule.

Best for: Nightly database exports to cloud storage

#bash#cron

bashadvanced

Spark Submit — Job Launcher Script

Launch PySpark jobs with spark-submit including cluster configuration, dependencies, and monitoring.

Best for: Launching PySpark batch jobs on YARN clusters

#spark#pyspark

bashbeginner

dbt Run and Test — CI/CD Pipeline Script

Bash script for running dbt build with testing, documentation generation, and failure notifications.

Best for: Automating dbt builds in CI/CD pipelines

#dbt#bash

pythonadvanced

Kafka Consumer in Python — Stream Processing

Build a Kafka consumer in Python with offset management, error handling, and batch processing.

Best for: Real-time event processing from Kafka topics

#kafka#streaming

bashbeginner

Kafka Topic — Create and Manage with CLI

Create, describe, alter, and manage Kafka topics using the kafka-topics CLI with partitioning config.

Best for: Setting up Kafka topics for new data streams

#kafka#bash

pythonbeginner

DuckDB — Query Parquet Files with Python

Use DuckDB to query Parquet files and CSVs directly from Python without loading into memory first.

Best for: Ad-hoc analytics on Parquet files without Spark

#duckdb#parquet

sqlbeginner

PostgreSQL COPY — Fast CSV Import

Use PostgreSQL COPY command for high-speed bulk data loading from CSV files with error handling.

Best for: High-speed bulk data loading into PostgreSQL

#postgres#copy

bashintermediate

Bash Pipeline Monitoring and Alerting

Monitor data pipeline health with row counts, runtime tracking, SLA checks, and Slack alerting.

Best for: Monitoring data pipeline health and freshness

#bash#monitoring

bashbeginner

Database Backup and Restore to S3

Automated PostgreSQL backup script with compression, S3 upload, retention policy, and restore commands.

Best for: Automated daily database backups to S3

#bash#backup

#data-engineering

Python ETL Pipeline Example

Read Large CSV in Chunks with Pandas

Polars Lazy Query — Fast DataFrame Processing

PySpark DataFrame — Filter and Aggregate

Airflow DAG with Python Operators

dbt Incremental Model Pattern

SQL Data Quality Checks and Assertions

Snowflake MERGE with Slowly Changing Dim

dbt Source Freshness and Testing

Bash ETL Pipeline Script

Cron Data Sync — Database to S3

Spark Submit — Job Launcher Script

dbt Run and Test — CI/CD Pipeline Script

Kafka Consumer in Python — Stream Processing

Kafka Topic — Create and Manage with CLI

DuckDB — Query Parquet Files with Python

PostgreSQL COPY — Fast CSV Import

Bash Pipeline Monitoring and Alerting

Database Backup and Restore to S3

Related Topics

Practical code snippets with clean structure and real-world utility.

Categories

Popular Tags