#etl
39 snippets tagged with #etl
SQL MERGE (Standard Upsert)
Use the SQL MERGE statement for atomic insert-or-update operations with matched/not-matched clauses.
Best for: Data warehouse loading
ETL Pipeline - Technique 39
Extract Transform Load
Best for: database operations
Python ETL Pipeline Example
Complete extract-transform-load pipeline with error handling, logging, and incremental processing.
Best for: Automating data ingestion from CSV to warehouse
Python Batch Processing Script
Process large files in configurable batches with progress tracking, error handling, and resume support.
Best for: Processing large CSV files that don't fit in memory
Database Sync Script in Python
Sync data between two databases with upsert logic, batch processing, and change detection.
Best for: Replicating data between databases
SQL Incremental Load Pattern
Incremental data load using watermark tracking to process only new and updated records efficiently.
Best for: Efficient warehouse loading without full reloads
SQL Data Deduplication Techniques
Remove duplicate records using ROW_NUMBER, DISTINCT ON, and self-join deduplication strategies.
Best for: Cleaning duplicate records in production databases
SQL Data Lineage Tracking
Track data lineage across ETL stages with metadata logging for debugging and audit trails.
Best for: Tracing data flow across pipeline stages
Read Large CSV in Chunks with Pandas
Process CSV files larger than RAM by reading in chunks — memory-efficient ETL pattern for data pipelines.
Best for: Processing multi-GB CSV files without running out of memory
Airflow DAG with Python Operators
Create an Apache Airflow DAG with task dependencies, retries, and XCom data passing between tasks.
Best for: Orchestrating daily ETL pipelines
Bash ETL Pipeline Script
Build a complete ETL script in Bash with logging, error handling, notifications, and idempotent runs.
Best for: Automating daily data extract and load jobs
Polars DataFrame Operations
High-performance DataFrame operations using Polars: filtering, groupby, joins, and lazy evaluation.
Best for: data transformation
Pandas Vectorised Operations vs Apply
Compare apply vs vectorised pandas operations for performance-critical column transformations.
Best for: feature engineering
Prefect ETL Flow with Tasks
Define a Prefect 2 flow with typed tasks, retries, and structured logging for ETL pipelines.
Best for: ETL orchestration
SQLite + Pandas Local Data Pipeline
Run a lightweight local ETL with SQLite and pandas: load CSV, transform, persist to SQLite.
Best for: local analytics
Multiprocessing Pool for ETL
Parallelise CPU-bound ETL transformations across multiple CPU cores using multiprocessing.Pool.
Best for: parallel file processing
Pydantic Models for ETL Validation
Parse and validate raw JSON records against Pydantic models before inserting into a database.
Best for: input validation
Flatten Nested JSON with pandas
Use pd.json_normalize to flatten deeply nested API responses into a flat DataFrame.
Best for: API response flattening
Structured Logging for Data Pipelines
Use Loguru to emit structured JSON logs with contextual fields from ETL pipeline stages.
Best for: pipeline observability
Pandas Method Chaining with .pipe()
Use the .pipe() method to create clean, readable pandas transformation chains.
Best for: clean ETL code
Read Files from S3 with fsspec
Access S3 files directly with fsspec and pandas without boto3 boilerplate.
Best for: cloud data access
Build a Data Lineage Graph with NetworkX
Track and visualise data lineage across ETL pipeline stages using a directed graph.
Best for: data governance
Pandas Explode List Column
Explode a column containing lists into separate rows, useful for normalising one-to-many relations.
Best for: array column expansion
Concat & Deduplicate DataFrames
Merge multiple DataFrames and remove duplicates by composite key for clean data consolidation.
Best for: data consolidation
Row Fingerprinting with hashlib
Generate deterministic hash fingerprints for each row to detect changes in incremental loads.
Best for: change data capture
Tenacity Retry for Pipeline Resilience
Add exponential backoff retries to flaky data pipeline steps using Tenacity.
Best for: resilient API calls
tqdm Progress Bars in Data Pipelines
Add progress bars to pandas operations, loops, and concurrent futures with tqdm.
Best for: ETL monitoring
Pandas .assign() for Immutable Chaining
Use DataFrame.assign() to add computed columns without mutating the original DataFrame.
Best for: immutable transforms
Async ETL Pipeline with asyncio
Run concurrent data fetches and transformations using asyncio.gather for high-throughput pipelines.
Best for: concurrent API ingestion
Read Multi-Sheet Excel Files
Load, merge, and process data from multiple Excel sheets using pandas ExcelFile context manager.
Best for: Excel ETL
Polars Join Strategies
Perform inner, left, cross, and anti joins in Polars with optimal join strategies.
Best for: data enrichment
Polars Expressions API Patterns
Use Polars expression API for complex column-level transformations without apply or loops.
Best for: column transformations
Pandera @check_input and @check_output
Decorate pipeline functions with Pandera schema validators to enforce input and output contracts.
Best for: contract testing
Pandas Conditional Join with merge + query
Perform range/conditional joins by merging on a common key and filtering with query expressions.
Best for: session attribution
Polars String Operations
Use the Polars .str namespace for fast, vectorised string cleaning and extraction.
Best for: data cleaning
Bulk Load CSV into PostgreSQL with COPY
Use psycopg2's copy_expert for the fastest possible bulk CSV load into a PostgreSQL table.
Best for: high-speed bulk loads
Read NDJSON / JSON Lines Files
Efficiently read newline-delimited JSON (NDJSON) log files into a pandas DataFrame.
Best for: log file ingestion
Expand JSON Column into DataFrame Columns
Parse a JSON-string column and expand its keys into separate columns in one step.
Best for: JSON column expansion
dbt Python Model with pandas
Write a dbt Python model that runs on Databricks/Snowpark to transform DataFrames in the warehouse.
Best for: dbt Python models