# etl

Python ETL Pipeline Example

Complete extract-transform-load pipeline with error handling, logging, and incremental processing.

Best for: Automating data ingestion from CSV to warehouse

#etl#pipeline

Python Batch Processing Script

Process large files in configurable batches with progress tracking, error handling, and resume support.

Best for: Processing large CSV files that don't fit in memory

#batch-processing#python

Database Sync Script in Python

Sync data between two databases with upsert logic, batch processing, and change detection.

Best for: Replicating data between databases

#database#sync

sqlintermediate

SQL Incremental Load Pattern

Incremental data load using watermark tracking to process only new and updated records efficiently.

Best for: Efficient warehouse loading without full reloads

#sql#incremental-load

sqlintermediate

SQL Data Deduplication Techniques

Remove duplicate records using ROW_NUMBER, DISTINCT ON, and self-join deduplication strategies.

Best for: Cleaning duplicate records in production databases

#sql#deduplication

sqladvanced

SQL Data Lineage Tracking

Track data lineage across ETL stages with metadata logging for debugging and audit trails.

Best for: Tracing data flow across pipeline stages

#lineage#metadata

Read Large CSV in Chunks with Pandas

Process CSV files larger than RAM by reading in chunks — memory-efficient ETL pattern for data pipelines.

Best for: Processing multi-GB CSV files without running out of memory

#pandas#csv

Airflow DAG with Python Operators

Create an Apache Airflow DAG with task dependencies, retries, and XCom data passing between tasks.

Best for: Orchestrating daily ETL pipelines

#airflow#dag

bashintermediate

Bash ETL Pipeline Script

Build a complete ETL script in Bash with logging, error handling, notifications, and idempotent runs.

Best for: Automating daily data extract and load jobs

#bash#etl

Polars DataFrame Operations

High-performance DataFrame operations using Polars: filtering, groupby, joins, and lazy evaluation.

Best for: data transformation

#polars#dataframe

Pandas Vectorised Operations vs Apply

Compare apply vs vectorised pandas operations for performance-critical column transformations.

Best for: feature engineering

#pandas#vectorization

Prefect ETL Flow with Tasks

Define a Prefect 2 flow with typed tasks, retries, and structured logging for ETL pipelines.

Best for: ETL orchestration

#prefect#etl

SQLite + Pandas Local Data Pipeline

Run a lightweight local ETL with SQLite and pandas: load CSV, transform, persist to SQLite.

Best for: local analytics

#sqlite#pandas

#multiprocessing#parallel

Multiprocessing Pool for ETL

Parallelise CPU-bound ETL transformations across multiple CPU cores using multiprocessing.Pool.

Best for: parallel file processing

Pydantic Models for ETL Validation

Parse and validate raw JSON records against Pydantic models before inserting into a database.

Best for: input validation

#pydantic#validation

Flatten Nested JSON with pandas

Use pd.json_normalize to flatten deeply nested API responses into a flat DataFrame.

Best for: API response flattening

#pandas#json

Structured Logging for Data Pipelines

Use Loguru to emit structured JSON logs with contextual fields from ETL pipeline stages.

Best for: pipeline observability

#loguru#logging

Pandas Method Chaining with .pipe()

Use the .pipe() method to create clean, readable pandas transformation chains.

Best for: clean ETL code

#pandas#pipe

Read Files from S3 with fsspec

Access S3 files directly with fsspec and pandas without boto3 boilerplate.

Best for: cloud data access

#fsspec#s3

Build a Data Lineage Graph with NetworkX

Track and visualise data lineage across ETL pipeline stages using a directed graph.

Best for: data governance

#networkx#lineage

Pandas Explode List Column

Explode a column containing lists into separate rows, useful for normalising one-to-many relations.

Best for: array column expansion

#pandas#explode

Concat & Deduplicate DataFrames

Merge multiple DataFrames and remove duplicates by composite key for clean data consolidation.

Best for: data consolidation

#pandas#deduplication

Row Fingerprinting with hashlib

Generate deterministic hash fingerprints for each row to detect changes in incremental loads.

Best for: change data capture

#hashlib#fingerprint

Tenacity Retry for Pipeline Resilience

Add exponential backoff retries to flaky data pipeline steps using Tenacity.

Best for: resilient API calls

#tenacity#retry

tqdm Progress Bars in Data Pipelines

Add progress bars to pandas operations, loops, and concurrent futures with tqdm.

Best for: ETL monitoring

#tqdm#progress

Pandas .assign() for Immutable Chaining

Use DataFrame.assign() to add computed columns without mutating the original DataFrame.

Best for: immutable transforms

#pandas#assign

Async ETL Pipeline with asyncio

Run concurrent data fetches and transformations using asyncio.gather for high-throughput pipelines.

Best for: concurrent API ingestion

#asyncio#async

Read Multi-Sheet Excel Files

Load, merge, and process data from multiple Excel sheets using pandas ExcelFile context manager.

Best for: Excel ETL

#pandas#excel

Polars Join Strategies

Perform inner, left, cross, and anti joins in Polars with optimal join strategies.

Best for: data enrichment

#polars#join

Polars Expressions API Patterns

Use Polars expression API for complex column-level transformations without apply or loops.

Best for: column transformations

#polars#expressions

Pandera @check_input and @check_output

Decorate pipeline functions with Pandera schema validators to enforce input and output contracts.

Best for: contract testing

#pandera#validation

Pandas Conditional Join with merge + query

Perform range/conditional joins by merging on a common key and filtering with query expressions.

Best for: session attribution

#pandas#conditional-join

Polars String Operations

Use the Polars .str namespace for fast, vectorised string cleaning and extraction.

Best for: data cleaning

#polars#strings

Bulk Load CSV into PostgreSQL with COPY

Use psycopg2's copy_expert for the fastest possible bulk CSV load into a PostgreSQL table.

Best for: high-speed bulk loads

#psycopg2#postgres

Read NDJSON / JSON Lines Files

Efficiently read newline-delimited JSON (NDJSON) log files into a pandas DataFrame.

Best for: log file ingestion

#pandas#ndjson

Expand JSON Column into DataFrame Columns

Parse a JSON-string column and expand its keys into separate columns in one step.

Best for: JSON column expansion

#pandas#json