Read Large CSV in Chunks with Pandas
Process CSV files larger than RAM by reading in chunks — memory-efficient ETL pattern for data pipelines.
import pandas as pd
from typing import Iterator
def process_large_csv(
filepath: str,
chunk_size: int = 10_000,
) -> Iterator[pd.DataFrame]:
"""
Stream a large CSV file in chunks.
Each chunk is a DataFrame — process and discard before loading next.
"""
reader = pd.read_csv(
filepath,
chunksize=chunk_size,
dtype_backend="pyarrow", # faster, less memory
low_memory=False,
)
for chunk in reader:
# Drop nulls, clean, transform
chunk = chunk.dropna(subset=["id", "email"])
chunk["email"] = chunk["email"].str.lower().str.strip()
yield chunk
def main():
total = 0
for chunk in process_large_csv("users.csv"):
total += len(chunk)
# load_to_db(chunk)
print(f"Processed {total:,} rows", end="\r")
print(f"\nDone. Total: {total:,}")
if __name__ == "__main__":
main()Use Cases
- Processing multi-GB CSV files without running out of memory
- Streaming ETL pipelines with pandas
- Batch loading data into databases from large files
Tags
Related Snippets
Similar patterns you can reuse in the same workflow.
Python ETL Pipeline Example
Complete extract-transform-load pipeline with error handling, logging, and incremental processing.
Best for: Automating data ingestion from CSV to warehouse
Airflow DAG with Python Operators
Create an Apache Airflow DAG with task dependencies, retries, and XCom data passing between tasks.
Best for: Orchestrating daily ETL pipelines
Bash ETL Pipeline Script
Build a complete ETL script in Bash with logging, error handling, notifications, and idempotent runs.
Best for: Automating daily data extract and load jobs
PostgreSQL COPY — Fast CSV Import
Use PostgreSQL COPY command for high-speed bulk data loading from CSV files with error handling.
Best for: High-speed bulk data loading into PostgreSQL