📊

Data Engineering

Data pipelines, transformations, ETL, and big data processing snippets.

200 snippets

Showing 200 of 200 snippets

pythonbeginner

Pandas DataFrame Transformations

Common pandas DataFrame transformations including column operations, type casting, and string methods.

Best for: Cleaning raw data files for analysis

#pandas#dataframe
pythonbeginner

Pandas DataFrame Filtering Techniques

Filter DataFrames using boolean masks, query syntax, isin, between, and string matching methods.

Best for: Extracting subsets of data for reporting

#pandas#filtering
pythonintermediate

Pandas GroupBy Aggregation Examples

GroupBy operations with multiple aggregations, named aggregations, and transform for DataFrame analysis.

Best for: Sales reporting by region and time period

#pandas#groupby
pythonadvanced

Python ETL Pipeline Example

Complete extract-transform-load pipeline with error handling, logging, and incremental processing.

Best for: Automating data ingestion from CSV to warehouse

#etl#pipeline
pythonadvanced

Apache Airflow DAG Example

Airflow DAG with task dependencies, retries, SLA, and PythonOperator for daily data pipeline.

Best for: Orchestrating daily data pipelines

#airflow#dag
pythonadvanced

Spark SQL Query Example

PySpark DataFrame operations with SQL queries, window functions, and aggregations for big data.

Best for: Processing large-scale datasets with Spark

#spark#pyspark
pythonintermediate

Python Batch Processing Script

Process large files in configurable batches with progress tracking, error handling, and resume support.

Best for: Processing large CSV files that don't fit in memory

#batch-processing#python
pythonintermediate

Nested JSON Flattening in Python

Flatten deeply nested JSON structures into flat dictionaries suitable for DataFrames or CSV export.

Best for: Converting API responses to flat tables

#json#flattening
pythonbeginner

Python CSV Processing Examples

Read, write, and transform CSV files using the csv module and pandas with encoding and dialect handling.

Best for: Reading and cleaning CSV data files

#csv#python
pythonintermediate

Data Validation with Pydantic

Validate and parse data records using Pydantic models with custom validators and error reporting.

Best for: Validating incoming data before warehouse loading

#validation#pydantic
pythonintermediate

Retry Logic for Data Pipelines

Configurable retry decorator with exponential backoff and jitter for resilient data pipeline tasks.

Best for: Resilient API calls in data pipelines

#retry#resilience
pythonadvanced

Database Sync Script in Python

Sync data between two databases with upsert logic, batch processing, and change detection.

Best for: Replicating data between databases

#database#sync
sqlintermediate

SQL Incremental Load Pattern

Incremental data load using watermark tracking to process only new and updated records efficiently.

Best for: Efficient warehouse loading without full reloads

#sql#incremental-load
sqlintermediate

SQL Data Deduplication Techniques

Remove duplicate records using ROW_NUMBER, DISTINCT ON, and self-join deduplication strategies.

Best for: Cleaning duplicate records in production databases

#sql#deduplication
pythonadvanced

Databricks Notebook Data Pipeline

Databricks notebook with Delta Lake reads, transformations, merge operations, and table optimization.

Best for: Medallion architecture data pipelines on Databricks

#databricks#delta-lake
pythonintermediate

Python Streaming Data Processing

Process streaming data with generators, windowed aggregation, and memory-efficient line-by-line reading.

Best for: Processing large event log files efficiently

#streaming#python
sqladvanced

SQL Window Functions for Analytics

Advanced SQL window functions for running totals, rankings, moving averages, and gap analysis.

Best for: Building analytics dashboards with running totals

#sql#window-functions
sqlintermediate

SQL Schema Migration Pattern

Versioned schema migration scripts with forward and rollback support for database evolution.

Best for: Managing database schema changes across environments

#sql#migration
pythonintermediate

Pandas Merge and Join Examples

Combine DataFrames using merge, join, and concat with different join types and key handling.

Best for: Combining data from multiple sources

#pandas#merge
sqladvanced

dbt Model with Tests and Schema

A dbt SQL model with incremental materialization, schema tests, and source freshness checks.

Best for: Building analytics data models with dbt

#dbt#sql
pythonbeginner

Parquet File Read and Write in Python

Read and write Parquet files with pandas and PyArrow including partitioning and schema control.

Best for: Efficient columnar storage for analytics data

#parquet#pyarrow
sqladvanced

Change Data Capture Pattern in SQL

Implement change data capture with trigger-based auditing to track inserts, updates, and deletes.

Best for: Tracking all data changes for audit compliance

#cdc#audit
pythonintermediate

Python Data Profiling Script

Generate a data quality profile report with null counts, distributions, and anomaly detection.

Best for: Automated data quality reporting

#data-quality#profiling
pythonintermediate

Pandas Pivot and Unpivot Reshaping

Reshape DataFrames between wide and long formats using pivot, melt, and stack operations.

Best for: Reshaping data for reporting dashboards

#pandas#pivot
sqladvanced

Slowly Changing Dimension Type 2 in SQL

Implement SCD Type 2 to track historical changes in dimension tables with effective date ranges.

Best for: Tracking customer attribute changes over time

#scd#dimension
pythonintermediate

Data Quality Testing with Expectations

Define and run data quality expectations for automated validation in data pipelines.

Best for: Automated data quality gates in pipelines

#data-quality#testing
pythonintermediate

Pandas Time Series Analysis

Time series operations with resampling, rolling windows, date offsets, and period conversions.

Best for: Sales trend analysis with moving averages

#pandas#time-series
sqladvanced

SQL Data Lineage Tracking

Track data lineage across ETL stages with metadata logging for debugging and audit trails.

Best for: Tracing data flow across pipeline stages

#lineage#metadata
pythonbeginner

Pandas Null Handling Strategies

Comprehensive strategies for detecting, filling, and handling missing values in pandas DataFrames.

Best for: Cleaning datasets with missing values

#pandas#null
sqladvanced

SQL Window Functions for Analytics

Use window functions for running totals, rankings, moving averages, and gap detection in analytics.

Best for: Building cumulative revenue dashboards

#sql#window-functions
pythonintermediate

Read Large CSV in Chunks with Pandas

Process CSV files larger than RAM by reading in chunks — memory-efficient ETL pattern for data pipelines.

Best for: Processing multi-GB CSV files without running out of memory

#pandas#csv
pythonintermediate

Polars Lazy Query — Fast DataFrame Processing

Use Polars lazy evaluation for high-performance data transformations that outperform pandas.

Best for: High-performance data processing replacing pandas

#polars#dataframe
pythonadvanced

PySpark DataFrame — Filter and Aggregate

Common PySpark DataFrame operations: filter, group by, window functions, and write to Parquet.

Best for: Large-scale data aggregation on distributed clusters

#spark#pyspark
pythonintermediate

Airflow DAG with Python Operators

Create an Apache Airflow DAG with task dependencies, retries, and XCom data passing between tasks.

Best for: Orchestrating daily ETL pipelines

#airflow#dag
sqlintermediate

dbt Incremental Model Pattern

Build efficient dbt incremental models that process only new or changed data instead of full refreshes.

Best for: Efficient data warehouse builds processing only deltas

#dbt#incremental
sqlintermediate

SQL Data Quality Checks and Assertions

Reusable SQL queries for data quality: null checks, uniqueness, referential integrity, and freshness.

Best for: Automated data quality gates in ETL pipelines

#sql#data-quality
sqladvanced

Snowflake MERGE with Slowly Changing Dim

Implement SCD Type 2 in Snowflake using MERGE to track historical changes in dimension tables.

Best for: Tracking full history of dimension changes

#snowflake#merge
sqlbeginner

dbt Source Freshness and Testing

Configure dbt source freshness checks and schema tests to validate upstream data pipelines.

Best for: Ensuring upstream data sources are fresh

#dbt#testing
sqlintermediate

SQL Running Totals and Cumulative Metrics

Calculate running totals, cumulative counts, and percent-of-total using window functions and partitions.

Best for: Building cumulative revenue dashboards

#sql#window-functions
bashintermediate

Bash ETL Pipeline Script

Build a complete ETL script in Bash with logging, error handling, notifications, and idempotent runs.

Best for: Automating daily data extract and load jobs

#bash#etl
bashintermediate

Cron Data Sync — Database to S3

Automated script to export database tables to compressed CSV and sync to S3 on a schedule.

Best for: Nightly database exports to cloud storage

#bash#cron
bashadvanced

Spark Submit — Job Launcher Script

Launch PySpark jobs with spark-submit including cluster configuration, dependencies, and monitoring.

Best for: Launching PySpark batch jobs on YARN clusters

#spark#pyspark
bashbeginner

dbt Run and Test — CI/CD Pipeline Script

Bash script for running dbt build with testing, documentation generation, and failure notifications.

Best for: Automating dbt builds in CI/CD pipelines

#dbt#bash
pythonadvanced

Kafka Consumer in Python — Stream Processing

Build a Kafka consumer in Python with offset management, error handling, and batch processing.

Best for: Real-time event processing from Kafka topics

#kafka#streaming
bashbeginner

Kafka Topic — Create and Manage with CLI

Create, describe, alter, and manage Kafka topics using the kafka-topics CLI with partitioning config.

Best for: Setting up Kafka topics for new data streams

#kafka#bash
pythonbeginner

DuckDB — Query Parquet Files with Python

Use DuckDB to query Parquet files and CSVs directly from Python without loading into memory first.

Best for: Ad-hoc analytics on Parquet files without Spark

#duckdb#parquet
sqlbeginner

PostgreSQL COPY — Fast CSV Import

Use PostgreSQL COPY command for high-speed bulk data loading from CSV files with error handling.

Best for: High-speed bulk data loading into PostgreSQL

#postgres#copy
sqlintermediate

BigQuery — Partitioned and Clustered Tables

Create BigQuery tables with time partitioning and clustering for optimal query performance and cost.

Best for: Optimizing BigQuery costs with partition pruning

#bigquery#partitioning
bashintermediate

Bash Pipeline Monitoring and Alerting

Monitor data pipeline health with row counts, runtime tracking, SLA checks, and Slack alerting.

Best for: Monitoring data pipeline health and freshness

#bash#monitoring
bashbeginner

Database Backup and Restore to S3

Automated PostgreSQL backup script with compression, S3 upload, retention policy, and restore commands.

Best for: Automated daily database backups to S3

#bash#backup
pythonbeginner

Polars Dataframe

Data science technique: polars-dataframe

Best for: machine learning

#data#machine-learning
pythonintermediate

Dask Distributed

Data science technique: dask-distributed

Best for: machine learning

#data#machine-learning
pythonadvanced

Vaex Big Data

Data science technique: vaex-big-data

Best for: machine learning

#data#machine-learning
pythonbeginner

Modin Parallel

Data science technique: modin-parallel

Best for: machine learning

#data#machine-learning
pythonintermediate

Scikit Learn Pipeline

Data science technique: scikit-learn-pipeline

Best for: machine learning

#data#machine-learning
pythonadvanced

Scikit Learn Preprocessing

Data science technique: scikit-learn-preprocessing

Best for: machine learning

#data#machine-learning
pythonbeginner

Feature Engineering

Data science technique: feature-engineering

Best for: machine learning

#data#machine-learning
pythonintermediate

Feature Selection

Data science technique: feature-selection

Best for: machine learning

#data#machine-learning
pythonadvanced

Dimensionality Reduction

Data science technique: dimensionality-reduction

Best for: machine learning

#data#machine-learning
pythonbeginner

Pca Analysis

Data science technique: pca-analysis

Best for: machine learning

#data#machine-learning
pythonintermediate

Tsne Visualization

Data science technique: tsne-visualization

Best for: machine learning

#data#machine-learning
pythonadvanced

Clustering Kmeans

Data science technique: clustering-kmeans

Best for: machine learning

#data#machine-learning
pythonbeginner

Hierarchical Clustering

Data science technique: hierarchical-clustering

Best for: machine learning

#data#machine-learning
pythonintermediate

Dbscan Clustering

Data science technique: dbscan-clustering

Best for: machine learning

#data#machine-learning
pythonadvanced

Gmm Clustering

Data science technique: gmm-clustering

Best for: machine learning

#data#machine-learning
pythonbeginner

Regression Linear

Data science technique: regression-linear

Best for: machine learning

#data#machine-learning
pythonintermediate

Regression Logistic

Data science technique: regression-logistic

Best for: machine learning

#data#machine-learning
pythonadvanced

Regression Ridge

Data science technique: regression-ridge

Best for: machine learning

#data#machine-learning
pythonbeginner

Regression Lasso

Data science technique: regression-lasso

Best for: machine learning

#data#machine-learning
pythonintermediate

Svm Classification

Data science technique: svm-classification

Best for: machine learning

#data#machine-learning
pythonadvanced

Random Forest

Data science technique: random-forest

Best for: machine learning

#data#machine-learning
pythonbeginner

Gradient Boosting

Data science technique: gradient-boosting

Best for: machine learning

#data#machine-learning
pythonintermediate

Xgboost Advanced

Data science technique: xgboost-advanced

Best for: machine learning

#data#machine-learning
pythonadvanced

Neural Networks Basics

Data science technique: neural-networks-basics

Best for: machine learning

#data#machine-learning
pythonbeginner

Keras Tensorflow

Data science technique: keras-tensorflow

Best for: machine learning

#data#machine-learning
pythonintermediate

Pytorch Training

Data science technique: pytorch-training

Best for: machine learning

#data#machine-learning
pythonadvanced

Jax Numerical

Data science technique: jax-numerical

Best for: machine learning

#data#machine-learning
pythonbeginner

Cross Validation Advanced

Data science technique: cross-validation-advanced

Best for: machine learning

#data#machine-learning
pythonintermediate

Hyperparameter Tuning

Data science technique: hyperparameter-tuning

Best for: machine learning

#data#machine-learning
pythonadvanced

Grid Search

Data science technique: grid-search

Best for: machine learning

#data#machine-learning
pythonbeginner

Bayesian Optimization

Data science technique: bayesian-optimization

Best for: machine learning

#data#machine-learning
pythonintermediate

Ensemble Methods

Data science technique: ensemble-methods

Best for: machine learning

#data#machine-learning
pythonadvanced

Voting Classifiers

Data science technique: voting-classifiers

Best for: machine learning

#data#machine-learning
pythonbeginner

Stacking Models

Data science technique: stacking-models

Best for: machine learning

#data#machine-learning
pythonintermediate

Blending Predictions

Data science technique: blending-predictions

Best for: machine learning

#data#machine-learning
pythonadvanced

Time Series Forecasting

Data science technique: time-series-forecasting

Best for: machine learning

#data#machine-learning
pythonbeginner

Anomaly Detection

Data science technique: anomaly-detection

Best for: machine learning

#data#machine-learning
pythonintermediate

Imbalanced Learning Smote

Data science technique: imbalanced-learning-smote

Best for: machine learning

#data#machine-learning
pythonadvanced

Target Encoding

Data science technique: target-encoding

Best for: machine learning

#data#machine-learning
pythonbeginner

Feature Store Basics

Data science technique: feature-store-basics

Best for: machine learning

#data#machine-learning
pythonintermediate

Model Calibration

Data science technique: model-calibration

Best for: machine learning

#data#machine-learning
pythonadvanced

Confusion Matrix Analysis

Data science technique: confusion-matrix-analysis

Best for: machine learning

#data#machine-learning
pythonbeginner

Precision Recall Curve

Data science technique: precision-recall-curve

Best for: machine learning

#data#machine-learning
pythonintermediate

Roc Auc Evaluation

Data science technique: roc-auc-evaluation

Best for: machine learning

#data#machine-learning
pythonadvanced

Uplift Modeling

Data science technique: uplift-modeling

Best for: machine learning

#data#machine-learning
pythonbeginner

Recommendation Collaborative Filtering

Data science technique: recommendation-collaborative-filtering

Best for: machine learning

#data#machine-learning
pythonintermediate

Nlp Tfidf

Data science technique: nlp-tfidf

Best for: machine learning

#data#machine-learning
pythonadvanced

Topic Modeling Lda

Data science technique: topic-modeling-lda

Best for: machine learning

#data#machine-learning
pythonbeginner

Survival Analysis

Data science technique: survival-analysis

Best for: machine learning

#data#machine-learning
pythonintermediate

Causal Inference

Data science technique: causal-inference

Best for: machine learning

#data#machine-learning
pythonintermediate

Polars DataFrame Operations

High-performance DataFrame operations using Polars: filtering, groupby, joins, and lazy evaluation.

Best for: data transformation

#polars#dataframe
pythonintermediate

Dask Parallel DataFrame Processing

Process datasets larger than RAM using Dask's parallel, lazy DataFrame API.

Best for: out-of-core processing

#dask#parallel
pythonintermediate

SQLAlchemy Bulk Insert with Upsert

Efficiently bulk-insert rows with conflict resolution using SQLAlchemy Core and PostgreSQL.

Best for: idempotent loads

#sqlalchemy#postgres
pythonintermediate

Pandas Vectorised Operations vs Apply

Compare apply vs vectorised pandas operations for performance-critical column transformations.

Best for: feature engineering

#pandas#vectorization
pythonintermediate

Pandas Rolling & Expanding Windows

Compute moving averages, rolling sums, and cumulative stats on time-series data with pandas.

Best for: sales forecasting

#pandas#time-series
pythonintermediate

PyArrow Schema Enforcement

Define and enforce strict schemas on columnar data using PyArrow before writing to Parquet.

Best for: data lake storage

#pyarrow#parquet
pythonadvanced

Great Expectations Data Quality Suite

Define and run a Great Expectations validation suite to catch data quality issues early.

Best for: CI data validation

#great-expectations#data-quality
pythonintermediate

Pandas MultiIndex Stack & Unstack

Work with hierarchical MultiIndex DataFrames: pivoting with stack/unstack and cross-sectional slicing.

Best for: panel data

#pandas#multiindex
pythonintermediate

Prefect ETL Flow with Tasks

Define a Prefect 2 flow with typed tasks, retries, and structured logging for ETL pipelines.

Best for: ETL orchestration

#prefect#etl
pythonintermediate

Pandas Memory Reduction via Dtypes

Reduce DataFrame memory by 60-80% by downcasting numeric types and using categorical columns.

Best for: large dataset loading

#pandas#memory
pythonbeginner

Generate Synthetic Data with Faker

Create realistic test datasets for development and testing using the Faker library.

Best for: test data generation

#faker#testing
pythonbeginner

SQLite + Pandas Local Data Pipeline

Run a lightweight local ETL with SQLite and pandas: load CSV, transform, persist to SQLite.

Best for: local analytics

#sqlite#pandas
pythonintermediate

Multiprocessing Pool for ETL

Parallelise CPU-bound ETL transformations across multiple CPU cores using multiprocessing.Pool.

Best for: parallel file processing

#multiprocessing#parallel
pythonintermediate

Pandera DataFrame Schema Validation

Use Pandera to validate DataFrame schemas with type checks, value constraints, and custom checks.

Best for: pipeline input validation

#pandera#validation
pythonadvanced

PySpark Window Functions

Use PySpark window functions for running totals, rank, lag/lead, and percentile computations.

Best for: sales analytics

#pyspark#spark
pythonintermediate

Redis Cache-Aside Pattern in Python

Implement cache-aside (lazy loading) with Redis and Python to accelerate repeated database queries.

Best for: query caching

#redis#caching
pythonintermediate

Kafka Producer & Consumer in Python

Produce and consume JSON messages from Apache Kafka using the confluent-kafka Python client.

Best for: event streaming

#kafka#streaming
pythonbeginner

Pandas Categorical Encoding for ML

One-hot encode, label encode, and ordinal encode categorical columns using pandas and scikit-learn.

Best for: ML preprocessing

#pandas#encoding
pythonbeginner

Pandas String Operations

Clean, extract, and transform string columns using pandas .str accessor methods.

Best for: data cleaning

#pandas#strings
pythonadvanced

SQLAlchemy Async Session with asyncpg

Use SQLAlchemy 2.0 async sessions with asyncpg for non-blocking database access in async pipelines.

Best for: async web services

#sqlalchemy#async
pythonintermediate

Pydantic Models for ETL Validation

Parse and validate raw JSON records against Pydantic models before inserting into a database.

Best for: input validation

#pydantic#validation
pythonbeginner

Flatten Nested JSON with pandas

Use pd.json_normalize to flatten deeply nested API responses into a flat DataFrame.

Best for: API response flattening

#pandas#json
pythonintermediate

Statistical Analysis with SciPy

Run hypothesis tests, correlations, and descriptive statistics on dataset columns with SciPy.

Best for: A/B testing

#scipy#statistics
pythonbeginner

Structured Logging for Data Pipelines

Use Loguru to emit structured JSON logs with contextual fields from ETL pipeline stages.

Best for: pipeline observability

#loguru#logging
pythonadvanced

Delta Lake MERGE with Python

Perform ACID upserts on a Delta Lake table using the delta-rs Python binding.

Best for: lakehouse upserts

#delta-lake#upsert
pythonintermediate

Scikit-learn Feature Pipeline

Build a reproducible ML feature pipeline with ColumnTransformer, StandardScaler, and OneHotEncoder.

Best for: ML preprocessing

#scikit-learn#ml-pipeline
pythonintermediate

Pandas Method Chaining with .pipe()

Use the .pipe() method to create clean, readable pandas transformation chains.

Best for: clean ETL code

#pandas#pipe
pythonbeginner

Read Files from S3 with fsspec

Access S3 files directly with fsspec and pandas without boto3 boilerplate.

Best for: cloud data access

#fsspec#s3
pythonadvanced

Build a Data Lineage Graph with NetworkX

Track and visualise data lineage across ETL pipeline stages using a directed graph.

Best for: data governance

#networkx#lineage
pythonadvanced

PySpark Structured Streaming from Kafka

Consume a Kafka topic in real-time with PySpark Structured Streaming and write to Parquet.

Best for: real-time ETL

#pyspark#kafka
pythonbeginner

Pandas Explode List Column

Explode a column containing lists into separate rows, useful for normalising one-to-many relations.

Best for: array column expansion

#pandas#explode
pythonintermediate

Stream Large SQL Query in Chunks

Read millions of rows from SQL in memory-safe chunks using pandas read_sql with chunksize.

Best for: large table extraction

#pandas#sql
pythonbeginner

Dataclasses as Pipeline Data Models

Use Python dataclasses to define typed, immutable data models passed between pipeline stages.

Best for: typed pipeline stages

#dataclasses#typing
pythonbeginner

Pandas Time-Series Resampling

Resample time-series data from daily to weekly/monthly frequencies with aggregation functions.

Best for: time-series analytics

#pandas#time-series
pythonbeginner

Concat & Deduplicate DataFrames

Merge multiple DataFrames and remove duplicates by composite key for clean data consolidation.

Best for: data consolidation

#pandas#deduplication
pythonbeginner

Pandas .query() for Readable Filters

Use DataFrame.query() with expressions for cleaner, SQL-like row filtering syntax.

Best for: data filtering

#pandas#query
pythonintermediate

NumPy Advanced Indexing Patterns

Use fancy indexing, boolean masks, and np.where for fast array transformations without loops.

Best for: numerical computing

#numpy#indexing
pythonintermediate

OLS Regression with statsmodels

Fit and interpret an Ordinary Least Squares regression model with diagnostics using statsmodels.

Best for: econometric analysis

#statsmodels#regression
pythonintermediate

Pandas .eval() for Fast Column Computation

Use DataFrame.eval() for expressive, fast in-place column calculations using numexpr.

Best for: large DataFrame operations

#pandas#eval
pythonintermediate

Pandas merge_asof for Time-Based Joins

Perform an as-of join to match events to the most recent reference record within a time window.

Best for: tick data joins

#pandas#merge-asof
pythonbeginner

Pandas Category Dtype Optimization

Convert string columns to categorical dtype to dramatically reduce memory and speed up groupby.

Best for: memory optimization

#pandas#category
pythonbeginner

Pandas Wide to Long (melt)

Transform a wide-format DataFrame into long format using pd.melt for analytics and visualisation.

Best for: pivot table conversion

#pandas#melt
pythonintermediate

Row Fingerprinting with hashlib

Generate deterministic hash fingerprints for each row to detect changes in incremental loads.

Best for: change data capture

#hashlib#fingerprint
pythonintermediate

Pandas Custom Aggregation Functions

Pass custom lambda and named functions to .agg() for complex groupby aggregations.

Best for: HR analytics

#pandas#groupby
pythonintermediate

Tenacity Retry for Pipeline Resilience

Add exponential backoff retries to flaky data pipeline steps using Tenacity.

Best for: resilient API calls

#tenacity#retry
pythonadvanced

PyArrow Dataset Scan with Predicate Pushdown

Scan a partitioned Parquet dataset with column pruning and row-level predicate pushdown via PyArrow.

Best for: lakehouse queries

#pyarrow#parquet
pythonbeginner

Pandas Styled DataFrame Report

Apply conditional formatting to a pandas DataFrame for styled HTML reports with highlighting.

Best for: executive reporting

#pandas#styling
pythonbeginner

tqdm Progress Bars in Data Pipelines

Add progress bars to pandas operations, loops, and concurrent futures with tqdm.

Best for: ETL monitoring

#tqdm#progress
pythonintermediate

Property-Based Testing for Data Functions

Use Hypothesis to automatically generate edge-case test data for data transformation functions.

Best for: data function testing

#hypothesis#testing
pythonbeginner

Pandas .assign() for Immutable Chaining

Use DataFrame.assign() to add computed columns without mutating the original DataFrame.

Best for: immutable transforms

#pandas#assign
pythonintermediate

Async ETL Pipeline with asyncio

Run concurrent data fetches and transformations using asyncio.gather for high-throughput pipelines.

Best for: concurrent API ingestion

#asyncio#async
pythonintermediate

Pandas IntervalIndex for Binning

Use IntervalIndex and pd.cut to bin continuous variables into labelled categories.

Best for: grading systems

#pandas#binning
pythonintermediate

Grouped Time-Series with ffill

Forward-fill missing time-series values within groups to handle irregular measurement intervals.

Best for: IoT sensor data

#pandas#ffill
pythonintermediate

SQLModel CRUD Patterns

Use SQLModel (SQLAlchemy + Pydantic) to define models and run type-safe CRUD operations.

Best for: FastAPI backends

#sqlmodel#orm
pythonbeginner

Pandas Pivot Table Summary

Create multi-level summary pivot tables from transactional data using pd.pivot_table.

Best for: sales reporting

#pandas#pivot-table
pythonbeginner

Fast JSON Serialisation with orjson

Use orjson for 5-10x faster JSON serialisation of large Python dicts, dataclasses, and NumPy arrays.

Best for: high-throughput serialisation

#orjson#json
pythonbeginner

Pandas Named Aggregations

Use named aggregations in groupby().agg() to produce readable, self-documenting summary tables.

Best for: HR reporting

#pandas#groupby
pythonbeginner

Read Multi-Sheet Excel Files

Load, merge, and process data from multiple Excel sheets using pandas ExcelFile context manager.

Best for: Excel ETL

#pandas#excel
pythonintermediate

Stratified Sampling with pandas

Draw a stratified random sample from a DataFrame, preserving class proportions for ML splits.

Best for: ML dataset splitting

#pandas#sampling
pythonintermediate

Polars Join Strategies

Perform inner, left, cross, and anti joins in Polars with optimal join strategies.

Best for: data enrichment

#polars#join
pythonadvanced

Ibis Portable DataFrame SQL

Write backend-agnostic analytics queries with Ibis that compile to DuckDB, BigQuery, or Spark.

Best for: portable analytics

#ibis#duckdb
pythonbeginner

DuckDB In-Memory Analytics

Run fast analytical SQL on pandas DataFrames or Parquet files without a server using DuckDB.

Best for: serverless analytics

#duckdb#analytics
pythonadvanced

Custom Pandas Accessor Extension

Create a reusable @pd.api.extensions.register_dataframe_accessor for domain-specific DataFrame methods.

Best for: domain-specific pandas

#pandas#extension
pythonintermediate

Timezone-Aware Timestamps in pandas

Convert naive timestamps to timezone-aware, handle DST transitions, and localise to UTC.

Best for: global event logs

#pandas#datetime
pythonintermediate

Polars Expressions API Patterns

Use Polars expression API for complex column-level transformations without apply or loops.

Best for: column transformations

#polars#expressions
pythonbeginner

Matplotlib Charts for Data Pipelines

Generate and save charts programmatically from pipeline output using matplotlib.

Best for: automated reports

#matplotlib#visualisation
pythonbeginner

Pandas Cross-Tabulation (crosstab)

Compute frequency and proportion cross-tabulations between two categorical columns.

Best for: categorical analysis

#pandas#crosstab
pythonbeginner

Pydantic Settings for Pipeline Config

Manage pipeline configuration from environment variables and .env files with Pydantic Settings.

Best for: pipeline configuration

#pydantic#config
pythonbeginner

Pandas read_csv with Explicit Dtypes

Specify column dtypes on CSV read to avoid costly inference and prevent silent type coercion.

Best for: fast CSV loading

#pandas#csv
pythonintermediate

Polars Lazy Scan of Parquet Files

Use Polars scan_parquet with predicate and projection pushdown for fast Parquet analytics.

Best for: lakehouse queries

#polars#parquet
pythonintermediate

Pareto / Cumulative Share Analysis

Calculate cumulative share (Pareto 80/20) of values for product or customer ranking analysis.

Best for: product analytics

#pandas#pareto
pythonintermediate

Pandas Apply with Chunked Progress

Apply a function to a large DataFrame in chunked batches to avoid memory spikes and track progress.

Best for: memory-safe transforms

#pandas#apply
pythonintermediate

Pandas Merge with Validation

Use merge() validate parameter to catch unexpected many-to-many or missing key issues in joins.

Best for: data integrity

#pandas#merge
pythonintermediate

Pandera @check_input and @check_output

Decorate pipeline functions with Pandera schema validators to enforce input and output contracts.

Best for: contract testing

#pandera#validation
pythonadvanced

NumPy Structured Arrays for Records

Use NumPy structured arrays to store heterogeneous record types efficiently without pandas overhead.

Best for: binary record storage

#numpy#structured-array
pythonintermediate

Pandas Conditional Join with merge + query

Perform range/conditional joins by merging on a common key and filtering with query expressions.

Best for: session attribution

#pandas#conditional-join
pythonbeginner

Polars String Operations

Use the Polars .str namespace for fast, vectorised string cleaning and extraction.

Best for: data cleaning

#polars#strings
pythonintermediate

Pandas GroupBy Transform Patterns

Use groupby().transform() to compute group-level statistics and broadcast them back to row level.

Best for: feature engineering

#pandas#groupby
pythonintermediate

Bulk Load CSV into PostgreSQL with COPY

Use psycopg2's copy_expert for the fastest possible bulk CSV load into a PostgreSQL table.

Best for: high-speed bulk loads

#psycopg2#postgres
pythonintermediate

Pandas Business Day Offsets

Compute business-day-adjusted dates using pandas offsets for financial and SLA calculations.

Best for: financial calendars

#pandas#datetime
pythonbeginner

Detect & Remove Pandas Duplicates

Find, count, and remove duplicate rows with flexible keep strategy and composite key support.

Best for: data cleaning

#pandas#duplicates
pythonadvanced

Efficient One-Hot Pivot with Sparse

Create a sparse user-item matrix from transaction logs for recommendation or ML use cases.

Best for: recommendation systems

#pandas#sparse
pythonintermediate

Pandas PeriodIndex for Fiscal Calendars

Use PeriodIndex for fiscal period arithmetic, aggregations, and comparisons beyond datetime.

Best for: fiscal reporting

#pandas#period
pythonintermediate

Compare Two DataFrames for Changes

Detect row-level additions, deletions, and modifications between two DataFrame snapshots.

Best for: change data capture

#pandas#diff
pythonintermediate

Cumulative Max & Streak Detection

Detect streaks, new highs, and consecutive-day patterns in time-series using cummax and groupby.

Best for: sports analytics

#pandas#cummax
pythonbeginner

Read NDJSON / JSON Lines Files

Efficiently read newline-delimited JSON (NDJSON) log files into a pandas DataFrame.

Best for: log file ingestion

#pandas#ndjson
pythonintermediate

attrs Classes as Immutable Pipeline Records

Use attrs to create fast, validated, immutable record types for data pipeline stage outputs.

Best for: typed pipeline records

#attrs#data-modeling
pythonintermediate

Polars Pivot and Unpivot

Reshape a Polars DataFrame from long to wide (pivot) and wide to long (unpivot/melt).

Best for: data reshaping

#polars#pivot
pythonadvanced

Detect Overlapping Date Intervals

Identify overlapping time periods in a DataFrame (e.g., booking conflicts or subscription overlaps).

Best for: scheduling conflicts

#pandas#intervals
pythonintermediate

Pandas SwapLevel MultiIndex

Swap and sort MultiIndex levels in a hierarchical DataFrame for flexible aggregation.

Best for: hierarchical reporting

#pandas#multiindex
pythonbeginner

Pandas Rank with Tie-Breaking Methods

Apply different ranking strategies (min, dense, average) and handle ties in pandas.

Best for: leaderboards

#pandas#ranking
pythonbeginner

Pandas nlargest / nsmallest

Efficiently retrieve the N largest or smallest rows without sorting the full DataFrame.

Best for: top-N queries

#pandas#top-n
pythonbeginner

Pandas Datetime Component Extraction

Extract year, month, day, hour, day-of-week and other components from a datetime column.

Best for: time-based features

#pandas#datetime
pythonbeginner

DataFrame to Dict Records

Convert DataFrames to lists of dicts for API responses, JSON export, or further processing.

Best for: API serialization

#pandas#records
pythonintermediate

Pandas Cartesian Feature Interaction

Generate pairwise feature interactions for ML by creating cross-product columns.

Best for: ML feature engineering

#pandas#feature-engineering
pythonintermediate

Pandas Rolling Correlation

Compute rolling Pearson correlation between two columns to detect shifting relationships over time.

Best for: regime detection

#pandas#rolling
pythonbeginner

Expand JSON Column into DataFrame Columns

Parse a JSON-string column and expand its keys into separate columns in one step.

Best for: JSON column expansion

#pandas#json
pythonbeginner

Pandas Forward Fill & Backward Fill

Propagate non-null values forward and backward to fill gaps in time-series or sparse data.

Best for: gap filling

#pandas#ffill
pythonbeginner

Value Counts with Normalisation

Compute frequency distributions and percentage breakdowns of categorical columns.

Best for: data profiling

#pandas#value-counts
pythonadvanced

dbt Python Model with pandas

Write a dbt Python model that runs on Databricks/Snowpark to transform DataFrames in the warehouse.

Best for: dbt Python models

#dbt#python-model