Data Engineering

Pandas DataFrame Transformations

Common pandas DataFrame transformations including column operations, type casting, and string methods.

Best for: Cleaning raw data files for analysis

#pandas#dataframe

Pandas DataFrame Filtering Techniques

Filter DataFrames using boolean masks, query syntax, isin, between, and string matching methods.

Best for: Extracting subsets of data for reporting

#pandas#filtering

Pandas GroupBy Aggregation Examples

GroupBy operations with multiple aggregations, named aggregations, and transform for DataFrame analysis.

Best for: Sales reporting by region and time period

Python ETL Pipeline Example

Complete extract-transform-load pipeline with error handling, logging, and incremental processing.

Best for: Automating data ingestion from CSV to warehouse

#etl#pipeline

Apache Airflow DAG Example

Airflow DAG with task dependencies, retries, SLA, and PythonOperator for daily data pipeline.

Best for: Orchestrating daily data pipelines

#airflow#dag

Spark SQL Query Example

PySpark DataFrame operations with SQL queries, window functions, and aggregations for big data.

Best for: Processing large-scale datasets with Spark

#spark#pyspark

Python Batch Processing Script

Process large files in configurable batches with progress tracking, error handling, and resume support.

Best for: Processing large CSV files that don't fit in memory

#batch-processing#python

Nested JSON Flattening in Python

Flatten deeply nested JSON structures into flat dictionaries suitable for DataFrames or CSV export.

Best for: Converting API responses to flat tables

#json#flattening

Python CSV Processing Examples

Read, write, and transform CSV files using the csv module and pandas with encoding and dialect handling.

Best for: Reading and cleaning CSV data files

#csv#python

Data Validation with Pydantic

Validate and parse data records using Pydantic models with custom validators and error reporting.

Best for: Validating incoming data before warehouse loading

#validation#pydantic

Retry Logic for Data Pipelines

Configurable retry decorator with exponential backoff and jitter for resilient data pipeline tasks.

Best for: Resilient API calls in data pipelines

#retry#resilience

Database Sync Script in Python

Sync data between two databases with upsert logic, batch processing, and change detection.

Best for: Replicating data between databases

#database#sync

SQL Incremental Load Pattern

Incremental data load using watermark tracking to process only new and updated records efficiently.

Best for: Efficient warehouse loading without full reloads

#sql#incremental-load

SQL Data Deduplication Techniques

Remove duplicate records using ROW_NUMBER, DISTINCT ON, and self-join deduplication strategies.

Best for: Cleaning duplicate records in production databases

#sql#deduplication

Databricks Notebook Data Pipeline

Databricks notebook with Delta Lake reads, transformations, merge operations, and table optimization.

Best for: Medallion architecture data pipelines on Databricks

#databricks#delta-lake

Python Streaming Data Processing

Process streaming data with generators, windowed aggregation, and memory-efficient line-by-line reading.

Best for: Processing large event log files efficiently

#streaming#python

SQL Window Functions for Analytics

Advanced SQL window functions for running totals, rankings, moving averages, and gap analysis.

Best for: Building analytics dashboards with running totals

#sql#window-functions

SQL Schema Migration Pattern

Versioned schema migration scripts with forward and rollback support for database evolution.

Best for: Managing database schema changes across environments

#sql#migration

Pandas Merge and Join Examples

Combine DataFrames using merge, join, and concat with different join types and key handling.

Best for: Combining data from multiple sources

#pandas#merge

dbt Model with Tests and Schema

A dbt SQL model with incremental materialization, schema tests, and source freshness checks.

Best for: Building analytics data models with dbt

#dbt#sql

Parquet File Read and Write in Python

Read and write Parquet files with pandas and PyArrow including partitioning and schema control.

Best for: Efficient columnar storage for analytics data

#parquet#pyarrow

Change Data Capture Pattern in SQL

Implement change data capture with trigger-based auditing to track inserts, updates, and deletes.

Best for: Tracking all data changes for audit compliance

#cdc#audit

Python Data Profiling Script

Generate a data quality profile report with null counts, distributions, and anomaly detection.

Best for: Automated data quality reporting

#data-quality#profiling

Pandas Pivot and Unpivot Reshaping

Reshape DataFrames between wide and long formats using pivot, melt, and stack operations.

Best for: Reshaping data for reporting dashboards

#pandas#pivot

Slowly Changing Dimension Type 2 in SQL

Implement SCD Type 2 to track historical changes in dimension tables with effective date ranges.

Best for: Tracking customer attribute changes over time

#scd#dimension

Data Quality Testing with Expectations

Define and run data quality expectations for automated validation in data pipelines.

Best for: Automated data quality gates in pipelines

#data-quality#testing

Pandas Time Series Analysis

Time series operations with resampling, rolling windows, date offsets, and period conversions.

Best for: Sales trend analysis with moving averages

#pandas#time-series

SQL Data Lineage Tracking

Track data lineage across ETL stages with metadata logging for debugging and audit trails.

Best for: Tracing data flow across pipeline stages

#lineage#metadata

Pandas Null Handling Strategies

Comprehensive strategies for detecting, filling, and handling missing values in pandas DataFrames.

Best for: Cleaning datasets with missing values

#pandas#null

SQL Window Functions for Analytics

Use window functions for running totals, rankings, moving averages, and gap detection in analytics.

Best for: Building cumulative revenue dashboards

#sql#window-functions

Read Large CSV in Chunks with Pandas

Process CSV files larger than RAM by reading in chunks — memory-efficient ETL pattern for data pipelines.

Best for: Processing multi-GB CSV files without running out of memory

#pandas#csv

Polars Lazy Query — Fast DataFrame Processing

Use Polars lazy evaluation for high-performance data transformations that outperform pandas.

Best for: High-performance data processing replacing pandas

#polars#dataframe

PySpark DataFrame — Filter and Aggregate

Common PySpark DataFrame operations: filter, group by, window functions, and write to Parquet.

Best for: Large-scale data aggregation on distributed clusters

#spark#pyspark

Airflow DAG with Python Operators

Create an Apache Airflow DAG with task dependencies, retries, and XCom data passing between tasks.

Best for: Orchestrating daily ETL pipelines

#airflow#dag

dbt Incremental Model Pattern

Build efficient dbt incremental models that process only new or changed data instead of full refreshes.

Best for: Efficient data warehouse builds processing only deltas

#dbt#incremental

SQL Data Quality Checks and Assertions

Reusable SQL queries for data quality: null checks, uniqueness, referential integrity, and freshness.

Best for: Automated data quality gates in ETL pipelines

#sql#data-quality

Snowflake MERGE with Slowly Changing Dim

Implement SCD Type 2 in Snowflake using MERGE to track historical changes in dimension tables.

Best for: Tracking full history of dimension changes

#snowflake#merge

sqlbeginner

dbt Source Freshness and Testing

Configure dbt source freshness checks and schema tests to validate upstream data pipelines.

Best for: Ensuring upstream data sources are fresh

#dbt#testing

SQL Running Totals and Cumulative Metrics

Calculate running totals, cumulative counts, and percent-of-total using window functions and partitions.

Best for: Building cumulative revenue dashboards

#sql#window-functions

bashintermediate

Bash ETL Pipeline Script

Build a complete ETL script in Bash with logging, error handling, notifications, and idempotent runs.

Best for: Automating daily data extract and load jobs

#bash#etl

bashintermediate

Cron Data Sync — Database to S3

Automated script to export database tables to compressed CSV and sync to S3 on a schedule.

Best for: Nightly database exports to cloud storage

#bash#cron

bashadvanced

Spark Submit — Job Launcher Script

Launch PySpark jobs with spark-submit including cluster configuration, dependencies, and monitoring.

Best for: Launching PySpark batch jobs on YARN clusters

#spark#pyspark

bashbeginner

dbt Run and Test — CI/CD Pipeline Script

Bash script for running dbt build with testing, documentation generation, and failure notifications.

Best for: Automating dbt builds in CI/CD pipelines

#dbt#bash

Kafka Consumer in Python — Stream Processing

Build a Kafka consumer in Python with offset management, error handling, and batch processing.

Best for: Real-time event processing from Kafka topics

#kafka#streaming

bashbeginner

Kafka Topic — Create and Manage with CLI

Create, describe, alter, and manage Kafka topics using the kafka-topics CLI with partitioning config.

Best for: Setting up Kafka topics for new data streams

#kafka#bash

DuckDB — Query Parquet Files with Python

Use DuckDB to query Parquet files and CSVs directly from Python without loading into memory first.

Best for: Ad-hoc analytics on Parquet files without Spark

#duckdb#parquet

sqlbeginner

PostgreSQL COPY — Fast CSV Import

Use PostgreSQL COPY command for high-speed bulk data loading from CSV files with error handling.

Best for: High-speed bulk data loading into PostgreSQL

#postgres#copy

BigQuery — Partitioned and Clustered Tables

Create BigQuery tables with time partitioning and clustering for optimal query performance and cost.

Best for: Optimizing BigQuery costs with partition pruning

#bigquery#partitioning

bashintermediate

Bash Pipeline Monitoring and Alerting

Monitor data pipeline health with row counts, runtime tracking, SLA checks, and Slack alerting.

Best for: Monitoring data pipeline health and freshness

#bash#monitoring

bashbeginner

Database Backup and Restore to S3

Automated PostgreSQL backup script with compression, S3 upload, retention policy, and restore commands.

Best for: Automated daily database backups to S3

#bash#backup

Polars Dataframe

Data science technique: polars-dataframe

Best for: machine learning

Dask Distributed

Data science technique: dask-distributed

Best for: machine learning

Vaex Big Data

Data science technique: vaex-big-data

Best for: machine learning

Modin Parallel

Data science technique: modin-parallel

Best for: machine learning

Scikit Learn Pipeline

Data science technique: scikit-learn-pipeline

Best for: machine learning

Scikit Learn Preprocessing

Data science technique: scikit-learn-preprocessing

Best for: machine learning

Feature Engineering

Data science technique: feature-engineering

Best for: machine learning

Feature Selection

Data science technique: feature-selection

Best for: machine learning

Dimensionality Reduction

Data science technique: dimensionality-reduction

Best for: machine learning

Pca Analysis

Data science technique: pca-analysis

Best for: machine learning

Tsne Visualization

Data science technique: tsne-visualization

Best for: machine learning

Clustering Kmeans

Data science technique: clustering-kmeans

Best for: machine learning

Hierarchical Clustering

Data science technique: hierarchical-clustering

Best for: machine learning

Dbscan Clustering

Data science technique: dbscan-clustering

Best for: machine learning

Gmm Clustering

Data science technique: gmm-clustering

Best for: machine learning

Regression Linear

Data science technique: regression-linear

Best for: machine learning

Regression Logistic

Data science technique: regression-logistic

Best for: machine learning

Regression Ridge

Data science technique: regression-ridge

Best for: machine learning

Regression Lasso

Data science technique: regression-lasso

Best for: machine learning

Svm Classification

Data science technique: svm-classification

Best for: machine learning

Random Forest

Data science technique: random-forest

Best for: machine learning

Gradient Boosting

Data science technique: gradient-boosting

Best for: machine learning

Xgboost Advanced

Data science technique: xgboost-advanced

Best for: machine learning

Neural Networks Basics

Data science technique: neural-networks-basics

Best for: machine learning

Keras Tensorflow

Data science technique: keras-tensorflow

Best for: machine learning

Pytorch Training

Data science technique: pytorch-training

Best for: machine learning

Jax Numerical

Data science technique: jax-numerical

Best for: machine learning

Cross Validation Advanced

Data science technique: cross-validation-advanced

Best for: machine learning

Hyperparameter Tuning

Data science technique: hyperparameter-tuning

Best for: machine learning

Grid Search

Data science technique: grid-search

Best for: machine learning

Bayesian Optimization

Data science technique: bayesian-optimization

Best for: machine learning

Ensemble Methods

Data science technique: ensemble-methods

Best for: machine learning

Voting Classifiers

Data science technique: voting-classifiers

Best for: machine learning

Stacking Models

Data science technique: stacking-models

Best for: machine learning

Blending Predictions

Data science technique: blending-predictions

Best for: machine learning

Time Series Forecasting

Data science technique: time-series-forecasting

Best for: machine learning

Anomaly Detection

Data science technique: anomaly-detection

Best for: machine learning

Imbalanced Learning Smote

Data science technique: imbalanced-learning-smote

Best for: machine learning

Target Encoding

Data science technique: target-encoding

Best for: machine learning

Feature Store Basics

Data science technique: feature-store-basics

Best for: machine learning

Model Calibration

Data science technique: model-calibration

Best for: machine learning

Confusion Matrix Analysis

Data science technique: confusion-matrix-analysis

Best for: machine learning

Precision Recall Curve

Data science technique: precision-recall-curve

Best for: machine learning

Roc Auc Evaluation

Data science technique: roc-auc-evaluation

Best for: machine learning

Uplift Modeling

Data science technique: uplift-modeling

Best for: machine learning

Recommendation Collaborative Filtering

Data science technique: recommendation-collaborative-filtering

Best for: machine learning

Nlp Tfidf

Data science technique: nlp-tfidf

Best for: machine learning

Topic Modeling Lda

Data science technique: topic-modeling-lda

Best for: machine learning

Survival Analysis

Data science technique: survival-analysis

Best for: machine learning

Causal Inference

Data science technique: causal-inference

Best for: machine learning

Polars DataFrame Operations

High-performance DataFrame operations using Polars: filtering, groupby, joins, and lazy evaluation.

Best for: data transformation

#polars#dataframe

Dask Parallel DataFrame Processing

Process datasets larger than RAM using Dask's parallel, lazy DataFrame API.

Best for: out-of-core processing

#dask#parallel

SQLAlchemy Bulk Insert with Upsert

Efficiently bulk-insert rows with conflict resolution using SQLAlchemy Core and PostgreSQL.

Best for: idempotent loads

#sqlalchemy#postgres

Pandas Vectorised Operations vs Apply

Compare apply vs vectorised pandas operations for performance-critical column transformations.

Best for: feature engineering

#pandas#vectorization

Pandas Rolling & Expanding Windows

Compute moving averages, rolling sums, and cumulative stats on time-series data with pandas.

Best for: sales forecasting

#pandas#time-series

PyArrow Schema Enforcement

Define and enforce strict schemas on columnar data using PyArrow before writing to Parquet.

Best for: data lake storage

#pyarrow#parquet

#great-expectations#data-quality

Great Expectations Data Quality Suite

Define and run a Great Expectations validation suite to catch data quality issues early.

Best for: CI data validation

Pandas MultiIndex Stack & Unstack

Work with hierarchical MultiIndex DataFrames: pivoting with stack/unstack and cross-sectional slicing.

Best for: panel data

#pandas#multiindex

Prefect ETL Flow with Tasks

Define a Prefect 2 flow with typed tasks, retries, and structured logging for ETL pipelines.

Best for: ETL orchestration

#prefect#etl

Pandas Memory Reduction via Dtypes

Reduce DataFrame memory by 60-80% by downcasting numeric types and using categorical columns.

Best for: large dataset loading

#pandas#memory

Generate Synthetic Data with Faker

Create realistic test datasets for development and testing using the Faker library.

Best for: test data generation

#faker#testing

SQLite + Pandas Local Data Pipeline

Run a lightweight local ETL with SQLite and pandas: load CSV, transform, persist to SQLite.

Best for: local analytics

#sqlite#pandas

#multiprocessing#parallel

Multiprocessing Pool for ETL

Parallelise CPU-bound ETL transformations across multiple CPU cores using multiprocessing.Pool.

Best for: parallel file processing

Pandera DataFrame Schema Validation

Use Pandera to validate DataFrame schemas with type checks, value constraints, and custom checks.

Best for: pipeline input validation

#pandera#validation

PySpark Window Functions

Use PySpark window functions for running totals, rank, lag/lead, and percentile computations.

Best for: sales analytics

#pyspark#spark

Redis Cache-Aside Pattern in Python

Implement cache-aside (lazy loading) with Redis and Python to accelerate repeated database queries.

Best for: query caching

#redis#caching

Kafka Producer & Consumer in Python

Produce and consume JSON messages from Apache Kafka using the confluent-kafka Python client.

Best for: event streaming

#kafka#streaming

Pandas Categorical Encoding for ML

One-hot encode, label encode, and ordinal encode categorical columns using pandas and scikit-learn.

Best for: ML preprocessing

#pandas#encoding

Pandas String Operations

Clean, extract, and transform string columns using pandas .str accessor methods.

Best for: data cleaning

#pandas#strings

SQLAlchemy Async Session with asyncpg

Use SQLAlchemy 2.0 async sessions with asyncpg for non-blocking database access in async pipelines.

Best for: async web services

#sqlalchemy#async

Pydantic Models for ETL Validation

Parse and validate raw JSON records against Pydantic models before inserting into a database.

Best for: input validation

#pydantic#validation

Flatten Nested JSON with pandas

Use pd.json_normalize to flatten deeply nested API responses into a flat DataFrame.

Best for: API response flattening

#pandas#json

Statistical Analysis with SciPy

Run hypothesis tests, correlations, and descriptive statistics on dataset columns with SciPy.

Best for: A/B testing

#scipy#statistics

Structured Logging for Data Pipelines

Use Loguru to emit structured JSON logs with contextual fields from ETL pipeline stages.

Best for: pipeline observability

#loguru#logging

Delta Lake MERGE with Python

Perform ACID upserts on a Delta Lake table using the delta-rs Python binding.

Best for: lakehouse upserts

#delta-lake#upsert

#scikit-learn#ml-pipeline

Scikit-learn Feature Pipeline

Build a reproducible ML feature pipeline with ColumnTransformer, StandardScaler, and OneHotEncoder.

Best for: ML preprocessing

Pandas Method Chaining with .pipe()

Use the .pipe() method to create clean, readable pandas transformation chains.

Best for: clean ETL code

#pandas#pipe

Read Files from S3 with fsspec

Access S3 files directly with fsspec and pandas without boto3 boilerplate.

Best for: cloud data access

#fsspec#s3

Build a Data Lineage Graph with NetworkX

Track and visualise data lineage across ETL pipeline stages using a directed graph.

Best for: data governance

#networkx#lineage

PySpark Structured Streaming from Kafka

Consume a Kafka topic in real-time with PySpark Structured Streaming and write to Parquet.

Best for: real-time ETL

#pyspark#kafka

Pandas Explode List Column

Explode a column containing lists into separate rows, useful for normalising one-to-many relations.

Best for: array column expansion

#pandas#explode

Stream Large SQL Query in Chunks

Read millions of rows from SQL in memory-safe chunks using pandas read_sql with chunksize.

Best for: large table extraction

#pandas#sql

Dataclasses as Pipeline Data Models

Use Python dataclasses to define typed, immutable data models passed between pipeline stages.

Best for: typed pipeline stages

#dataclasses#typing

Pandas Time-Series Resampling

Resample time-series data from daily to weekly/monthly frequencies with aggregation functions.

Best for: time-series analytics

#pandas#time-series

Concat & Deduplicate DataFrames

Merge multiple DataFrames and remove duplicates by composite key for clean data consolidation.

Best for: data consolidation

#pandas#deduplication

Pandas .query() for Readable Filters

Use DataFrame.query() with expressions for cleaner, SQL-like row filtering syntax.

Best for: data filtering

#pandas#query

NumPy Advanced Indexing Patterns

Use fancy indexing, boolean masks, and np.where for fast array transformations without loops.

Best for: numerical computing

#numpy#indexing

OLS Regression with statsmodels

Fit and interpret an Ordinary Least Squares regression model with diagnostics using statsmodels.

Best for: econometric analysis

#statsmodels#regression

Pandas .eval() for Fast Column Computation

Use DataFrame.eval() for expressive, fast in-place column calculations using numexpr.

Best for: large DataFrame operations

#pandas#eval

Pandas merge_asof for Time-Based Joins

Perform an as-of join to match events to the most recent reference record within a time window.

Best for: tick data joins

#pandas#merge-asof

Pandas Category Dtype Optimization

Convert string columns to categorical dtype to dramatically reduce memory and speed up groupby.

Best for: memory optimization

#pandas#category

Pandas Wide to Long (melt)

Transform a wide-format DataFrame into long format using pd.melt for analytics and visualisation.

Best for: pivot table conversion

#pandas#melt

Row Fingerprinting with hashlib

Generate deterministic hash fingerprints for each row to detect changes in incremental loads.

Best for: change data capture

#hashlib#fingerprint

Pandas Custom Aggregation Functions

Pass custom lambda and named functions to .agg() for complex groupby aggregations.

Best for: HR analytics

Tenacity Retry for Pipeline Resilience

Add exponential backoff retries to flaky data pipeline steps using Tenacity.

Best for: resilient API calls

#tenacity#retry

PyArrow Dataset Scan with Predicate Pushdown

Scan a partitioned Parquet dataset with column pruning and row-level predicate pushdown via PyArrow.

Best for: lakehouse queries

#pyarrow#parquet

Pandas Styled DataFrame Report

Apply conditional formatting to a pandas DataFrame for styled HTML reports with highlighting.

Best for: executive reporting

#pandas#styling

tqdm Progress Bars in Data Pipelines

Add progress bars to pandas operations, loops, and concurrent futures with tqdm.

Best for: ETL monitoring

#tqdm#progress

Property-Based Testing for Data Functions

Use Hypothesis to automatically generate edge-case test data for data transformation functions.

Best for: data function testing

#hypothesis#testing

Pandas .assign() for Immutable Chaining

Use DataFrame.assign() to add computed columns without mutating the original DataFrame.

Best for: immutable transforms

#pandas#assign

Async ETL Pipeline with asyncio

Run concurrent data fetches and transformations using asyncio.gather for high-throughput pipelines.

Best for: concurrent API ingestion

#asyncio#async

Pandas IntervalIndex for Binning

Use IntervalIndex and pd.cut to bin continuous variables into labelled categories.

Best for: grading systems

#pandas#binning

Grouped Time-Series with ffill

Forward-fill missing time-series values within groups to handle irregular measurement intervals.

Best for: IoT sensor data

#pandas#ffill

SQLModel CRUD Patterns

Use SQLModel (SQLAlchemy + Pydantic) to define models and run type-safe CRUD operations.

Best for: FastAPI backends

#sqlmodel#orm

Pandas Pivot Table Summary

Create multi-level summary pivot tables from transactional data using pd.pivot_table.

Best for: sales reporting

#pandas#pivot-table

Fast JSON Serialisation with orjson

Use orjson for 5-10x faster JSON serialisation of large Python dicts, dataclasses, and NumPy arrays.

Best for: high-throughput serialisation

#orjson#json

Pandas Named Aggregations

Use named aggregations in groupby().agg() to produce readable, self-documenting summary tables.

Best for: HR reporting

Read Multi-Sheet Excel Files

Load, merge, and process data from multiple Excel sheets using pandas ExcelFile context manager.

Best for: Excel ETL

#pandas#excel

Stratified Sampling with pandas

Draw a stratified random sample from a DataFrame, preserving class proportions for ML splits.

Best for: ML dataset splitting

#pandas#sampling

Polars Join Strategies

Perform inner, left, cross, and anti joins in Polars with optimal join strategies.

Best for: data enrichment

#polars#join

Ibis Portable DataFrame SQL

Write backend-agnostic analytics queries with Ibis that compile to DuckDB, BigQuery, or Spark.

Best for: portable analytics

#ibis#duckdb

DuckDB In-Memory Analytics

Run fast analytical SQL on pandas DataFrames or Parquet files without a server using DuckDB.

Best for: serverless analytics

#duckdb#analytics

Custom Pandas Accessor Extension

Create a reusable @pd.api.extensions.register_dataframe_accessor for domain-specific DataFrame methods.

Best for: domain-specific pandas

#pandas#extension

Timezone-Aware Timestamps in pandas

Convert naive timestamps to timezone-aware, handle DST transitions, and localise to UTC.

Best for: global event logs

#pandas#datetime

Polars Expressions API Patterns

Use Polars expression API for complex column-level transformations without apply or loops.

Best for: column transformations

#polars#expressions

#matplotlib#visualisation

Matplotlib Charts for Data Pipelines

Generate and save charts programmatically from pipeline output using matplotlib.

Best for: automated reports

Pandas Cross-Tabulation (crosstab)

Compute frequency and proportion cross-tabulations between two categorical columns.

Best for: categorical analysis

#pandas#crosstab

Pydantic Settings for Pipeline Config

Manage pipeline configuration from environment variables and .env files with Pydantic Settings.

Best for: pipeline configuration

#pydantic#config

Pandas read_csv with Explicit Dtypes

Specify column dtypes on CSV read to avoid costly inference and prevent silent type coercion.

Best for: fast CSV loading

#pandas#csv

Polars Lazy Scan of Parquet Files

Use Polars scan_parquet with predicate and projection pushdown for fast Parquet analytics.

Best for: lakehouse queries

#polars#parquet

Pareto / Cumulative Share Analysis

Calculate cumulative share (Pareto 80/20) of values for product or customer ranking analysis.

Best for: product analytics

#pandas#pareto

Pandas Apply with Chunked Progress

Apply a function to a large DataFrame in chunked batches to avoid memory spikes and track progress.

Best for: memory-safe transforms

#pandas#apply

Pandas Merge with Validation

Use merge() validate parameter to catch unexpected many-to-many or missing key issues in joins.

Best for: data integrity

#pandas#merge

Pandera @check_input and @check_output

Decorate pipeline functions with Pandera schema validators to enforce input and output contracts.

Best for: contract testing

#pandera#validation

NumPy Structured Arrays for Records

Use NumPy structured arrays to store heterogeneous record types efficiently without pandas overhead.

Best for: binary record storage

#numpy#structured-array

Pandas Conditional Join with merge + query

Perform range/conditional joins by merging on a common key and filtering with query expressions.

Best for: session attribution

#pandas#conditional-join

Polars String Operations

Use the Polars .str namespace for fast, vectorised string cleaning and extraction.

Best for: data cleaning

#polars#strings

Pandas GroupBy Transform Patterns

Use groupby().transform() to compute group-level statistics and broadcast them back to row level.

Best for: feature engineering

Bulk Load CSV into PostgreSQL with COPY

Use psycopg2's copy_expert for the fastest possible bulk CSV load into a PostgreSQL table.

Best for: high-speed bulk loads

#psycopg2#postgres

Pandas Business Day Offsets

Compute business-day-adjusted dates using pandas offsets for financial and SLA calculations.

Best for: financial calendars

#pandas#datetime

Detect & Remove Pandas Duplicates

Find, count, and remove duplicate rows with flexible keep strategy and composite key support.

Best for: data cleaning

#pandas#duplicates

Efficient One-Hot Pivot with Sparse

Create a sparse user-item matrix from transaction logs for recommendation or ML use cases.

Best for: recommendation systems

#pandas#sparse

Pandas PeriodIndex for Fiscal Calendars

Use PeriodIndex for fiscal period arithmetic, aggregations, and comparisons beyond datetime.

Best for: fiscal reporting

#pandas#period

Compare Two DataFrames for Changes

Detect row-level additions, deletions, and modifications between two DataFrame snapshots.

Best for: change data capture

#pandas#diff

Cumulative Max & Streak Detection

Detect streaks, new highs, and consecutive-day patterns in time-series using cummax and groupby.

Best for: sports analytics

#pandas#cummax

Read NDJSON / JSON Lines Files

Efficiently read newline-delimited JSON (NDJSON) log files into a pandas DataFrame.

Best for: log file ingestion

#pandas#ndjson

attrs Classes as Immutable Pipeline Records

Use attrs to create fast, validated, immutable record types for data pipeline stage outputs.

Best for: typed pipeline records

#attrs#data-modeling

Polars Pivot and Unpivot

Reshape a Polars DataFrame from long to wide (pivot) and wide to long (unpivot/melt).

Best for: data reshaping

#polars#pivot

Detect Overlapping Date Intervals

Identify overlapping time periods in a DataFrame (e.g., booking conflicts or subscription overlaps).

Best for: scheduling conflicts

#pandas#intervals

Pandas SwapLevel MultiIndex

Swap and sort MultiIndex levels in a hierarchical DataFrame for flexible aggregation.

Best for: hierarchical reporting

#pandas#multiindex

Pandas Rank with Tie-Breaking Methods

Apply different ranking strategies (min, dense, average) and handle ties in pandas.

Best for: leaderboards

#pandas#ranking

Pandas nlargest / nsmallest

Efficiently retrieve the N largest or smallest rows without sorting the full DataFrame.

Best for: top-N queries

#pandas#top-n

Pandas Datetime Component Extraction

Extract year, month, day, hour, day-of-week and other components from a datetime column.

Best for: time-based features

#pandas#datetime

DataFrame to Dict Records

Convert DataFrames to lists of dicts for API responses, JSON export, or further processing.

Best for: API serialization

#pandas#records

#pandas#feature-engineering

Pandas Cartesian Feature Interaction

Generate pairwise feature interactions for ML by creating cross-product columns.

Best for: ML feature engineering

Pandas Rolling Correlation

Compute rolling Pearson correlation between two columns to detect shifting relationships over time.

Best for: regime detection

#pandas#rolling

Expand JSON Column into DataFrame Columns

Parse a JSON-string column and expand its keys into separate columns in one step.

Best for: JSON column expansion

#pandas#json

Pandas Forward Fill & Backward Fill

Propagate non-null values forward and backward to fill gaps in time-series or sparse data.

Best for: gap filling

#pandas#ffill

Value Counts with Normalisation

Compute frequency distributions and percentage breakdowns of categorical columns.

Best for: data profiling

#pandas#value-counts