Data Engineering
Data pipelines, transformations, ETL, and big data processing snippets.
200 snippets
Showing 200 of 200 snippets
Pandas DataFrame Transformations
Common pandas DataFrame transformations including column operations, type casting, and string methods.
Best for: Cleaning raw data files for analysis
Pandas DataFrame Filtering Techniques
Filter DataFrames using boolean masks, query syntax, isin, between, and string matching methods.
Best for: Extracting subsets of data for reporting
Pandas GroupBy Aggregation Examples
GroupBy operations with multiple aggregations, named aggregations, and transform for DataFrame analysis.
Best for: Sales reporting by region and time period
Python ETL Pipeline Example
Complete extract-transform-load pipeline with error handling, logging, and incremental processing.
Best for: Automating data ingestion from CSV to warehouse
Apache Airflow DAG Example
Airflow DAG with task dependencies, retries, SLA, and PythonOperator for daily data pipeline.
Best for: Orchestrating daily data pipelines
Spark SQL Query Example
PySpark DataFrame operations with SQL queries, window functions, and aggregations for big data.
Best for: Processing large-scale datasets with Spark
Python Batch Processing Script
Process large files in configurable batches with progress tracking, error handling, and resume support.
Best for: Processing large CSV files that don't fit in memory
Nested JSON Flattening in Python
Flatten deeply nested JSON structures into flat dictionaries suitable for DataFrames or CSV export.
Best for: Converting API responses to flat tables
Python CSV Processing Examples
Read, write, and transform CSV files using the csv module and pandas with encoding and dialect handling.
Best for: Reading and cleaning CSV data files
Data Validation with Pydantic
Validate and parse data records using Pydantic models with custom validators and error reporting.
Best for: Validating incoming data before warehouse loading
Retry Logic for Data Pipelines
Configurable retry decorator with exponential backoff and jitter for resilient data pipeline tasks.
Best for: Resilient API calls in data pipelines
Database Sync Script in Python
Sync data between two databases with upsert logic, batch processing, and change detection.
Best for: Replicating data between databases
SQL Incremental Load Pattern
Incremental data load using watermark tracking to process only new and updated records efficiently.
Best for: Efficient warehouse loading without full reloads
SQL Data Deduplication Techniques
Remove duplicate records using ROW_NUMBER, DISTINCT ON, and self-join deduplication strategies.
Best for: Cleaning duplicate records in production databases
Databricks Notebook Data Pipeline
Databricks notebook with Delta Lake reads, transformations, merge operations, and table optimization.
Best for: Medallion architecture data pipelines on Databricks
Python Streaming Data Processing
Process streaming data with generators, windowed aggregation, and memory-efficient line-by-line reading.
Best for: Processing large event log files efficiently
SQL Window Functions for Analytics
Advanced SQL window functions for running totals, rankings, moving averages, and gap analysis.
Best for: Building analytics dashboards with running totals
SQL Schema Migration Pattern
Versioned schema migration scripts with forward and rollback support for database evolution.
Best for: Managing database schema changes across environments
Pandas Merge and Join Examples
Combine DataFrames using merge, join, and concat with different join types and key handling.
Best for: Combining data from multiple sources
dbt Model with Tests and Schema
A dbt SQL model with incremental materialization, schema tests, and source freshness checks.
Best for: Building analytics data models with dbt
Parquet File Read and Write in Python
Read and write Parquet files with pandas and PyArrow including partitioning and schema control.
Best for: Efficient columnar storage for analytics data
Change Data Capture Pattern in SQL
Implement change data capture with trigger-based auditing to track inserts, updates, and deletes.
Best for: Tracking all data changes for audit compliance
Python Data Profiling Script
Generate a data quality profile report with null counts, distributions, and anomaly detection.
Best for: Automated data quality reporting
Pandas Pivot and Unpivot Reshaping
Reshape DataFrames between wide and long formats using pivot, melt, and stack operations.
Best for: Reshaping data for reporting dashboards
Slowly Changing Dimension Type 2 in SQL
Implement SCD Type 2 to track historical changes in dimension tables with effective date ranges.
Best for: Tracking customer attribute changes over time
Data Quality Testing with Expectations
Define and run data quality expectations for automated validation in data pipelines.
Best for: Automated data quality gates in pipelines
Pandas Time Series Analysis
Time series operations with resampling, rolling windows, date offsets, and period conversions.
Best for: Sales trend analysis with moving averages
SQL Data Lineage Tracking
Track data lineage across ETL stages with metadata logging for debugging and audit trails.
Best for: Tracing data flow across pipeline stages
Pandas Null Handling Strategies
Comprehensive strategies for detecting, filling, and handling missing values in pandas DataFrames.
Best for: Cleaning datasets with missing values
SQL Window Functions for Analytics
Use window functions for running totals, rankings, moving averages, and gap detection in analytics.
Best for: Building cumulative revenue dashboards
Read Large CSV in Chunks with Pandas
Process CSV files larger than RAM by reading in chunks — memory-efficient ETL pattern for data pipelines.
Best for: Processing multi-GB CSV files without running out of memory
Polars Lazy Query — Fast DataFrame Processing
Use Polars lazy evaluation for high-performance data transformations that outperform pandas.
Best for: High-performance data processing replacing pandas
PySpark DataFrame — Filter and Aggregate
Common PySpark DataFrame operations: filter, group by, window functions, and write to Parquet.
Best for: Large-scale data aggregation on distributed clusters
Airflow DAG with Python Operators
Create an Apache Airflow DAG with task dependencies, retries, and XCom data passing between tasks.
Best for: Orchestrating daily ETL pipelines
dbt Incremental Model Pattern
Build efficient dbt incremental models that process only new or changed data instead of full refreshes.
Best for: Efficient data warehouse builds processing only deltas
SQL Data Quality Checks and Assertions
Reusable SQL queries for data quality: null checks, uniqueness, referential integrity, and freshness.
Best for: Automated data quality gates in ETL pipelines
Snowflake MERGE with Slowly Changing Dim
Implement SCD Type 2 in Snowflake using MERGE to track historical changes in dimension tables.
Best for: Tracking full history of dimension changes
dbt Source Freshness and Testing
Configure dbt source freshness checks and schema tests to validate upstream data pipelines.
Best for: Ensuring upstream data sources are fresh
SQL Running Totals and Cumulative Metrics
Calculate running totals, cumulative counts, and percent-of-total using window functions and partitions.
Best for: Building cumulative revenue dashboards
Bash ETL Pipeline Script
Build a complete ETL script in Bash with logging, error handling, notifications, and idempotent runs.
Best for: Automating daily data extract and load jobs
Cron Data Sync — Database to S3
Automated script to export database tables to compressed CSV and sync to S3 on a schedule.
Best for: Nightly database exports to cloud storage
Spark Submit — Job Launcher Script
Launch PySpark jobs with spark-submit including cluster configuration, dependencies, and monitoring.
Best for: Launching PySpark batch jobs on YARN clusters
dbt Run and Test — CI/CD Pipeline Script
Bash script for running dbt build with testing, documentation generation, and failure notifications.
Best for: Automating dbt builds in CI/CD pipelines
Kafka Consumer in Python — Stream Processing
Build a Kafka consumer in Python with offset management, error handling, and batch processing.
Best for: Real-time event processing from Kafka topics
Kafka Topic — Create and Manage with CLI
Create, describe, alter, and manage Kafka topics using the kafka-topics CLI with partitioning config.
Best for: Setting up Kafka topics for new data streams
DuckDB — Query Parquet Files with Python
Use DuckDB to query Parquet files and CSVs directly from Python without loading into memory first.
Best for: Ad-hoc analytics on Parquet files without Spark
PostgreSQL COPY — Fast CSV Import
Use PostgreSQL COPY command for high-speed bulk data loading from CSV files with error handling.
Best for: High-speed bulk data loading into PostgreSQL
BigQuery — Partitioned and Clustered Tables
Create BigQuery tables with time partitioning and clustering for optimal query performance and cost.
Best for: Optimizing BigQuery costs with partition pruning
Bash Pipeline Monitoring and Alerting
Monitor data pipeline health with row counts, runtime tracking, SLA checks, and Slack alerting.
Best for: Monitoring data pipeline health and freshness
Database Backup and Restore to S3
Automated PostgreSQL backup script with compression, S3 upload, retention policy, and restore commands.
Best for: Automated daily database backups to S3
Polars Dataframe
Data science technique: polars-dataframe
Best for: machine learning
Dask Distributed
Data science technique: dask-distributed
Best for: machine learning
Vaex Big Data
Data science technique: vaex-big-data
Best for: machine learning
Modin Parallel
Data science technique: modin-parallel
Best for: machine learning
Scikit Learn Pipeline
Data science technique: scikit-learn-pipeline
Best for: machine learning
Scikit Learn Preprocessing
Data science technique: scikit-learn-preprocessing
Best for: machine learning
Feature Engineering
Data science technique: feature-engineering
Best for: machine learning
Feature Selection
Data science technique: feature-selection
Best for: machine learning
Dimensionality Reduction
Data science technique: dimensionality-reduction
Best for: machine learning
Pca Analysis
Data science technique: pca-analysis
Best for: machine learning
Tsne Visualization
Data science technique: tsne-visualization
Best for: machine learning
Clustering Kmeans
Data science technique: clustering-kmeans
Best for: machine learning
Hierarchical Clustering
Data science technique: hierarchical-clustering
Best for: machine learning
Dbscan Clustering
Data science technique: dbscan-clustering
Best for: machine learning
Gmm Clustering
Data science technique: gmm-clustering
Best for: machine learning
Regression Linear
Data science technique: regression-linear
Best for: machine learning
Regression Logistic
Data science technique: regression-logistic
Best for: machine learning
Regression Ridge
Data science technique: regression-ridge
Best for: machine learning
Regression Lasso
Data science technique: regression-lasso
Best for: machine learning
Svm Classification
Data science technique: svm-classification
Best for: machine learning
Random Forest
Data science technique: random-forest
Best for: machine learning
Gradient Boosting
Data science technique: gradient-boosting
Best for: machine learning
Xgboost Advanced
Data science technique: xgboost-advanced
Best for: machine learning
Neural Networks Basics
Data science technique: neural-networks-basics
Best for: machine learning
Keras Tensorflow
Data science technique: keras-tensorflow
Best for: machine learning
Pytorch Training
Data science technique: pytorch-training
Best for: machine learning
Jax Numerical
Data science technique: jax-numerical
Best for: machine learning
Cross Validation Advanced
Data science technique: cross-validation-advanced
Best for: machine learning
Hyperparameter Tuning
Data science technique: hyperparameter-tuning
Best for: machine learning
Grid Search
Data science technique: grid-search
Best for: machine learning
Bayesian Optimization
Data science technique: bayesian-optimization
Best for: machine learning
Ensemble Methods
Data science technique: ensemble-methods
Best for: machine learning
Voting Classifiers
Data science technique: voting-classifiers
Best for: machine learning
Stacking Models
Data science technique: stacking-models
Best for: machine learning
Blending Predictions
Data science technique: blending-predictions
Best for: machine learning
Time Series Forecasting
Data science technique: time-series-forecasting
Best for: machine learning
Anomaly Detection
Data science technique: anomaly-detection
Best for: machine learning
Imbalanced Learning Smote
Data science technique: imbalanced-learning-smote
Best for: machine learning
Target Encoding
Data science technique: target-encoding
Best for: machine learning
Feature Store Basics
Data science technique: feature-store-basics
Best for: machine learning
Model Calibration
Data science technique: model-calibration
Best for: machine learning
Confusion Matrix Analysis
Data science technique: confusion-matrix-analysis
Best for: machine learning
Precision Recall Curve
Data science technique: precision-recall-curve
Best for: machine learning
Roc Auc Evaluation
Data science technique: roc-auc-evaluation
Best for: machine learning
Uplift Modeling
Data science technique: uplift-modeling
Best for: machine learning
Recommendation Collaborative Filtering
Data science technique: recommendation-collaborative-filtering
Best for: machine learning
Nlp Tfidf
Data science technique: nlp-tfidf
Best for: machine learning
Topic Modeling Lda
Data science technique: topic-modeling-lda
Best for: machine learning
Survival Analysis
Data science technique: survival-analysis
Best for: machine learning
Causal Inference
Data science technique: causal-inference
Best for: machine learning
Polars DataFrame Operations
High-performance DataFrame operations using Polars: filtering, groupby, joins, and lazy evaluation.
Best for: data transformation
Dask Parallel DataFrame Processing
Process datasets larger than RAM using Dask's parallel, lazy DataFrame API.
Best for: out-of-core processing
SQLAlchemy Bulk Insert with Upsert
Efficiently bulk-insert rows with conflict resolution using SQLAlchemy Core and PostgreSQL.
Best for: idempotent loads
Pandas Vectorised Operations vs Apply
Compare apply vs vectorised pandas operations for performance-critical column transformations.
Best for: feature engineering
Pandas Rolling & Expanding Windows
Compute moving averages, rolling sums, and cumulative stats on time-series data with pandas.
Best for: sales forecasting
PyArrow Schema Enforcement
Define and enforce strict schemas on columnar data using PyArrow before writing to Parquet.
Best for: data lake storage
Great Expectations Data Quality Suite
Define and run a Great Expectations validation suite to catch data quality issues early.
Best for: CI data validation
Pandas MultiIndex Stack & Unstack
Work with hierarchical MultiIndex DataFrames: pivoting with stack/unstack and cross-sectional slicing.
Best for: panel data
Prefect ETL Flow with Tasks
Define a Prefect 2 flow with typed tasks, retries, and structured logging for ETL pipelines.
Best for: ETL orchestration
Pandas Memory Reduction via Dtypes
Reduce DataFrame memory by 60-80% by downcasting numeric types and using categorical columns.
Best for: large dataset loading
Generate Synthetic Data with Faker
Create realistic test datasets for development and testing using the Faker library.
Best for: test data generation
SQLite + Pandas Local Data Pipeline
Run a lightweight local ETL with SQLite and pandas: load CSV, transform, persist to SQLite.
Best for: local analytics
Multiprocessing Pool for ETL
Parallelise CPU-bound ETL transformations across multiple CPU cores using multiprocessing.Pool.
Best for: parallel file processing
Pandera DataFrame Schema Validation
Use Pandera to validate DataFrame schemas with type checks, value constraints, and custom checks.
Best for: pipeline input validation
PySpark Window Functions
Use PySpark window functions for running totals, rank, lag/lead, and percentile computations.
Best for: sales analytics
Redis Cache-Aside Pattern in Python
Implement cache-aside (lazy loading) with Redis and Python to accelerate repeated database queries.
Best for: query caching
Kafka Producer & Consumer in Python
Produce and consume JSON messages from Apache Kafka using the confluent-kafka Python client.
Best for: event streaming
Pandas Categorical Encoding for ML
One-hot encode, label encode, and ordinal encode categorical columns using pandas and scikit-learn.
Best for: ML preprocessing
Pandas String Operations
Clean, extract, and transform string columns using pandas .str accessor methods.
Best for: data cleaning
SQLAlchemy Async Session with asyncpg
Use SQLAlchemy 2.0 async sessions with asyncpg for non-blocking database access in async pipelines.
Best for: async web services
Pydantic Models for ETL Validation
Parse and validate raw JSON records against Pydantic models before inserting into a database.
Best for: input validation
Flatten Nested JSON with pandas
Use pd.json_normalize to flatten deeply nested API responses into a flat DataFrame.
Best for: API response flattening
Statistical Analysis with SciPy
Run hypothesis tests, correlations, and descriptive statistics on dataset columns with SciPy.
Best for: A/B testing
Structured Logging for Data Pipelines
Use Loguru to emit structured JSON logs with contextual fields from ETL pipeline stages.
Best for: pipeline observability
Delta Lake MERGE with Python
Perform ACID upserts on a Delta Lake table using the delta-rs Python binding.
Best for: lakehouse upserts
Scikit-learn Feature Pipeline
Build a reproducible ML feature pipeline with ColumnTransformer, StandardScaler, and OneHotEncoder.
Best for: ML preprocessing
Pandas Method Chaining with .pipe()
Use the .pipe() method to create clean, readable pandas transformation chains.
Best for: clean ETL code
Read Files from S3 with fsspec
Access S3 files directly with fsspec and pandas without boto3 boilerplate.
Best for: cloud data access
Build a Data Lineage Graph with NetworkX
Track and visualise data lineage across ETL pipeline stages using a directed graph.
Best for: data governance
PySpark Structured Streaming from Kafka
Consume a Kafka topic in real-time with PySpark Structured Streaming and write to Parquet.
Best for: real-time ETL
Pandas Explode List Column
Explode a column containing lists into separate rows, useful for normalising one-to-many relations.
Best for: array column expansion
Stream Large SQL Query in Chunks
Read millions of rows from SQL in memory-safe chunks using pandas read_sql with chunksize.
Best for: large table extraction
Dataclasses as Pipeline Data Models
Use Python dataclasses to define typed, immutable data models passed between pipeline stages.
Best for: typed pipeline stages
Pandas Time-Series Resampling
Resample time-series data from daily to weekly/monthly frequencies with aggregation functions.
Best for: time-series analytics
Concat & Deduplicate DataFrames
Merge multiple DataFrames and remove duplicates by composite key for clean data consolidation.
Best for: data consolidation
Pandas .query() for Readable Filters
Use DataFrame.query() with expressions for cleaner, SQL-like row filtering syntax.
Best for: data filtering
NumPy Advanced Indexing Patterns
Use fancy indexing, boolean masks, and np.where for fast array transformations without loops.
Best for: numerical computing
OLS Regression with statsmodels
Fit and interpret an Ordinary Least Squares regression model with diagnostics using statsmodels.
Best for: econometric analysis
Pandas .eval() for Fast Column Computation
Use DataFrame.eval() for expressive, fast in-place column calculations using numexpr.
Best for: large DataFrame operations
Pandas merge_asof for Time-Based Joins
Perform an as-of join to match events to the most recent reference record within a time window.
Best for: tick data joins
Pandas Category Dtype Optimization
Convert string columns to categorical dtype to dramatically reduce memory and speed up groupby.
Best for: memory optimization
Pandas Wide to Long (melt)
Transform a wide-format DataFrame into long format using pd.melt for analytics and visualisation.
Best for: pivot table conversion
Row Fingerprinting with hashlib
Generate deterministic hash fingerprints for each row to detect changes in incremental loads.
Best for: change data capture
Pandas Custom Aggregation Functions
Pass custom lambda and named functions to .agg() for complex groupby aggregations.
Best for: HR analytics
Tenacity Retry for Pipeline Resilience
Add exponential backoff retries to flaky data pipeline steps using Tenacity.
Best for: resilient API calls
PyArrow Dataset Scan with Predicate Pushdown
Scan a partitioned Parquet dataset with column pruning and row-level predicate pushdown via PyArrow.
Best for: lakehouse queries
Pandas Styled DataFrame Report
Apply conditional formatting to a pandas DataFrame for styled HTML reports with highlighting.
Best for: executive reporting
tqdm Progress Bars in Data Pipelines
Add progress bars to pandas operations, loops, and concurrent futures with tqdm.
Best for: ETL monitoring
Property-Based Testing for Data Functions
Use Hypothesis to automatically generate edge-case test data for data transformation functions.
Best for: data function testing
Pandas .assign() for Immutable Chaining
Use DataFrame.assign() to add computed columns without mutating the original DataFrame.
Best for: immutable transforms
Async ETL Pipeline with asyncio
Run concurrent data fetches and transformations using asyncio.gather for high-throughput pipelines.
Best for: concurrent API ingestion
Pandas IntervalIndex for Binning
Use IntervalIndex and pd.cut to bin continuous variables into labelled categories.
Best for: grading systems
Grouped Time-Series with ffill
Forward-fill missing time-series values within groups to handle irregular measurement intervals.
Best for: IoT sensor data
SQLModel CRUD Patterns
Use SQLModel (SQLAlchemy + Pydantic) to define models and run type-safe CRUD operations.
Best for: FastAPI backends
Pandas Pivot Table Summary
Create multi-level summary pivot tables from transactional data using pd.pivot_table.
Best for: sales reporting
Fast JSON Serialisation with orjson
Use orjson for 5-10x faster JSON serialisation of large Python dicts, dataclasses, and NumPy arrays.
Best for: high-throughput serialisation
Pandas Named Aggregations
Use named aggregations in groupby().agg() to produce readable, self-documenting summary tables.
Best for: HR reporting
Read Multi-Sheet Excel Files
Load, merge, and process data from multiple Excel sheets using pandas ExcelFile context manager.
Best for: Excel ETL
Stratified Sampling with pandas
Draw a stratified random sample from a DataFrame, preserving class proportions for ML splits.
Best for: ML dataset splitting
Polars Join Strategies
Perform inner, left, cross, and anti joins in Polars with optimal join strategies.
Best for: data enrichment
Ibis Portable DataFrame SQL
Write backend-agnostic analytics queries with Ibis that compile to DuckDB, BigQuery, or Spark.
Best for: portable analytics
DuckDB In-Memory Analytics
Run fast analytical SQL on pandas DataFrames or Parquet files without a server using DuckDB.
Best for: serverless analytics
Custom Pandas Accessor Extension
Create a reusable @pd.api.extensions.register_dataframe_accessor for domain-specific DataFrame methods.
Best for: domain-specific pandas
Timezone-Aware Timestamps in pandas
Convert naive timestamps to timezone-aware, handle DST transitions, and localise to UTC.
Best for: global event logs
Polars Expressions API Patterns
Use Polars expression API for complex column-level transformations without apply or loops.
Best for: column transformations
Matplotlib Charts for Data Pipelines
Generate and save charts programmatically from pipeline output using matplotlib.
Best for: automated reports
Pandas Cross-Tabulation (crosstab)
Compute frequency and proportion cross-tabulations between two categorical columns.
Best for: categorical analysis
Pydantic Settings for Pipeline Config
Manage pipeline configuration from environment variables and .env files with Pydantic Settings.
Best for: pipeline configuration
Pandas read_csv with Explicit Dtypes
Specify column dtypes on CSV read to avoid costly inference and prevent silent type coercion.
Best for: fast CSV loading
Polars Lazy Scan of Parquet Files
Use Polars scan_parquet with predicate and projection pushdown for fast Parquet analytics.
Best for: lakehouse queries
Pareto / Cumulative Share Analysis
Calculate cumulative share (Pareto 80/20) of values for product or customer ranking analysis.
Best for: product analytics
Pandas Apply with Chunked Progress
Apply a function to a large DataFrame in chunked batches to avoid memory spikes and track progress.
Best for: memory-safe transforms
Pandas Merge with Validation
Use merge() validate parameter to catch unexpected many-to-many or missing key issues in joins.
Best for: data integrity
Pandera @check_input and @check_output
Decorate pipeline functions with Pandera schema validators to enforce input and output contracts.
Best for: contract testing
NumPy Structured Arrays for Records
Use NumPy structured arrays to store heterogeneous record types efficiently without pandas overhead.
Best for: binary record storage
Pandas Conditional Join with merge + query
Perform range/conditional joins by merging on a common key and filtering with query expressions.
Best for: session attribution
Polars String Operations
Use the Polars .str namespace for fast, vectorised string cleaning and extraction.
Best for: data cleaning
Pandas GroupBy Transform Patterns
Use groupby().transform() to compute group-level statistics and broadcast them back to row level.
Best for: feature engineering
Bulk Load CSV into PostgreSQL with COPY
Use psycopg2's copy_expert for the fastest possible bulk CSV load into a PostgreSQL table.
Best for: high-speed bulk loads
Pandas Business Day Offsets
Compute business-day-adjusted dates using pandas offsets for financial and SLA calculations.
Best for: financial calendars
Detect & Remove Pandas Duplicates
Find, count, and remove duplicate rows with flexible keep strategy and composite key support.
Best for: data cleaning
Efficient One-Hot Pivot with Sparse
Create a sparse user-item matrix from transaction logs for recommendation or ML use cases.
Best for: recommendation systems
Pandas PeriodIndex for Fiscal Calendars
Use PeriodIndex for fiscal period arithmetic, aggregations, and comparisons beyond datetime.
Best for: fiscal reporting
Compare Two DataFrames for Changes
Detect row-level additions, deletions, and modifications between two DataFrame snapshots.
Best for: change data capture
Cumulative Max & Streak Detection
Detect streaks, new highs, and consecutive-day patterns in time-series using cummax and groupby.
Best for: sports analytics
Read NDJSON / JSON Lines Files
Efficiently read newline-delimited JSON (NDJSON) log files into a pandas DataFrame.
Best for: log file ingestion
attrs Classes as Immutable Pipeline Records
Use attrs to create fast, validated, immutable record types for data pipeline stage outputs.
Best for: typed pipeline records
Polars Pivot and Unpivot
Reshape a Polars DataFrame from long to wide (pivot) and wide to long (unpivot/melt).
Best for: data reshaping
Detect Overlapping Date Intervals
Identify overlapping time periods in a DataFrame (e.g., booking conflicts or subscription overlaps).
Best for: scheduling conflicts
Pandas SwapLevel MultiIndex
Swap and sort MultiIndex levels in a hierarchical DataFrame for flexible aggregation.
Best for: hierarchical reporting
Pandas Rank with Tie-Breaking Methods
Apply different ranking strategies (min, dense, average) and handle ties in pandas.
Best for: leaderboards
Pandas nlargest / nsmallest
Efficiently retrieve the N largest or smallest rows without sorting the full DataFrame.
Best for: top-N queries
Pandas Datetime Component Extraction
Extract year, month, day, hour, day-of-week and other components from a datetime column.
Best for: time-based features
DataFrame to Dict Records
Convert DataFrames to lists of dicts for API responses, JSON export, or further processing.
Best for: API serialization
Pandas Cartesian Feature Interaction
Generate pairwise feature interactions for ML by creating cross-product columns.
Best for: ML feature engineering
Pandas Rolling Correlation
Compute rolling Pearson correlation between two columns to detect shifting relationships over time.
Best for: regime detection
Expand JSON Column into DataFrame Columns
Parse a JSON-string column and expand its keys into separate columns in one step.
Best for: JSON column expansion
Pandas Forward Fill & Backward Fill
Propagate non-null values forward and backward to fill gaps in time-series or sparse data.
Best for: gap filling
Value Counts with Normalisation
Compute frequency distributions and percentage breakdowns of categorical columns.
Best for: data profiling
dbt Python Model with pandas
Write a dbt Python model that runs on Databricks/Snowpark to transform DataFrames in the warehouse.
Best for: dbt Python models