pythonadvanced
PyArrow Dataset Scan with Predicate Pushdown
Scan a partitioned Parquet dataset with column pruning and row-level predicate pushdown via PyArrow.
pythonPress ⌘/Ctrl + Shift + C to copy
import pyarrow.dataset as ds
dataset = ds.dataset('s3://my-bucket/warehouse/events/', format='parquet', partitioning='hive')
table = dataset.to_table(
columns=['user_id','action','ts'],
filter=(ds.field('year') == 2024) & (ds.field('action') == 'purchase'),
)
print(table.schema)
print(f'Rows read: {table.num_rows:,}')Use Cases
- lakehouse queries
- column pruning
- efficient S3 reads
Tags
Related Snippets
Similar patterns you can reuse in the same workflow.
pythonbeginner
Parquet File Read and Write in Python
Read and write Parquet files with pandas and PyArrow including partitioning and schema control.
Best for: Efficient columnar storage for analytics data
#parquet#pyarrow
pythonintermediate
PyArrow Schema Enforcement
Define and enforce strict schemas on columnar data using PyArrow before writing to Parquet.
Best for: data lake storage
#pyarrow#parquet
pythonintermediate
Polars Lazy Query — Fast DataFrame Processing
Use Polars lazy evaluation for high-performance data transformations that outperform pandas.
Best for: High-performance data processing replacing pandas
#polars#dataframe
pythonbeginner
DuckDB — Query Parquet Files with Python
Use DuckDB to query Parquet files and CSVs directly from Python without loading into memory first.
Best for: Ad-hoc analytics on Parquet files without Spark
#duckdb#parquet