pythonadvanced

PyArrow Dataset Scan with Predicate Pushdown

Scan a partitioned Parquet dataset with column pruning and row-level predicate pushdown via PyArrow.

python
import pyarrow.dataset as ds

dataset = ds.dataset('s3://my-bucket/warehouse/events/', format='parquet', partitioning='hive')

table = dataset.to_table(
    columns=['user_id','action','ts'],
    filter=(ds.field('year') == 2024) & (ds.field('action') == 'purchase'),
)

print(table.schema)
print(f'Rows read: {table.num_rows:,}')

Use Cases

  • lakehouse queries
  • column pruning
  • efficient S3 reads

Tags

Related Snippets

Similar patterns you can reuse in the same workflow.