pythonintermediate

Polars Lazy Scan of Parquet Files

Use Polars scan_parquet with predicate and projection pushdown for fast Parquet analytics.

python
import polars as pl

lf = (
    pl.scan_parquet('s3://bucket/events/*.parquet')
    .filter(
        (pl.col('year') == 2024) &
        (pl.col('action').is_in(['purchase','refund']))
    )
    .select(['user_id','action','amount','ts'])
    .group_by(['user_id','action'])
    .agg([
        pl.col('amount').sum().alias('total'),
        pl.col('ts').max().alias('last_event'),
    ])
)

result = lf.collect(streaming=True)
print(result.head())

Use Cases

  • lakehouse queries
  • predicate pushdown
  • efficient Parquet reads

Tags

Related Snippets

Similar patterns you can reuse in the same workflow.