pythonbeginner

Parquet File Read and Write in Python

Read and write Parquet files with pandas and PyArrow including partitioning and schema control.

python
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame({
    "id": range(1, 1001),
    "name": [f"item_{i}" for i in range(1, 1001)],
    "value": [i * 1.5 for i in range(1, 1001)],
    "category": ["A", "B", "C", "D"] * 250,
})

# Write single Parquet file
df.to_parquet("output.parquet", engine="pyarrow", index=False)

# Write partitioned Parquet dataset
df.to_parquet(
    "output_partitioned/",
    engine="pyarrow",
    partition_cols=["category"],
    index=False,
)

# Read Parquet file
df_read = pd.read_parquet("output.parquet")

# Read specific columns only
df_partial = pd.read_parquet("output.parquet", columns=["id", "value"])

# Read with filter (predicate pushdown)
df_filtered = pd.read_parquet(
    "output_partitioned/",
    filters=[("category", "==", "A")],
)

# Inspect schema without loading data
schema = pq.read_schema("output.parquet")
print(schema)

# Get row count without loading
meta = pq.read_metadata("output.parquet")
print(f"Rows: {meta.num_rows}, Size: {meta.serialized_size} bytes")

Use Cases

  • Efficient columnar storage for analytics data
  • Partitioned data lakes for query performance
  • Schema inspection and metadata analysis

Tags

Related Snippets

Similar patterns you can reuse in the same workflow.