pythonbeginner
Parquet File Read and Write in Python
Read and write Parquet files with pandas and PyArrow including partitioning and schema control.
pythonPress ⌘/Ctrl + Shift + C to copy
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
df = pd.DataFrame({
"id": range(1, 1001),
"name": [f"item_{i}" for i in range(1, 1001)],
"value": [i * 1.5 for i in range(1, 1001)],
"category": ["A", "B", "C", "D"] * 250,
})
# Write single Parquet file
df.to_parquet("output.parquet", engine="pyarrow", index=False)
# Write partitioned Parquet dataset
df.to_parquet(
"output_partitioned/",
engine="pyarrow",
partition_cols=["category"],
index=False,
)
# Read Parquet file
df_read = pd.read_parquet("output.parquet")
# Read specific columns only
df_partial = pd.read_parquet("output.parquet", columns=["id", "value"])
# Read with filter (predicate pushdown)
df_filtered = pd.read_parquet(
"output_partitioned/",
filters=[("category", "==", "A")],
)
# Inspect schema without loading data
schema = pq.read_schema("output.parquet")
print(schema)
# Get row count without loading
meta = pq.read_metadata("output.parquet")
print(f"Rows: {meta.num_rows}, Size: {meta.serialized_size} bytes")Use Cases
- Efficient columnar storage for analytics data
- Partitioned data lakes for query performance
- Schema inspection and metadata analysis
Tags
Related Snippets
Similar patterns you can reuse in the same workflow.
pythonintermediate
PyArrow Schema Enforcement
Define and enforce strict schemas on columnar data using PyArrow before writing to Parquet.
Best for: data lake storage
#pyarrow#parquet
pythonadvanced
PyArrow Dataset Scan with Predicate Pushdown
Scan a partitioned Parquet dataset with column pruning and row-level predicate pushdown via PyArrow.
Best for: lakehouse queries
#pyarrow#parquet
pythonbeginner
Pandas DataFrame Transformations
Common pandas DataFrame transformations including column operations, type casting, and string methods.
Best for: Cleaning raw data files for analysis
#pandas#dataframe
pythonbeginner
Pandas DataFrame Filtering Techniques
Filter DataFrames using boolean masks, query syntax, isin, between, and string matching methods.
Best for: Extracting subsets of data for reporting
#pandas#filtering