pythonbeginner
DuckDB — Query Parquet Files with Python
Use DuckDB to query Parquet files and CSVs directly from Python without loading into memory first.
pythonPress ⌘/Ctrl + Shift + C to copy
import duckdb
# Query Parquet files directly (no loading required)
result = duckdb.sql("""
SELECT
region,
COUNT(*) AS order_count,
SUM(amount) AS total_revenue,
AVG(amount) AS avg_order
FROM 'data/orders/*.parquet'
WHERE order_date >= '2024-01-01'
GROUP BY region
ORDER BY total_revenue DESC
""").fetchdf()
print(result)
# Query CSV files
users = duckdb.sql("""
SELECT * FROM read_csv_auto('data/users.csv')
WHERE signup_date >= '2024-01-01'
LIMIT 100
""")
# Join Parquet + CSV
result = duckdb.sql("""
SELECT u.name, u.email, SUM(o.amount) AS total_spent
FROM 'data/orders/*.parquet' o
JOIN read_csv_auto('data/users.csv') u
ON o.user_id = u.id
GROUP BY u.name, u.email
HAVING total_spent > 1000
ORDER BY total_spent DESC
""")
# Export results
result.write_parquet("output/top_customers.parquet")
result.write_csv("output/top_customers.csv")
# Persistent database
con = duckdb.connect("analytics.duckdb")
con.execute("CREATE TABLE IF NOT EXISTS summary AS SELECT * FROM result")
con.close()
# Query S3 directly (with httpfs extension)
# duckdb.sql("INSTALL httpfs; LOAD httpfs;")
# duckdb.sql("SELECT * FROM 's3://bucket/data/*.parquet' LIMIT 10")Use Cases
- Ad-hoc analytics on Parquet files without Spark
- Lightweight local data exploration
- Joining multiple file formats without ETL
Tags
Related Snippets
Similar patterns you can reuse in the same workflow.
pythonadvanced
Python ETL Pipeline Example
Complete extract-transform-load pipeline with error handling, logging, and incremental processing.
Best for: Automating data ingestion from CSV to warehouse
#etl#pipeline
pythonadvanced
Kafka Consumer in Python — Stream Processing
Build a Kafka consumer in Python with offset management, error handling, and batch processing.
Best for: Real-time event processing from Kafka topics
#kafka#streaming
pythonbeginner
DuckDB In-Memory Analytics
Run fast analytical SQL on pandas DataFrames or Parquet files without a server using DuckDB.
Best for: serverless analytics
#duckdb#analytics
pythonbeginner
Pandas DataFrame Transformations
Common pandas DataFrame transformations including column operations, type casting, and string methods.
Best for: Cleaning raw data files for analysis
#pandas#dataframe