前言
身為後端工程師,你有沒有過這種經驗:PM 丟了一個 2GB 的 CSV 檔案過來,說「幫我統計一下上個月的用戶行為」。你打開 Python,import pandas as pd,然後 pd.read_csv(),接著電腦的風扇就開始狂轉,記憶體直接吃到 8GB……
DuckDB 就是來拯救這種場景的。它是一個嵌入式的分析型資料庫,不需要安裝伺服器,可以直接用 SQL 查詢 CSV、Parquet、甚至 pandas DataFrame,而且速度通常比 pandas 快好幾倍。
這篇文章我會帶你從安裝開始,展示 DuckDB 在本地資料分析場景下的強大能力。
DuckDB 是什麼?
DuckDB 自稱是「資料分析界的 SQLite」,這個定位非常精確:
- 嵌入式:不需要安裝伺服器,像 SQLite 一樣嵌入在你的程式裡
- 列式儲存:針對分析查詢(聚合、掃描)最佳化
- 向量化執行:利用 CPU 的 SIMD 指令,批次處理資料
- 直接查詢檔案:CSV、Parquet、JSON,不需要先匯入
安裝
# Python
pip install duckdb
# CLI 工具
# macOS
brew install duckdb
# Linux
wget https://github.com/duckdb/duckdb/releases/download/v1.0.0/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip
sudo mv duckdb /usr/local/bin/
快速體驗
import duckdb
# 直接查詢 CSV 檔案(不需要先匯入!)
result = duckdb.sql("""
SELECT
category,
COUNT(*) AS count,
AVG(price) AS avg_price,
SUM(quantity) AS total_qty
FROM 'sales_data.csv'
GROUP BY category
ORDER BY total_qty DESC
""")
print(result)
# 或者用 CLI
# $ duckdb
# D SELECT COUNT(*) FROM 'sales_data.csv';
是的,就是這麼簡單。不需要 CREATE TABLE,不需要 COPY,直接在 SQL 裡放檔案路徑就好了。
直接查詢各種檔案格式
CSV 檔案
import duckdb
# 基本查詢
duckdb.sql("SELECT * FROM 'data/users.csv' LIMIT 10").show()
# 自動偵測分隔符和型別
duckdb.sql("""
SELECT * FROM read_csv_auto('data/users.csv',
header=true,
dateformat='%Y-%m-%d'
) LIMIT 10
""").show()
# 查詢多個 CSV 檔案(支援 glob)
duckdb.sql("""
SELECT
filename,
COUNT(*) AS rows
FROM read_csv_auto('data/logs_*.csv', filename=true)
GROUP BY filename
""").show()
# 處理有問題的 CSV(指定型別)
duckdb.sql("""
SELECT * FROM read_csv('messy_data.csv',
columns={'id': 'INTEGER', 'name': 'VARCHAR', 'amount': 'DOUBLE'},
header=true,
ignore_errors=true, -- 跳過有問題的行
null_padding=true
)
""").show()
Parquet 檔案
Parquet 是列式儲存格式,和 DuckDB 搭配使用效能最好:
# 直接查詢 Parquet
duckdb.sql("SELECT * FROM 'data/events.parquet' LIMIT 10").show()
# Parquet 的列裁剪(只讀取需要的欄位,速度極快)
duckdb.sql("""
SELECT user_id, event_type, COUNT(*)
FROM 'data/events.parquet'
WHERE event_date >= '2024-01-01'
GROUP BY user_id, event_type
""").show()
# 查詢遠端 Parquet(S3)
duckdb.sql("""
INSTALL httpfs;
LOAD httpfs;
SET s3_region='ap-northeast-1';
SET s3_access_key_id='your-key';
SET s3_secret_access_key='your-secret';
SELECT COUNT() FROM 's3://my-bucket/data/events/.parquet';
""")
# CSV 轉 Parquet(超快的轉換工具)
duckdb.sql("""
COPY (SELECT * FROM 'big_file.csv')
TO 'big_file.parquet' (FORMAT PARQUET, COMPRESSION ZSTD)
""")
JSON 檔案
# 查詢 JSON Lines 格式
duckdb.sql("""
SELECT
json_extract_string(data, '$.user.name') AS name,
json_extract(data, '$.metrics.page_views')::INTEGER AS views
FROM read_json_auto('data/events.jsonl')
WHERE json_extract_string(data, '$.event_type') = 'page_view'
""").show()
# 或者如果是結構化的 JSON
duckdb.sql("""
SELECT * FROM read_json_auto('data/users.json')
""").show()
DuckDB vs pandas 效能比較
讓我們來做一個實際的比較:
import duckdb
import pandas as pd
import time
import numpy as np
# 產生 1000 萬筆測試資料
print("Generating test data...")
n = 10_000_000
df = pd.DataFrame({
'user_id': np.random.randint(1, 100000, n),
'category': np.random.choice(['A', 'B', 'C', 'D', 'E'], n),
'amount': np.random.uniform(10, 1000, n).round(2),
'date': pd.date_range('2024-01-01', periods=n, freq='s')
})
# 存成 CSV 和 Parquet
df.to_csv('test_data.csv', index=False)
df.to_parquet('test_data.parquet', index=False)
# === 測試一:讀取 CSV ===
start = time.time()
pdf = pd.read_csv('test_data.csv')
pandas_read = time.time() - start
start = time.time()
ddf = duckdb.sql("SELECT * FROM 'test_data.csv'").fetchdf()
duck_read = time.time() - start
print(f"Read CSV - pandas: {pandas_read:.2f}s, DuckDB: {duck_read:.2f}s")
# === 測試二:GROUP BY 聚合 ===
start = time.time()
result_pd = df.groupby('category').agg(
count=('amount', 'count'),
total=('amount', 'sum'),
avg=('amount', 'mean')
).reset_index()
pandas_group = time.time() - start
start = time.time()
result_duck = duckdb.sql("""
SELECT
category,
COUNT(*) AS count,
SUM(amount) AS total,
AVG(amount) AS avg
FROM df
GROUP BY category
""").fetchdf()
duck_group = time.time() - start
print(f"GROUP BY - pandas: {pandas_group:.2f}s, DuckDB: {duck_group:.2f}s")
# === 測試三:複雜查詢 ===
start = time.time()
result_pd = (df[df['amount'] > 500]
.groupby(['category', df['date'].dt.date])
.agg(
count=('amount', 'count'),
total=('amount', 'sum')
)
.reset_index()
.sort_values('total', ascending=False)
.head(20))
pandas_complex = time.time() - start
start = time.time()
result_duck = duckdb.sql("""
SELECT
category,
date::DATE AS day,
COUNT(*) AS count,
SUM(amount) AS total
FROM df
WHERE amount > 500
GROUP BY category, day
ORDER BY total DESC
LIMIT 20
""").fetchdf()
duck_complex = time.time() - start
print(f"Complex - pandas: {pandas_complex:.2f}s, DuckDB: {duck_complex:.2f}s")
典型的結果(會因硬體而異):
Read CSV - pandas: 12.3s, DuckDB: 3.1s
GROUP BY - pandas: 0.8s, DuckDB: 0.2s
Complex - pandas: 1.5s, DuckDB: 0.3s
DuckDB 通常快 3-10 倍,而且記憶體使用量更少。
和 pandas 無縫整合
DuckDB 最酷的一點是它可以直接查詢 pandas DataFrame,不需要任何轉換:
import duckdb
import pandas as pd
# 建立 pandas DataFrame
users = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Carol', 'Dave', 'Eve'],
'city': ['Taipei', 'Tokyo', 'Taipei', 'Osaka', 'Tokyo']
})
orders = pd.DataFrame({
'order_id': range(1, 11),
'user_id': [1, 2, 1, 3, 4, 2, 5, 1, 3, 4],
'amount': [100, 200, 150, 300, 250, 180, 120, 400, 220, 310]
})
# 直接用 SQL 查詢 pandas DataFrame(變數名就是表名)
result = duckdb.sql("""
SELECT
u.name,
u.city,
COUNT(o.order_id) AS order_count,
SUM(o.amount) AS total_spent,
AVG(o.amount) AS avg_order
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.name, u.city
ORDER BY total_spent DESC
""")
# 轉回 pandas DataFrame
df_result = result.fetchdf()
print(df_result)
# 或者直接 show()
result.show()
這意味著你可以混合使用 pandas 的資料操作能力和 SQL 的查詢能力。用 pandas 做資料清理、用 DuckDB SQL 做分析,兩全其美。
實戰:分析伺服器日誌
假設你有一批 Nginx access log 轉成的 CSV:
import duckdb
# 建立持久化的資料庫(可選)
con = duckdb.connect("analysis.duckdb")
# 匯入日誌資料
con.execute("""
CREATE TABLE IF NOT EXISTS access_logs AS
SELECT FROM read_csv_auto('logs/access_.csv', filename=true)
""")
# 查看資料結構
con.sql("DESCRIBE access_logs").show()
# 每小時的請求量趨勢
con.sql("""
SELECT
date_trunc('hour', timestamp) AS hour,
COUNT(*) AS requests,
COUNT(CASE WHEN status >= 500 THEN 1 END) AS errors,
ROUND(AVG(response_time_ms), 2) AS avg_response_ms,
ROUND(percentile_cont(0.95) WITHIN GROUP (ORDER BY response_time_ms), 2) AS p95_ms
FROM access_logs
WHERE timestamp >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY hour
ORDER BY hour
""").show()
# 最慢的端點 Top 10
con.sql("""
SELECT
method,
path,
COUNT(*) AS requests,
ROUND(AVG(response_time_ms), 2) AS avg_ms,
ROUND(percentile_cont(0.99) WITHIN GROUP (ORDER BY response_time_ms), 2) AS p99_ms,
MAX(response_time_ms) AS max_ms
FROM access_logs
WHERE timestamp >= CURRENT_DATE - INTERVAL '1 day'
GROUP BY method, path
HAVING COUNT(*) >= 10
ORDER BY p99_ms DESC
LIMIT 10
""").show()
# HTTP 狀態碼分布
con.sql("""
SELECT
status,
COUNT(*) AS count,
ROUND(COUNT() 100.0 / SUM(COUNT(*)) OVER (), 2) AS percentage
FROM access_logs
GROUP BY status
ORDER BY status
""").show()
# 每個 IP 的請求量(找爬蟲或攻擊)
con.sql("""
SELECT
client_ip,
COUNT(*) AS requests,
COUNT(DISTINCT path) AS unique_paths,
MIN(timestamp) AS first_seen,
MAX(timestamp) AS last_seen
FROM access_logs
WHERE timestamp >= CURRENT_DATE - INTERVAL '1 day'
GROUP BY client_ip
HAVING COUNT(*) > 1000
ORDER BY requests DESC
LIMIT 20
""").show()
# 匯出分析結果到 Parquet
con.sql("""
COPY (
SELECT
date_trunc('hour', timestamp) AS hour,
path,
COUNT(*) AS requests,
AVG(response_time_ms) AS avg_ms
FROM access_logs
GROUP BY hour, path
) TO 'analysis_results.parquet' (FORMAT PARQUET)
""")
con.close()
DuckDB 的進階功能
窗口函式
-- 每個用戶的消費排名
SELECT
user_id,
order_date,
amount,
SUM(amount) OVER (PARTITION BY user_id ORDER BY order_date) AS running_total,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY amount DESC) AS rank_by_amount,
amount - LAG(amount) OVER (PARTITION BY user_id ORDER BY order_date) AS diff_from_prev
FROM orders;
CTE 和遞迴查詢
-- 用 CTE 組織複雜分析
WITH daily_stats AS (
SELECT
date_trunc('day', created_at) AS day,
COUNT(DISTINCT user_id) AS dau,
COUNT(*) AS events,
SUM(amount) AS revenue
FROM events
GROUP BY day
),
weekly_avg AS (
SELECT
day,
dau,
revenue,
AVG(dau) OVER (ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS dau_7day_avg,
AVG(revenue) OVER (ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rev_7day_avg
FROM daily_stats
)
SELECT * FROM weekly_avg
ORDER BY day DESC
LIMIT 30;
PIVOT 和 UNPIVOT
-- DuckDB 原生支援 PIVOT(很多資料庫不支援)
PIVOT orders
ON category
USING SUM(amount)
GROUP BY date_trunc('month', order_date);
-- 結果像是:
-- month | Electronics | Clothing | Books
-- 2024-01-01 | 15000 | 3200 | 800
-- 2024-02-01 | 18000 | 4100 | 950
匯出成各種格式
import duckdb
con = duckdb.connect()
# 匯出成 CSV
con.sql("COPY (SELECT * FROM 'data.parquet' WHERE amount > 100) TO 'filtered.csv' (HEADER)")
# 匯出成 Parquet(帶壓縮)
con.sql("COPY (SELECT * FROM 'data.csv') TO 'data.parquet' (FORMAT PARQUET, COMPRESSION ZSTD)")
# 匯出成 JSON
con.sql("COPY (SELECT * FROM 'data.parquet' LIMIT 100) TO 'sample.json' (FORMAT JSON, ARRAY true)")
# 匯出成 Excel(需要安裝 spatial extension)
con.sql("INSTALL spatial; LOAD spatial;")
con.sql("COPY (SELECT * FROM 'data.parquet') TO 'report.xlsx' (FORMAT GDAL, DRIVER 'xlsx')")
什麼時候用 DuckDB?
最適合的場景
- 臨時性的資料分析:PM 丟了一個 CSV 來,你需要快速回答問題
- ETL 管道:CSV → 清理 → 轉換 → Parquet
- 本地資料探索:拿到一個新的資料集,先用 SQL 探索結構和分布
- 取代 pandas 中的複雜聚合:GROUP BY + JOIN + 窗口函式用 SQL 寫更直覺
- Jupyter Notebook 分析:搭配 JupySQL 擴充套件效果很好
不適合的場景
- 高併發 OLTP:這不是它的設計目標
- 即時串流處理:它是批次處理引擎
- 多使用者同時寫入:嵌入式限制
小結
DuckDB 改變了我做資料分析的工作流程。以前面對大檔案,我要麼用 pandas 慢慢等,要麼匯入到 PostgreSQL 再查詢。現在我直接 duckdb.sql("SELECT ... FROM 'file.csv'"),一行搞定。
它和 pandas 的關係不是取代,而是互補。用 pandas 做資料清理和轉換(.fillna(), .apply() 這些 SQL 不擅長的),用 DuckDB 做聚合分析(GROUP BY, JOIN, 窗口函式這些 SQL 的強項)。兩者配合使用,效率最高。
如果你還沒試過 DuckDB,強烈建議用你手邊的 CSV 檔案跑一次。那個速度差距,第一次體驗真的會被嚇到。
延伸閱讀
- DuckDB 官方文件
- DuckDB 支援的 SQL 語法
- JupySQL — 在 Jupyter 中用 SQL
- MotherDuck — DuckDB 的雲端版本
- ClickHouse — 另一個強大的分析資料庫(伺服器級別)