前言

身為後端工程師,你有沒有過這種經驗:PM 丟了一個 2GB 的 CSV 檔案過來,說「幫我統計一下上個月的用戶行為」。你打開 Python,import pandas as pd,然後 pd.read_csv(),接著電腦的風扇就開始狂轉,記憶體直接吃到 8GB……

DuckDB 就是來拯救這種場景的。它是一個嵌入式的分析型資料庫,不需要安裝伺服器,可以直接用 SQL 查詢 CSV、Parquet、甚至 pandas DataFrame,而且速度通常比 pandas 快好幾倍。

這篇文章我會帶你從安裝開始,展示 DuckDB 在本地資料分析場景下的強大能力。


DuckDB 是什麼?

DuckDB 自稱是「資料分析界的 SQLite」,這個定位非常精確:

  • 嵌入式:不需要安裝伺服器,像 SQLite 一樣嵌入在你的程式裡
  • 列式儲存:針對分析查詢(聚合、掃描)最佳化
  • 向量化執行:利用 CPU 的 SIMD 指令,批次處理資料
  • 直接查詢檔案:CSV、Parquet、JSON,不需要先匯入

安裝

# Python
pip install duckdb

# CLI 工具 # macOS brew install duckdb

# Linux wget https://github.com/duckdb/duckdb/releases/download/v1.0.0/duckdb_cli-linux-amd64.zip unzip duckdb_cli-linux-amd64.zip sudo mv duckdb /usr/local/bin/

快速體驗

import duckdb

# 直接查詢 CSV 檔案(不需要先匯入!) result = duckdb.sql(""" SELECT category, COUNT(*) AS count, AVG(price) AS avg_price, SUM(quantity) AS total_qty FROM 'sales_data.csv' GROUP BY category ORDER BY total_qty DESC """) print(result)

# 或者用 CLI # $ duckdb # D SELECT COUNT(*) FROM 'sales_data.csv';

是的,就是這麼簡單。不需要 CREATE TABLE,不需要 COPY,直接在 SQL 裡放檔案路徑就好了。


直接查詢各種檔案格式

CSV 檔案

import duckdb

# 基本查詢 duckdb.sql("SELECT * FROM 'data/users.csv' LIMIT 10").show()

# 自動偵測分隔符和型別 duckdb.sql(""" SELECT * FROM read_csv_auto('data/users.csv', header=true, dateformat='%Y-%m-%d' ) LIMIT 10 """).show()

# 查詢多個 CSV 檔案(支援 glob) duckdb.sql(""" SELECT filename, COUNT(*) AS rows FROM read_csv_auto('data/logs_*.csv', filename=true) GROUP BY filename """).show()

# 處理有問題的 CSV(指定型別) duckdb.sql(""" SELECT * FROM read_csv('messy_data.csv', columns={'id': 'INTEGER', 'name': 'VARCHAR', 'amount': 'DOUBLE'}, header=true, ignore_errors=true, -- 跳過有問題的行 null_padding=true ) """).show()

Parquet 檔案

Parquet 是列式儲存格式,和 DuckDB 搭配使用效能最好:

# 直接查詢 Parquet
duckdb.sql("SELECT * FROM 'data/events.parquet' LIMIT 10").show()

# Parquet 的列裁剪(只讀取需要的欄位,速度極快) duckdb.sql(""" SELECT user_id, event_type, COUNT(*) FROM 'data/events.parquet' WHERE event_date >= '2024-01-01' GROUP BY user_id, event_type """).show()

# 查詢遠端 Parquet(S3) duckdb.sql(""" INSTALL httpfs; LOAD httpfs; SET s3_region='ap-northeast-1'; SET s3_access_key_id='your-key'; SET s3_secret_access_key='your-secret';

SELECT COUNT() FROM 's3://my-bucket/data/events/.parquet'; """)

# CSV 轉 Parquet(超快的轉換工具) duckdb.sql(""" COPY (SELECT * FROM 'big_file.csv') TO 'big_file.parquet' (FORMAT PARQUET, COMPRESSION ZSTD) """)

JSON 檔案

# 查詢 JSON Lines 格式
duckdb.sql("""
    SELECT
        json_extract_string(data, '$.user.name') AS name,
        json_extract(data, '$.metrics.page_views')::INTEGER AS views
    FROM read_json_auto('data/events.jsonl')
    WHERE json_extract_string(data, '$.event_type') = 'page_view'
""").show()

# 或者如果是結構化的 JSON duckdb.sql(""" SELECT * FROM read_json_auto('data/users.json') """).show()


DuckDB vs pandas 效能比較

讓我們來做一個實際的比較:

import duckdb
import pandas as pd
import time
import numpy as np

# 產生 1000 萬筆測試資料 print("Generating test data...") n = 10_000_000 df = pd.DataFrame({ 'user_id': np.random.randint(1, 100000, n), 'category': np.random.choice(['A', 'B', 'C', 'D', 'E'], n), 'amount': np.random.uniform(10, 1000, n).round(2), 'date': pd.date_range('2024-01-01', periods=n, freq='s') })

# 存成 CSV 和 Parquet df.to_csv('test_data.csv', index=False) df.to_parquet('test_data.parquet', index=False)

# === 測試一:讀取 CSV === start = time.time() pdf = pd.read_csv('test_data.csv') pandas_read = time.time() - start

start = time.time() ddf = duckdb.sql("SELECT * FROM 'test_data.csv'").fetchdf() duck_read = time.time() - start

print(f"Read CSV - pandas: {pandas_read:.2f}s, DuckDB: {duck_read:.2f}s")

# === 測試二:GROUP BY 聚合 === start = time.time() result_pd = df.groupby('category').agg( count=('amount', 'count'), total=('amount', 'sum'), avg=('amount', 'mean') ).reset_index() pandas_group = time.time() - start

start = time.time() result_duck = duckdb.sql(""" SELECT category, COUNT(*) AS count, SUM(amount) AS total, AVG(amount) AS avg FROM df GROUP BY category """).fetchdf() duck_group = time.time() - start

print(f"GROUP BY - pandas: {pandas_group:.2f}s, DuckDB: {duck_group:.2f}s")

# === 測試三:複雜查詢 === start = time.time() result_pd = (df[df['amount'] > 500] .groupby(['category', df['date'].dt.date]) .agg( count=('amount', 'count'), total=('amount', 'sum') ) .reset_index() .sort_values('total', ascending=False) .head(20)) pandas_complex = time.time() - start

start = time.time() result_duck = duckdb.sql(""" SELECT category, date::DATE AS day, COUNT(*) AS count, SUM(amount) AS total FROM df WHERE amount > 500 GROUP BY category, day ORDER BY total DESC LIMIT 20 """).fetchdf() duck_complex = time.time() - start

print(f"Complex - pandas: {pandas_complex:.2f}s, DuckDB: {duck_complex:.2f}s")

典型的結果(會因硬體而異):

Read CSV - pandas: 12.3s, DuckDB: 3.1s
GROUP BY - pandas: 0.8s,  DuckDB: 0.2s
Complex  - pandas: 1.5s,  DuckDB: 0.3s

DuckDB 通常快 3-10 倍,而且記憶體使用量更少。


和 pandas 無縫整合

DuckDB 最酷的一點是它可以直接查詢 pandas DataFrame,不需要任何轉換:

import duckdb
import pandas as pd

# 建立 pandas DataFrame users = pd.DataFrame({ 'id': [1, 2, 3, 4, 5], 'name': ['Alice', 'Bob', 'Carol', 'Dave', 'Eve'], 'city': ['Taipei', 'Tokyo', 'Taipei', 'Osaka', 'Tokyo'] })

orders = pd.DataFrame({ 'order_id': range(1, 11), 'user_id': [1, 2, 1, 3, 4, 2, 5, 1, 3, 4], 'amount': [100, 200, 150, 300, 250, 180, 120, 400, 220, 310] })

# 直接用 SQL 查詢 pandas DataFrame(變數名就是表名) result = duckdb.sql(""" SELECT u.name, u.city, COUNT(o.order_id) AS order_count, SUM(o.amount) AS total_spent, AVG(o.amount) AS avg_order FROM users u LEFT JOIN orders o ON u.id = o.user_id GROUP BY u.name, u.city ORDER BY total_spent DESC """)

# 轉回 pandas DataFrame df_result = result.fetchdf() print(df_result)

# 或者直接 show() result.show()

這意味著你可以混合使用 pandas 的資料操作能力和 SQL 的查詢能力。用 pandas 做資料清理、用 DuckDB SQL 做分析,兩全其美。


實戰:分析伺服器日誌

假設你有一批 Nginx access log 轉成的 CSV:

import duckdb

# 建立持久化的資料庫(可選) con = duckdb.connect("analysis.duckdb")

# 匯入日誌資料 con.execute(""" CREATE TABLE IF NOT EXISTS access_logs AS SELECT FROM read_csv_auto('logs/access_.csv', filename=true) """)

# 查看資料結構 con.sql("DESCRIBE access_logs").show()

# 每小時的請求量趨勢 con.sql(""" SELECT date_trunc('hour', timestamp) AS hour, COUNT(*) AS requests, COUNT(CASE WHEN status >= 500 THEN 1 END) AS errors, ROUND(AVG(response_time_ms), 2) AS avg_response_ms, ROUND(percentile_cont(0.95) WITHIN GROUP (ORDER BY response_time_ms), 2) AS p95_ms FROM access_logs WHERE timestamp >= CURRENT_DATE - INTERVAL '7 days' GROUP BY hour ORDER BY hour """).show()

# 最慢的端點 Top 10 con.sql(""" SELECT method, path, COUNT(*) AS requests, ROUND(AVG(response_time_ms), 2) AS avg_ms, ROUND(percentile_cont(0.99) WITHIN GROUP (ORDER BY response_time_ms), 2) AS p99_ms, MAX(response_time_ms) AS max_ms FROM access_logs WHERE timestamp >= CURRENT_DATE - INTERVAL '1 day' GROUP BY method, path HAVING COUNT(*) >= 10 ORDER BY p99_ms DESC LIMIT 10 """).show()

# HTTP 狀態碼分布 con.sql(""" SELECT status, COUNT(*) AS count, ROUND(COUNT() 100.0 / SUM(COUNT(*)) OVER (), 2) AS percentage FROM access_logs GROUP BY status ORDER BY status """).show()

# 每個 IP 的請求量(找爬蟲或攻擊) con.sql(""" SELECT client_ip, COUNT(*) AS requests, COUNT(DISTINCT path) AS unique_paths, MIN(timestamp) AS first_seen, MAX(timestamp) AS last_seen FROM access_logs WHERE timestamp >= CURRENT_DATE - INTERVAL '1 day' GROUP BY client_ip HAVING COUNT(*) > 1000 ORDER BY requests DESC LIMIT 20 """).show()

# 匯出分析結果到 Parquet con.sql(""" COPY ( SELECT date_trunc('hour', timestamp) AS hour, path, COUNT(*) AS requests, AVG(response_time_ms) AS avg_ms FROM access_logs GROUP BY hour, path ) TO 'analysis_results.parquet' (FORMAT PARQUET) """)

con.close()


DuckDB 的進階功能

窗口函式

-- 每個用戶的消費排名
SELECT
    user_id,
    order_date,
    amount,
    SUM(amount) OVER (PARTITION BY user_id ORDER BY order_date) AS running_total,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY amount DESC) AS rank_by_amount,
    amount - LAG(amount) OVER (PARTITION BY user_id ORDER BY order_date) AS diff_from_prev
FROM orders;

CTE 和遞迴查詢

-- 用 CTE 組織複雜分析
WITH daily_stats AS (
    SELECT
        date_trunc('day', created_at) AS day,
        COUNT(DISTINCT user_id) AS dau,
        COUNT(*) AS events,
        SUM(amount) AS revenue
    FROM events
    GROUP BY day
),
weekly_avg AS (
    SELECT
        day,
        dau,
        revenue,
        AVG(dau) OVER (ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS dau_7day_avg,
        AVG(revenue) OVER (ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rev_7day_avg
    FROM daily_stats
)
SELECT * FROM weekly_avg
ORDER BY day DESC
LIMIT 30;

PIVOT 和 UNPIVOT

-- DuckDB 原生支援 PIVOT(很多資料庫不支援)
PIVOT orders
ON category
USING SUM(amount)
GROUP BY date_trunc('month', order_date);

-- 結果像是: -- month | Electronics | Clothing | Books -- 2024-01-01 | 15000 | 3200 | 800 -- 2024-02-01 | 18000 | 4100 | 950

匯出成各種格式

import duckdb

con = duckdb.connect()

# 匯出成 CSV con.sql("COPY (SELECT * FROM 'data.parquet' WHERE amount > 100) TO 'filtered.csv' (HEADER)")

# 匯出成 Parquet(帶壓縮) con.sql("COPY (SELECT * FROM 'data.csv') TO 'data.parquet' (FORMAT PARQUET, COMPRESSION ZSTD)")

# 匯出成 JSON con.sql("COPY (SELECT * FROM 'data.parquet' LIMIT 100) TO 'sample.json' (FORMAT JSON, ARRAY true)")

# 匯出成 Excel(需要安裝 spatial extension) con.sql("INSTALL spatial; LOAD spatial;") con.sql("COPY (SELECT * FROM 'data.parquet') TO 'report.xlsx' (FORMAT GDAL, DRIVER 'xlsx')")


什麼時候用 DuckDB?

最適合的場景

  • 臨時性的資料分析:PM 丟了一個 CSV 來,你需要快速回答問題
  • ETL 管道:CSV → 清理 → 轉換 → Parquet
  • 本地資料探索:拿到一個新的資料集,先用 SQL 探索結構和分布
  • 取代 pandas 中的複雜聚合:GROUP BY + JOIN + 窗口函式用 SQL 寫更直覺
  • Jupyter Notebook 分析:搭配 JupySQL 擴充套件效果很好

不適合的場景

  • 高併發 OLTP:這不是它的設計目標
  • 即時串流處理:它是批次處理引擎
  • 多使用者同時寫入:嵌入式限制

小結

DuckDB 改變了我做資料分析的工作流程。以前面對大檔案,我要麼用 pandas 慢慢等,要麼匯入到 PostgreSQL 再查詢。現在我直接 duckdb.sql("SELECT ... FROM 'file.csv'"),一行搞定。

它和 pandas 的關係不是取代,而是互補。用 pandas 做資料清理和轉換(.fillna(), .apply() 這些 SQL 不擅長的),用 DuckDB 做聚合分析(GROUP BY, JOIN, 窗口函式這些 SQL 的強項)。兩者配合使用,效率最高。

如果你還沒試過 DuckDB,強烈建議用你手邊的 CSV 檔案跑一次。那個速度差距,第一次體驗真的會被嚇到。

延伸閱讀