BasicUtils

DuckLake: Lightweight Lakehouse Power for DuckDB Users

Updated: May 30, 2025

By: Joseph Horace

#ducklake
#duckdb
#ducklake tutorial
#ducklake setup
#install ducklake
#duckdb lakehouse
#duckdb extension
#time travel queries
#duckdb snapshots
#duckdb partitioning

Table of Contents

  1. Introduction
  2. What Is DuckLake?
  3. Key Features of DuckLake
  4. 🤝 Why Use DuckLake?
  5. Setting Up DuckLake
  6. Choosing a Catalog Database
  7. Tutorial: Using DuckLake Step-by-Step
  8. 🧑‍💻 Open Source and Easy to Adopt
  9. 🧠 Final Thoughts

Introduction

DuckDB is well-known for being a fast, in-process analytical database, but what if you want to scale your storage, collaborate across users, or keep track of historical data? That’s where DuckLake comes in. DuckLake turns DuckDB into a flexible, versioned lakehouse system that supports concurrent access, schema evolution, and snapshotting—all while keeping things simple and fast.

What Is DuckLake?

DuckLake is a powerful open-source extension that transforms DuckDB into a fully capable lakehouse engine. The term "DuckLake" can refer to:

  • The DuckLake format: a specification for structuring datasets in a versioned, partitioned layout.
  • The DuckLake DuckDB extension: a plugin that enables DuckDB to read from and write to DuckLake datasets.
  • A DuckLake dataset: your actual tables and metadata stored according to the format.

With DuckLake, you can go beyond DuckDB’s default single-file setup and manage scalable, multi-file datasets that support collaboration and time travel.

Key Features of DuckLake

💾 Seamless Data Lake Management

DuckLake supports time travel queries, schema changes, and intelligent partitioning—letting you manage growing datasets with ease.

🪶 Lightweight Snapshots at Scale

Take unlimited snapshots of your tables without worrying about expensive compaction. DuckLake stores only the metadata it needs to make each version queryable.

🔒 ACID Transactions for Consistency

DuckLake brings full ACID compliance to DuckDB tables, ensuring that even complex, multi-table operations remain consistent in concurrent or shared environments.

⚡ Performance-Driven Design

With support for statistics-based filter pushdown, DuckLake optimizes queries under the hood, making even large datasets responsive and fast.

🤝 Why Use DuckLake?

If you're looking for a lightweight, SQL-native lakehouse solution, DuckLake is a natural fit. It extends DuckDB in three key ways:

  • Enables shared access across multiple DuckDB instances—ideal for teams or distributed workflows.
  • Supports partitioned, multi-file storage, avoiding the limitations of single database files.
  • Adds support for time travel, letting you query historical versions of your tables without backups.

You don’t need a heavyweight system or an external catalog to get started. DuckLake handles everything using DuckDB itself.

Setting Up DuckLake

To get started, you need to install the DuckLake extension in your DuckDB environment.

Step 1: Install DuckDB (if not already installed)

Visit the DuckDB installation page and install it via your preferred method (CLI, Python, etc).

Step 2: Install the DuckLake Extension

You can install the DuckLake extension directly from the DuckDB CLI:

INSTALL ducklake;
LOAD ducklake;

Choosing a Catalog Database

When setting up DuckLake, pick a catalog based on concurrency and existing infrastructure:

DuckDB (single client)

INSTALL ducklake;
ATTACH 'ducklake:metadata.ducklake' AS my_ducklake;
USE my_ducklake;
  • Only one process can open this catalog at a time.

SQLite (multiple local clients)

INSTALL ducklake;
INSTALL sqlite;
ATTACH 'ducklake:sqlite:metadata.sqlite' AS my_ducklake (DATA_PATH 'data_files/');
USE my_ducklake;
  • DuckDB repeatedly ATTACHs/DETACHs to handle SQLite’s single-writer lock.

PostgreSQL (multi-user/remote clients)

INSTALL ducklake;
INSTALL postgres;
-- Ensure `ducklake_catalog` exists in Postgres (v12+)
ATTACH 'ducklake:postgres:dbname=ducklake_catalog host=localhost' AS my_ducklake (DATA_PATH 'data_files/');
USE my_ducklake;
  • Provides full transactional support for concurrent users.

MySQL (multi-user/remote clients)

INSTALL ducklake;
INSTALL mysql;
-- Ensure `ducklake_catalog` exists in MySQL (v8+)
ATTACH 'ducklake:mysql:db=ducklake_catalog host=localhost' AS my_ducklake (DATA_PATH 'data_files/');
USE my_ducklake;
  • Leverages MySQL’s client-server model for concurrent access.

Tutorial: Using DuckLake Step-by-Step

Let’s walk through a simple but powerful use case: versioning and updating a dataset.

1. Attach a DuckLake Catalog

ATTACH 'ducklake:catalog.ducklake' AS lakehouse (DATA_PATH 'data/warehouse/');
USE lakehouse;

2. Create a Table and Insert Data

CREATE TABLE lakehouse.devices(device_id INTEGER, status TEXT);
INSERT INTO lakehouse.devices VALUES (101, 'active'), (102, 'standby');
FROM lakehouse.devices;

Output:

┌────────────┬──────────┐
│ device_id  │  status  │
├────────────┼──────────┤
│      101   │ active   │
│      102   │ standby  │
└────────────┴──────────┘

3. Update a Row

UPDATE lakehouse.devices SET status = 'offline' WHERE device_id = 102;
FROM lakehouse.devices;

4. Time Travel to an Earlier Version

DuckLake maintains a history of every database snapshot—each representing a self-consistent state—and preserves their associated changesets until you explicitly remove them through compaction. Time travel lets you query or mount the database exactly as it was at any recorded snapshot, specified by version ID or by timestamp.

Example

FROM lakehouse.devices AT (VERSION => 2);
  • Query by snapshot version
SELECT * 
FROM orders 
AT (VERSION => 5);

Retrieves the contents of orders as it existed in snapshot version 5.

  • Query by timestamp
SELECT * 
FROM orders 
AT (TIMESTAMP => now() - INTERVAL '2 days');

Returns the state of orders from exactly two days ago (provided a matching snapshot exists).

  • Attach at a specific snapshot version
ATTACH 'ducklake:product_catalog.db' (SNAPSHOT_VERSION 5) AS historical_catalog;
USE historical_catalog;

Mounts the product_catalog.db catalog at version 5 under the alias historical_catalog.

  • Attach at a specific point in time
ATTACH 'ducklake:inventory.db' (SNAPSHOT_TIME '2025-05-28 12:00:00') AS past_inventory;
USE past_inventory;

Mounts inventory.db as it appeared on May 28, 2025 at noon under the alias past_inventory.

With time travel, you can audit prior data states, reproduce past analyses, or rebuild ETL pipelines based on any snapshot—without altering the current dataset.

  • Listing Snapshots

Snapshots record every commit in DuckLake. Each snapshot represents a set of changes—such as creating tables, inserting or deleting rows, or altering schemas. All modifications to the database happen within a snapshot.

To list snapshots, attach your catalog and query the snapshots function:

ATTACH 'ducklake:archive.db' AS archive;
SELECT * FROM archive.snapshots();

5. Add a New Column

ALTER TABLE lakehouse.devices ADD COLUMN last_checked TIMESTAMP;
FROM lakehouse.devices;

6. Track Row-Level Changes

FROM lakehouse.table_changes('devices', 2, 3);

🧑‍💻 Open Source and Easy to Adopt

DuckLake is fully open source and available under the permissive MIT license. You can inspect the format, extend it, or integrate it into your data stack freely.

🧠 Final Thoughts

DuckLake bridges the gap between small-scale DuckDB use cases and full-scale data lake needs. Whether you're a solo developer managing analytics in a notebook, or a team building a scalable data platform, DuckLake gives you the structure, speed, and flexibility to get the job done.

It’s lakehouse tech — the DuckDB way.

About the Author

Joseph Horace

Horace is a dedicated software developer with a deep passion for technology and problem-solving. With years of experience in developing robust and scalable applications, Horace specializes in building user-friendly solutions using cutting-edge technologies. His expertise spans across multiple areas of software development, with a focus on delivering high-quality code and seamless user experiences. Horace believes in continuous learning and enjoys sharing insights with the community through contributions and collaborations. When not coding, he enjoys exploring new technologies and staying updated on industry trends.