DuckLake: Lightweight Lakehouse Power for DuckDB Users

Updated: May 30, 2025

By: Joseph Horace

#ducklake

#duckdb

#ducklake tutorial

#ducklake setup

#install ducklake

#duckdb lakehouse

#duckdb extension

#time travel queries

#duckdb snapshots

#duckdb partitioning

Introduction
What Is DuckLake?
Key Features of DuckLake
🤝 Why Use DuckLake?
Setting Up DuckLake
Choosing a Catalog Database
Tutorial: Using DuckLake Step-by-Step
🧑‍💻 Open Source and Easy to Adopt
🧠 Final Thoughts

Introduction

DuckDB is well-known for being a fast, in-process analytical database, but what if you want to scale your storage, collaborate across users, or keep track of historical data? That’s where DuckLake comes in. DuckLake turns DuckDB into a flexible, versioned lakehouse system that supports concurrent access, schema evolution, and snapshotting—all while keeping things simple and fast.

What Is DuckLake?

DuckLake is a powerful open-source extension that transforms DuckDB into a fully capable lakehouse engine. The term "DuckLake" can refer to:

The DuckLake format: a specification for structuring datasets in a versioned, partitioned layout.
The DuckLake DuckDB extension: a plugin that enables DuckDB to read from and write to DuckLake datasets.
A DuckLake dataset: your actual tables and metadata stored according to the format.

With DuckLake, you can go beyond DuckDB’s default single-file setup and manage scalable, multi-file datasets that support collaboration and time travel.

Key Features of DuckLake

💾 Seamless Data Lake Management

DuckLake supports time travel queries, schema changes, and intelligent partitioning—letting you manage growing datasets with ease.

🪶 Lightweight Snapshots at Scale

Take unlimited snapshots of your tables without worrying about expensive compaction. DuckLake stores only the metadata it needs to make each version queryable.

🔒 ACID Transactions for Consistency

DuckLake brings full ACID compliance to DuckDB tables, ensuring that even complex, multi-table operations remain consistent in concurrent or shared environments.

⚡ Performance-Driven Design

With support for statistics-based filter pushdown, DuckLake optimizes queries under the hood, making even large datasets responsive and fast.

🤝 Why Use DuckLake?

If you're looking for a lightweight, SQL-native lakehouse solution, DuckLake is a natural fit. It extends DuckDB in three key ways:

Enables shared access across multiple DuckDB instances—ideal for teams or distributed workflows.
Supports partitioned, multi-file storage, avoiding the limitations of single database files.
Adds support for time travel, letting you query historical versions of your tables without backups.

You don’t need a heavyweight system or an external catalog to get started. DuckLake handles everything using DuckDB itself.

Setting Up DuckLake

To get started, you need to install the DuckLake extension in your DuckDB environment.

Step 1: Install DuckDB (if not already installed)

Visit the DuckDB installation page and install it via your preferred method (CLI, Python, etc).

Step 2: Install the DuckLake Extension

You can install the DuckLake extension directly from the DuckDB CLI:

INSTALL ducklake;
LOAD ducklake;

Choosing a Catalog Database

When setting up DuckLake, pick a catalog based on concurrency and existing infrastructure:

DuckDB (single client)

INSTALL ducklake;
ATTACH 'ducklake:metadata.ducklake' AS my_ducklake;
USE my_ducklake;

Only one process can open this catalog at a time.

SQLite (multiple local clients)

INSTALL ducklake;
INSTALL sqlite;
ATTACH 'ducklake:sqlite:metadata.sqlite' AS my_ducklake (DATA_PATH 'data_files/');
USE my_ducklake;

DuckDB repeatedly ATTACHs/DETACHs to handle SQLite’s single-writer lock.

PostgreSQL (multi-user/remote clients)

INSTALL ducklake;
INSTALL postgres;
-- Ensure `ducklake_catalog` exists in Postgres (v12+)
ATTACH 'ducklake:postgres:dbname=ducklake_catalog host=localhost' AS my_ducklake (DATA_PATH 'data_files/');
USE my_ducklake;

Provides full transactional support for concurrent users.

MySQL (multi-user/remote clients)

INSTALL ducklake;
INSTALL mysql;
-- Ensure `ducklake_catalog` exists in MySQL (v8+)
ATTACH 'ducklake:mysql:db=ducklake_catalog host=localhost' AS my_ducklake (DATA_PATH 'data_files/');
USE my_ducklake;

Leverages MySQL’s client-server model for concurrent access.

Tutorial: Using DuckLake Step-by-Step

Let’s walk through a simple but powerful use case: versioning and updating a dataset.

1. Attach a DuckLake Catalog

ATTACH 'ducklake:catalog.ducklake' AS lakehouse (DATA_PATH 'data/warehouse/');
USE lakehouse;

2. Create a Table and Insert Data

CREATE TABLE lakehouse.devices(device_id INTEGER, status TEXT);
INSERT INTO lakehouse.devices VALUES (101, 'active'), (102, 'standby');
FROM lakehouse.devices;

Output:

┌────────────┬──────────┐
│ device_id  │  status  │
├────────────┼──────────┤
│      101   │ active   │
│      102   │ standby  │
└────────────┴──────────┘

3. Update a Row

UPDATE lakehouse.devices SET status = 'offline' WHERE device_id = 102;
FROM lakehouse.devices;

4. Time Travel to an Earlier Version

DuckLake maintains a history of every database snapshot—each representing a self-consistent state—and preserves their associated changesets until you explicitly remove them through compaction. Time travel lets you query or mount the database exactly as it was at any recorded snapshot, specified by version ID or by timestamp.

Example

FROM lakehouse.devices AT (VERSION => 2);

Query by snapshot version

SELECT * 
FROM orders 
AT (VERSION => 5);

Retrieves the contents of orders as it existed in snapshot version 5.

Query by timestamp

SELECT * 
FROM orders 
AT (TIMESTAMP => now() - INTERVAL '2 days');

Returns the state of orders from exactly two days ago (provided a matching snapshot exists).

Attach at a specific snapshot version

ATTACH 'ducklake:product_catalog.db' (SNAPSHOT_VERSION 5) AS historical_catalog;
USE historical_catalog;

Mounts the product_catalog.db catalog at version 5 under the alias historical_catalog.

Attach at a specific point in time

ATTACH 'ducklake:inventory.db' (SNAPSHOT_TIME '2025-05-28 12:00:00') AS past_inventory;
USE past_inventory;

Mounts inventory.db as it appeared on May 28, 2025 at noon under the alias past_inventory.

With time travel, you can audit prior data states, reproduce past analyses, or rebuild ETL pipelines based on any snapshot—without altering the current dataset.

Listing Snapshots

Snapshots record every commit in DuckLake. Each snapshot represents a set of changes—such as creating tables, inserting or deleting rows, or altering schemas. All modifications to the database happen within a snapshot.

To list snapshots, attach your catalog and query the snapshots function:

ATTACH 'ducklake:archive.db' AS archive;
SELECT * FROM archive.snapshots();

5. Add a New Column

ALTER TABLE lakehouse.devices ADD COLUMN last_checked TIMESTAMP;
FROM lakehouse.devices;

6. Track Row-Level Changes

FROM lakehouse.table_changes('devices', 2, 3);

🧑‍💻 Open Source and Easy to Adopt

DuckLake is fully open source and available under the permissive MIT license. You can inspect the format, extend it, or integrate it into your data stack freely.

GitHub: github.com/duckdb/ducklake
Website: ducklake.select

🧠 Final Thoughts

DuckLake bridges the gap between small-scale DuckDB use cases and full-scale data lake needs. Whether you're a solo developer managing analytics in a notebook, or a team building a scalable data platform, DuckLake gives you the structure, speed, and flexibility to get the job done.

It’s lakehouse tech — the DuckDB way.

About the Author

Joseph Horace

Horace is a dedicated software developer with a deep passion for technology and problem-solving. With years of experience in developing robust and scalable applications, Horace specializes in building user-friendly solutions using cutting-edge technologies. His expertise spans across multiple areas of software development, with a focus on delivering high-quality code and seamless user experiences. Horace believes in continuous learning and enjoys sharing insights with the community through contributions and collaborations. When not coding, he enjoys exploring new technologies and staying updated on industry trends.

Table of Contents

Introduction

What Is DuckLake?

Key Features of DuckLake

💾 Seamless Data Lake Management

🪶 Lightweight Snapshots at Scale

🔒 ACID Transactions for Consistency

⚡ Performance-Driven Design

🤝 Why Use DuckLake?

Setting Up DuckLake

Step 1: Install DuckDB (if not already installed)

Step 2: Install the DuckLake Extension

Choosing a Catalog Database

DuckDB (single client)

SQLite (multiple local clients)

PostgreSQL (multi-user/remote clients)

MySQL (multi-user/remote clients)

Tutorial: Using DuckLake Step-by-Step

1. Attach a DuckLake Catalog

2. Create a Table and Insert Data

3. Update a Row

4. Time Travel to an Earlier Version

5. Add a New Column

6. Track Row-Level Changes

🧑‍💻 Open Source and Easy to Adopt

🧠 Final Thoughts

About the Author

Joseph Horace

How Valkey Client-Side Caching Works and When to Use It

Valkey Data Types

Valkey Tutorial

Comparing DuckLake, Apache Iceberg, and Delta Lake

Getting Started with DuckDB in Java: JDBC Guide with Real Examples

An Introduction to SpacetimeDB: A Tutorial

Handling Deployments in Dokploy 2025: A Tutorial

Set Up Monitoring in Dokploy for Applications and Servers

Deploy Any App with Dokploy

Dokploy Vs Vercel: A Comparison

Dokploy vs Vercel vs Dokku vs Caprover

What is Wiredoor: Ingress-as-a-Service Using WireGuard

Wiredoor Docker Gateway: Installation & Best Practices

Wiredoor kubernetes Gateway: Installation & Best Practices

Wiredoor Security