Javier Bonilla

Data

5

m

How DuckDB powers file uploads at Querio

No server, no dataframes, no .duckdb file and no lock

Querio works on top of data warehouses and databases, but our users had long been asking to be able to upload excels and CSVs into the product to analyze them in tandem to their warehouse data. Querio, as you’ve probably heard a couple of times from Rami, is a notebook-based product. So the immediate reflex to enable this workflow is to take the upload, drop it somewhere like S3, and read_csv into a pandas or polars dataframe in the runtime to work on top of. It’s one line, it feels almost free and it’ll work great in demos.

Once we fended off the reflex and thought about it a bit more, we realized it would break a pretty engrained pattern of our product. Everything else in Querio starts with SQL against a datasource that has a catalog. A dataframe in the runtime makes uploads a different kind of thing than every other datasource. We already have a highly optimized agent harness that knows how to work in this pattern very effectively, introducing an un-cataloged datasource floating in runtime memory wouldn’t be business as usual, so we’d be teaching it a second way to work for one feature. Not to mention the gnarliness of having the manage the state of that dataframe, it lives in the kernel, if the kernel restarts well there goes your dataframe. So yeah, the dataframe path looks lighter at face value, but it actually bolts a parallel subsystem onto the product.

Instead, we decided to make files look like any other datasource in Querio, and that’s where our first pretty reasonable implementation with DuckDB came in. A single .querio/uploads.duckdb per workspace. kind="duckdb", normal connection string, no special-casing. Each upload became a real table: CREATE OR REPLACE TABLE ... AS SELECT * FROM read_csv_auto(...) inside a BEGIN/COMMIT. Query path was just a normal DuckDBEngine opening the same file. This made files look and behave the same as any other datasource in Querio, literally identical to working with a MotherDuck warehouse.

We shipped it and… it broke. It turns out that a DuckDB database file takes a process-level lock: one read-write process at a time. Some of this was our own fault and some of it was just the nature of the thing. The self-inflicted part: we couldn’t get multiple readers at once because we missed setting ?access_mode=read_only in the connection string, an easy fix in hindsight. The fundamental part is the one that actually hurt tho. A query holding the file read-write blocked any upload and the reverse, and two uploads couldn’t run at the same time in a single workspace. No flag fixes that, it’s just how the lock works. So basically, say goodbye to using this feature if literally any other user in your workspace is touching it at that point in time. This worked great for a one-user-workspace and did not scale beyond that. Vamos!

We loved the fundamental idea but we evidently had to redesign this thing. The key insight was that a :memory: DuckDB connection takes no file lock at all, so stop sharing a locked database file and make the only shared state plain Parquet on disk. Something that was trivial given how we’ve architected every Querio workspace to be a file system. Storage flipped from one locked .duckdb to one files/<table>.parquet per upload. Files are the state and the database is the throwaway instead. The writing is driven by an in-memory DuckDB that does COPY ... TO a temp Parquet, then os.replace() swaps it into place. A reader opens either the complete old file or the complete new one, never a half-written one. The write-to-temp-then-rename is what replaced the BEGIN/COMMIT, the filesystem rename is the transaction now. The reading gives every query its own per-process :memory: connection with CREATE VIEW ... read_parquet(...). Readers never contend and the writer never blocks them. Instead of solving DuckDB’s concurrency model, we redesigned around it and engineered it out of existence.

To the more security focused engineers, this will sound like a threat vector. DuckDB can read s3://, http://, gcs://, azure://, so a read_parquet('s3://attacker/...') inside our servers is an exfiltration and SSRF vector once we run user-generated SQL. So we slammed that door shut. We disabled the remote filesystems outright (disabled_filesystems = HTTPFileSystem, S3FileSystem, GCSFileSystem, AzureFileSystem), whitelisted allowed_directories to scratch and files only (no traversal), disabled all extension loading and added resource caps on conversion. And so none of this can be undone from inside a query, we set lock_configuration=true, the SQL can’t unlock its own cage. The demo works with none of this, prod assumes someone (at least our pen testers) is going to try s3:// and shuts it before they do.

This doesn’t come without caps, like a 10M row, 1024 column and 1GB limit. But lets be real, this is the upload path, not the warehouse path. Warehouses and data engineers aren’t going anywhere any time soon, we scoped this tool to do its job. And before you yell at me, there is S3 somewhere at some point, if the box gets cooked we don’t lose all of our users’ files. We back up all of our users’ file systems in S3, we just don’t pull from it live to let you analyze your CSV. So that’s the story of how DuckDB powers file uploads at Querio, the cleanest system came from us refusing the urge to add a subsystem, and we only found it by building the heavier broken one first. Big shout out to Nik here who made us think harder and led the design and implementation of this whole thing.

Written by