Cache results by backend

Pick the right cache class for your source and target backends

This guide shows you how to pick a cache class and attach it to an expression. Xorq ships four cache classes, and the right one depends on two questions: should the cache invalidate when source data changes, and where should the cached result live?

Caching is lazy. Attaching .cache(...) to an expression does nothing until you call .execute(), at which point Xorq either writes the result to storage (miss) or reads it back (hit).

Prerequisites

Xorq installed (Install Xorq)
A backend connection (Connect to a backend)

Steps

1. Pick the cache class

Answer two questions and read the class off the table:

Goal	Store in a backend	Store as Parquet on disk
Recompute when source data changes	`SourceCache`	`ParquetCache`
Keep the first result, even if data changes	`SourceSnapshotCache`	`ParquetSnapshotCache`

Rules of thumb:

Store in a backend when downstream steps run on that backend anyway, or when the cache should live in a shared database.
Store as Parquet when the source backend is in-memory and you want the cache to outlive it, or when no database is involved.
Pick a Snapshot variant when you’d rather serve stale data than pay for recomputation.

2. Cache into a backend with `SourceCache`

SourceCache stores the result as a table in whatever backend you pass as source. The cache key includes source-data metadata, so changes upstream produce a new key and force recomputation.

import xorq.api as xo
from xorq.caching import SourceCache

ddb = xo.duckdb.connect()
penguins = xo.examples.penguins.fetch(backend=ddb)

expr = (
    penguins.group_by("species")
    .agg(avg_mass=xo._.body_mass_g.mean())
    .cache(SourceCache.from_kwargs(source=ddb))
)
print(expr.execute())

     species     avg_mass
0     Adelie  3700.662252
1  Chinstrap  3733.088235
2     Gentoo  5076.016260

Calling .cache() with no arguments does the same thing, using the expression’s own backend as the source.

3. Cache to disk with `ParquetCache`

ParquetCache writes the result as a Parquet file, by default under ~/.cache/xorq/.

from xorq.caching import ParquetCache

con = xo.connect()

expr = (
    penguins.filter(xo._.sex.notnull())
    .cache(ParquetCache.from_kwargs(source=con))
)
print(expr.count().execute())

To put the files somewhere else, pass base_path=Path("/some/dir") to from_kwargs, or set the XORQ_CACHE_DIR environment variable.

Note that penguins lives in DuckDB while source here is the embedded backend. That’s deliberate: for ParquetCache the source backend isn’t where the data lives, it’s only what writes the Parquet file on a miss and reads it back on a hit. The cache itself lives on disk, so the source can be any backend. (For SourceCache, by contrast, the cache is a table in the source backend, so the choice matters more.)

4. Freeze results with the Snapshot variants

SourceSnapshotCache and ParquetSnapshotCache compute the cache key from the expression structure only: table name, path, schema. The key stays stable when the underlying data changes, so the first cached result is served until you delete it.

from xorq.caching import SourceSnapshotCache

frozen = (
    penguins.group_by("island")
    .agg(n=xo._.count())
    .cache(SourceSnapshotCache.from_kwargs(source=ddb))
)
print(frozen.execute())

      island    n
0  Torgersen   52
1     Biscoe  168
2      Dream  124

Use this when reproducibility beats freshness, like model training inputs or an expensive backfill you don’t want to rerun.

5. Combine with `into_backend()`: DuckDB source, DataFusion target

A common pattern: read and filter in DuckDB, move the result to the embedded DataFusion backend for further work, and cache the transferred result as Parquet so the transfer doesn’t repeat.

from xorq.caching import ParquetCache

con = xo.connect()

pipeline = (
    penguins.filter(xo._.species == "Adelie")
    .into_backend(con, "adelie")
    .cache(ParquetCache.from_kwargs(source=con))
    .group_by("island")
    .agg(avg_bill=xo._.bill_length_mm.mean())
)
print(pipeline.execute())

      island   avg_bill
0  Torgersen  38.950980
1      Dream  38.501786
2     Biscoe  38.975000

On the second .execute(), the DuckDB read and the cross-backend transfer are skipped entirely; the pipeline starts from the Parquet file.

6. Check the backend caveats

Automatic invalidation relies on backend-specific change signals, and they’re not all equally sharp:

Backend	Change signal	Caveat
PostgreSQL	`reltuples` estimate	Updates after `ANALYZE`/autovacuum; an insert may not invalidate immediately. Run `ANALYZE <table>` to force it.
Snowflake	`LAST_ALTERED` timestamp	Reliable; updates on any DDL or DML.
DuckDB	File metadata or data bytes	In-memory tables hash the actual data, which is precise but costly for large tables.
SQLite (on disk)	`COUNT(*)` and `MAX(id)`	Needs an `id` column; misses updates that change neither count nor max id.
PyIceberg	Snapshot IDs	Tied to Iceberg’s snapshot model.

If your backend’s signal is too coarse for your use case, a Snapshot variant plus explicit cache deletion gives you manual control.

Prerequisites

Steps

1. Pick the cache class

2. Cache into a backend with SourceCache

3. Cache to disk with ParquetCache

4. Freeze results with the Snapshot variants

5. Combine with into_backend(): DuckDB source, DataFusion target

6. Check the backend caveats

See also

2. Cache into a backend with `SourceCache`

3. Cache to disk with `ParquetCache`

5. Combine with `into_backend()`: DuckDB source, DataFusion target