This guide shows you how to pick a cache class and attach it to an expression. Xorq ships four cache classes, and the right one depends on two questions: should the cache invalidate when source data changes, and where should the cached result live?
Caching is lazy. Attaching .cache(...) to an expression does nothing until you call .execute(), at which point Xorq either writes the result to storage (miss) or reads it back (hit).
Steps
1. Pick the cache class
Answer two questions and read the class off the table:
Rules of thumb:
- Store in a backend when downstream steps run on that backend anyway, or when the cache should live in a shared database.
- Store as Parquet when the source backend is in-memory and you want the cache to outlive it, or when no database is involved.
- Pick a Snapshot variant when you’d rather serve stale data than pay for recomputation.
2. Cache into a backend with SourceCache
SourceCache stores the result as a table in whatever backend you pass as source. The cache key includes source-data metadata, so changes upstream produce a new key and force recomputation.
import xorq.api as xo
from xorq.caching import SourceCache
ddb = xo.duckdb.connect()
penguins = xo.examples.penguins.fetch(backend=ddb)
expr = (
penguins.group_by("species")
.agg(avg_mass=xo._.body_mass_g.mean())
.cache(SourceCache.from_kwargs(source=ddb))
)
print(expr.execute())
species avg_mass
0 Adelie 3700.662252
1 Chinstrap 3733.088235
2 Gentoo 5076.016260
Calling .cache() with no arguments does the same thing, using the expression’s own backend as the source.
3. Cache to disk with ParquetCache
ParquetCache writes the result as a Parquet file, by default under ~/.cache/xorq/.
from xorq.caching import ParquetCache
con = xo.connect()
expr = (
penguins.filter(xo._.sex.notnull())
.cache(ParquetCache.from_kwargs(source=con))
)
print(expr.count().execute())
To put the files somewhere else, pass base_path=Path("/some/dir") to from_kwargs, or set the XORQ_CACHE_DIR environment variable.
Note that penguins lives in DuckDB while source here is the embedded backend. That’s deliberate: for ParquetCache the source backend isn’t where the data lives, it’s only what writes the Parquet file on a miss and reads it back on a hit. The cache itself lives on disk, so the source can be any backend. (For SourceCache, by contrast, the cache is a table in the source backend, so the choice matters more.)
4. Freeze results with the Snapshot variants
SourceSnapshotCache and ParquetSnapshotCache compute the cache key from the expression structure only: table name, path, schema. The key stays stable when the underlying data changes, so the first cached result is served until you delete it.
from xorq.caching import SourceSnapshotCache
frozen = (
penguins.group_by("island")
.agg(n=xo._.count())
.cache(SourceSnapshotCache.from_kwargs(source=ddb))
)
print(frozen.execute())
island n
0 Torgersen 52
1 Biscoe 168
2 Dream 124
Use this when reproducibility beats freshness, like model training inputs or an expensive backfill you don’t want to rerun.
5. Combine with into_backend(): DuckDB source, DataFusion target
A common pattern: read and filter in DuckDB, move the result to the embedded DataFusion backend for further work, and cache the transferred result as Parquet so the transfer doesn’t repeat.
from xorq.caching import ParquetCache
con = xo.connect()
pipeline = (
penguins.filter(xo._.species == "Adelie")
.into_backend(con, "adelie")
.cache(ParquetCache.from_kwargs(source=con))
.group_by("island")
.agg(avg_bill=xo._.bill_length_mm.mean())
)
print(pipeline.execute())
island avg_bill
0 Biscoe 38.975000
1 Torgersen 38.950980
2 Dream 38.501786
On the second .execute(), the DuckDB read and the cross-backend transfer are skipped entirely; the pipeline starts from the Parquet file.
6. Check the backend caveats
Automatic invalidation relies on backend-specific change signals, and they’re not all equally sharp:
| PostgreSQL |
reltuples estimate |
Updates after ANALYZE/autovacuum; an insert may not invalidate immediately. Run ANALYZE <table> to force it. |
| Snowflake |
LAST_ALTERED timestamp |
Reliable; updates on any DDL or DML. |
| DuckDB |
File metadata or data bytes |
In-memory tables hash the actual data, which is precise but costly for large tables. |
| SQLite (on disk) |
COUNT(*) and MAX(id) |
Needs an id column; misses updates that change neither count nor max id. |
| PyIceberg |
Snapshot IDs |
Tied to Iceberg’s snapshot model. |
If your backend’s signal is too coarse for your use case, a Snapshot variant plus explicit cache deletion gives you manual control.