Cache API overview

Strategy/storage matrix, how to build a cache, and backend invalidation signals for the Xorq cache classes

The classes exported from xorq.caching are named combinations of a strategy (how the cache key is computed) and a storage (where the cached data lives). For the conceptual treatment of those two axes, see Caching.

Class matrix

Cache class	Strategy	Storage
`SourceCache`	`ModificationTimeStrategy`	`SourceStorage`
`SourceSnapshotCache`	`SnapshotStrategy`	`SourceStorage`
`ParquetCache`	`ModificationTimeStrategy`	`ParquetStorage`
`ParquetSnapshotCache`	`SnapshotStrategy`	`ParquetStorage`
`ParquetTTLSnapshotCache`	`SnapshotStrategy`	`ParquetTTLStorage`
`GCSCache`	`ModificationTimeStrategy`	`GCStorage`

ModificationTimeStrategy folds backend-specific change metadata into the key, so the cache invalidates when the source data changes.
SnapshotStrategy keys on expression structure only (table name, path, schema), so the first cached result is served until you delete it.

Building a cache

Build any cache with from_kwargs(...) and attach it with .cache(...). The keyword arguments are forwarded to the storage; the accepted arguments and their defaults are documented on each class’s reference page (linked in the matrix above). The example below caches a small real dataset into the embedded (local DataFusion) backend with SourceCache, then to a local Parquet file with ParquetCache:

import xorq.api as xo
from xorq.caching import SourceCache, ParquetCache

con = xo.connect()  # embedded DataFusion backend, runs locally
penguins = xo.examples.penguins.fetch(backend=con)
expr = penguins.group_by("species").agg(n=xo._.count())

# Cache as a table in the source backend
expr.cache(SourceCache.from_kwargs(source=con)).execute()

	species	n
0	Gentoo	124
1	Adelie	152
2	Chinstrap	68

# Cache to a local Parquet file (default ~/.cache/xorq/parquet/)
expr.cache(ParquetCache.from_kwargs(source=con)).execute()

	species	n
0	Gentoo	124
1	Adelie	152
2	Chinstrap	68

GCSCache follows the same shape but writes to a Google Cloud Storage bucket, so it needs a bucket name and isn’t runnable locally:

from xorq.caching import GCSCache

# Requires a GCS bucket reachable via gcsfs
expr.cache(GCSCache.from_kwargs(bucket_name="my-bucket", source=con))

SourceStorage checks existence with key in source.tables and writes the result as a table named after the key. ParquetStorage checks whether the .parquet file exists, writing to a .tmp file first and atomically renaming on success. ParquetTTLStorage additionally treats a file older than its ttl as expired.

Backend invalidation signals

ModificationTimeStrategy reads a per-backend change signal into the key hash:

Backend	Invalidation signal	ADBC ingestion	Notes
Postgres	`reltuples` from `pg_class`	Yes (`PgADBC`)	Estimate; may lag writes until `ANALYZE` runs
Snowflake	`LAST_ALTERED` timestamp	Yes (`SnowflakeADBC`)	Updates on any DDL/DML
BigQuery	`last_modified_time` from `__TABLES__`	—
DuckDB	File metadata / data bytes	No (direct register)	In-memory hashes data
DataFusion / Xorq	Data bytes (in-memory)	No (direct register)
SQLite	`COUNT(*)` and `MAX(id)` (on-disk) / data bytes (in-memory)	Yes (`SQLiteADBC`)	On-disk requires an `id` column
PyIceberg	Snapshot IDs	No (uses `create_table`)	Tied to Iceberg’s snapshot model
Deferred file reads (parquet, CSV)	inode mtime, size, number	—
Deferred URL reads (S3/GCS)	`last_modified`, `size`, `e_tag`	—
Deferred URL reads (HTTP)	`Last-Modified`, `Content-Length`	—

For a frozen result regardless of source changes, use a snapshot variant (SourceSnapshotCache, ParquetSnapshotCache).