import xorq.api as xo
from xorq.caching import SourceStorage
# Connect to source database
= xo.postgres.connect_env()
pg = xo.connect()
con
# Create source storage
= SourceStorage(source=con)
storage
# Register table from postgres and cache it
= pg.table("batting")
batting
= batting.filter(batting.yearID == 2015)
expr
# Cache the filtered data in the source backend
= (
cached =storage) # cache expression
expr.cache(storage
)
# Execute the query - results will be cached
= xo.execute(cached) result
Caching
Xorq provides a caching system that enables efficient iterative development of ML pipelines. The caching system is designed to optimize performance, reduce computational overhead, and provide automatic invalidation when upstream data changes.
Overview
The caching system in Xorq allows you to:
- Cache results from upstream query engines to store intermediate results to avoid recomputation
- Persist data locally or in remote storage so you can choose between in-memory, disk-based, or remote storage
- Automatically invalidate cache when source data changes ensuring data freshness without manual intervention
- Chain caches across multiple engines enabling the creation of complex pipelines with multiple caching layers
Core Concepts
Lazy Evaluation and Caching
Xorq operations are lazy by default - they don’t execute until you call .execute()
. This lazy evaluation works hand-in-hand with the caching system:
The lazy nature of the .cache
in Xorq is a deviation from Ibis cache, where calling this method eagerly executes the expression.
# Operations are lazy until execute() is called
= (
recent_batting > 2010]
batting[batting.yearID 'playerID', 'yearID', 'teamID', 'G', 'AB', 'R', 'H'])
.select([
)
# Execute to see results - this is when caching can be applied
= recent_batting.execute() result
Cache Keys and Hashing
Xorq uses different hashing strategies to determine when cached data is still valid. Cache keys are generated using cityhash based on different components depending on the storage type:
Storage Type | Hash Components |
---|---|
In-Memory | Data bytes + Schema |
Disk-Based | Query plan + Schema |
Remote | Table metadata |
Additionally, when data freshness is required, i.e. automatic cache invalidation on data changes it uses the last modified time of files both disk-based and remote
Storage Types
Xorq supports four types of storage, each optimized for different use cases:
SourceStorage
SourceStorage provides automatic cache invalidation when upstream data changes:
Key Features:
- Automatically invalidates cache when upstream data changes
- Persistence depends on the source backend
- Supports both remote (Snowflake, Postgres) and in-process (pandas, DuckDB) backends
- Ideal for production pipelines where data freshness is critical
SnapshotStorage
SnapshotStorage provides caching without automatic invalidation:
from xorq.caching import SourceSnapshotStorage
# Create snapshot storage
= SourceSnapshotStorage(source=con)
storage
# Cache data without automatic invalidation
= (
cached_snapshot =storage)
expr.cache(storage )
Key Features:
- No automatic invalidation
- Ideal for one-off analyses or when you want manual control over cache invalidation
- Persistence depends on source backend
- Useful for exploratory data analysis where you want to preserve intermediate results
ParquetStorage
ParquetStorage is a special case of SourceStorage that persists data as Parquet files:
from pathlib import Path
from xorq.caching import ParquetStorage
# Create a storage for cached data
= ParquetStorage(source=con, base_path=Path.cwd())
cache_storage
# Cache the results as Parquet files
= pg.table("awards_players").cache(storage=cache_storage)
cached_awards
# The next execution will use the cached Parquet data
= cached_awards.execute() result
Invalid type PosixPath for attribute 'path' value. Expected one of ['bool', 'str', 'bytes', 'int', 'float'] or a sequence of those types
Key Features:
- Caches results as Parquet files on local disk
- Uses source backend for writing and reading
- Ensures durable persistence across sessions
- Excellent for iterative development workflows
- Supports compression and efficient columnar storage
ParquetSnapshotStorage
ParquetSnapshotStorage combines Parquet file persistence with snapshot-style caching (no automatic invalidation):
from pathlib import Path
from xorq.caching import ParquetSnapshotStorage
# Create a snapshot storage for Parquet files
= ParquetSnapshotStorage(source=con, base_path=Path.cwd())
cache_storage
# Cache the results as Parquet files without automatic invalidation
= expr.cache(storage=cache_storage)
cached_analysis
# Subsequent runs will use the cached Parquet data
= cached_analysis.execute() result
Invalid type PosixPath for attribute 'path' value. Expected one of ['bool', 'str', 'bytes', 'int', 'float'] or a sequence of those types
Key Features:
- Caches results as Parquet files on local disk
- No automatic invalidation - manual control over cache lifecycle
- Ideal for reproducible research and analysis where you want fixed snapshots
Multi-Engine Caching
Xorq excels at caching data across different backends using into_backend()
:
# Read from Postgres and cache in xorq backend
= pg.table("awards_players").into_backend(con, "awards")
awards
# Perform operations and cache
= (
cached_join 'playerID', 'yearID'])
expr.join(awards, [=ParquetStorage(con))
.cache(storage
)
# Move to DuckDB for specific operations
= xo.duckdb.connect()
ddb = cached_join.into_backend(ddb, "ddb_awards") ddb_summary
Cache Environment Variables
Xorq caching behavior can be configured through environment variables:
export XORQ_CACHE_DIR=~/.cache/xorq
export XORQ_DEFAULT_RELATIVE_PATH=parquet
export XORQ_CACHE_KEY_PREFIX=letsql_cache-