Caching System

xorq provides a sophisticated caching system that enables efficient iterative development of ML pipelines. The caching system allows you to:

  • Cache results from upstream query engines
  • Persist data locally or in remote storage
  • Automatically invalidate cache when source data changes
  • Chain caches across multiple engines

Storage Types

xorq supports two main types of cache storage:

1. SourceStorage

  • Automatically invalidates cache when upstream data changes
  • Persistence depends on the source backend
  • Supports both remote (Snowflake, Postgres) and in-process (pandas, DuckDB) backends
import xorq as xo
from xorq.caching import SourceStorage

# Connect to source database
pg = xo.postgres.connect_env()
con = xo.connect()  # empty connection

# Create source storage
storage = SourceStorage(source=con)

# Register table from postgres and cache it
batting = pg.table("batting")

# Cache the filtered data in the source backend
cached = (
    batting.filter(batting.yearID == 2015)
    .cache(storage=storage)  # cache expression
)

# Execute the query - results will be cached
result = xo.execute(cached)

2. SnapshotStorage

  • No automatic invalidation
  • Ideal for one-off analyses
  • Persistence depends on source backend

3. ParquetStorage

  • Special case of SourceStorage
  • Caches results as Parquet files on local disk
  • Uses source backend for writing and reading
  • Ensures durable persistence

Hashing Strategies

Cache invalidation uses different hashing strategies based on the storage type:

Storage TypeHash Components
In-MemoryData bytes + Schema
Disk-BasedQuery plan + Schema
RemoteTable metadata + Last modified time