Caching

Xorq provides a caching system that enables efficient iterative development of ML pipelines. The caching system is designed to optimize performance, reduce computational overhead, and provide automatic invalidation when upstream data changes.

Overview

The caching system in Xorq allows you to:

  • Cache results from upstream query engines to store intermediate results to avoid recomputation
  • Persist data locally or in remote storage so you can choose between in-memory, disk-based, or remote storage
  • Automatically invalidate cache when source data changes ensuring data freshness without manual intervention
  • Chain caches across multiple engines enabling the creation of complex pipelines with multiple caching layers

Core concepts

Lazy evaluation and caching

Xorq operations are lazy by default - they don’t execute until you call .execute(). This lazy evaluation works hand-in-hand with the caching system:

Note

The lazy nature of the .cache in Xorq is a deviation from Ibis cache, where calling this method eagerly executes the expression.

# Operations are lazy until execute() is called
recent_batting = (
    batting[batting.yearID > 2010]
    .select(['playerID', 'yearID', 'teamID', 'G', 'AB', 'R', 'H'])
)

# Execute to see results - this is when caching can be applied
result = recent_batting.execute()

Cache keys and hashing

Xorq uses different hashing strategies to determine when cached data is still valid. Cache keys are generated using cityhash based on different components depending on the cache type:

Storage Type Hash Components
In-Memory Data bytes + Schema
Disk-Based Query plan + Schema
Remote Table metadata

Additionally, when data freshness is required, i.e. automatic cache invalidation on data changes it uses the last modified time of files both disk-based and remote

Cache types

Xorq supports four types of cache, each optimized for different use cases:

SourceCache

SourceCache provides automatic cache invalidation when upstream data changes:

import xorq.api as xo
from xorq.caching import SourceCache

# Connect to source database
pg = xo.postgres.connect_env()
con = xo.connect()  

# Create source storage
storage = SourceCache.from_kwargs(source=con)

# Register table from postgres and cache it
batting = pg.table("batting")

expr = batting.filter(batting.yearID == 2015)

# Cache the filtered data in the source backend
cached = (
    expr.cache(cache=storage)  # cache expression
)

# Execute the query - results will be cached
result = xo.execute(cached)

Key Features:

  • Automatically invalidates cache when upstream data changes
  • Persistence depends on the source backend
  • Supports both remote (Snowflake, Postgres) and in-process (pandas, DuckDB) backends
  • Ideal for production pipelines where data freshness is critical

SourceSnapshotCache

SourceSnapshotCache provides caching without automatic invalidation:

import xorq.api as xo
from xorq.caching import SourceSnapshotCache

# Connect to source database
pg = xo.postgres.connect_env()
con = xo.connect()

# Create snapshot storage
storage = SourceSnapshotCache.from_kwargs(source=con)

# Register table from postgres
batting = pg.table("batting")
expr = batting.filter(batting.yearID == 2015)

# Cache data without automatic invalidation
cached_snapshot = (
    expr.cache(cache=storage)
)

Key Features:

  • No automatic invalidation
  • Ideal for one-off analyses or when you want manual control over cache invalidation
  • Persistence depends on source backend
  • Useful for exploratory data analysis where you want to preserve intermediate results

ParquetCache

ParquetCache is a special case of SourceCache that persists data as Parquet files:

import xorq.api as xo
from pathlib import Path
from xorq.caching import ParquetCache

# Connect to source database
pg = xo.postgres.connect_env()
con = xo.connect()

# Create a storage for cached data
cache_storage = ParquetCache.from_kwargs(source=con, base_path=Path.cwd())

# Cache the results as Parquet files
cached_awards = pg.table("awards_players").cache(cache=cache_storage)

# The next execution will use the cached Parquet data
result = cached_awards.execute()

Key Features:

  • Caches results as Parquet files on local disk
  • Uses source backend for writing and reading
  • Ensures durable persistence across sessions
  • Excellent for iterative development workflows
  • Supports compression and efficient columnar storage

ParquetSnapshotCache

ParquetSnapshotCache combines Parquet file persistence with snapshot-style caching (no automatic invalidation):

import xorq.api as xo
from pathlib import Path
from xorq.caching import ParquetSnapshotCache

# Connect to source database
pg = xo.postgres.connect_env()
con = xo.connect()

# Create a snapshot storage for Parquet files
cache_storage = ParquetSnapshotCache.from_kwargs(source=con, base_path=Path.cwd())

# Register table from postgres
batting = pg.table("batting")
expr = batting.filter(batting.yearID == 2015)

# Cache the results as Parquet files without automatic invalidation
cached_analysis = expr.cache(cache=cache_storage)

# Subsequent runs will use the cached Parquet data
result = cached_analysis.execute()

Key Features:

  • Caches results as Parquet files on local disk
  • No automatic invalidation - manual control over cache lifecycle
  • Ideal for reproducible research and analysis where you want fixed snapshots

Multi-engine caching

Xorq excels at caching data across different backends using into_backend():

import xorq.api as xo
from xorq.caching import ParquetCache

# Connect to source database
pg = xo.postgres.connect_env()
con = xo.connect()

# Register table from postgres
batting = pg.table("batting")
expr = batting.filter(batting.yearID == 2015)

# Read from Postgres and cache in xorq backend
awards = pg.table("awards_players").into_backend(con, "awards")

# Perform operations and cache
cached_join = (
    expr.join(awards, ['playerID', 'yearID'])
    .cache(cache=ParquetCache.from_kwargs(source=con))
)

# Move to DuckDB for specific operations
ddb = xo.duckdb.connect()
ddb_summary = cached_join.into_backend(ddb, "ddb_awards")

Cache environment variables

Xorq caching behavior can be configured through environment variables:

export XORQ_CACHE_DIR=~/.cache/xorq
export XORQ_DEFAULT_RELATIVE_PATH=parquet
export XORQ_CACHE_KEY_PREFIX=letsql_cache-