Cache expression results

This tutorial shows you how Xorq’s caching system works through hands-on examples. You’ll see cache hits and misses in real time, and understand when Xorq reuses results versus recomputing them.

After completing this tutorial, you’ll know how to use caching to speed up your workflows.

Why caching matters

Running the same query twice shouldn’t mean doing the work twice. Xorq caches expression results so repeated queries return instantly from the cache instead of recomputing.

This is especially powerful for expensive operations: - Loading large datasets from remote databases - Training machine learning models - Calling external APIs - Running complex aggregations

Smart caching

Xorq uses content-addressed hashing to determine if an expression matches cached results. Same computation = same hash = cache hit.

How to follow along

Run the code examples in order using any of these methods:

Python interactive shell (recommended): Open a terminal, run python, then copy and paste each code block.
Jupyter notebook: Create a new notebook and run each code block in a separate cell.
Python script: Copy all code blocks into a .py file and run it with python script.py.

The code blocks build on each other. Variables like iris, storage, and cached_expr are created in earlier blocks and used in later ones.

Set up caching

You’ll start by connecting to a backend and setting up a cache storage location.

import xorq.api as xo
from xorq.caching import SourceCache


con = xo.connect()


storage = SourceCache.from_kwargs(source=con)

print(f"Connected to: {con}")
print(f"Cache storage ready!")

1: Connect to the embedded backend where cached data is stored.
2: Create a SourceCache object that manages the cache.

SourceCache stores cached results in your backend as tables. When you run an expression with .cache(), Xorq saves the results and reuses them on subsequent runs.

Cache your first expression

Now you’ll build an expression and add caching to it.


iris = xo.examples.iris.fetch(backend=con)


cached_expr = (
    iris
    .filter(xo._.sepal_length > 6)
    .cache(cache=storage)
)

print(f"Expression with caching: {type(cached_expr)}")

1: Load the iris dataset.
2: Build a filter expression.
3: Add caching with .cache(cache=storage).

The .cache() method tells Xorq to store results from this expression. On the first run, Xorq computes and caches the results. On subsequent runs, it retrieves them directly from cache.

Observe cache miss (first run)

You’ll execute the expression for the first time. This will be a cache miss, Xorq has to compute the results.

import time


print("First execution (cache miss)...")
start = time.time()


result1 = cached_expr.execute()


elapsed = time.time() - start
print(f"✗ Cache miss: computed in {elapsed:.4f} seconds")
print(f"Result shape: {result1.shape}")
print(f"\nFirst few rows:")
print(result1.head(3))

1: Start timing the execution.
2: Execute the expression, triggers computation and caching.
3: Print how long it took.

Since this is the first run, Xorq computed the filter operation and stored the results in cache.

Observe cache hit (second run)

Now you can run the same expression again. This time you’ll see a cache hit.


print("\nSecond execution (cache hit)...")
start = time.time()


result2 = cached_expr.execute()


elapsed = time.time() - start
print(f"✓ Cache hit: returned in {elapsed:.4f} seconds")
print(f"Results match: {result1.equals(result2)}")

1: Time the second execution.
2: Run the same expression again.
3: See how much faster it was.

The second execution should be significantly faster because Xorq fetched results from cache instead of recomputing the filter operation.

Cache key

Xorq computes a hash from your expression’s structure and data sources. If the expression is identical, then the hash matches, and you get a cache hit.

Understand cache invalidation

What happens if you change the expression? You’ll modify the filter and see cache invalidation in action.


modified_expr = (
    iris
    .filter(xo._.sepal_length > 6.5)
    .cache(cache=storage)
)


print("Modified expression (different filter)...")
start = time.time()
result3 = modified_expr.execute()
elapsed = time.time() - start


print(f"✗ Cache miss: computed in {elapsed:.4f} seconds")
print(f"Different result shape: {result3.shape}")

1: Create a new expression with a different filter threshold.
2: Changed from > 6 to > 6.5—this is a different computation.
3: Execute the modified expression.
4: Cache miss because the expression changed.

Since you changed the filter threshold, Xorq computed a different hash. The cache from the previous expression doesn’t match, so Xorq recomputed.

Compare multiple runs

You’ll run several executions and see the timing difference between cache hits and misses.


def time_execution(expr, label):
    start = time.time()
    result = expr.execute()
    elapsed = time.time() - start
    return elapsed, len(result)


print("\nTiming comparison:")
print("-" * 50)


t1, rows1 = time_execution(cached_expr, "First run")
print(f"Run 1 (miss):  {t1:.4f}s - {rows1} rows")

t2, rows2 = time_execution(cached_expr, "Second run")
print(f"Run 2 (hit):   {t2:.4f}s - {rows2} rows")

t3, rows3 = time_execution(cached_expr, "Third run")
print(f"Run 3 (hit):   {t3:.4f}s - {rows3} rows")


speedup = t1 / t2 if t2 > 0 else float('inf')
print(f"\nSpeedup from caching: {speedup:.1f}x faster")

1: Create a helper function to time executions.
2: Print a header for the comparison.
3: Run the same expression three times.
4: Calculate the speedup from caching.

The first execution is a cache miss (slower), but the second and third are cache hits (much faster). This shows how caching eliminates redundant computation.

Cache storage

SourceCache keeps cached data in your backend as tables. Make sure you have enough storage space for cached results, especially with large datasets.

Chain cached expressions

You can cache multiple steps in a pipeline. Each cached expression can reuse results from previous runs.


step1 = iris.filter(xo._.sepal_length > 5).cache(cache=storage)


step2 = step1.group_by("species").agg(
    avg_width=xo._.sepal_width.mean()
).cache(cache=storage)


print("First execution of step2...")
result_a = step2.execute()


print("\nSecond execution of step2...")
result_b = step2.execute()

print("\nBoth steps now cached!")
print(result_a)

1: Cache the filtered dataset.
2: Build on the cached result and cache the aggregation too.
3: First execution caches both steps.
4: Second execution hits cache for both steps.

When you cache multiple steps, Xorq can reuse intermediate results, making complex pipelines faster on repeated runs.

Complete example

Here’s a full caching workflow in one place:

import xorq.api as xo
from xorq.caching import SourceCache

# Set up connection and load data
con = xo.connect()
storage = SourceCache.from_kwargs(source=con)
iris = xo.examples.iris.fetch(backend=con)

# Build cached expression
cached_expr = (
    iris
    .filter(xo._.sepal_length > 6)
    .cache(cache=storage)
)

# First run: cache miss
result1 = cached_expr.execute()
print("First run complete (cached)")

# Second run: cache hit
result2 = cached_expr.execute()
print("Second run complete (from cache)")

Next steps

Now you understand how caching works. Continue learning:

Switch backends shows how caching works when moving data between engines
Your first build explains how cached expressions become portable artifacts
Optimize pipeline performance covers advanced caching strategies