This tutorial shows you how Xorq’s caching system works through hands-on examples. You’ll see cache hits and misses in real time, and understand when Xorq reuses results versus recomputing them.
After completing this tutorial, you’ll know how to use caching to speed up your workflows.
Why caching matters
Running the same query twice shouldn’t mean doing the work twice. Xorq caches expression results so repeated queries return instantly from the cache instead of recomputing.
This is especially powerful for expensive operations: - Loading large datasets from remote databases - Training machine learning models - Calling external APIs - Running complex aggregations
TipSmart caching
Xorq uses content-addressed hashing to determine if an expression matches cached results. Same computation = same hash = cache hit.
How to follow along
Run the code examples in order using any of these methods:
Python interactive shell (recommended): Open a terminal, run python, then copy and paste each code block.
Jupyter notebook: Create a new notebook and run each code block in a separate cell.
Python script: Copy all code blocks into a .py file and run it with python script.py.
The code blocks build on each other. Variables like iris, storage, and cached_expr are created in earlier blocks and used in later ones.
Set up caching
You’ll start by connecting to a backend and setting up a cache storage location.
import xorq.api as xofrom xorq.caching import SourceCachecon = xo.connect()storage = SourceCache.from_kwargs(source=con)print(f"Connected to: {con}")print(f"Cache storage ready!")
1
Connect to the embedded backend where cached data is stored.
2
Create a SourceCache object that manages the cache.
SourceCache stores cached results in your backend as tables. When you run an expression with .cache(), Xorq saves the results and reuses them on subsequent runs.
Cache your first expression
Now you’ll build an expression and add caching to it.
The .cache() method tells Xorq to store results from this expression. On the first run, Xorq computes and caches the results. On subsequent runs, it retrieves them directly from cache.
Observe cache miss (first run)
You’ll execute the expression for the first time. This will be a cache miss, Xorq has to compute the results.
Execute the expression, triggers computation and caching.
3
Print how long it took.
Since this is the first run, Xorq computed the filter operation and stored the results in cache.
Observe cache hit (second run)
Now you can run the same expression again. This time you’ll see a cache hit.
print("\nSecond execution (cache hit)...")start = time.time()result2 = cached_expr.execute()elapsed = time.time() - startprint(f"✓ Cache hit: returned in {elapsed:.4f} seconds")print(f"Results match: {result1.equals(result2)}")
1
Time the second execution.
2
Run the same expression again.
3
See how much faster it was.
The second execution should be significantly faster because Xorq fetched results from cache instead of recomputing the filter operation.
NoteCache key
Xorq computes a hash from your expression’s structure and data sources. If the expression is identical, then the hash matches, and you get a cache hit.
Understand cache invalidation
What happens if you change the expression? You’ll modify the filter and see cache invalidation in action.
The first execution is a cache miss (slower), but the second and third are cache hits (much faster). This shows how caching eliminates redundant computation.
WarningCache storage
SourceCache keeps cached data in your backend as tables. Make sure you have enough storage space for cached results, especially with large datasets.
Chain cached expressions
You can cache multiple steps in a pipeline. Each cached expression can reuse results from previous runs.
step1 = iris.filter(xo._.sepal_length >5).cache(cache=storage)step2 = step1.group_by("species").agg( avg_width=xo._.sepal_width.mean()).cache(cache=storage)print("First execution of step2...")result_a = step2.execute()print("\nSecond execution of step2...")result_b = step2.execute()print("\nBoth steps now cached!")print(result_a)
1
Cache the filtered dataset.
2
Build on the cached result and cache the aggregation too.
3
First execution caches both steps.
4
Second execution hits cache for both steps.
When you cache multiple steps, Xorq can reuse intermediate results, making complex pipelines faster on repeated runs.
Complete example
Here’s a full caching workflow in one place:
import xorq.api as xofrom xorq.caching import SourceCache# Set up connection and load datacon = xo.connect()storage = SourceCache.from_kwargs(source=con)iris = xo.examples.iris.fetch(backend=con)# Build cached expressioncached_expr = ( iris .filter(xo._.sepal_length >6) .cache(cache=storage))# First run: cache missresult1 = cached_expr.execute()print("First run complete (cached)")# Second run: cache hitresult2 = cached_expr.execute()print("Second run complete (from cache)")
Next steps
Now you understand how caching works. Continue learning:
Switch backends shows how caching works when moving data between engines
Your first build explains how cached expressions become portable artifacts