Every operation in Xorq produces an expression — a lazy, immutable description of a computation. Before anything executes, Xorq classifies each expression into an ExprKind. This classification drives how expressions are hashed, cached, serialized, and executed across backends.
The expression kinds
Three of the kinds are structural — they fall out of the shape of the expression graph itself:
| Source |
A leaf node — raw data with no transformations |
No |
| Expr |
Bound data with transformations applied |
No |
| UnboundExpr |
A template expression not tied to any backend |
Yes |
Two more are tag-driven—they’re recognized by domain-specific tags that higher-level tools (the ML pipeline API, the catalog, semantic layers) attach to the outermost node:
| ExprBuilder |
Carries builder metadata that can recover a domain object (for example, a fitted ML pipeline) |
xo.Pipeline fit/predict, semantic-layer queries |
| Composed |
Carries a catalog tag, marking a reference to another catalog entry |
catalog add / composition |
You can inspect the kind of any expression using the .ls accessor:
import xorq.api as xo
# Source — just data, no transforms
t = xo.memtable({"a": [1, 2, 3], "b": ["x", "y", "z"]})
print(t.ls.kind) # ExprKind.Source
# Expr — transforms applied to bound data
filtered = t.filter(xo._.a > 1)
print(filtered.ls.kind) # ExprKind.Expr
# UnboundExpr — schema template, no concrete data
template = xo.table(schema={"a": "int64", "b": "string"})
print(template.ls.kind) # ExprKind.UnboundExpr
Source expressions
A Source is a leaf node in the expression graph — it represents data without any transformations. Sources are the starting points of every pipeline.
In-memory tables
The simplest source. Data lives in the current process and is wrapped in an InMemoryTable operation.
import xorq.api as xo
t = xo.memtable({"name": ["alice", "bob"], "score": [95, 87]})
print(t.ls.kind) # ExprKind.Source
Database tables
A table registered in a backend (DuckDB, DataFusion, Postgres, etc.). Backed by a DatabaseTable operation that holds a reference to the backend connection.
import xorq.api as xo
con = xo.duckdb.connect()
batting = xo.examples.batting.fetch(backend=con)
print(batting.ls.kind) # ExprKind.Source
Deferred reads
A lazy pointer to a file (Parquet, CSV). The file isn’t read until execution. Backed by a Read operation.
import xorq.api as xo
# A local Parquet file, fetched once via pins
path = xo.options.pins.get_path("batting")
# With an explicit schema the file isn't read until execute()
expr = xo.deferred_read_parquet(
path,
schema={
"playerID": "string",
"yearID": "int64",
"teamID": "string",
"G": "int64",
"AB": "float64",
"H": "float64",
},
)
print(expr.ls.kind) # ExprKind.Source
Cached nodes
When you call .cache() on an expression, Xorq wraps it in a CachedNode. The cache itself is a source — it acts as a materialization boundary that breaks the expression graph into independent segments.
import xorq.api as xo
from xorq.caching import ParquetCache
con = xo.connect()
ddb = xo.duckdb.connect()
cache = ParquetCache.from_kwargs(source=con)
# The cached expression is a Source, even though it wraps transforms
cached = (
xo.examples.batting.fetch(backend=ddb)
.filter(xo._.yearID > 2010)
.cache(cache=cache)
)
print(cached.ls.kind) # ExprKind.Source
print(cached.ls.is_cached) # True
This is an important property: caching resets the expression kind back to Source. Downstream transforms on a cached expression produce a new Expr that depends on the cache, not on the original upstream query.
Remote tables
Created by .into_backend(), a RemoteTable transfers data between backends via Arrow record batches. Like a cache, it acts as a source boundary.
import xorq.api as xo
ddb = xo.duckdb.connect()
con = xo.connect() # DataFusion
# Move filtered DuckDB data into DataFusion
t = (
xo.examples.batting.fetch(backend=ddb)
.filter(xo._.yearID == 2015)
.into_backend(con, "local_batting")
)
print(t.ls.kind) # ExprKind.Source
UnboundExpr — template expressions
An UnboundExpr is an expression built on an UnboundTable instead of a concrete data source. It defines a transformation as a reusable template — a schema in, a schema out, and the operations in between.
Creating unbound expressions
Use xo.table() with a schema to create a template:
import xorq.api as xo
# Define a template: takes int64 column "a", filters, adds a computed column
template = xo.table(schema={"a": "int64"})
transform = template.filter(template.a > 0).mutate(doubled=template.a * 2)
print(transform.ls.kind) # ExprKind.UnboundExpr
Unbinding bound expressions
You can convert any bound expression into an unbound one with .unbind(). This strips the backend connection and replaces DatabaseTable nodes with UnboundTable nodes:
import xorq.api as xo
ddb = xo.duckdb.connect()
bound = xo.examples.batting.fetch(backend=ddb).filter(xo._.yearID > 2010)
unbound = bound.unbind()
print(bound.ls.kind) # ExprKind.Expr
print(unbound.ls.kind) # ExprKind.UnboundExpr
# Schema is preserved
assert unbound.schema() == bound.schema()
Why unbound expressions matter
Unbound expressions are the foundation of several Xorq features:
- Flight UDXFs: Define a transformation template that runs on an Arrow Flight server. The input data is streamed as record batches, and the unbound expression describes what to do with it.
- Serialization: The expression YAML format stores unbound expressions so they can be loaded and bound to a different backend later.
- Schema validation:
ExprMetadata exposes schema_in (the UnboundTable’s schema) and schema_out (the result schema), enabling compile-time checks before execution.
The .ls.metadata accessor returns the ExprMetadata for any expression:
import xorq.api as xo
template = xo.table(schema={"a": "int64"})
transform = template.filter(template.a > 0)
metadata = transform.ls.metadata
print(metadata.kind) # ExprKind.UnboundExpr
print(metadata.schema_in) # Schema with column "a" (int64)
print(metadata.schema_out) # Schema with column "a" (int64)
unbound_expr
ibis.Schema {
a int64
}
ibis.Schema {
a int64
}
ExprBuilder — expressions that carry a domain object
An ExprBuilder is an expression whose outermost node carries a builder tag — domain-specific metadata that lets Xorq recover the higher-level object the expression came from. The expression still executes like any other; the tag is extra information riding alongside it.
The most common source is the ML pipeline API. Fitting a pipeline and calling .predict() produces an expression tagged with the fitted pipeline’s steps, features, and target:
import xorq.api as xo
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
train = xo.memtable(
{
"feature_0": [1.0, 2.0, 3.0, 4.0],
"feature_1": [4.0, 5.0, 6.0, 7.0],
"target": [0.0, 1.0, 0.0, 1.0],
},
name="train",
)
sk_pipe = make_pipeline(StandardScaler(), LinearRegression())
fitted = xo.Pipeline.from_instance(sk_pipe).fit(
train, features=("feature_0", "feature_1"), target="target"
)
predict = fitted.predict(train)
print(predict.ls.kind) # ExprKind.ExprBuilder
The recovered metadata lives on .ls.metadata.builders:
for builder in predict.ls.metadata.builders:
print(builder["type"], builder["steps"], "->", builder["target"])
fitted_pipeline ('standardscaler', 'linearregression') -> target
This is what lets the catalog persist a fitted pipeline as an artifact and rebuild it later: the tag describes how to reconstruct the domain object, not just the rows it produces. Semantic-layer queries (boring-semantic-layer) produce ExprBuilder expressions the same way.
Composed — references to catalog entries
A Composed expression carries a catalog tag. When you add an expression to a catalog and then reference it from another expression, the reference is tagged so the catalog can resolve it back to the stored entry instead of inlining the whole subgraph. Composition is a catalog-level concern — see the catalog documentation for the full workflow.
Inspecting expressions with .ls
Every expression has an .ls accessor (the LETSQL accessor) that exposes introspection properties:
.ls.kind |
ExprKind |
Source, Expr, or UnboundExpr |
.ls.metadata |
ExprMetadata |
Full metadata (kind, schema_in, schema_out) |
.ls.unwrapped |
ops.Node |
Underlying op with any wrapper layers stripped |
.ls.backends |
tuple |
All backend connections used in the expression graph |
.ls.is_multiengine |
bool |
Whether the expression spans multiple backends (False for memtable / unbound—no backend) |
.ls.is_cached |
bool |
Whether the root op is a CachedNode |
.ls.has_cached |
bool |
Whether any CachedNode exists in the graph |
.ls.cached_nodes |
tuple |
All CachedNode operations in the graph |
.ls.cache |
Cache |
The cache object if is_cached, else None |
.ls.uncached |
Expr |
Expression with all cache nodes removed |
.ls.tokenized |
str |
Content hash of the expression |
.ls.cache_exists() |
bool \| None |
Whether the cache is materialized (None if not cached) |
import xorq.api as xo
from xorq.caching import ParquetCache
ddb = xo.duckdb.connect()
con = xo.connect()
cache = ParquetCache.from_kwargs(source=con)
expr = (
xo.examples.batting.fetch(backend=ddb)
.filter(xo._.yearID > 2010)
.into_backend(con, "batting_local")
.cache(cache=cache)
)
print(expr.ls.kind) # ExprKind.Source (cache is a source)
print(expr.ls.is_cached) # True
print(expr.ls.is_multiengine) # True (DuckDB + DataFusion)
print(len(expr.ls.backends)) # 2
print(expr.ls.cache_exists()) # False (not yet executed)
How kind affects the system
The expression kind isn’t just a label — it drives behavior across Xorq:
| Hashing |
Hash of the data reference (table name, path, etc.) |
Hash of the full operation graph |
Hash includes UnboundTable schema |
| Serialization |
Stored as a data source reference in expr.yaml |
Full operation tree serialized |
Stored with schema_in for rebinding |
| Caching |
Can be the result of a cache |
Can be cached (wrapping in CachedNode) |
Can’t be cached (no data to materialize) |
| Execution |
Read data from backend |
Execute operation graph |
Must bind to data first |
| Build system |
Leaf node in the build graph |
Intermediate node |
Template — not directly buildable |
The tag-driven kinds (ExprBuilder, Composed) execute, hash, and serialize like the structural expression underneath them; the tag adds the metadata the ML pipeline API and catalog use to recover or resolve the higher-level object.