Expression types

Every operation in Xorq produces an expression — a lazy, immutable description of a computation. Before anything executes, Xorq classifies each expression into an ExprKind. This classification drives how expressions are hashed, cached, serialized, and executed across backends.

The expression kinds

Three of the kinds are structural — they fall out of the shape of the expression graph itself:

Kind	Meaning	Has input schema?
Source	A leaf node — raw data with no transformations	No
Expr	Bound data with transformations applied	No
UnboundExpr	A template expression not tied to any backend	Yes

Two more are tag-driven — they’re recognized by domain-specific tags that higher-level tools (the ML pipeline API, the catalog, semantic layers) attach to the outermost node:

Kind	Meaning	Produced by
ExprBuilder	Carries builder metadata that can recover a domain object (for example, a fitted ML pipeline)	`xo.Pipeline` fit/predict, semantic-layer queries
Composed	Carries a catalog tag, marking a reference to another catalog entry	`catalog add` / composition

You can inspect the kind of any expression using the .ls accessor:

import xorq.api as xo

# Source — just data, no transforms
t = xo.memtable({"a": [1, 2, 3], "b": ["x", "y", "z"]})
print(t.ls.kind)  # ExprKind.Source

# Expr — transforms applied to bound data
filtered = t.filter(xo._.a > 1)
print(filtered.ls.kind)  # ExprKind.Expr

# UnboundExpr — schema template, no concrete data
template = xo.table(schema={"a": "int64", "b": "string"})
print(template.ls.kind)  # ExprKind.UnboundExpr

source
expr
unbound_expr

Source expressions

A Source is a leaf node in the expression graph — it represents data without any transformations. Sources are the starting points of every pipeline.

In-memory tables

The simplest source. Data lives in the current process and is wrapped in an InMemoryTable operation.

import xorq.api as xo

t = xo.memtable({"name": ["alice", "bob"], "score": [95, 87]})
print(t.ls.kind)  # ExprKind.Source

source

Database tables

A table registered in a backend (DuckDB, DataFusion, Postgres, etc.). Backed by a DatabaseTable operation that holds a reference to the backend connection.

import xorq.api as xo

con = xo.duckdb.connect()
batting = xo.examples.batting.fetch(backend=con)
print(batting.ls.kind)  # ExprKind.Source

source

Deferred reads

A lazy pointer to a file (Parquet, CSV). The file isn’t read until execution. Backed by a Read operation.

import xorq.api as xo

# A local Parquet file, fetched once via pins
path = xo.options.pins.get_path("batting")

# With an explicit schema the file isn't read until execute()
expr = xo.deferred_read_parquet(
    path,
    schema={
        "playerID": "string",
        "yearID": "int64",
        "teamID": "string",
        "G": "int64",
        "AB": "float64",
        "H": "float64",
    },
)
print(expr.ls.kind)  # ExprKind.Source

source

Cached nodes

When you call .cache() on an expression, Xorq wraps it in a CachedNode. The cache itself is a source — it acts as a materialization boundary that breaks the expression graph into independent segments.

import xorq.api as xo
from xorq.caching import ParquetCache

con = xo.connect()
ddb = xo.duckdb.connect()
cache = ParquetCache.from_kwargs(source=con)

# The cached expression is a Source, even though it wraps transforms
cached = (
    xo.examples.batting.fetch(backend=ddb)
    .filter(xo._.yearID > 2010)
    .cache(cache=cache)
)
print(cached.ls.kind)       # ExprKind.Source
print(cached.ls.is_cached)  # True

source
True

This is an important property: caching resets the expression kind back to Source. Downstream transforms on a cached expression produce a new Expr that depends on the cache, not on the original upstream query.

Remote tables

Created by .into_backend(), a RemoteTable transfers data between backends via Arrow record batches. Like a cache, it acts as a source boundary.

import xorq.api as xo

ddb = xo.duckdb.connect()
con = xo.connect()  # DataFusion

# Move filtered DuckDB data into DataFusion
t = (
    xo.examples.batting.fetch(backend=ddb)
    .filter(xo._.yearID == 2015)
    .into_backend(con, "local_batting")
)
print(t.ls.kind)  # ExprKind.Source

source

Expr — transformed expressions

An Expr is any expression that applies one or more transformations to bound data. It has at least one source upstream and at least one operation (filter, project, aggregate, join, etc.) on top.

import xorq.api as xo

t = xo.memtable({"a": [1, 2, 3], "b": [10, 20, 30]})

# Each of these produces an Expr
filtered = t.filter(xo._.a > 1)
projected = t.select("a")
aggregated = t.aggregate(total=t.b.sum())
sorted_t = t.order_by("a")
limited = t.limit(10)

# All are ExprKind.Expr
for expr in [filtered, projected, aggregated, sorted_t, limited]:
    print(expr.ls.kind)  # ExprKind.Expr

expr
expr
expr
expr
expr

Transforms compose naturally by chaining. Each method returns a new immutable expression:

import xorq.api as xo

ddb = xo.duckdb.connect()

result = (
    xo.examples.batting.fetch(backend=ddb)
    .filter(xo._.yearID > 2010)
    .select("playerID", "yearID", "teamID", "H", "AB")
    .mutate(avg=xo._.H / xo._.AB)
    .order_by(xo._.avg.desc())
    .limit(20)
)
print(result.ls.kind)  # ExprKind.Expr

expr

Available transform operations

Method	Operation	Description
`.filter()`	Filter	Select rows matching boolean predicates
`.select()`	Project	Choose or compute columns
`.mutate()`	Project	Add or replace columns
`.aggregate()`	Aggregate	Group-by and reduce
`.order_by()`	Sort	Sort rows
`.limit()`	Limit	Take first N rows
`.join()`	JoinChain	Combine tables
`.union()`	Union	Stack rows from two tables
`.distinct()`	Distinct	Deduplicate rows
`.drop()`	DropColumns	Remove columns
`.fill_null()`	FillNull	Replace nulls
`.dropna()`	DropNull	Remove rows with nulls
`.sample()`	Sample	Random subset
`.unnest()`	TableUnnest	Flatten array column

`UnboundExpr` — template expressions

An UnboundExpr is an expression built on an UnboundTable instead of a concrete data source. It defines a transformation as a reusable template — a schema in, a schema out, and the operations in between.

Creating unbound expressions

Use xo.table() with a schema to create a template:

import xorq.api as xo

# Define a template: takes int64 column "a", filters, adds a computed column
template = xo.table(schema={"a": "int64"})
transform = template.filter(template.a > 0).mutate(doubled=template.a * 2)

print(transform.ls.kind)  # ExprKind.UnboundExpr

unbound_expr

Unbinding bound expressions

You can convert any bound expression into an unbound one with .unbind(). This strips the backend connection and replaces DatabaseTable nodes with UnboundTable nodes:

import xorq.api as xo

ddb = xo.duckdb.connect()
bound = xo.examples.batting.fetch(backend=ddb).filter(xo._.yearID > 2010)

unbound = bound.unbind()
print(bound.ls.kind)    # ExprKind.Expr
print(unbound.ls.kind)  # ExprKind.UnboundExpr

# Schema is preserved
assert unbound.schema() == bound.schema()

expr
unbound_expr

Why unbound expressions matter

Unbound expressions are the foundation of several Xorq features:

Flight UDXFs: Define a transformation template that runs on an Arrow Flight server. The input data is streamed as record batches, and the unbound expression describes what to do with it.
Serialization: The expression YAML format stores unbound expressions so they can be loaded and bound to a different backend later.
Schema validation: ExprMetadata exposes schema_in (the UnboundTable’s schema) and schema_out (the result schema), enabling compile-time checks before execution.

The .ls.metadata accessor returns the ExprMetadata for any expression:

import xorq.api as xo

template = xo.table(schema={"a": "int64"})
transform = template.filter(template.a > 0)

metadata = transform.ls.metadata
print(metadata.kind)       # ExprKind.UnboundExpr
print(metadata.schema_in)  # Schema with column "a" (int64)
print(metadata.schema_out) # Schema with column "a" (int64)

unbound_expr
ibis.Schema {
  a  int64
}
ibis.Schema {
  a  int64
}

Tags: the mechanism behind tag-driven kinds

Both tag-driven kinds come from the same primitive — a tag node wrapped around an expression. A tag is metadata (a name plus arbitrary keyword values) that rides on top of the expression without changing the rows it produces.

Any table exposes two methods to attach one:

Method	Node	Affects content hash?
`.tag(name, **kwargs)`	`Tag`	No — stripped before hashing
`.hashing_tag(name, **kwargs)`	`HashingTag`	Yes — preserved during hashing

import xorq.api as xo

t = xo.memtable({"a": [1, 2, 3]})

tagged = t.tag("my_marker", note="anything")
print(tagged.ls.metadata.root_tag)  # "my_marker"

my_marker

The difference is what happens at hash time. A plain Tag is invisible to the content hash — two expressions that differ only by a Tag cache to the same key. A HashingTag is folded into the hash, so different tag metadata produces distinct hashes (and distinct cache entries):

import xorq.api as xo

t = xo.memtable({"a": [1, 2, 3]})

# Plain Tag: hash unchanged
print(t.ls.tokenized == t.tag("x").ls.tokenized)          # True

# HashingTag: hash changes with the metadata
print(t.ls.tokenized == t.hashing_tag("x").ls.tokenized)  # False

True
False

When Xorq classifies an expression, it walks the outermost chain of Tag / HashingTag nodes and looks the tag name up in two registries: catalog tags yield Composed, builder tags (registered by the ML pipeline API, semantic layers, or third-party plugins) yield ExprBuilder. An unrecognized tag is transparent — the expression falls through to its structural kind.

`ExprBuilder` — expressions that carry a domain object

An ExprBuilder is an expression whose outermost node carries a builder tag — domain-specific metadata that lets Xorq recover the higher-level object the expression came from. The expression still executes like any other; the tag is extra information riding alongside it.

The most common source is the ML pipeline API. Fitting a pipeline and calling .predict() produces an expression tagged with the fitted pipeline’s steps, features, and target:

import xorq.api as xo
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

train = xo.memtable(
    {
        "feature_0": [1.0, 2.0, 3.0, 4.0],
        "feature_1": [4.0, 5.0, 6.0, 7.0],
        "target": [0.0, 1.0, 0.0, 1.0],
    },
    name="train",
)

sk_pipe = make_pipeline(StandardScaler(), LinearRegression())
fitted = xo.Pipeline.from_instance(sk_pipe).fit(
    train, features=("feature_0", "feature_1"), target="target"
)
predict = fitted.predict(train)

print(predict.ls.kind)  # ExprKind.ExprBuilder

expr_builder

The recovered metadata lives on .ls.metadata.builders:

for builder in predict.ls.metadata.builders:
    print(builder["type"], builder["steps"], "->", builder["target"])

fitted_pipeline ('standardscaler', 'linearregression') -> target

This is what lets the catalog persist a fitted pipeline as an artifact and rebuild it later: the tag describes how to reconstruct the domain object, not just the rows it produces. Semantic-layer queries (boring-semantic-layer) produce ExprBuilder expressions the same way.

`Composed` — references to catalog entries

A Composed expression carries a catalog tag. When you add an expression to a catalog and then reference it from another expression, the reference is tagged so the catalog can resolve it back to the stored entry instead of inlining the whole subgraph. Composition is a catalog-level concern — see the catalog documentation for the full workflow.

Inspecting expressions with `.ls`

Every expression has an .ls accessor (the LETSQL accessor) that exposes introspection properties:

Property	Type	Description
`.ls.kind`	`ExprKind`	Source, Expr, or UnboundExpr
`.ls.metadata`	`ExprMetadata`	Full metadata (kind, schema_in, schema_out)
`.ls.unwrapped`	`ops.Node`	Underlying op with any wrapper layers stripped
`.ls.backends`	`tuple`	All backend connections used in the expression graph
`.ls.is_multiengine`	`bool`	Whether the expression spans multiple backends (`False` for memtable / unbound—no backend)
`.ls.is_cached`	`bool`	Whether the root op is a CachedNode
`.ls.has_cached`	`bool`	Whether any CachedNode exists in the graph
`.ls.cached_nodes`	`tuple`	All CachedNode operations in the graph
`.ls.cache`	`Cache`	The cache object if `is_cached`, else None
`.ls.uncached`	`Expr`	Expression with all cache nodes removed
`.ls.tokenized`	`str`	Content hash of the expression
`.ls.cache_exists()`	`bool \\| None`	Whether the cache is materialized (`None` if not cached)

import xorq.api as xo
from xorq.caching import ParquetCache

ddb = xo.duckdb.connect()
con = xo.connect()
cache = ParquetCache.from_kwargs(source=con)

expr = (
    xo.examples.batting.fetch(backend=ddb)
    .filter(xo._.yearID > 2010)
    .into_backend(con, "batting_local")
    .cache(cache=cache)
)

print(expr.ls.kind)            # ExprKind.Source (cache is a source)
print(expr.ls.is_cached)       # True
print(expr.ls.is_multiengine)  # True (DuckDB + DataFusion)
print(len(expr.ls.backends))   # 2
print(expr.ls.cache_exists())  # False (not yet executed)

source
True
True
2
False

How kind affects the system

The expression kind isn’t just a label — it drives behavior across Xorq:

Concern	Source	Expr	UnboundExpr
Hashing	Hash of the data reference (table name, path, etc.)	Hash of the full operation graph	Hash includes UnboundTable schema
Serialization	Stored as a data source reference in `expr.yaml`	Full operation tree serialized	Stored with `schema_in` for rebinding
Caching	Can be the result of a cache	Can be cached (wrapping in CachedNode)	Can’t be cached (no data to materialize)
Execution	Read data from backend	Execute operation graph	Must bind to data first
Build system	Leaf node in the build graph	Intermediate node	Template — not directly buildable

The tag-driven kinds (ExprBuilder, Composed) execute, hash, and serialize like the structural expression underneath them; the tag adds the metadata the ML pipeline API and catalog use to recover or resolve the higher-level object.