Expression types

Every operation in Xorq produces an expression — a lazy, immutable description of a computation. Before anything executes, Xorq classifies each expression into an ExprKind. This classification drives how expressions are hashed, cached, serialized, and executed across backends.

The expression kinds

Three of the kinds are structural — they fall out of the shape of the expression graph itself:

Kind Meaning Has input schema?
Source A leaf node — raw data with no transformations No
Expr Bound data with transformations applied No
UnboundExpr A template expression not tied to any backend Yes

Two more are tag-driven—they’re recognized by domain-specific tags that higher-level tools (the ML pipeline API, the catalog, semantic layers) attach to the outermost node:

Kind Meaning Produced by
ExprBuilder Carries builder metadata that can recover a domain object (for example, a fitted ML pipeline) xo.Pipeline fit/predict, semantic-layer queries
Composed Carries a catalog tag, marking a reference to another catalog entry catalog add / composition

You can inspect the kind of any expression using the .ls accessor:

import xorq.api as xo

# Source — just data, no transforms
t = xo.memtable({"a": [1, 2, 3], "b": ["x", "y", "z"]})
print(t.ls.kind)  # ExprKind.Source

# Expr — transforms applied to bound data
filtered = t.filter(xo._.a > 1)
print(filtered.ls.kind)  # ExprKind.Expr

# UnboundExpr — schema template, no concrete data
template = xo.table(schema={"a": "int64", "b": "string"})
print(template.ls.kind)  # ExprKind.UnboundExpr
source
expr
unbound_expr

Source expressions

A Source is a leaf node in the expression graph — it represents data without any transformations. Sources are the starting points of every pipeline.

In-memory tables

The simplest source. Data lives in the current process and is wrapped in an InMemoryTable operation.

import xorq.api as xo

t = xo.memtable({"name": ["alice", "bob"], "score": [95, 87]})
print(t.ls.kind)  # ExprKind.Source
source

Database tables

A table registered in a backend (DuckDB, DataFusion, Postgres, etc.). Backed by a DatabaseTable operation that holds a reference to the backend connection.

import xorq.api as xo

con = xo.duckdb.connect()
batting = xo.examples.batting.fetch(backend=con)
print(batting.ls.kind)  # ExprKind.Source
source

Deferred reads

A lazy pointer to a file (Parquet, CSV). The file isn’t read until execution. Backed by a Read operation.

import xorq.api as xo

# A local Parquet file, fetched once via pins
path = xo.options.pins.get_path("batting")

# With an explicit schema the file isn't read until execute()
expr = xo.deferred_read_parquet(
    path,
    schema={
        "playerID": "string",
        "yearID": "int64",
        "teamID": "string",
        "G": "int64",
        "AB": "float64",
        "H": "float64",
    },
)
print(expr.ls.kind)  # ExprKind.Source
source

Cached nodes

When you call .cache() on an expression, Xorq wraps it in a CachedNode. The cache itself is a source — it acts as a materialization boundary that breaks the expression graph into independent segments.

import xorq.api as xo
from xorq.caching import ParquetCache

con = xo.connect()
ddb = xo.duckdb.connect()
cache = ParquetCache.from_kwargs(source=con)

# The cached expression is a Source, even though it wraps transforms
cached = (
    xo.examples.batting.fetch(backend=ddb)
    .filter(xo._.yearID > 2010)
    .cache(cache=cache)
)
print(cached.ls.kind)       # ExprKind.Source
print(cached.ls.is_cached)  # True
source
True

This is an important property: caching resets the expression kind back to Source. Downstream transforms on a cached expression produce a new Expr that depends on the cache, not on the original upstream query.

Remote tables

Created by .into_backend(), a RemoteTable transfers data between backends via Arrow record batches. Like a cache, it acts as a source boundary.

import xorq.api as xo

ddb = xo.duckdb.connect()
con = xo.connect()  # DataFusion

# Move filtered DuckDB data into DataFusion
t = (
    xo.examples.batting.fetch(backend=ddb)
    .filter(xo._.yearID == 2015)
    .into_backend(con, "local_batting")
)
print(t.ls.kind)  # ExprKind.Source
source

Expr — transformed expressions

An Expr is any expression that applies one or more transformations to bound data. It has at least one source upstream and at least one operation (filter, project, aggregate, join, etc.) on top.

import xorq.api as xo

t = xo.memtable({"a": [1, 2, 3], "b": [10, 20, 30]})

# Each of these produces an Expr
filtered = t.filter(xo._.a > 1)
projected = t.select("a")
aggregated = t.aggregate(total=t.b.sum())
sorted_t = t.order_by("a")
limited = t.limit(10)

# All are ExprKind.Expr
for expr in [filtered, projected, aggregated, sorted_t, limited]:
    print(expr.ls.kind)  # ExprKind.Expr
expr
expr
expr
expr
expr

Transforms compose naturally by chaining. Each method returns a new immutable expression:

import xorq.api as xo

ddb = xo.duckdb.connect()

result = (
    xo.examples.batting.fetch(backend=ddb)
    .filter(xo._.yearID > 2010)
    .select("playerID", "yearID", "teamID", "H", "AB")
    .mutate(avg=xo._.H / xo._.AB)
    .order_by(xo._.avg.desc())
    .limit(20)
)
print(result.ls.kind)  # ExprKind.Expr
expr

Available transform operations

Method Operation Description
.filter() Filter Select rows matching boolean predicates
.select() Project Choose or compute columns
.mutate() Project Add or replace columns
.aggregate() Aggregate Group-by and reduce
.order_by() Sort Sort rows
.limit() Limit Take first N rows
.join() JoinChain Combine tables
.union() Union Stack rows from two tables
.distinct() Distinct Deduplicate rows
.drop() DropColumns Remove columns
.fill_null() FillNull Replace nulls
.dropna() DropNull Remove rows with nulls
.sample() Sample Random subset
.unnest() TableUnnest Flatten array column

UnboundExpr — template expressions

An UnboundExpr is an expression built on an UnboundTable instead of a concrete data source. It defines a transformation as a reusable template — a schema in, a schema out, and the operations in between.

Creating unbound expressions

Use xo.table() with a schema to create a template:

import xorq.api as xo

# Define a template: takes int64 column "a", filters, adds a computed column
template = xo.table(schema={"a": "int64"})
transform = template.filter(template.a > 0).mutate(doubled=template.a * 2)

print(transform.ls.kind)  # ExprKind.UnboundExpr
unbound_expr

Unbinding bound expressions

You can convert any bound expression into an unbound one with .unbind(). This strips the backend connection and replaces DatabaseTable nodes with UnboundTable nodes:

import xorq.api as xo

ddb = xo.duckdb.connect()
bound = xo.examples.batting.fetch(backend=ddb).filter(xo._.yearID > 2010)

unbound = bound.unbind()
print(bound.ls.kind)    # ExprKind.Expr
print(unbound.ls.kind)  # ExprKind.UnboundExpr

# Schema is preserved
assert unbound.schema() == bound.schema()
expr
unbound_expr

Why unbound expressions matter

Unbound expressions are the foundation of several Xorq features:

  • Flight UDXFs: Define a transformation template that runs on an Arrow Flight server. The input data is streamed as record batches, and the unbound expression describes what to do with it.
  • Serialization: The expression YAML format stores unbound expressions so they can be loaded and bound to a different backend later.
  • Schema validation: ExprMetadata exposes schema_in (the UnboundTable’s schema) and schema_out (the result schema), enabling compile-time checks before execution.

The .ls.metadata accessor returns the ExprMetadata for any expression:

import xorq.api as xo

template = xo.table(schema={"a": "int64"})
transform = template.filter(template.a > 0)

metadata = transform.ls.metadata
print(metadata.kind)       # ExprKind.UnboundExpr
print(metadata.schema_in)  # Schema with column "a" (int64)
print(metadata.schema_out) # Schema with column "a" (int64)
unbound_expr
ibis.Schema {
  a  int64
}
ibis.Schema {
  a  int64
}

Tags: the mechanism behind tag-driven kinds

Both tag-driven kinds come from the same primitive — a tag node wrapped around an expression. A tag is metadata (a name plus arbitrary keyword values) that rides on top of the expression without changing the rows it produces.

Any table exposes two methods to attach one:

Method Node Affects content hash?
.tag(name, **kwargs) Tag No — stripped before hashing
.hashing_tag(name, **kwargs) HashingTag Yes — preserved during hashing
import xorq.api as xo

t = xo.memtable({"a": [1, 2, 3]})

tagged = t.tag("my_marker", note="anything")
print(tagged.ls.metadata.root_tag)  # "my_marker"
my_marker

The difference is what happens at hash time. A plain Tag is invisible to the content hash — two expressions that differ only by a Tag cache to the same key. A HashingTag is folded into the hash, so different tag metadata produces distinct hashes (and distinct cache entries):

import xorq.api as xo

t = xo.memtable({"a": [1, 2, 3]})

# Plain Tag: hash unchanged
print(t.ls.tokenized == t.tag("x").ls.tokenized)          # True

# HashingTag: hash changes with the metadata
print(t.ls.tokenized == t.hashing_tag("x").ls.tokenized)  # False
True
False

When Xorq classifies an expression, it walks the outermost chain of Tag / HashingTag nodes and looks the tag name up in two registries: catalog tags yield Composed, builder tags (registered by the ML pipeline API, semantic layers, or third-party plugins) yield ExprBuilder. An unrecognized tag is transparent — the expression falls through to its structural kind.

ExprBuilder — expressions that carry a domain object

An ExprBuilder is an expression whose outermost node carries a builder tag — domain-specific metadata that lets Xorq recover the higher-level object the expression came from. The expression still executes like any other; the tag is extra information riding alongside it.

The most common source is the ML pipeline API. Fitting a pipeline and calling .predict() produces an expression tagged with the fitted pipeline’s steps, features, and target:

import xorq.api as xo
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

train = xo.memtable(
    {
        "feature_0": [1.0, 2.0, 3.0, 4.0],
        "feature_1": [4.0, 5.0, 6.0, 7.0],
        "target": [0.0, 1.0, 0.0, 1.0],
    },
    name="train",
)

sk_pipe = make_pipeline(StandardScaler(), LinearRegression())
fitted = xo.Pipeline.from_instance(sk_pipe).fit(
    train, features=("feature_0", "feature_1"), target="target"
)
predict = fitted.predict(train)

print(predict.ls.kind)  # ExprKind.ExprBuilder
expr_builder

The recovered metadata lives on .ls.metadata.builders:

for builder in predict.ls.metadata.builders:
    print(builder["type"], builder["steps"], "->", builder["target"])
fitted_pipeline ('standardscaler', 'linearregression') -> target

This is what lets the catalog persist a fitted pipeline as an artifact and rebuild it later: the tag describes how to reconstruct the domain object, not just the rows it produces. Semantic-layer queries (boring-semantic-layer) produce ExprBuilder expressions the same way.

Composed — references to catalog entries

A Composed expression carries a catalog tag. When you add an expression to a catalog and then reference it from another expression, the reference is tagged so the catalog can resolve it back to the stored entry instead of inlining the whole subgraph. Composition is a catalog-level concern — see the catalog documentation for the full workflow.

Inspecting expressions with .ls

Every expression has an .ls accessor (the LETSQL accessor) that exposes introspection properties:

Property Type Description
.ls.kind ExprKind Source, Expr, or UnboundExpr
.ls.metadata ExprMetadata Full metadata (kind, schema_in, schema_out)
.ls.unwrapped ops.Node Underlying op with any wrapper layers stripped
.ls.backends tuple All backend connections used in the expression graph
.ls.is_multiengine bool Whether the expression spans multiple backends (False for memtable / unbound—no backend)
.ls.is_cached bool Whether the root op is a CachedNode
.ls.has_cached bool Whether any CachedNode exists in the graph
.ls.cached_nodes tuple All CachedNode operations in the graph
.ls.cache Cache The cache object if is_cached, else None
.ls.uncached Expr Expression with all cache nodes removed
.ls.tokenized str Content hash of the expression
.ls.cache_exists() bool \| None Whether the cache is materialized (None if not cached)
import xorq.api as xo
from xorq.caching import ParquetCache

ddb = xo.duckdb.connect()
con = xo.connect()
cache = ParquetCache.from_kwargs(source=con)

expr = (
    xo.examples.batting.fetch(backend=ddb)
    .filter(xo._.yearID > 2010)
    .into_backend(con, "batting_local")
    .cache(cache=cache)
)

print(expr.ls.kind)            # ExprKind.Source (cache is a source)
print(expr.ls.is_cached)       # True
print(expr.ls.is_multiengine)  # True (DuckDB + DataFusion)
print(len(expr.ls.backends))   # 2
print(expr.ls.cache_exists())  # False (not yet executed)
source
True
True
2
False

How kind affects the system

The expression kind isn’t just a label — it drives behavior across Xorq:

Concern Source Expr UnboundExpr
Hashing Hash of the data reference (table name, path, etc.) Hash of the full operation graph Hash includes UnboundTable schema
Serialization Stored as a data source reference in expr.yaml Full operation tree serialized Stored with schema_in for rebinding
Caching Can be the result of a cache Can be cached (wrapping in CachedNode) Can’t be cached (no data to materialize)
Execution Read data from backend Execute operation graph Must bind to data first
Build system Leaf node in the build graph Intermediate node Template — not directly buildable

The tag-driven kinds (ExprBuilder, Composed) execute, hash, and serialize like the structural expression underneath them; the tag adds the metadata the ML pipeline API and catalog use to recover or resolve the higher-level object.