Quickstart
Installation
Xorq can be installed using pip:
pip install xorq
Or using nix to drop into an IPython shell:
nix run github:xorq-labs/xorq
Quick Start: 4 Steps to Your First Pipeline
Step 1: Initialize a Project
The fastest way to get started with Xorq is to use the xorq init
command:
xorq init -t penguins -p penguins_example
cd penguins_example
This creates a complete ML pipeline example with the Palmer Penguins dataset demonstrating key Xorq features including machine learning, caching, and lineage tracking.
Step 2: Build Your Expression
Convert your pipeline into a serialized, executable format:
xorq build expr.py
Output:
Building expr from expr.py
Written 'expr' to builds/7061dd65ff3c
builds/7061dd65ff3c
Step 3: Run Your Pipeline
Execute your built pipeline:
# Run and see results
xorq run builds/7061dd65ff3c
# Save to file
xorq run builds/7061dd65ff3c -o predictions.parquet
# Run with limit for testing
xorq run builds/7061dd65ff3c --limit 10
Step 4: Serve Your Pipelines
To serve your pipeline as an endpoint, you can use the xorq serve-unbound
command:
xorq serve-unbound builds/7061dd65ff3c --host localhost --port 8001 --cache-dir penguins_example b2370a29c19df8e1e639c63252dacd0e
# This replaces a specific node hash with an exchanger input and serves the unbound expr as do_exchange
That’s it! You’ve built and run your first Xorq ML pipeline.
Understanding the Generated Pipeline
The template creates an expr.py
file that demonstrates a complete ML workflow. Let’s walk through the key components:
1. Data Loading and Preparation
import sklearn
from sklearn.linear_model import LogisticRegression
import xorq as xo
from xorq.caching import ParquetStorage
from xorq.expr.ml.pipeline_lib import Pipeline
= ("bill_length_mm", "bill_depth_mm")
features = "species"
target = "https://storage.googleapis.com/letsql-pins/penguins/20250703T145709Z-c3cde/penguins.parquet" data_url
2. Data Splitting
def gen_splits(expr, test_size=0.2, random_seed=42, **split_kwargs):
= "row_number"
row_number yield from (
expr.drop(row_number)for expr in xo.train_test_splits(
**{row_number: xo.row_number()}),
expr.mutate(=row_number,
unique_key=test_size,
test_sizes=random_seed,
random_seed**split_kwargs,
)
)
def get_penguins_splits(storage=None, **split_kwargs):
= (
t
xo.deferred_read_parquet(=xo.duckdb.connect(),
con=data_url,
path="t",
table_name
)+ (target,))
.select(features
.drop_null()
)= (
(train, test) or ParquetStorage())
expr.cache(storage for expr in gen_splits(t, **split_kwargs)
)return (train, test)
3. Deferred ML Pipeline
The key is converting scikit-learn pipelines to deferred expressions:
# Configure hyperparameters
= {"logistic__C": 1e-4}
params
# Get train/test splits (still deferred!)
= get_penguins_splits()
(train, test)
# Create and convert pipeline
= make_pipeline(params=params)
sklearn_pipeline = Pipeline.from_instance(sklearn_pipeline)
xorq_pipeline
# Create fitted pipeline expression (no computation yet!)
= xorq_pipeline.fit(train, features=features, target=target)
fitted_pipeline
# Get prediction expression
= test_predicted = fitted_pipeline.predict(test[list(features)]) expr
fitted_pipeline
is still just an expression - no actual training has happened yet. The computation is deferred until you call .execute()
or run via CLI.
CLI Commands Deep Dive
Build Options
# Basic build
xorq build expr.py
# Build with specific expression name
xorq build expr.py -e my_expr
# Build with profile
xorq build expr.py --profile production
Run Options
# Different output formats
xorq run builds/HASH -o results.csv --format csv
xorq run builds/HASH -o results.json --format json
xorq run builds/HASH -o results.parquet --format parquet
# Control output size
xorq run builds/HASH --limit 100
Inspecting Builds
# View build contents
ls builds/7061dd65ff3c/
# Shows: expr.yaml, *.sql, metadata.json, deferred_reads.yaml
# Check expression definition
cat builds/7061dd65ff3c/expr.yaml
# View generated SQL
cat builds/7061dd65ff3c/*.sql
Serving Pipelines as Endpoints
Basic Catalog Server with Arrow Flight
Start a Flight server:
# Start server
xorq serve --port 8001
Then connect and execute expressions:
import xorq as xo
# Connect to Flight server
= xo.flight.client.FlightClient(port=8001)
client
# Create a simple expression
= "https://storage.googleapis.com/letsql-pins/penguins/20250703T145709Z-c3cde/penguins.parquet"
data_url
= (
expr
xo.deferred_read_parquet(=xo.duckdb.connect(),
con=data_url,
path="penguins",
table_name
)"bill_length_mm", "bill_depth_mm", "species") # Match schema order
.select(
.drop_null()5)
.limit(
)
print("Executing via Flight do_exchange...")
= client.do_exchange("default", expr)
fut, rbr = rbr.read_pandas()
result_df print(result_df)
Serving Built Pipelines with serve-unbound
For deployments, you can serve a specific built expression as an endpoint using serve-unbound
. This allows you to expose a particular expression as a Catalog service:
xorq serve-unbound builds/7061dd65ff3c --host localhost --port 8001 --cache-dir penguins_example b2370a29c19df8e1e639c63252dacd0e
Understanding the command:
builds/7061dd65ff3c
: Your built pipeline directory--host localhost --port 8001
: Server configuration--cache-dir penguins_example
: Directory for caching resultsb2370a29c19df8e1e639c63252dacd0e
: The specific node hash to serve
Finding the Node Hash
The node hash (like b2370a29c19df8e1e639c63252dacd0e
) identifies a specific expression node in your pipeline. You can find this hash using:
import dask
import sys
'penguins_example')
sys.path.append(from expr import expr # or your specific expression
# Get the hash for any expression
= dask.base.tokenize(expr)
node_hash print(f"Node hash: {node_hash}")
This hash represents the unique identity of your expression, including its computation graph and dependencies. When you serve this specific node, clients can query exactly that expression endpoint.
Exploring Pipeline Lineage
One of Xorq’s most powerful features is automatic lineage tracking:
from xorq.common.utils.lineage_utils import build_column_trees, print_tree
# Visualize complete lineage
'predicted']) print_tree(build_column_trees(expr)[
This shows the complete computational graph from raw data to predictions, including data loading, splitting, caching, and model execution.
Next Steps
Common Patterns
# Development cycle
xorq init -t penguins -p my_project
cd my_project
# Edit expr.py
xorq build expr.py
xorq run builds/HASH --limit 10 # test
xorq run builds/HASH -o final_results.parquet # production
# Batch processing
for file in data/*.csv; do
xorq run builds/HASH --input $file -o results/$(basename $file .csv).parquet
done
# API serving
xorq serve-unbound builds/HASH --host 0.0.0.0 --port 8001 --cache-dir cache NODE_HASH