Quickstart

Installation

Xorq can be installed using pip:

pip install xorq

Or using nix to drop into an IPython shell:

nix run github:xorq-labs/xorq

Quick Start: 4 Steps to Your First Pipeline

Step 1: Initialize a Project

The fastest way to get started with Xorq is to use the xorq init command:

xorq init -t penguins -p penguins_example
cd penguins_example

This creates a complete ML pipeline example with the Palmer Penguins dataset demonstrating key Xorq features including machine learning, caching, and lineage tracking.

Step 2: Build Your Expression

Convert your pipeline into a serialized, executable format:

xorq build expr.py

Output:

Building expr from expr.py
Written 'expr' to builds/7061dd65ff3c
builds/7061dd65ff3c

Step 3: Run Your Pipeline

Execute your built pipeline:

# Run and see results
xorq run builds/7061dd65ff3c

# Save to file
xorq run builds/7061dd65ff3c -o predictions.parquet

# Run with limit for testing
xorq run builds/7061dd65ff3c --limit 10

Step 4: Serve Your Pipelines

To serve your pipeline as an endpoint, you can use the xorq serve-unbound command:

xorq serve-unbound builds/7061dd65ff3c --host localhost --port 8001 --cache-dir penguins_example b2370a29c19df8e1e639c63252dacd0e
# This replaces a specific node hash with an exchanger input and serves the unbound expr as do_exchange

That’s it! You’ve built and run your first Xorq ML pipeline.

Understanding the Generated Pipeline

The template creates an expr.py file that demonstrates a complete ML workflow. Let’s walk through the key components:

1. Data Loading and Preparation

import sklearn
from sklearn.linear_model import LogisticRegression
import xorq as xo
from xorq.caching import ParquetStorage
from xorq.expr.ml.pipeline_lib import Pipeline

features = ("bill_length_mm", "bill_depth_mm")
target = "species"
data_url = "https://storage.googleapis.com/letsql-pins/penguins/20250703T145709Z-c3cde/penguins.parquet"

2. Data Splitting

def gen_splits(expr, test_size=0.2, random_seed=42, **split_kwargs):
    row_number = "row_number"
    yield from (
        expr.drop(row_number)
        for expr in xo.train_test_splits(
            expr.mutate(**{row_number: xo.row_number()}),
            unique_key=row_number,
            test_sizes=test_size,
            random_seed=random_seed,
            **split_kwargs,
        )
    )

def get_penguins_splits(storage=None, **split_kwargs):
    t = (
        xo.deferred_read_parquet(
            con=xo.duckdb.connect(),
            path=data_url,
            table_name="t",
        )
        .select(features + (target,))
        .drop_null()
    )
    (train, test) = (
        expr.cache(storage or ParquetStorage())
        for expr in gen_splits(t, **split_kwargs)
    )
    return (train, test)

3. Deferred ML Pipeline

The key is converting scikit-learn pipelines to deferred expressions:

# Configure hyperparameters
params = {"logistic__C": 1e-4}

# Get train/test splits (still deferred!)
(train, test) = get_penguins_splits()

# Create and convert pipeline
sklearn_pipeline = make_pipeline(params=params)
xorq_pipeline = Pipeline.from_instance(sklearn_pipeline)

# Create fitted pipeline expression (no computation yet!)
fitted_pipeline = xorq_pipeline.fit(train, features=features, target=target)

# Get prediction expression
expr = test_predicted = fitted_pipeline.predict(test[list(features)])

fitted_pipeline is still just an expression - no actual training has happened yet. The computation is deferred until you call .execute() or run via CLI.

CLI Commands Deep Dive

Build Options

# Basic build
xorq build expr.py

# Build with specific expression name
xorq build expr.py -e my_expr

# Build with profile
xorq build expr.py --profile production

Run Options

# Different output formats
xorq run builds/HASH -o results.csv --format csv
xorq run builds/HASH -o results.json --format json
xorq run builds/HASH -o results.parquet --format parquet

# Control output size
xorq run builds/HASH --limit 100

Inspecting Builds

# View build contents
ls builds/7061dd65ff3c/
# Shows: expr.yaml, *.sql, metadata.json, deferred_reads.yaml

# Check expression definition
cat builds/7061dd65ff3c/expr.yaml

# View generated SQL
cat builds/7061dd65ff3c/*.sql

Serving Pipelines as Endpoints

Basic Catalog Server with Arrow Flight

Start a Flight server:

# Start server
xorq serve --port 8001

Then connect and execute expressions:

import xorq as xo

# Connect to Flight server
client = xo.flight.client.FlightClient(port=8001)

# Create a simple expression
data_url = "https://storage.googleapis.com/letsql-pins/penguins/20250703T145709Z-c3cde/penguins.parquet"

expr = (
    xo.deferred_read_parquet(
        con=xo.duckdb.connect(),
        path=data_url,
        table_name="penguins",
    )
    .select("bill_length_mm", "bill_depth_mm", "species")  # Match schema order
    .drop_null()
    .limit(5)
)

print("Executing via Flight do_exchange...")
fut, rbr = client.do_exchange("default", expr)
result_df = rbr.read_pandas()
print(result_df)

Serving Built Pipelines with serve-unbound

For deployments, you can serve a specific built expression as an endpoint using serve-unbound. This allows you to expose a particular expression as a Catalog service:

xorq serve-unbound builds/7061dd65ff3c --host localhost --port 8001 --cache-dir penguins_example b2370a29c19df8e1e639c63252dacd0e

Understanding the command:

  • builds/7061dd65ff3c: Your built pipeline directory
  • --host localhost --port 8001: Server configuration
  • --cache-dir penguins_example: Directory for caching results
  • b2370a29c19df8e1e639c63252dacd0e: The specific node hash to serve

Finding the Node Hash

The node hash (like b2370a29c19df8e1e639c63252dacd0e) identifies a specific expression node in your pipeline. You can find this hash using:

import dask
import sys
sys.path.append('penguins_example')
from expr import expr  # or your specific expression

# Get the hash for any expression
node_hash = dask.base.tokenize(expr)
print(f"Node hash: {node_hash}")

This hash represents the unique identity of your expression, including its computation graph and dependencies. When you serve this specific node, clients can query exactly that expression endpoint.

Exploring Pipeline Lineage

One of Xorq’s most powerful features is automatic lineage tracking:

from xorq.common.utils.lineage_utils import build_column_trees, print_tree

# Visualize complete lineage
print_tree(build_column_trees(expr)['predicted'])

This shows the complete computational graph from raw data to predictions, including data loading, splitting, caching, and model execution.

Next Steps

  • Experiment: Modify expr.py and rebuild to see changes
  • Learn more: Check Profiles for backend configuration
  • Get help: Join our community for support

Common Patterns

# Development cycle
xorq init -t penguins -p my_project
cd my_project
# Edit expr.py
xorq build expr.py
xorq run builds/HASH --limit 10  # test
xorq run builds/HASH -o final_results.parquet  # production

# Batch processing
for file in data/*.csv; do
  xorq run builds/HASH --input $file -o results/$(basename $file .csv).parquet
done

# API serving
xorq serve-unbound builds/HASH --host 0.0.0.0 --port 8001 --cache-dir cache NODE_HASH