Train your first model
This tutorial walks you through training your first classification model with Xorq. You’ll use the Iris dataset to build a flower species classifier and see how deferred execution works in action.
After completing this tutorial, you’ll know how to wrap scikit-learn pipelines with Xorq and make predictions using deferred execution.
Prerequisites
You need:
- Xorq installed:
pip install "xorq[examples]" - Basic familiarity with scikit-learn
How to follow along
This tutorial builds code incrementally. Each section adds to the same Python file (train_classifier.py). You can:
- Python interactive shell (recommended): Open a terminal, run
python, then copy and paste each code block sequentially - Run as a script: Add each code block to
train_classifier.py, then runpython train_classifier.pyafter each section - Jupyter notebook: Create a new notebook and run each code block in a separate cell
The code blocks build on each other. Variables like iris, xorq_pipeline, and fitted_pipeline are created in earlier blocks and used in later ones.
When you call .fit() in Xorq, training doesn’t happen immediately. Xorq builds a computation graph. Training only runs when you call .execute(). This lets Xorq cache trained models and reuse them across runs.
Load data and define your target
Here’s what you need to know: you’ll load the Iris dataset and separate features from the target.
# train_classifier.py
import xorq.api as xo
iris = xo.examples.iris.fetch()
target = "species"
features = tuple(iris.drop(target).schema())
print(f"Loaded {iris.count().execute()} rows")
print(f"Target: {target}")
print(f"Features: {len(features)} columns")- 1
- Load the Iris dataset. This returns an expression, not actual data.
- 2
- Define what you’re predicting (target) and what the model uses (features).
- 3
- Check what you loaded.
You’ll see:
Loaded 150 rows
Target: species
Features: 4 columns
What just happened? The iris variable is an expression. It’s a description of data, not the data itself. Nothing executes until you call .execute(). This is deferred execution in action.
Once you’ve got your data loaded, you can move on to building the pipeline.
Build and wrap your pipeline
Now you’ll create a scikit-learn pipeline and wrap it with Xorq. Think of it this way: you’re taking a standard scikit-learn workflow and adding Xorq’s caching layer.
Add this to train_classifier.py:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline as SklearnPipeline
from xorq.expr.ml import Pipeline
sklearn_pipeline = SklearnPipeline([
('scaler', StandardScaler()),
('classifier', KNeighborsClassifier(n_neighbors=5))
])
xorq_pipeline = Pipeline.from_instance(sklearn_pipeline)
print("Pipeline wrapped with Xorq!")- 1
- Create a standard scikit-learn pipeline. It normalizes features, then classifies with k-nearest neighbors.
- 2
-
Wrap it with
Pipeline.from_instance(). This adds deferred execution.
Why normalize first? K-nearest neighbors is distance-based. If one feature ranges from 0 to 100 and another from 0 to 1, then the first feature dominates. Normalization fixes this.
Train the model (but don’t execute yet)
Here’s the thing: when you call .fit(), you’re not actually training yet. You’re describing the training.
Add this to train_classifier.py:
fitted_pipeline = xorq_pipeline.fit(
iris,
features=features,
target=target
)
print("Training described (but hasn't run yet)!")- 1
- Describe the training operation. Xorq builds a graph node, but doesn’t execute.
The key insight: You’ve told Xorq what to do, but it hasn’t done it yet. Training only happens when you call .execute().
Understanding this timing helps you optimize workflows. You can describe complex pipelines, then execute once.
Make predictions and see results
Now for the payoff. You’ll make predictions, and that’s when everything executes.
Add this to train_classifier.py:
def as_struct(expr, name=None):
struct = xo.struct({c: expr[c] for c in expr.columns})
if name:
struct = struct.name(name)
return struct
ORIGINAL_ROW = "original_row"
predictions_expr = (
iris.mutate(as_struct(iris, name=ORIGINAL_ROW))
.pipe(fitted_pipeline.predict)
.drop(target)
.unpack(ORIGINAL_ROW)
)
predictions = predictions_expr.execute()
print("\nFirst 10 predictions:")
print(predictions[["species", "predicted"]].head(10))- 1
- Create a helper that packages data into a struct. This preserves original values.
- 2
- Build the prediction expression. Still deferred.
- 3
- Execute! Training and prediction happen now, in one optimized pass.
You’ll see:
First 10 predictions:
species predicted
0 setosa setosa
1 setosa setosa
2 setosa setosa
3 setosa setosa
...
What does success look like? Your predictions match the actual species. When you called .execute(), Xorq trained the pipeline and made predictions in one shot.
Next, you’ll check how accurate your model is.
Check your accuracy
Now check how well your model performed. Add this to train_classifier.py:
accuracy_expr = (
predictions_expr
.mutate(correct=xo._.species == xo._.predicted)
.agg(
total=xo._.species.count(),
correct_count=xo._.correct.sum().cast("int64"),
)
.mutate(accuracy=xo._.correct_count / xo._.total)
)
result = accuracy_expr.execute()
accuracy = result["accuracy"][0]
correct = result["correct_count"][0]
total = result["total"][0]
print(f"\nAccuracy: {accuracy:.1%}")
print(f"Got {correct} out of {total} correct")- 1
- Build an accuracy calculation. Create a boolean for correct predictions, count them, compute the ratio.
- 2
- Execute and display results.
You’ll typically see:
Accuracy: 96.7%
Got 145 out of 150 correct
Why this pattern matters: Evaluation is also deferred. You describe metrics, then execute once. Most teams find that this simplifies evaluation code.
Complete example
Here’s everything in one file. This is what you built:
# train_classifier.py
import xorq.api as xo
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline as SklearnPipeline
from xorq.expr.ml import Pipeline
# Load data
iris = xo.examples.iris.fetch()
target = "species"
features = tuple(iris.drop(target).schema())
# Build and wrap pipeline
sklearn_pipeline = SklearnPipeline([
('scaler', StandardScaler()),
('classifier', KNeighborsClassifier(n_neighbors=5))
])
xorq_pipeline = Pipeline.from_instance(sklearn_pipeline)
# Train (deferred)
fitted_pipeline = xorq_pipeline.fit(iris, features=features, target=target)
# Predict (deferred)
def as_struct(expr, name=None):
struct = xo.struct({c: expr[c] for c in expr.columns})
if name:
struct = struct.name(name)
return struct
ORIGINAL_ROW = "original_row"
predictions_expr = (
iris.mutate(as_struct(iris, name=ORIGINAL_ROW))
.pipe(fitted_pipeline.predict)
.drop(target)
.unpack(ORIGINAL_ROW)
)
# Evaluate (deferred)
accuracy_expr = (
predictions_expr
.mutate(correct=xo._.species == xo._.predicted)
.agg(
total=xo._.species.count(),
correct_count=xo._.correct.sum().cast("int64"),
)
.mutate(accuracy=xo._.correct_count / xo._.total)
)
# Execute and show results
result = accuracy_expr.execute()
accuracy = result["accuracy"][0]
correct = result["correct_count"][0]
total = result["total"][0]
print(f"Accuracy: {accuracy:.1%}")
print(f"Got {correct} out of {total} correct")
# Show sample predictions
predictions = predictions_expr.execute()
print("\nSample predictions:")
print(predictions[["species", "predicted"]].head(10))Run it:
python train_classifier.pyNotice how you built multiple deferred operations (training, predictions, accuracy), then executed. Xorq optimized the entire graph.
What you learned
You built a complete ML workflow with deferred execution. Here’s what you accomplished:
- Loaded data as expressions (deferred)
- Wrapped a scikit-learn pipeline with Xorq
- Trained a model without immediate execution
- Made predictions using the struct pattern
- Evaluated accuracy with deferred metrics
The key insight? Deferred execution lets you describe complex workflows, then Xorq handles optimization and caching automatically.
Next steps
Now that you’ve trained your first model, continue learning:
- Split data for training shows how to create proper train/test splits
- Understand the catalog explains how to save and reuse trained models
- Cache ML computations covers caching strategies for expensive operations