Build a semantic catalog

Use the Boring Semantic Layer (BSL) to define a flights model, query it, and catalog it as a recoverable artifact

This tutorial shows you how to build a semantic model over an FAA-style flights dataset using the Boring Semantic Layer (BSL), query it through Xorq, and store it in the catalog so anyone (or any future build) can recover the model and issue new queries against it.

After completing this tutorial, you understand how to define dimensions and measures with BSL, query them through Xorq, and round-trip the model through the catalog so a downstream consumer can recover it and ask their own questions.

Prerequisites

You need:

Set up a project directory

catalog.add(...) needs a pyproject.toml in the working directory (or an ancestor) so it can pin the xorq version embedded in the entry. Create a fresh project and pull in xorq with the BSL and DuckDB extras:

mkdir flights-tutorial && cd flights-tutorial
uv init --bare
uv add "xorq[bsl,duckdb]"
printf '\n[tool.setuptools]\npy-modules = []\n' >> pyproject.toml

The rest of the tutorial assumes commands are run from inside flights-tutorial/.

NoteWhy --bare?

Plain uv init drops a sample main.py next to your pyproject.toml and runs git init. Both bite you later: setuptools’ auto-discovery sees main.py + flights_catalog.py and refuses to build a wheel for catalog.add(...) (multiple top-level modules), and the empty git repo (no HEAD) makes xorq’s import-time git probe write fatal: ambiguous argument 'HEAD' to stderr on every run. --bare skips both — only pyproject.toml is created.

NoteWhy the py-modules = [] line?

catalog.add(...) builds a wheel of your project to embed in the catalog entry as a dep-pinning artifact. With no [tool.setuptools] config, setuptools auto-discovers top-level .py modules and refuses to build the wheel as soon as it finds more than one — and this tutorial gives you two (flights_catalog.py and recover_flights.py). py-modules = [] tells setuptools “no modules in the wheel,” so the wheel builds empty. That’s fine here: xorq only needs the wheel for its dependency metadata, not to redistribute your scripts.

NoteActivate the project venv

uv add set up a .venv in flights-tutorial/. Activate it once in your shell so plain python uses it:

source .venv/bin/activate

The rest of the tutorial assumes the venv is active.

NoteWhy DuckDB?

The catalog stores the in-memory flights table as a parquet file alongside the entry. Reading that parquet back when you recover the model uses DuckDB by default, so it has to be installed.

NoteWhy a project directory?

If you run the script from a directory without a pyproject.toml, catalog.add(...) raises cannot locate a pyproject.toml .... uv init creates one for you; if you’re not using uv, an empty pyproject.toml next to your script is enough — or pass project_path= to catalog.add(...) explicitly.

TipWhat is BSL?

The Boring Semantic Layer is a small, declarative semantic layer that lets you attach dimensions (groupings) and measures (aggregations) to a table once, then issue many different queries without repeating the SQL or the Python. Xorq integrates with BSL so a SemanticModel can be stamped onto an expression and stored in the catalog.

What you’ll build

A reusable semantic model over flights data with:

  • Three dimensions: origin, destination, carrier
  • Three measures: flight_count, avg_dep_delay, total_distance

You’ll query it two different ways, then catalog the model so a colleague can pull it down and ask their own questions — without ever seeing your original Python file.

Create the flights dataset

Start with a small FAA-style flights table. The columns mirror what you’d find in the FAA On-Time Performance dataset (or nycflights13): an origin airport, a destination airport, a carrier code, departure delay in minutes, and route distance in miles.

Create a file called flights_catalog.py:

# flights_catalog.py
import xorq.api as xo


flights = xo.memtable(
    {
        "origin":      ["JFK", "LAX", "ORD", "JFK", "LAX", "ORD", "JFK", "LAX"],
        "destination": ["LAX", "ORD", "JFK", "ORD", "JFK", "LAX", "LAX", "JFK"],
        "carrier":     ["AA",  "UA",  "AA",  "UA",  "AA",  "UA",  "AA",  "UA"],
        "dep_delay":   [10.0, -5.0,  30.0,  15.0, -2.0,  45.0,   5.0,  20.0],
        "distance":    [2475, 1745,   740,  1300, 2475,  1745,  2475,  2475],
    },
    name="flights",
)
1
xo.memtable builds a deferred Xorq table from inline data. Nothing executes yet — flights is an expression you can pass to BSL.
NoteWhy an in-memory table?

For the tutorial we keep the data inline so you can run the whole thing offline. In production, swap xo.memtable(...) for a real source — con.read_parquet(...), a Postgres table, or a pinned dataset. The semantic model on top doesn’t change.

Define the semantic model

A BSL SemanticModel is a table plus a vocabulary of dimensions and measures. You build it by wrapping the Xorq expression with to_semantic_table and chaining .with_dimensions(...) and .with_measures(...).

Append to flights_catalog.py:

from boring_semantic_layer import to_semantic_table


flights_model = (
    to_semantic_table(flights)
    .with_dimensions(
        origin=lambda t: t.origin,
        destination=lambda t: t.destination,
        carrier=lambda t: t.carrier,
    )
    .with_measures(
        flight_count=lambda t: t.count(),
        avg_dep_delay=lambda t: t.dep_delay.mean(),
        total_distance=lambda t: t.distance.sum(),
    )
)

print("Dimensions:", tuple(flights_model.dimensions))
print("Measures:  ", tuple(flights_model.measures))
1
Each dimension and measure is a lambda that takes the table and returns an expression. BSL stores the lambda — it doesn’t run it until you query.
TipDimensions vs. measures

A dimension is a column you can group by (or filter on). A measure is an aggregation: counts, means, sums, anything that collapses rows. The split is what lets BSL turn query(dimensions=..., measures=...) into the right group_by(...).agg(...) for you.

TipRun as you go

flights_catalog.py grows section by section through the rest of the tutorial. After each addition, run python flights_catalog.py from your project directory — the output shown beneath each block (dimensions, measures, query results) is what you’ll see when you do. The complete script is consolidated at the end in Putting it all together.

Query the model

A model is useless until you ask it questions. flights_model.query(...) returns a regular Xorq expression — the same kind you’d get from table.group_by(...).agg(...) — so .execute() runs it on your backend.

Add a first query: average departure delay by origin airport.

by_origin = flights_model.query(
    dimensions=("origin",),
    measures=("flight_count", "avg_dep_delay"),
).order_by("origin")

print(by_origin.execute())
NoteThe same query, without BSL

Because query(...) lowers to ordinary Xorq, the equivalent without the semantic layer is just group_by + agg:

by_origin_plain = flights.group_by("origin").agg(
    flight_count=flights.count(),
    avg_dep_delay=flights.dep_delay.mean(),
)

by_origin_plain.execute() returns the same DataFrame. The semantic-layer version pays off as soon as you have a second query: dep_delay.mean() doesn’t have to be re-typed (or kept in sync between callers), and consumers ask for avg_dep_delay by name without knowing how it’s computed.

You see one row per origin airport, with the count and average delay:

  origin  flight_count  avg_dep_delay
0    JFK             3      10.000000
1    LAX             3       4.333333
2    ORD             2      37.500000
NoteRow order

BSL doesn’t sort the result — the row order you see depends on the backend’s hash layout. The .order_by("origin") above is what makes this output reproducible; without it, the rows can come back in any order. Every query(...) block below this point is sorted for the same reason.

Now ask a different question — total distance flown by each carrier:

by_carrier = flights_model.query(
    dimensions=("carrier",),
    measures=("flight_count", "total_distance"),
).order_by("carrier")

print(by_carrier.execute())
  carrier  flight_count  total_distance
0      AA             4            8165
1      UA             4            7265

Notice: same model, two completely different queries. The model is the contract — query(...) is the conversation.

TipWhat if you ask for something that doesn’t exist?

Try requesting a dimension or measure the model never registered:

flights_model.query(dimensions=("airport",), measures=("flight_count",)).execute()

You get an error immediately, not silently-wrong results:

XorqTypeError: Column 'airport' is not found in table. Existing columns:
'origin', 'destination', 'carrier', 'dep_delay', 'distance'.

This is the second payoff of the semantic layer. Dimensions and measures are a closed vocabulary: airport doesn’t exist, so the query fails before it touches the data. Without the model, a typo in group_by("airport") would give the same error — but a typo in a hand-written measure (say, dep_delay.mean() vs. dep_delay.sum()) wouldn’t fail at all; it would just return a quietly-wrong number. By naming the aggregation avg_dep_delay once, on the model, every caller gets the right one or none at all.

Catalog the model

To preserve the model itself — not just one query result — turn it into a Xorq expression that carries the BSL metadata via to_tagged(flights_model), then add that expression to the catalog. The catalog is git-backed, so the directory you point it at becomes a versioned store of every entry you add.

from pathlib import Path

from boring_semantic_layer import to_tagged
from xorq.catalog.catalog import Catalog


flights_model_expr = to_tagged(flights_model)


catalog_dir = Path("flights-catalog")
catalog = Catalog.from_repo_path(catalog_dir, init=True)


catalog.add(flights_model_expr, aliases=("flights-model",), sync=False)

print("Catalog at:", catalog_dir.absolute())
print("Aliases:   ", catalog.list_aliases())
1
to_tagged(flights_model) serializes the dimensions, measures, and underlying table into a Xorq expression with BSL metadata attached. We bind it to flights_model_expr to make the role explicit: it’s the expression form of the model, ready for the catalog. You’re cataloging the model itself, not the result of one of its queries.
2
A stable path inside the project directory. Use anywhere you like — but a real folder (not a temp dir) is what lets the next script find the catalog by path. Catalog.from_repo_path(..., init=True) initializes a fresh git repo there.
3
The alias flights-model is the human-readable handle. Internally each entry has a content-addressed hash; the alias just points at it.
NoteWhy go through the catalog?

The catalog speaks the language of expressions: schemas, lineage, content hashes, deferred reads. By tagging the model and storing it as a catalog entry, the model travels through the same pipes as everything else — xorq build, xorq run, lineage tools — and from_tagged lets you get the rich Python object back when you need it.

TipPointing at a real catalog

For team use, replace the temp directory with a path to a checked-out git repo, push commits with sync=True, and your colleagues can clone it. Aliases survive across machines because they live in git.

Recover the model from a separate script

Here’s the payoff. The catalog is now persisted at ./flights-catalog/ — a regular git directory you could commit, push, share, or back up. Switch hats: imagine you’re a different person on the team. You have access to that directory, but you’ve never seen flights_catalog.py and don’t know how the model was built. All you have is the alias.

Create a new file alongside flights_catalog.py, called recover_flights.py:

# recover_flights.py
from pathlib import Path

from boring_semantic_layer import from_tagged
from xorq.catalog.catalog import Catalog


catalog = Catalog.from_repo_path(Path("flights-catalog"), init=False)


flights_entry = catalog.get_catalog_entry("flights-model", maybe_alias=True)
flights_model = from_tagged(flights_entry.expr)

print("Recovered type:    ", type(flights_model).__name__)
print("Recovered dims:    ", tuple(flights_model.dimensions))
print("Recovered measures:", tuple(flights_model.measures))


by_destination = flights_model.query(
    dimensions=("destination",),
    measures=("flight_count", "total_distance"),
).order_by("destination")

print(by_destination.execute())
1
init=False opens the existing catalog at the path. Note what’s not imported: nothing from flights_catalog.py, no to_semantic_table, no inline flights data. The catalog directory is the only handoff.
2
flights_entry is the catalog handle — content hash, alias, sidecar metadata, and the cataloged expression on .expr. from_tagged(...) reads the BSL metadata off that expression and reconstructs a live SemanticModel: same dimensions, same measures, same underlying table.
3
A brand-new query that the original Python file never even mentioned. The model’s vocabulary is enough.

Run it:

python recover_flights.py
Recovered type:     SemanticModel
Recovered dims:     ('origin', 'destination', 'carrier')
Recovered measures: ('flight_count', 'avg_dep_delay', 'total_distance')
  destination  flight_count  total_distance
0         JFK             3            5690
1         LAX             3            6695
2         ORD             2            3045

This is the property that makes the catalog interesting: you stored the model, and any consumer with access to the catalog directory can ask anything the model’s dimensions and measures are designed to answer — without seeing or running your original code.

Putting it all together

Two scripts, one shared catalog directory.

flights_catalog.py — defines the model, queries it, publishes it:

# flights_catalog.py
from pathlib import Path

from boring_semantic_layer import to_semantic_table, to_tagged

import xorq.api as xo
from xorq.catalog.catalog import Catalog


# 1. Source table
flights = xo.memtable(
    {
        "origin":      ["JFK", "LAX", "ORD", "JFK", "LAX", "ORD", "JFK", "LAX"],
        "destination": ["LAX", "ORD", "JFK", "ORD", "JFK", "LAX", "LAX", "JFK"],
        "carrier":     ["AA",  "UA",  "AA",  "UA",  "AA",  "UA",  "AA",  "UA"],
        "dep_delay":   [10.0, -5.0,  30.0,  15.0, -2.0,  45.0,   5.0,  20.0],
        "distance":    [2475, 1745,   740,  1300, 2475,  1745,  2475,  2475],
    },
    name="flights",
)

# 2. Semantic model
flights_model = (
    to_semantic_table(flights)
    .with_dimensions(
        origin=lambda t: t.origin,
        destination=lambda t: t.destination,
        carrier=lambda t: t.carrier,
    )
    .with_measures(
        flight_count=lambda t: t.count(),
        avg_dep_delay=lambda t: t.dep_delay.mean(),
        total_distance=lambda t: t.distance.sum(),
    )
)

# 3. Query
print(flights_model.query(
    dimensions=("origin",),
    measures=("flight_count", "avg_dep_delay"),
).order_by("origin").execute())

print(flights_model.query(
    dimensions=("carrier",),
    measures=("flight_count", "total_distance"),
).order_by("carrier").execute())

# 4. Tag the model and add it to the catalog
flights_model_expr = to_tagged(flights_model)

catalog_dir = Path("flights-catalog")
catalog = Catalog.from_repo_path(catalog_dir, init=True)
catalog.add(flights_model_expr, aliases=("flights-model",), sync=False)

recover_flights.py — reads the catalog from scratch, recovers the model, runs a new query:

# recover_flights.py
from pathlib import Path

from boring_semantic_layer import from_tagged
from xorq.catalog.catalog import Catalog


catalog = Catalog.from_repo_path(Path("flights-catalog"), init=False)
flights_entry = catalog.get_catalog_entry("flights-model", maybe_alias=True)
flights_model = from_tagged(flights_entry.expr)

print(flights_model.query(
    dimensions=("destination",),
    measures=("flight_count", "total_distance"),
).order_by("destination").execute())

Run them:

python flights_catalog.py
python recover_flights.py

What you learned

  • The Boring Semantic Layer turns a Xorq table into a SemanticModel with named dimensions and measures.
  • flights_model.query(dimensions=..., measures=...) produces an ordinary Xorq expression, so .execute() runs on any Xorq backend.
  • Asking for a dimension or measure the model didn’t register raises an error before any data is touched — typos in measure definitions can’t return quietly-wrong numbers.
  • to_tagged(flights_model) produces a catalog-ready expression, and from_tagged(flights_entry.expr) reconstructs the live SemanticModel on the other side.

The point of the BSL + catalog combination is decoupling: the team that owns the data publishes the model once, and every downstream user gets a typed, queryable object instead of a frozen result set.

Next steps