Working with the catalog

Share a catalog to GitHub, accept changes from a collaborator via pull request, and swap profiles when recovering entries

In Build a semantic catalog you tagged a BSL flights model and dropped it into a local catalog. That catalog is a regular git repository—every entry, alias, and revision is a git commit. The moment you push it to a remote, anyone with access can clone it, query the model, propose changes, and run the model against their own backend.

This tutorial walks you through the collaboration loop end-to-end. You’ll play both sides:

You’ll also see how to swap the connection profile at recovery time, so a downstream user can run the cataloged expression against their own backend without modifying the entry.

Prerequisites

You need:

  • Completed Build a semantic catalog—same model definition, same project layout. This tutorial recreates the catalog at a stable path here, since the foundation tutorial used a temp directory.

  • User A’s flights-tutorial/ project directory from the foundation tutorial. This tutorial assumes it lives at ~/flights-tutorial/ so it sits as a sibling of User B’s ~/flights-tutorial-userb/ (created later in the tutorial). If you put it somewhere else, either move it now (mv path/to/flights-tutorial ~/flights-tutorial) or substitute your actual path wherever you see ~/flights-tutorial below. (User B gets their own project; you’ll create it later in the tutorial.)

  • The sqlite extra installed in that project. The foundation tutorial installed xorq[bsl,duckdb]; add sqlite here to demonstrate the profile-swap section against a different backend than the one User A built the entry with. From inside ~/flights-tutorial/:

    uv add "xorq[bsl,duckdb,sqlite]"
  • Git installed locally and authenticated with GitHub (the gh command-line tool is convenient but not required).

NoteA catalog is a git repo

Xorq’s catalog stores entries as files in a git repository: catalog.yaml is the index, aliases/ holds alias pointers, entries/ holds the entry zips themselves, and metadata/ holds a sidecar yaml per entry. Every catalog.add(...) is one git commit. That means GitHub’s permission model, branch protection rules, and pull requests Just Work—there’s no separate object store, no extra service to provision.

Publish the catalog to GitHub (User A)

TipFollow along without GitHub

You don’t need a GitHub account to work through this tutorial—a local bare git repo behaves the same way for clone/push/pull. Each remote-using step below has a Local bare repo tab next to the GitHub one; pick one and stay on it.

Recreate the catalog at a stable path (~/work/flights-catalog-usera) so the rest of the tutorial has somewhere persistent to point at, then re-add the flights model the same way the foundation tutorial did. Save the snippet below as publish_catalog.py in User A’s ~/flights-tutorial/ project directory and run it with uv run python publish_catalog.py from there:

# publish_catalog.py
from pathlib import Path

from boring_semantic_layer import to_semantic_table, to_tagged
import xorq.api as xo
from xorq.catalog.catalog import Catalog

catalog_dir = Path("~/work/flights-catalog-usera").expanduser()
catalog_dir.parent.mkdir(parents=True, exist_ok=True)

catalog = Catalog.from_repo_path(catalog_dir, init=True)

# Same memtable + semantic model as the foundation tutorial
flights = xo.memtable(
    {
        "origin":      ["JFK", "LAX", "ORD", "JFK", "LAX", "ORD", "JFK", "LAX"],
        "destination": ["LAX", "ORD", "JFK", "ORD", "JFK", "LAX", "LAX", "JFK"],
        "carrier":     ["AA",  "UA",  "AA",  "UA",  "AA",  "UA",  "AA",  "UA"],
        "dep_delay":   [10.0, -5.0,  30.0,  15.0, -2.0,  45.0,   5.0,  20.0],
        "distance":    [2475, 1745,   740,  1300, 2475,  1745,  2475,  2475],
    },
    name="flights",
)
flights_model = (
    to_semantic_table(flights)
    .with_dimensions(
        origin=lambda t: t.origin,
        destination=lambda t: t.destination,
        carrier=lambda t: t.carrier,
    )
    .with_measures(
        flight_count=lambda t: t.count(),
        avg_dep_delay=lambda t: t.dep_delay.mean(),
        total_distance=lambda t: t.distance.sum(),
    )
)
flights_model_expr = to_tagged(flights_model)
catalog.add(flights_model_expr, aliases=("flights-model",), sync=False)
1
Catalog.from_repo_path(..., init=True) creates the leaf directory but not its parent, so make sure ~/work/ exists first.
NoteExpect wheel-build output on catalog.add(...)

Each catalog.add(...) packages your project as a wheel and stores it inside the entry, so the cataloged expression keeps a frozen record of the dependencies it was built against. You’ll see Building wheel..., running egg_info, and Successfully built ...whl in the output, plus a UserWarning about local filesystem paths from the inline memtable—both are expected and not errors.

Wire up a remote. The first push needs -u to set upstream tracking:

Create an empty repository on GitHub (web UI: “New repository” → leave empty), then run these in the catalog directory:

cd ~/work/flights-catalog-usera
git remote add origin https://github.com/<you>/flights-catalog.git
git push -u origin main

Or, equivalently, with the gh command-line tool from inside the catalog directory—--source=. means “create the repo from this working directory,” so the cd matters:

cd ~/work/flights-catalog-usera
gh repo create <you>/flights-catalog --public --source=. --remote=origin --push

Initialize a local bare repo to act as the “remote”—same git semantics, no GitHub account needed:

git init --bare ~/work/flights-catalog-remote.git

Then, in the catalog directory (note: this is ~/work/flights-catalog-usera, not the bare repo you just created):

cd ~/work/flights-catalog-usera
git remote add origin "file://$HOME/work/flights-catalog-remote.git"
git push -u origin main

That first push has to use raw git: catalog.push() doesn’t add remotes or set upstream tracking, so there’s nothing for it to push to until git remote add + git push -u have run once. Every subsequent publish can use catalog.push(), which runs git push against every remote configured on the repo. Either append it to the bottom of publish_catalog.py (where catalog is already in scope), or run it as its own one-off—save the snippet below as push_catalog.py in User A’s ~/flights-tutorial/ project and run uv run python push_catalog.py from there:

# push_catalog.py
from pathlib import Path

from xorq.catalog.catalog import Catalog

catalog = Catalog.from_repo_path(Path("~/work/flights-catalog-usera").expanduser(), init=False)
catalog.push()

Verify the remote sees what you expect:

gh repo view --web   # opens the repo on GitHub

You should see catalog.yaml (the index) at the top level, plus aliases/, entries/, and metadata/ directories. Click into aliases/ and you’ll see flights-model.zip—the alias from the foundation tutorial.

A bare repo has no working tree to ls, so list its files at main directly:

git -C ~/work/flights-catalog-remote.git ls-tree -r main

You should see catalog.yaml, aliases/flights-model.zip, an entries/<hash>.zip, and a matching metadata/<hash>.zip.metadata.yaml—the same files that would appear under “Files” on GitHub.

Set up User B’s project

Switch hats. User B is on a different machine (or pretending to be—same laptop, different working directory and venv). Give them their own uv project so they aren’t sharing User A’s pyproject.toml or .venv:

mkdir ~/flights-tutorial-userb && cd ~/flights-tutorial-userb
uv init --bare
uv add "xorq[bsl,duckdb,sqlite]"
printf '\n[tool.setuptools]\npy-modules = []\n' >> pyproject.toml

This is the same setup the foundation tutorial walked you through for User A, just under a different directory name. From here on, every User B script runs with uv run python <script>.py from inside ~/flights-tutorial-userb/. The uv run command picks up that project’s .venv automatically, so you never have to deactivate User A’s venv to run User B’s code.

TipTwo projects, no venv switching

With User A’s flights-tutorial/ and User B’s flights-tutorial-userb/, you have two .venvs on disk. uv run python script.py looks at the pyproject.toml of the directory you’re in and uses that venv—so as long as you cd into the right project before each command, the right venv is used. No source, no deactivate.

Clone the catalog (User B)

User B clones the catalog with one call:

Save the snippet below as clone_catalog.py in ~/flights-tutorial-userb/ and run uv run python clone_catalog.py from there:

# clone_catalog.py
from pathlib import Path

from xorq.catalog.catalog import Catalog

catalog = Catalog.clone_from(
    "https://github.com/<you>/flights-catalog.git",
    Path("~/work/flights-catalog-userb").expanduser(),
)

print("Aliases:", catalog.list_aliases())
# ['flights-model']

The same thing is available from the command line—pick one or the other; running both fails on the second clone because the directory already exists:

uv run xorq catalog clone https://github.com/<you>/flights-catalog.git --path ~/work/flights-catalog-userb
uv run xorq catalog --path ~/work/flights-catalog-userb list-aliases

Save the snippet below as clone_catalog.py in ~/flights-tutorial-userb/ and run uv run python clone_catalog.py from there:

# clone_catalog.py
from pathlib import Path

from xorq.catalog.catalog import Catalog

catalog = Catalog.clone_from(
    f"file://{Path('~/work/flights-catalog-remote.git').expanduser()}",
    Path("~/work/flights-catalog-userb").expanduser(),
)

print("Aliases:", catalog.list_aliases())
# ['flights-model']

The same thing is available from the command line—pick one or the other; running both fails on the second clone because the directory already exists:

uv run xorq catalog clone "file://$HOME/work/flights-catalog-remote.git" --path ~/work/flights-catalog-userb
uv run xorq catalog --path ~/work/flights-catalog-userb list-aliases

User B never saw User A’s Python file, never saw the original to_semantic_table(...) call. All they have is a git clone—and that’s enough.

Recover and query the model (User B)

Recover the model the same way the foundation tutorial did. Because the catalog is plain git, the entry contents arrived during clone_from, so from_tagged can read them immediately. Save the snippet below as recover_model.py in ~/flights-tutorial-userb/ and run it with uv run python recover_model.py:

# recover_model.py
from pathlib import Path

from boring_semantic_layer import from_tagged
from xorq.catalog.catalog import Catalog

catalog = Catalog.from_repo_path(Path("~/work/flights-catalog-userb").expanduser(), init=False)

flights_entry = catalog.get_catalog_entry("flights-model", maybe_alias=True)
flights_model = from_tagged(flights_entry.expr)

print(
    flights_model.query(
        dimensions=("origin",),
        measures=("flight_count", "avg_dep_delay"),
    ).order_by("origin").execute()
)

The recovered SemanticModel has the same dimensions and measures User A defined. The data User A used (the inline xo.memtable(...) from the foundation tutorial) is serialized inside the entry, so the query runs locally—no shared filesystem, no out-of-band data transfer.

Propose a change via pull request (User B)

User B wants to publish a refined view: same model, but filtered to American Airlines only. Because the catalog is a git repo, they branch, commit, push, and open a PR—exactly like any other code change.

cd ~/work/flights-catalog-userb
git checkout -b add-aa-only-model

Build the new entry in Python. User B has the same flights data as User A—it’s the inline memtable from the foundation tutorial—so they reconstruct it the same way. Save the snippet below as add_aa_model.py in ~/flights-tutorial-userb/:

# add_aa_model.py
from pathlib import Path

from boring_semantic_layer import to_semantic_table, to_tagged
import xorq.api as xo
from xorq.catalog.catalog import Catalog

catalog = Catalog.from_repo_path(Path("~/work/flights-catalog-userb").expanduser(), init=False)

flights = xo.memtable(
    {
        "origin":      ["JFK", "LAX", "ORD", "JFK", "LAX", "ORD", "JFK", "LAX"],
        "destination": ["LAX", "ORD", "JFK", "ORD", "JFK", "LAX", "LAX", "JFK"],
        "carrier":     ["AA",  "UA",  "AA",  "UA",  "AA",  "UA",  "AA",  "UA"],
        "dep_delay":   [10.0, -5.0,  30.0,  15.0, -2.0,  45.0,   5.0,  20.0],
        "distance":    [2475, 1745,   740,  1300, 2475,  1745,  2475,  2475],
    },
    name="flights",
)

# Same model shape, restricted to AA
aa_flights = flights.filter(flights.carrier == "AA")
aa_model = (
    to_semantic_table(aa_flights)
    .with_dimensions(
        origin=lambda t: t.origin,
        destination=lambda t: t.destination,
    )
    .with_measures(
        flight_count=lambda t: t.count(),
        avg_dep_delay=lambda t: t.dep_delay.mean(),
    )
)

aa_model_expr = to_tagged(aa_model)
catalog.add(aa_model_expr, aliases=("flights-aa-only",), sync=False)

Run it from User B’s project—uv run puts you in that project’s venv, and catalog.add(...) finds ~/flights-tutorial-userb/pyproject.toml from cwd to build the dependency-pinning wheel:

cd ~/flights-tutorial-userb
uv run python add_aa_model.py

That commits a new entry on the add-aa-only-model branch in your clone—sync=False deliberately keeps it local so you can review and push the branch yourself in the next step.

Notesync=False

Passing sync=False keeps the add local—it commits to the working branch but doesn’t push to the remote. You’ll push the branch yourself in the next step, after reviewing the diff.

Push the feature branch and open the PR. Run these from User B’s catalog directory (~/work/flights-catalog-userb):

git log --oneline -3                    # confirm: "add: <hash> (aliases flights-aa-only)"
git push -u origin add-aa-only-model
gh pr create --title "Add AA-only flights model" --body "Adds an AA-filtered view of the flights model under alias flights-aa-only."

User A reviews the PR on GitHub. Because each catalog.add(...) is a single commit, the diff is small and readable: a new alias under aliases/, a new entry under entries/, a new sidecar under metadata/, and an update to catalog.yaml.

git log --oneline -3                    # confirm: "add: <hash> (aliases flights-aa-only)"
git push -u origin add-aa-only-model

There’s no PR—User A reviews the change as a regular fetched branch (see the next section). The diff is the same: a new alias under aliases/, a new entry under entries/, a new sidecar under metadata/, and an update to catalog.yaml.

Merge the PR and pull the changes (User A)

User A reviews the diff, approves the PR, and clicks Merge pull request in the GitHub UI. (The gh command-line tool equivalent is gh pr merge --squash <pr-number> from a clone—but a tutorial reader doing this step manually is the most common path.)

There’s no PR UI to merge through, so User A pulls the branch into their catalog and merges by hand:

cd ~/work/flights-catalog-usera
git fetch origin add-aa-only-model
git diff main..origin/add-aa-only-model    # review the change
git merge --no-ff origin/add-aa-only-model -m "Merge: add AA-only flights model"
git push origin main

--no-ff keeps the merge commit so the history matches what GitHub would have produced from a “Merge pull request” click.

Once main has moved on the remote, User A pulls. catalog.pull() runs git pull against every git remote, fast-forwarding the local main. Save the snippet below as pull_catalog.py in User A’s ~/flights-tutorial/ project and run it with uv run python pull_catalog.py from there:

# pull_catalog.py
from pathlib import Path

from xorq.catalog.catalog import Catalog

catalog_a = Catalog.from_repo_path(Path("~/work/flights-catalog-usera").expanduser(), init=False)
catalog_a.pull()

print("Aliases now:", catalog_a.list_aliases())
# ['flights-model', 'flights-aa-only']
NoteOn the local-bare-repo path, pull() is effectively a no-op

You merged in your local clone and pushed up—your local main is already at the merged tip, so there’s nothing for pull() to fast-forward. The script is still worth running because it’s the same one-liner User A would use after a teammate clicked “Merge pull request” on GitHub; here it just confirms the new alias is visible.

The same alias is now visible everywhere the catalog is cloned. Anyone with access can recover and query flights-aa-only exactly the way they recover flights-model.

Swap the profile at recovery time

The catalog stores expressions, not connections. When User A built the entry they used the default Xorq backend (a xorq_datafusion session); when User B recovers it, they may want to execute against a different profile—perhaps a SQLite database they’ve configured locally, a Postgres instance with extra resources, or a Snowflake warehouse.

A profile in Xorq is a named connection configuration: con_name plus connection kwargs, serialized to disk. Save one with Profile.from_con(con).save(alias=...), load it with Profile.load(...).

This section demonstrates the swap by moving to SQLite—a genuinely different backend than the default, no server to provision, and the adbc-driver-sqlite connector already came in via the sqlite extra in the prereqs. Save the snippet below as profile_swap.py in ~/flights-tutorial-userb/ and run it with uv run python profile_swap.py:

# profile_swap.py
from pathlib import Path

from xorq.vendor.ibis.backends.profiles import Profile
import xorq.api as xo
from xorq.catalog.catalog import Catalog

catalog = Catalog.from_repo_path(Path("~/work/flights-catalog-userb").expanduser(), init=False)

# Capture User B's preferred connection as a named profile
sqlite_con = xo.sqlite.connect()                 # in-memory SQLite
Profile.from_con(sqlite_con).save(alias="local_dev_sqlite", clobber=True)

# Later—possibly in a different script—load the profile and bind the entry to it
profile = Profile.load("local_dev_sqlite")
con = profile.get_con()
expr = catalog.load("flights-model", con=con)

print("Executing against backend:", con.name)
print(
    expr.group_by("origin")
        .agg(
            flight_count=expr.count(),
            avg_dep_delay=expr.dep_delay.mean(),
        )
        .order_by("origin")
        .execute()
)
Executing against backend: sqlite
  origin  flight_count  avg_dep_delay
0    JFK             3      10.000000
1    LAX             3       4.333333
2    ORD             2      37.500000

The Executing against backend: sqlite line is the proof—User A cataloged the entry against xorq_datafusion, User B loaded it bound to a SQLite connection, and .execute() shipped the work to SQLite. catalog.load(name, con=...) returns the underlying Xorq expression—the flights table, in this case—bound to whichever connection you pass; you compose any group-by / aggregation you like on top, and .execute() runs it on the chosen backend. The entry on disk is unchanged; the swap is purely a runtime decision.

Notecatalog.load vs from_tagged

from_tagged(entry.expr) rebuilds the BSL SemanticModel so you can call .query(...) against it—that’s the right tool when you want the semantic-layer interface back. catalog.load(name, con=...) skips the BSL layer and gives you the underlying Xorq expression bound to a connection of your choosing—that’s the right tool when you want to redirect execution to a specific backend without touching the entry. They compose: profiles for the connection, BSL for the dimensions and measures.

What you learned

  • A Xorq catalog is a git repository: every catalog.add(...) is one commit, and the diff is small enough to review on GitHub.
  • Sharing the catalog is catalog.push() (after a one-time git remote add origin + git push -u origin main); cloning uses Catalog.clone_from(...).
  • Collaboration uses the GitHub workflow you already know: branch, commit, push, open a PR. The reviewer sees alias and entry files in the diff; merging makes the new alias available everywhere the catalog is cloned.
  • from_tagged(flights_entry.expr) recovers the BSL model on the consumer side—the same call as in the foundation tutorial, regardless of where the entry came from.
  • catalog.load(name, con=...) rebinds a cataloged expression to a different connection at recovery time. Combined with named Profiles, downstream users can pick their own execution backend—SQLite, Postgres, anything Xorq supports—without modifying the entry.

Next steps