Pins: versioned data artifacts

Xorq uses pins to provide named, versioned access to shared datasets, trained models, and code modules. Instead of shipping large files in the repository or requiring manual downloads, artifacts are stored in a public cloud bucket and accessed by name.

Overview

The pins system in Xorq allows you to:

  • Access shared datasets by name without managing file paths or URLs
  • Load pre-trained ML models for use in pipelines
  • Pin specific versions of artifacts for reproducible pipelines
  • Cache downloads locally so repeated access is fast

Quick start

import xorq.api as xo

# Fetch an example dataset as a table expression
t = xo.examples.iris.fetch()

xo.examples.<name>.fetch() downloads the artifact via pins, reads it into a table expression on the default backend, and caches the download locally. Subsequent calls use the cached copy.

How it works

Xorq wraps the pins Python library with a preconfigured connection to a public GCS bucket (letsql-pins). No authentication is required.

The configuration lives in xo.options.pins:

Setting Default Description
protocol "gcs" Storage protocol
path "letsql-pins" GCS bucket name
storage_options {"cache_timeout": 0, "token": "anon"} Anonymous access; always check remote freshness

get_path(name, board=None, **kwargs)

Downloads the named pin and returns a local path. Extra keyword arguments (such as version) are forwarded to the underlying pins library via board.pin_download.

# Get the latest version
path = xo.options.pins.get_path("diamonds")

# Pin a specific version for reproducibility
path = xo.options.pins.get_path("hackernews_lib", version="20250820T111457Z-1d66a")

Once you have a local path, hand it to any backend reader and execute:

con = xo.connect()
t = con.read_parquet(xo.options.pins.get_path("batting"))
t.execute()
playerID yearID stint teamID lgID G AB R H X2B ... RBI SB CS BB SO IBB HBP SH SF GIDP
0 abercda01 1871 1 TRO NA 1 4.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN
1 addybo01 1871 1 RC1 NA 25 118.0 30.0 32.0 6.0 ... 13.0 8.0 1.0 4.0 0.0 NaN NaN NaN NaN NaN
2 allisar01 1871 1 CL1 NA 29 137.0 28.0 40.0 4.0 ... 19.0 3.0 1.0 2.0 5.0 NaN NaN NaN NaN NaN
3 allisdo01 1871 1 WS3 NA 27 133.0 28.0 44.0 10.0 ... 27.0 1.0 1.0 0.0 2.0 NaN NaN NaN NaN NaN
4 ansonca01 1871 1 RC1 NA 25 120.0 29.0 39.0 11.0 ... 16.0 6.0 2.0 2.0 1.0 NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
101327 zitoba01 2015 1 OAK AL 3 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
101328 zobribe01 2015 1 OAK AL 67 235.0 39.0 63.0 20.0 ... 33.0 1.0 1.0 33.0 26.0 2.0 0.0 0.0 3.0 5.0
101329 zobribe01 2015 2 KCA AL 59 232.0 37.0 66.0 16.0 ... 23.0 2.0 3.0 29.0 30.0 1.0 1.0 0.0 2.0 3.0
101330 zuninmi01 2015 1 SEA AL 112 350.0 28.0 61.0 11.0 ... 28.0 0.0 1.0 21.0 132.0 0.0 5.0 8.0 2.0 6.0
101331 zychto01 2015 1 SEA AL 13 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

101332 rows × 22 columns

get_board()

Returns the underlying pins.board object for advanced operations like listing available pins or reading metadata.

board = xo.options.pins.get_board()
board.pin_list()        # list all available pins
board.pin_meta("iris")  # get metadata for a pin
Meta(title='iris: a pinned 150 x 5 DataFrame', description=None, created='20240716T095120Z', pin_hash='d77c9966e54405d9', file='iris.csv', file_size=3858, type='csv', api_version=1, version=Version(created=datetime.datetime(2024, 7, 16, 9, 51, 20), hash='d77c9'), tags=None, name='iris', user={}, local={})

Available pins

Pin Format Description
"iris" CSV Classic iris dataset
"diamonds" Parquet Diamonds pricing dataset
"penguins" Parquet Palmer penguins dataset
"batting" Parquet Baseball batting statistics
"lending-club" Parquet Lending Club loan data
"bank-marketing" CSV Bank marketing dataset
"hn-fetcher-input-small.parquet" Parquet HackerNews sample data
"hn_tfidf_fitted_model" Binary Pre-trained TF-IDF model
"hn_sentiment_reg" Binary Pre-trained XGBoost sentiment model
"hackernews_lib" Python module HackerNews pipeline code (versioned)
"diamonds-model" JSON XGBoost model for predicting diamond price

Common patterns

Loading datasets

The preferred way to load example datasets is xo.examples.<name>.fetch():

import xorq.api as xo

# Fetch with default backend
t = xo.examples.diamonds.fetch()

# Fetch with a specific backend
con = xo.connect()
t = xo.examples.diamonds.fetch(backend=con)

This calls xo.options.pins.get_path("diamonds") under the hood, reads the file with the appropriate method based on format, and returns a table expression.

Loading ML models

Pre-trained models are pinned as binary artifacts and loaded by path:

import pathlib

TFIDF_MODEL_PATH = pathlib.Path(
    xo.options.pins.get_path("hn_tfidf_fitted_model")
)
XGB_MODEL_PATH = pathlib.Path(
    xo.options.pins.get_path("hn_sentiment_reg")
)

Loading versioned code modules

Python modules can be pinned and loaded with a specific version, ensuring pipeline reproducibility:

from xorq.common.utils.import_utils import import_python

m = import_python(
    xo.options.pins.get_path("hackernews_lib", version="20250820T111457Z-1d66a")
)
# m is now a module with functions defined in the pinned file

# Everything the pinned module brought into the namespace:
print([name for name in dir(m) if not name.startswith("_")])
['Path', 'base_api_url', 'curry', 'do_hackernews_fetcher_udxf', 'functools', 'get_hackernews_item', 'get_hackernews_maxitem', 'get_hackernews_stories', 'get_hackernews_stories_batch', 'get_json', 'json', 'pd', 'requests', 'schema_in', 'schema_out', 'simple_disk_cache', 'toolz', 'xo']

Caching behavior

The pins library caches downloads automatically in a platform-specific cache directory (typically ~/.cache/pins/ on Linux). The cache_timeout: 0 setting in Xorq’s default configuration means the library always checks if the remote version has changed, but serves from cache if it hasn’t.