A quick tour of Xorq’s core features. This tutorial covers multi-engine operations, caching, UDXF nodes and Profiles.
Installation
pip install 'xorq[duckdb,examples]'
Working with Data
Xorq leverage Ibis for working with data, for example:
import xorq as xo from xorq import _penguins = xo.examples.penguins.fetch(backend=xo.pandas.connect())# Build a query - nothing executes yetclean_penguins = ( penguins .filter( _.bill_length_mm.isnull() ==False, _.sex.isnull() ==False ) .select(['species', 'island', 'sex', 'bill_length_mm', 'body_mass_g']) .mutate( body_mass_kg=_.body_mass_g /1000 ))# Execute to see resultsclean_penguins.execute()
species
island
sex
bill_length_mm
body_mass_g
body_mass_kg
0
Adelie
Torgersen
male
39.1
3750.0
3.750
1
Adelie
Torgersen
female
39.5
3800.0
3.800
2
Adelie
Torgersen
female
40.3
3250.0
3.250
3
Adelie
Torgersen
female
36.7
3450.0
3.450
4
Adelie
Torgersen
male
39.3
3650.0
3.650
...
...
...
...
...
...
...
328
Chinstrap
Dream
male
55.8
4000.0
4.000
329
Chinstrap
Dream
female
43.5
3400.0
3.400
330
Chinstrap
Dream
male
49.6
3775.0
3.775
331
Chinstrap
Dream
male
50.8
4100.0
4.100
332
Chinstrap
Dream
female
50.2
3775.0
3.775
333 rows × 6 columns
For more info on data operations and transformations with Ibis, check the Ibis docs.
Multi-Engine Operations
A core feature of Xorq is the capability of moving data between backends use into_backend():
# Move pandas data to DuckDBddb = xo.duckdb.connect()ddb_penguins = clean_penguins.into_backend(ddb, "penguins_clean")# Now use DuckDB for analyticsspecies_stats = ( ddb_penguins .group_by(['species', 'island']) .agg([ _.bill_length_mm.mean().name('avg_bill_length'), _.body_mass_g.mean().name('avg_body_mass'), _.count().name('penguin_count') ]))# Join across the same backendenriched = ( ddb_penguins .join(species_stats, ['species', 'island'], how='left') .mutate( bill_vs_avg=_.bill_length_mm - _.avg_bill_length, mass_vs_avg=_.body_mass_g - _.avg_body_mass ))enriched.execute()
species
island
sex
bill_length_mm
body_mass_g
body_mass_kg
species_right
island_right
avg_bill_length
avg_body_mass
penguin_count
bill_vs_avg
mass_vs_avg
0
Adelie
Torgersen
male
39.1
3750.0
3.750
Adelie
Torgersen
39.038298
3708.510638
47
0.061702
41.489362
1
Adelie
Torgersen
female
39.5
3800.0
3.800
Adelie
Torgersen
39.038298
3708.510638
47
0.461702
91.489362
2
Adelie
Torgersen
female
40.3
3250.0
3.250
Adelie
Torgersen
39.038298
3708.510638
47
1.261702
-458.510638
3
Adelie
Torgersen
female
36.7
3450.0
3.450
Adelie
Torgersen
39.038298
3708.510638
47
-2.338298
-258.510638
4
Adelie
Torgersen
male
39.3
3650.0
3.650
Adelie
Torgersen
39.038298
3708.510638
47
0.261702
-58.510638
...
...
...
...
...
...
...
...
...
...
...
...
...
...
328
Chinstrap
Dream
male
55.8
4000.0
4.000
Chinstrap
Dream
48.833824
3733.088235
68
6.966176
266.911765
329
Chinstrap
Dream
female
43.5
3400.0
3.400
Chinstrap
Dream
48.833824
3733.088235
68
-5.333824
-333.088235
330
Chinstrap
Dream
male
49.6
3775.0
3.775
Chinstrap
Dream
48.833824
3733.088235
68
0.766176
41.911765
331
Chinstrap
Dream
male
50.8
4100.0
4.100
Chinstrap
Dream
48.833824
3733.088235
68
1.966176
366.911765
332
Chinstrap
Dream
female
50.2
3775.0
3.775
Chinstrap
Dream
48.833824
3733.088235
68
1.366176
41.911765
333 rows × 13 columns
Caching
Xorq provides a caching mechanism that can write to Parquet or a source (DuckDB, Postgres) to avoid re-computation:
from pathlib import Pathfrom xorq.caching import ParquetStoragecon = xo.connect()# Set up cachingcache_storage = ParquetStorage(source=con, base_path=Path.cwd())# Cache the resultcached_enriched = enriched.cache(storage=cache_storage)# First execution computes and cachesresult1 = cached_enriched.execute()# Second execution uses cacheresult2 = cached_enriched.execute() # Much faster
UDXF (User-Defined Exchange Functions)
UDXF are functions that spin an ephemeral Flight Server for the execution through a DoExchange:
The Profiles API supports serialization to and deserialization from YAML, environment variable substitution, and security checks to prevent sensitive information from being stored in plain text.
Summary
You’ve seen:
Multi-engine operations for moving data between backends with into_backend()
Caching deferred, for store expensive computation results
UDXF nodes for creating reusable transformation functions