10-minute tour of xorq

This tutorial will walk you through the key features of xorq, a data processing library that enables you to work seamlessly across multiple data engines.

Installation

First, install xorq using pip. We’ll include the duckdb extra to enable the duckdb backend:

pip install 'xorq[duckdb]'

Setting up Connections

Let’s start by creating connections to different backends:

import xorq as xo
from xorq import _  # import the underscore accessor for column reference

# Create connections to different backends
con = xo.connect()  # xorq's main connection
ddb = xo.duckdb.connect()  # DuckDB connection
pg = xo.postgres.connect_examples()  # Postgres connection

In this section, we: - Import xorq and its deferred object, which allow us to referred to columns - Create a xorq connection (backed by the xorq backend) - Create a DuckDB connection - Create a Postgres connection

Note that you can create a custom Postgres connection by specifying the different parameters, for example:

pg = xo.postgres.connect(
  host="localhost",
  port=5432,
  user="postgres",
  password="postgres",
  database="ibis_testing"
)

Reading Data

Now let’s read some data into xorq:

# Read a parquet file using xorq
path = xo.config.options.pins.get_path("batting")
batting = con.read_parquet(path, table_name="batting")

xorq can read data from various sources. Here we’re reading a Parquet file directly. The table_name parameter specifies how this table will be referenced inside the con backend.

Basic Operations

Let’s perform some basic data operations:

# Filtering and selection
recent_batting = (
    batting[batting.yearID > 2010]  # filter for recent years
    .select(['playerID', 'yearID', 'teamID', 'G', 'AB', 'R', 'H'])  # select specific columns
)

# Execute to see results
recent_batting.execute()
/home/runner/work/xorq/xorq/python/xorq/vendor/ibis/expr/types/relations.py:675: FutureWarning: Selecting/filtering arbitrary expressions in `Table.__getitem__` is deprecated and will be removed in version 10.0. Please use `Table.select` or `Table.filter` instead.
  warnings.warn(
playerID yearID teamID G AB R H
0 utleych01 2013 PHI 131 476.0 73.0 135.0
1 valaich01 2013 MIA 22 64.0 4.0 14.0
2 valbulu01 2013 CHN 108 331.0 34.0 72.0
3 valdejo02 2013 NYN 66 133.0 16.0 25.0
4 valdelu01 2013 BAL 4 0.0 0.0 0.0
... ... ... ... ... ... ... ...
7122 ugglada01 2013 ATL 136 448.0 60.0 80.0
7123 uptonbj01 2013 ATL 126 391.0 30.0 72.0
7124 uptonju01 2013 ATL 149 558.0 94.0 147.0
7125 uribeju01 2013 LAN 132 388.0 47.0 108.0
7126 urruthe01 2013 BAL 24 58.0 5.0 16.0

7127 rows × 7 columns

Note that xorq operations are lazy - they don’t execute until you call execute, which returns a pandas DataFrame.

Multi-Engine Operations

One of xorq’s powerful features is the ability to move data between different backends using into_backend(). This method converts an expression from one backend into a table in another backend, using a PyArrow RecordBatchReader as an intermediate format:

# Read a table from Postgres and move it to xorq's backend
awards = pg.table("awards_players").into_backend(con, "awards")  # bring into xorq backend

# Perform a join between the two tables
player_awards = (
    recent_batting.join(
        awards,
        ['playerID', 'yearID'],  # join keys
        how='left'  # left join
    )
    .select([
        'playerID',
        'yearID',
        'teamID',
        'awardID',
        'G',
        'AB',
        'H'
    ])
)

player_awards.execute()
playerID yearID teamID awardID G AB H
0 jonesad01 2013 BAL Gold Glove 160 653.0 186.0
1 cruzne02 2011 TEX ALCS MVP 124 475.0 125.0
2 ellsbja01 2011 BOS Comeback Player of the Year 158 660.0 212.0
3 gonzaad01 2011 BOS Gold Glove 159 630.0 213.0
4 wietema01 2011 BAL Gold Glove 139 500.0 131.0
... ... ... ... ... ... ... ...
7196 tracych01 2013 WAS None 92 129.0 26.0
7197 tulowtr01 2013 COL None 126 446.0 139.0
7198 turneju01 2013 NYN None 86 200.0 56.0
7199 uribeju01 2013 LAN None 132 388.0 108.0
7200 urruthe01 2013 BAL None 24 58.0 16.0

7201 rows × 7 columns

into_backend() is particularly useful when you want to: - Move data between different database engines - Combine data from multiple sources - Avoid writing intermediate results to disk

Leveraging Different Backends

Different backends have different strengths. Let’s use DuckDB for some aggregations:

# Move data to DuckDB for additional processing
ddb_awards = player_awards.into_backend(ddb, "ddb_awards")

# Perform DuckDB-specific operations
ddb_summary = (
    ddb_awards.group_by(['yearID', 'awardID'])
    .agg([
        _.count().name('player_count'),
        _.G.mean().name('avg_games'),
        _.H.mean().name('avg_hits'),
    ])
    .order_by(['yearID', 'awardID'])
)

print("Award summary from DuckDB:")
ddb_summary.execute()
Award summary from DuckDB:
yearID awardID player_count avg_games avg_hits
0 2011 ALCS MVP 1 124.000000 125.000000
1 2011 All-Star Game MVP 1 162.000000 170.000000
2 2011 Babe Ruth Award 1 97.000000 99.000000
3 2011 Branch Rickey Award 1 132.000000 145.000000
4 2011 Comeback Player of the Year 2 151.500000 179.500000
... ... ... ... ... ...
80 2015 Rookie of the Year 2 125.000000 131.000000
81 2015 Silver Slugger 18 144.555556 160.722222
82 2015 TSN Major League Player of the Year 1 158.000000 184.000000
83 2015 World Series MVP 1 142.000000 138.000000
84 2015 None 1448 45.295580 25.419199

85 rows × 5 columns

Caching Expressions

xorq provides caching capabilities to optimize performance:

from pathlib import Path
from xorq.caching import ParquetStorage

# Create a storage for cached data
cache_storage = ParquetStorage(source=con, base_path=Path.cwd())

# Cache the results
cached_awards = player_awards.cache(storage=cache_storage)

# The next execution will use the cached data
cached_awards.execute()
playerID yearID teamID awardID G AB H
0 berkmla01 2011 SLN Comeback Player of the Year 145 488.0 147.0
1 verlaju01 2011 DET Cy Young Award 34 4.0 0.0
2 buehrma01 2011 CHA Gold Glove 31 2.0 1.0
3 markani01 2011 BAL Gold Glove 160 641.0 182.0
4 ethiean01 2011 LAN Gold Glove 135 487.0 142.0
... ... ... ... ... ... ... ...
7196 yateski01 2015 TBA None 20 0.0 0.0
7197 ynoara01 2015 COL None 72 127.0 33.0
7198 youngch04 2015 NYA None 140 318.0 80.0
7199 zimmejo02 2015 WAS None 33 63.0 10.0
7200 zychto01 2015 SEA None 13 0.0 0.0

7201 rows × 7 columns

Key Takeaways

  • xorq provides a consistent interface across different data engines
  • Operations are lazy until .execute() is called
  • Data can be moved between backends using into_backend()
  • Caching helps optimize performance for frequently used queries
  • Different backends can be leveraged for their specific strengths