import xorq as xo
from xorq import _ # import the underscore accessor for column reference
# Create connections to different backends
= xo.connect() # xorq's main connection
con = xo.duckdb.connect() # DuckDB connection
ddb = xo.postgres.connect_examples() # Postgres connection pg
10-minute tour of xorq
This tutorial will walk you through the key features of xorq, a data processing library that enables you to work seamlessly across multiple data engines.
Installation
First, install xorq using pip. We’ll include the duckdb
extra to enable the duckdb backend:
pip install 'xorq[duckdb]'
Setting up Connections
Let’s start by creating connections to different backends:
In this section, we: - Import xorq and its deferred object, which allow us to referred to columns - Create a xorq connection (backed by the xorq backend) - Create a DuckDB connection - Create a Postgres connection
Note that you can create a custom Postgres connection by specifying the different parameters, for example:
= xo.postgres.connect(
pg ="localhost",
host=5432,
port="postgres",
user="postgres",
password="ibis_testing"
database )
Reading Data
Now let’s read some data into xorq:
# Read a parquet file using xorq
= xo.config.options.pins.get_path("batting")
path = con.read_parquet(path, table_name="batting") batting
xorq can read data from various sources. Here we’re reading a Parquet file directly. The table_name
parameter specifies how this table will be referenced inside the con
backend.
Basic Operations
Let’s perform some basic data operations:
# Filtering and selection
= (
recent_batting > 2010] # filter for recent years
batting[batting.yearID 'playerID', 'yearID', 'teamID', 'G', 'AB', 'R', 'H']) # select specific columns
.select([
)
# Execute to see results
recent_batting.execute()
/home/runner/work/xorq/xorq/python/xorq/vendor/ibis/expr/types/relations.py:675: FutureWarning: Selecting/filtering arbitrary expressions in `Table.__getitem__` is deprecated and will be removed in version 10.0. Please use `Table.select` or `Table.filter` instead.
warnings.warn(
playerID | yearID | teamID | G | AB | R | H | |
---|---|---|---|---|---|---|---|
0 | utleych01 | 2013 | PHI | 131 | 476.0 | 73.0 | 135.0 |
1 | valaich01 | 2013 | MIA | 22 | 64.0 | 4.0 | 14.0 |
2 | valbulu01 | 2013 | CHN | 108 | 331.0 | 34.0 | 72.0 |
3 | valdejo02 | 2013 | NYN | 66 | 133.0 | 16.0 | 25.0 |
4 | valdelu01 | 2013 | BAL | 4 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... |
7122 | ugglada01 | 2013 | ATL | 136 | 448.0 | 60.0 | 80.0 |
7123 | uptonbj01 | 2013 | ATL | 126 | 391.0 | 30.0 | 72.0 |
7124 | uptonju01 | 2013 | ATL | 149 | 558.0 | 94.0 | 147.0 |
7125 | uribeju01 | 2013 | LAN | 132 | 388.0 | 47.0 | 108.0 |
7126 | urruthe01 | 2013 | BAL | 24 | 58.0 | 5.0 | 16.0 |
7127 rows × 7 columns
Note that xorq operations are lazy - they don’t execute until you call execute
, which returns a pandas DataFrame.
Multi-Engine Operations
One of xorq’s powerful features is the ability to move data between different backends using into_backend()
. This method converts an expression from one backend into a table in another backend, using a PyArrow RecordBatchReader as an intermediate format:
# Read a table from Postgres and move it to xorq's backend
= pg.table("awards_players").into_backend(con, "awards") # bring into xorq backend
awards
# Perform a join between the two tables
= (
player_awards
recent_batting.join(
awards,'playerID', 'yearID'], # join keys
[='left' # left join
how
)
.select(['playerID',
'yearID',
'teamID',
'awardID',
'G',
'AB',
'H'
])
)
player_awards.execute()
playerID | yearID | teamID | awardID | G | AB | H | |
---|---|---|---|---|---|---|---|
0 | jonesad01 | 2013 | BAL | Gold Glove | 160 | 653.0 | 186.0 |
1 | cruzne02 | 2011 | TEX | ALCS MVP | 124 | 475.0 | 125.0 |
2 | ellsbja01 | 2011 | BOS | Comeback Player of the Year | 158 | 660.0 | 212.0 |
3 | gonzaad01 | 2011 | BOS | Gold Glove | 159 | 630.0 | 213.0 |
4 | wietema01 | 2011 | BAL | Gold Glove | 139 | 500.0 | 131.0 |
... | ... | ... | ... | ... | ... | ... | ... |
7196 | tracych01 | 2013 | WAS | None | 92 | 129.0 | 26.0 |
7197 | tulowtr01 | 2013 | COL | None | 126 | 446.0 | 139.0 |
7198 | turneju01 | 2013 | NYN | None | 86 | 200.0 | 56.0 |
7199 | uribeju01 | 2013 | LAN | None | 132 | 388.0 | 108.0 |
7200 | urruthe01 | 2013 | BAL | None | 24 | 58.0 | 16.0 |
7201 rows × 7 columns
into_backend()
is particularly useful when you want to: - Move data between different database engines - Combine data from multiple sources - Avoid writing intermediate results to disk
Leveraging Different Backends
Different backends have different strengths. Let’s use DuckDB for some aggregations:
# Move data to DuckDB for additional processing
= player_awards.into_backend(ddb, "ddb_awards")
ddb_awards
# Perform DuckDB-specific operations
= (
ddb_summary 'yearID', 'awardID'])
ddb_awards.group_by([
.agg(['player_count'),
_.count().name('avg_games'),
_.G.mean().name('avg_hits'),
_.H.mean().name(
])'yearID', 'awardID'])
.order_by([
)
print("Award summary from DuckDB:")
ddb_summary.execute()
Award summary from DuckDB:
yearID | awardID | player_count | avg_games | avg_hits | |
---|---|---|---|---|---|
0 | 2011 | ALCS MVP | 1 | 124.000000 | 125.000000 |
1 | 2011 | All-Star Game MVP | 1 | 162.000000 | 170.000000 |
2 | 2011 | Babe Ruth Award | 1 | 97.000000 | 99.000000 |
3 | 2011 | Branch Rickey Award | 1 | 132.000000 | 145.000000 |
4 | 2011 | Comeback Player of the Year | 2 | 151.500000 | 179.500000 |
... | ... | ... | ... | ... | ... |
80 | 2015 | Rookie of the Year | 2 | 125.000000 | 131.000000 |
81 | 2015 | Silver Slugger | 18 | 144.555556 | 160.722222 |
82 | 2015 | TSN Major League Player of the Year | 1 | 158.000000 | 184.000000 |
83 | 2015 | World Series MVP | 1 | 142.000000 | 138.000000 |
84 | 2015 | None | 1448 | 45.295580 | 25.419199 |
85 rows × 5 columns
Caching Expressions
xorq provides caching capabilities to optimize performance:
from pathlib import Path
from xorq.caching import ParquetStorage
# Create a storage for cached data
= ParquetStorage(source=con, base_path=Path.cwd())
cache_storage
# Cache the results
= player_awards.cache(storage=cache_storage)
cached_awards
# The next execution will use the cached data
cached_awards.execute()
playerID | yearID | teamID | awardID | G | AB | H | |
---|---|---|---|---|---|---|---|
0 | berkmla01 | 2011 | SLN | Comeback Player of the Year | 145 | 488.0 | 147.0 |
1 | verlaju01 | 2011 | DET | Cy Young Award | 34 | 4.0 | 0.0 |
2 | buehrma01 | 2011 | CHA | Gold Glove | 31 | 2.0 | 1.0 |
3 | markani01 | 2011 | BAL | Gold Glove | 160 | 641.0 | 182.0 |
4 | ethiean01 | 2011 | LAN | Gold Glove | 135 | 487.0 | 142.0 |
... | ... | ... | ... | ... | ... | ... | ... |
7196 | yateski01 | 2015 | TBA | None | 20 | 0.0 | 0.0 |
7197 | ynoara01 | 2015 | COL | None | 72 | 127.0 | 33.0 |
7198 | youngch04 | 2015 | NYA | None | 140 | 318.0 | 80.0 |
7199 | zimmejo02 | 2015 | WAS | None | 33 | 63.0 | 10.0 |
7200 | zychto01 | 2015 | SEA | None | 13 | 0.0 | 0.0 |
7201 rows × 7 columns
Key Takeaways
- xorq provides a consistent interface across different data engines
- Operations are lazy until
.execute()
is called - Data can be moved between backends using
into_backend()
- Caching helps optimize performance for frequently used queries
- Different backends can be leveraged for their specific strengths