This tutorial will walk you through the key features of xorq, a data processing library that enables you to work seamlessly across multiple data engines.
Installation
First, install xorq using pip. We’ll include the duckdb
extra to enable the duckdb backend:
pip install 'xorq[duckdb]'
Setting up Connections
Let’s start by creating connections to different backends:
import xorq as xo
from xorq import _ # import the underscore accessor for column reference
# Create connections to different backends
con = xo.connect() # xorq's main connection
ddb = xo.duckdb.connect() # DuckDB connection
pg = xo.postgres.connect_examples() # Postgres connection
In this section, we:
- Import xorq and its deferred object, which allow us to referred to columns
- Create a xorq connection (backed by the xorq backend)
- Create a DuckDB connection
- Create a Postgres connection
Note that you can create a custom Postgres connection by specifying the different parameters, for example:
pg = xo.postgres.connect(
host="localhost",
port=5432,
user="postgres",
password="postgres",
database="ibis_testing"
)
Reading Data
Now let’s read some data into xorq:
# Read a parquet file using xorq
path = xo.config.options.pins.get_path("batting")
batting = con.read_parquet(path, table="batting")
xorq can read data from various sources. Here we’re reading a Parquet file directly. The table_name
parameter specifies how this table will be referenced inside the con
backend.
Basic Operations
Let’s perform some basic data operations:
# Filtering and selection
recent_batting = (
batting[batting.yearID > 2010] # filter for recent years
.select(['playerID', 'yearID', 'teamID', 'G', 'AB', 'R', 'H']) # select specific columns
)
# Execute to see results
recent_batting.execute()
The output is
playerID yearID teamID G AB R H
0 utleych01 2013 PHI 131 476.0 73.0 135.0
1 valaich01 2013 MIA 22 64.0 4.0 14.0
2 valbulu01 2013 CHN 108 331.0 34.0 72.0
3 valdejo02 2013 NYN 66 133.0 16.0 25.0
4 valdelu01 2013 BAL 4 0.0 0.0 0.0
... ... ... ... ... ... ... ...
7122 ugglada01 2013 ATL 136 448.0 60.0 80.0
7123 uptonbj01 2013 ATL 126 391.0 30.0 72.0
7124 uptonju01 2013 ATL 149 558.0 94.0 147.0
7125 uribeju01 2013 LAN 132 388.0 47.0 108.0
7126 urruthe01 2013 BAL 24 58.0 5.0 16.0
[7127 rows x 7 columns]
Note that xorq operations are lazy - they don’t execute until you call execute
, which returns a pandas
DataFrame.
Multi-Engine Operations
One of xorq’s powerful features is the ability to move data between different backends using into_backend()
.
This method converts an expression from one backend into a table in another backend, using a PyArrow RecordBatchReader
as an intermediate format:
# Read a table from Postgres and move it to xorq's backend
awards = pg.table("awards_players").into_backend(con, "awards") # bring into xorq backend
# Perform a join between the two tables
player_awards = (
recent_batting.join(
awards,
['playerID', 'yearID'], # join keys
how='left' # left join
)
.select([
'playerID',
'yearID',
'teamID',
'awardID',
'G',
'AB',
'H'
])
)
player_awards.execute()
The corresponding output is:
playerID yearID teamID awardID G AB H
0 victosh01 2011 PHI Branch Rickey Award 132 519.0 145.0
1 aybarer01 2011 LAA Gold Glove 143 556.0 155.0
2 parrage01 2011 ARI Gold Glove 141 445.0 130.0
3 rodnefe01 2012 TBA Comeback Player of the Year 76 0.0 0.0
4 molinya01 2012 SLN Gold Glove 138 505.0 159.0
... ... ... ... ... ... ... ...
7196 sanchjo01 2013 PIT None 6 5.0 0.0
7197 saundmi01 2013 SEA None 132 406.0 96.0
7198 selleju01 2013 LAN None 27 69.0 13.0
7199 skipwky01 2013 MIA None 4 3.0 0.0
7200 sogarer01 2013 OAK None 130 368.0 98.0
[7201 rows x 7 columns]
into_backend()
is particularly useful when you want to:
- Move data between different database engines
- Combine data from multiple sources
- Avoid writing intermediate results to disk
Leveraging Different Backends
Different backends have different strengths. Let’s use DuckDB for some aggregations:
# Move data to DuckDB for additional processing
ddb_awards = player_awards.into_backend(ddb, "ddb_awards")
# Perform DuckDB-specific operations
ddb_summary = (
ddb_awards.group_by(['yearID', 'awardID'])
.agg([
_.count().name('player_count'),
_.G.mean().name('avg_games'),
_.H.mean().name('avg_hits'),
])
.order_by(['yearID', 'awardID'])
)
print("Award summary from DuckDB:")
ddb_summary.execute()
The output:
yearID awardID ... avg_games avg_hits
0 2011 ALCS MVP ... 124.000000 125.000000
1 2011 All-Star Game MVP ... 162.000000 170.000000
2 2011 Babe Ruth Award ... 97.000000 99.000000
3 2011 Branch Rickey Award ... 132.000000 145.000000
4 2011 Comeback Player of the Year ... 151.500000 179.500000
.. ... ... ... ... ...
80 2015 Rookie of the Year ... 125.000000 131.000000
81 2015 Silver Slugger ... 144.555556 160.722222
82 2015 TSN Major League Player of the Year ... 158.000000 184.000000
83 2015 World Series MVP ... 142.000000 138.000000
84 2015 None ... 45.295580 25.419199
[85 rows x 5 columns]
Caching Expressions
xorq provides caching capabilities to optimize performance:
from pathlib import Path
from xorq.caching import ParquetStorage
# Create a storage for cached data
cache_storage = ParquetStorage(source=con, path=Path.cwd())
# Cache the results
cached_awards = player_awards.cache(storage=cache_storage)
# The next execution will use the cached data
cached_awards.execute()
Key Takeaways
- xorq provides a consistent interface across different data engines
- Operations are lazy until
.execute()
is called
- Data can be moved between backends using
into_backend()
- Caching helps optimize performance for frequently used queries
- Different backends can be leveraged for their specific strengths