# penguins_analysis.py
import pandas as pd
import xorq as xo
def analyze_penguin_health(df: pd.DataFrame) -> pd.DataFrame:
"""
Analyze penguin health based on body mass and bill dimensions.
This function demonstrates a typical ML preprocessing pipeline.
"""
def calculate_health_score(row):
if pd.isna(row['body_mass_g']) or pd.isna(row['bill_length_mm']):
return None
# Simple health score based on mass-to-bill ratio
= min(row['body_mass_g'] / 1000, 10) # Scale to 0-10
mass_score = min(row['bill_length_mm'] / 10, 10) # Scale to 0-10
bill_score return (mass_score + bill_score) / 2
def categorize_size(mass):
if pd.isna(mass):
return "unknown"
elif mass < 3500:
return "small"
elif mass < 4500:
return "medium"
else:
return "large"
return df.assign(
=df.apply(calculate_health_score, axis=1),
health_score=df['body_mass_g'].apply(categorize_size),
size_category=pd.Timestamp.now()
processed_at
)
# Load penguins data
= xo.pandas.connect()
pandas_con = xo.examples.penguins.fetch(backend=pandas_con)
penguins_data
# Filter out rows with missing critical values
= penguins_data.filter(
filtered_penguins == False,
penguins_data.bill_length_mm.isnull() == False,
penguins_data.bill_depth_mm.isnull() == False,
penguins_data.flipper_length_mm.isnull() == False,
penguins_data.body_mass_g.isnull()
)
# Define schemas for the UDXF
= xo.schema({
input_schema "species": str,
"island": str,
"bill_length_mm": float,
"bill_depth_mm": float,
"flipper_length_mm": float,
"body_mass_g": float,
"sex": str,
"year": int
})
= xo.schema({
output_schema "species": str,
"island": str,
"bill_length_mm": float,
"bill_depth_mm": float,
"flipper_length_mm": float,
"body_mass_g": float,
"sex": str,
"year": int,
"health_score": float,
"size_category": str,
"processed_at": "timestamp"
})
# Create the UDXF expression - this is what gets built
= xo.expr.relations.flight_udxf(
expr
filtered_penguins,=analyze_penguin_health,
process_df=input_schema,
maybe_schema_in=output_schema,
maybe_schema_out=xo.connect(),
con={
make_udxf_kwargs"name": "analyze_penguin_health",
"command": "penguin_health_analyzer"
} )
Build
The Xorq CLI provides the build
command to compile Xorq expressions into reusable artifacts that can be executed later. This separation enables:
- Replicability: Define transformations once and execute them consistently across different environments with guaranteed identical results.
- Serialization: Store complex queries as compiled artifacts that can be shared, versioned, and executed without the original code
- Performance optimization: Pre-compile expressions to avoid repeated parsing and optimization at runtime
Prerequisites
Before starting, make sure you have Xorq installed:
pip install xorq
Building Expressions
The build
command compiles a Xorq expression into a reusable artifact that can be executed later.
Basic Usage
The basic syntax for the build
command is:
xorq build <script_path> -e <expression_name> --builds-dir <output_directory>
Where: - <script_path>
is the path to your Python script containing the expression - <expression_name>
is the name of the variable holding the xorq expression - <builds-dir>
is where the artifacts will be generated (defaults to “builds”)
Example: Penguins Data Processing
Let’s create a complete example using the penguins dataset. First, we’ll create a script that defines a Xorq expression with a UDXF for data processing:
Now, let’s build this expression using the CLI:
xorq build penguins_analysis.py -e expr --builds-dir builds
This command will: 1. Load the penguins_analysis.py
script 2. Find the expr
variable containing the UDXF expression 3. Generate artifacts based on the expression 4. Save them to the builds
directory
You should see output similar to:
Building expr from penguins_analysis.py
Written 'expr' to builds/f02d28198715
Understanding Build Artifacts
After building, you’ll find several artifacts in the build directory:
builds/penguins_analysis/
├── expr.yaml # YAML expression definition
├── metadata.json # Build metadata and checksums
├── deferred_reads.yaml # Data source configurations
└── profiles.yaml # Backend connection profiles
expr.yaml contains the complete expression definition:
# Example snippet from expr.yaml
expression:
op: RemoteTable
table: ibis_rbr-placeholder_row5rfsuc5dino
schema_ref: schema_0
profile: feda6956a9ca4d2bda0fbc8e775042c3_3
remote_expr:
op: FlightUDXF
name: ibis_rbr-placeholder_xcbivckvizcild
schema_ref: schema_0
input_expr:
node_ref: fa2a3654153da35ad09d4a9c84ea14c8
udxf: gAWVfRoAAAAAAACMF2Nsb3VkcGlja2xlLmNsb3
make_server: gAWV7wMAAAAAAACMF2Nsb3VkcGlja2x
make_connection: gAWVFAAAAAAAAACMBHhvcnGUjAd
do_instrument_reader: false
deferred_reads.yaml tracks data sources:
reads:
penguins-8e3342f56fda839fd4878058ca4394f9:
engine: pandas
profile_name: 08f39a9ca2742d208a09d0ee9c7756c0_1
relations:
- penguins-8e3342f56fda839fd4878058ca4394f9
options:
method_name: read_parquet
name: penguins
read_kwargs:
- source: path/to/penguins
- table_name: penguins
sql_file: f7bbe97f6c5e.sql
Advanced Build Options
Custom Build Directory
Specify a custom location for build artifacts:
xorq build penguins_analysis.py -e expr --builds-dir /path/to/custom/builds
Cache Directory
Specify where intermediate cache files should be stored:
xorq build penguins_analysis.py -e expr --cache-dir /path/to/cache
Multiple Expressions
You can build different expressions from the same script:
# In penguins_analysis.py, add another expression:
# Simple aggregation expression
= filtered_penguins.group_by("species").agg(
summary_expr =filtered_penguins.body_mass_g.mean(),
avg_mass=filtered_penguins.count()
count )
Build each expression separately:
xorq build penguins_analysis.py -e expr --builds-dir builds
xorq build penguins_analysis.py -e summary_expr --builds-dir builds
The build step separates the definition of your data transformation from its execution, enabling reproducible and portable data pipelines.