Build

The Xorq CLI provides the build command to compile Xorq expressions into reusable artifacts that can be executed later. This separation enables:

Prerequisites

Before starting, make sure you have Xorq installed:

pip install xorq

Building Expressions

The build command compiles a Xorq expression into a reusable artifact that can be executed later.

Basic Usage

The basic syntax for the build command is:

xorq build <script_path> -e <expression_name> --builds-dir <output_directory>

Where: - <script_path> is the path to your Python script containing the expression - <expression_name> is the name of the variable holding the xorq expression - <builds-dir> is where the artifacts will be generated (defaults to “builds”)

Example: Penguins Data Processing

Let’s create a complete example using the penguins dataset. First, we’ll create a script that defines a Xorq expression with a UDXF for data processing:

# penguins_analysis.py

import pandas as pd
import xorq as xo

def analyze_penguin_health(df: pd.DataFrame) -> pd.DataFrame:
    """
    Analyze penguin health based on body mass and bill dimensions.
    This function demonstrates a typical ML preprocessing pipeline.
    """
    def calculate_health_score(row):
        if pd.isna(row['body_mass_g']) or pd.isna(row['bill_length_mm']):
            return None
        
        # Simple health score based on mass-to-bill ratio
        mass_score = min(row['body_mass_g'] / 1000, 10)  # Scale to 0-10
        bill_score = min(row['bill_length_mm'] / 10, 10)  # Scale to 0-10
        return (mass_score + bill_score) / 2
    
    def categorize_size(mass):
        if pd.isna(mass):
            return "unknown"
        elif mass < 3500:
            return "small"
        elif mass < 4500:
            return "medium" 
        else:
            return "large"
    
    return df.assign(
        health_score=df.apply(calculate_health_score, axis=1),
        size_category=df['body_mass_g'].apply(categorize_size),
        processed_at=pd.Timestamp.now()
    )

# Load penguins data
pandas_con = xo.pandas.connect()
penguins_data = xo.examples.penguins.fetch(backend=pandas_con)

# Filter out rows with missing critical values
filtered_penguins = penguins_data.filter(
    penguins_data.bill_length_mm.isnull() == False,
    penguins_data.bill_depth_mm.isnull() == False,
    penguins_data.flipper_length_mm.isnull() == False,
    penguins_data.body_mass_g.isnull() == False,
)

# Define schemas for the UDXF
input_schema = xo.schema({
    "species": str,
    "island": str,
    "bill_length_mm": float,
    "bill_depth_mm": float,
    "flipper_length_mm": float,
    "body_mass_g": float,
    "sex": str,
    "year": int
})

output_schema = xo.schema({
    "species": str,
    "island": str,
    "bill_length_mm": float,
    "bill_depth_mm": float,
    "flipper_length_mm": float,
    "body_mass_g": float,
    "sex": str,
    "year": int,
    "health_score": float,
    "size_category": str,
    "processed_at": "timestamp"
})

# Create the UDXF expression - this is what gets built
expr = xo.expr.relations.flight_udxf(
    filtered_penguins,
    process_df=analyze_penguin_health,
    maybe_schema_in=input_schema,
    maybe_schema_out=output_schema,
    con=xo.connect(),
    make_udxf_kwargs={
        "name": "analyze_penguin_health",
        "command": "penguin_health_analyzer"
    }
)

Now, let’s build this expression using the CLI:

xorq build penguins_analysis.py -e expr --builds-dir builds

This command will: 1. Load the penguins_analysis.py script 2. Find the expr variable containing the UDXF expression 3. Generate artifacts based on the expression 4. Save them to the builds directory

You should see output similar to:

Building expr from penguins_analysis.py
Written 'expr' to builds/f02d28198715

Understanding Build Artifacts

After building, you’ll find several artifacts in the build directory:

builds/penguins_analysis/
├── expr.yaml          # YAML expression definition
├── metadata.json      # Build metadata and checksums
├── deferred_reads.yaml # Data source configurations
└── profiles.yaml      # Backend connection profiles

expr.yaml contains the complete expression definition:

# Example snippet from expr.yaml
expression:
  op: RemoteTable
  table: ibis_rbr-placeholder_row5rfsuc5dino
  schema_ref: schema_0
  profile: feda6956a9ca4d2bda0fbc8e775042c3_3
  remote_expr:
    op: FlightUDXF
    name: ibis_rbr-placeholder_xcbivckvizcild
    schema_ref: schema_0
    input_expr:
      node_ref: fa2a3654153da35ad09d4a9c84ea14c8
    udxf: gAWVfRoAAAAAAACMF2Nsb3VkcGlja2xlLmNsb3
    make_server: gAWV7wMAAAAAAACMF2Nsb3VkcGlja2x
    make_connection: gAWVFAAAAAAAAACMBHhvcnGUjAd
    do_instrument_reader: false

deferred_reads.yaml tracks data sources:

reads:
  penguins-8e3342f56fda839fd4878058ca4394f9:
    engine: pandas
    profile_name: 08f39a9ca2742d208a09d0ee9c7756c0_1
    relations:
    - penguins-8e3342f56fda839fd4878058ca4394f9
    options:
      method_name: read_parquet
      name: penguins
      read_kwargs:
      - source: path/to/penguins
      - table_name: penguins
    sql_file: f7bbe97f6c5e.sql

Advanced Build Options

Custom Build Directory

Specify a custom location for build artifacts:

xorq build penguins_analysis.py -e expr --builds-dir /path/to/custom/builds

Cache Directory

Specify where intermediate cache files should be stored:

xorq build penguins_analysis.py -e expr --cache-dir /path/to/cache

Multiple Expressions

You can build different expressions from the same script:

# In penguins_analysis.py, add another expression:

# Simple aggregation expression
summary_expr = filtered_penguins.group_by("species").agg(
    avg_mass=filtered_penguins.body_mass_g.mean(),
    count=filtered_penguins.count()
)

Build each expression separately:

xorq build penguins_analysis.py -e expr --builds-dir builds
xorq build penguins_analysis.py -e summary_expr --builds-dir builds

The build step separates the definition of your data transformation from its execution, enabling reproducible and portable data pipelines.