make_pandas_udf

xorq.expr.udf.make_pandas_udf(
    fn,
    schema,
    return_type,
    database=None,
    catalog=None,
    name=None,
    **kwargs,
)

Create a scalar User-Defined Function (UDF) that operates on pandas DataFrames.

This function creates a scalar UDF that processes data row-by-row, converting PyArrow arrays to pandas DataFrames for processing. It’s ideal for operations that benefit from pandas’ rich functionality and are easier to express with DataFrame operations.

Parameters

Name	Type	Description	Default
fn	callable	The function to be executed. Should accept a pandas DataFrame and return a pandas Series or scalar value.	required
schema	Schema	The input schema defining column names and their data types.	required
return_type	DataType	The return data type of the UDF.	required
database	str	Database name for the UDF namespace.	`None`
catalog	str	Catalog name for the UDF namespace.	`None`
name	str	Name of the UDF. If None, generates a name from the function.	`None`
**kwargs		Additional configuration parameters (e.g., volatility settings).	`{}`

Returns

Name	Type	Description
	callable	A UDF constructor that can be used in expressions with `.on_expr()` method.

Examples

Creating a UDF that calculates penguin bill ratio:

>>> import pandas as pd
>>> from xorq.expr.udf import make_pandas_udf
>>> import xorq.expr.datatypes as dt
>>> import xorq.api as xo

>>> # Load penguins dataset
>>> penguins = xo.examples.penguins.fetch(backend=xo.connect())

>>> # Define the function
>>> def bill_ratio(df):
...     return df['bill_length_mm'] / df['bill_depth_mm']

>>> # Create UDF
>>> schema = penguins.select(['bill_length_mm', 'bill_depth_mm']).schema()
>>> bill_ratio_udf = make_pandas_udf(
...     fn=bill_ratio,
...     schema=schema,
...     return_type=dt.float64,
...     name="bill_ratio"
>>> )

>>> # Apply to table
>>> result = penguins.mutate(
...     bill_ratio=bill_ratio_udf.on_expr(penguins)
>>> ).execute()

Creating a UDF for penguin size classification:

>>> def classify_penguin_size(df):
...     def size_category(row):
...         mass = row['body_mass_g']
...         flipper = row['flipper_length_mm']
...
...         if pd.isna(mass) or pd.isna(flipper):
...             return 'Unknown'
...
...         # Simple size classification based on body mass and flipper length
...         if mass > 4500 and flipper > 210:
...             return 'Large'
...         elif mass < 3500 and flipper < 190:
...             return 'Small'
...         else:
...             return 'Medium'
...
...     return df.apply(size_category, axis=1)

>>> size_schema = penguins.select(['body_mass_g', 'flipper_length_mm']).schema()
>>> size_udf = make_pandas_udf(
...     fn=classify_penguin_size,
...     schema=size_schema,
...     return_type=dt.string,
...     name="classify_size"
>>> )

>>> # Apply size classification
>>> result = penguins.mutate(
...     size_category=size_udf.on_expr(penguins)
>>> ).execute()

Creating a UDF for complex penguin feature engineering:

>>> def penguin_features(df):
...     # Create multiple derived features
...     features = pd.DataFrame(index=df.index)
...
...     # Bill area
...     features['bill_area'] = df['bill_length_mm'] * df['bill_depth_mm']
...
...     # Body condition index
...     features['body_condition'] = df['body_mass_g'] / (df['flipper_length_mm'] ** 2)
...
...     # Aspect ratio of bill
...     features['bill_aspect_ratio'] = df['bill_length_mm'] / df['bill_depth_mm']
...
...     # Return as concatenated string for this example
...     return features.apply(lambda row: f"area:{row['bill_area']:.1f}_bci:{row['body_condition']:.4f}_ratio:{row['bill_aspect_ratio']:.2f}", axis=1)

>>> all_measurements = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
>>> features_schema = penguins.select(all_measurements).schema()
>>> features_udf = make_pandas_udf(
...     fn=penguin_features,
...     schema=features_schema,
...     return_type=dt.string,
...     name="penguin_features"
>>> )

>>> # Apply feature engineering
>>> result = penguins.mutate(
...     derived_features=features_udf.on_expr(penguins)
>>> ).execute()

Notes

The function receives a pandas DataFrame where columns correspond to the schema keys
The function should return a pandas Series or scalar value compatible with return_type
PyArrow arrays are automatically converted to pandas and back for seamless integration
Use this when you need pandas-specific functionality like string operations, datetime handling, or complex data manipulations

See Also

scalar : For PyArrow-based scalar UDFs with potentially better performance make_pandas_expr_udf : For UDFs that need pre-computed values agg : For aggregation functions