agg.pyarrow

pyarrow(fn=None, name=None, signature=None, **kwargs)

Decorator for creating PyArrow-based aggregation functions.

This method creates high-performance aggregation UDFs that operate directly on PyArrow arrays using PyArrow compute functions. It’s ideal for numerical computations and operations that benefit from vectorized processing.

Parameters

Name	Type	Description	Default
fn	callable	The aggregation function. Should accept PyArrow arrays and return a scalar or array result using PyArrow compute functions.	`None`
name	str	Name of the UDF. If None, uses the function name.	`None`
signature	Signature	Function signature specification for type checking.	`None`
**kwargs		Additional configuration parameters like volatility settings.	`{}`

Returns

Name	Type	Description
	callable	A UDF decorator that can be applied to functions, or if fn is provided, the wrapped UDF function.

Examples

Creating a PyArrow aggregation for penguin bill measurements:

>>> import pyarrow.compute as pc
>>> from xorq.expr.udf import agg
>>> import xorq.expr.datatypes as dt
>>> import xorq.api as xo

>>> # Load penguins dataset
>>> penguins = xo.examples.penguins.fetch(backend=xo.connect())

>>> @agg.pyarrow
>>> def bill_length_range(arr: dt.float64) -> dt.float64:
...     return pc.subtract(pc.max(arr), pc.min(arr))

>>> # Calculate bill length range by species
>>> result = penguins.group_by("species").agg(
...     length_range=bill_length_range(penguins.bill_length_mm)
>>> ).execute()

Creating a weighted average for penguin measurements:

>>> @agg.pyarrow
>>> def weighted_avg_flipper(lengths: dt.float64, weights: dt.float64) -> dt.float64:
...     return pc.divide(
...         pc.sum(pc.multiply(lengths, weights)),
...         pc.sum(weights)
...     )

>>> # Weighted average flipper length by body mass
>>> result = penguins.group_by("species").agg(
...     weighted_flipper=weighted_avg_flipper(
...         penguins.flipper_length_mm,
...         penguins.body_mass_g
...     )
>>> ).execute()

Notes

PyArrow UDAFs typically offer the best performance for numerical operations
Functions receive PyArrow arrays and should use PyArrow compute functions
The return type should be compatible with PyArrow scalar types
Use this for high-performance aggregations on large datasets

See Also

agg.pandas_df : For pandas DataFrame-based aggregations agg.builtin : For database-native aggregate functions