agg.pyarrow

pyarrow(fn=None, name=None, signature=None, **kwargs)

Decorator for creating PyArrow-based aggregation functions.

This method creates high-performance aggregation UDFs that operate directly on PyArrow arrays using PyArrow compute functions. It’s ideal for numerical computations and operations that benefit from vectorized processing.

Parameters

Name Type Description Default
fn callable The aggregation function. Should accept PyArrow arrays and return a scalar or array result using PyArrow compute functions. None
name str Name of the UDF. If None, uses the function name. None
signature Signature Function signature specification for type checking. None
**kwargs Additional configuration parameters like volatility settings. {}

Returns

Name Type Description
callable A UDF decorator that can be applied to functions, or if fn is provided, the wrapped UDF function.

Examples

Creating a PyArrow aggregation for penguin bill measurements:

>>> import pyarrow.compute as pc
>>> from xorq.expr.udf import agg
>>> import xorq.expr.datatypes as dt
>>> import xorq as xo
>>> # Load penguins dataset
>>> penguins = xo.examples.penguins.fetch(backend=xo.connect())
>>> @agg.pyarrow
>>> def bill_length_range(arr: dt.float64) -> dt.float64:
...     return pc.subtract(pc.max(arr), pc.min(arr))
>>> # Calculate bill length range by species
>>> result = penguins.group_by("species").agg(
...     length_range=bill_length_range(penguins.bill_length_mm)
>>> ).execute()

Creating a weighted average for penguin measurements:

>>> @agg.pyarrow
>>> def weighted_avg_flipper(lengths: dt.float64, weights: dt.float64) -> dt.float64:
...     return pc.divide(
...         pc.sum(pc.multiply(lengths, weights)),
...         pc.sum(weights)
...     )
>>> # Weighted average flipper length by body mass
>>> result = penguins.group_by("species").agg(
...     weighted_flipper=weighted_avg_flipper(
...         penguins.flipper_length_mm,
...         penguins.body_mass_g
...     )
>>> ).execute()

Notes

  • PyArrow UDAFs typically offer the best performance for numerical operations
  • Functions receive PyArrow arrays and should use PyArrow compute functions
  • The return type should be compatible with PyArrow scalar types
  • Use this for high-performance aggregations on large datasets

See Also

agg.pandas_df : For pandas DataFrame-based aggregations agg.builtin : For database-native aggregate functions