>>> import pyarrow.compute as pc
>>> from xorq.expr.udf import agg
>>> import xorq.expr.datatypes as dt
>>> import xorq as xo
agg.pyarrow
=None, name=None, signature=None, **kwargs) pyarrow(fn
Decorator for creating PyArrow-based aggregation functions.
This method creates high-performance aggregation UDFs that operate directly on PyArrow arrays using PyArrow compute functions. It’s ideal for numerical computations and operations that benefit from vectorized processing.
Parameters
Name | Type | Description | Default |
---|---|---|---|
fn | callable | The aggregation function. Should accept PyArrow arrays and return a scalar or array result using PyArrow compute functions. | None |
name | str | Name of the UDF. If None, uses the function name. | None |
signature | Signature | Function signature specification for type checking. | None |
**kwargs | Additional configuration parameters like volatility settings. | {} |
Returns
Name | Type | Description |
---|---|---|
callable | A UDF decorator that can be applied to functions, or if fn is provided, the wrapped UDF function. |
Examples
Creating a PyArrow aggregation for penguin bill measurements:
>>> # Load penguins dataset
>>> penguins = xo.examples.penguins.fetch(backend=xo.connect())
>>> @agg.pyarrow
>>> def bill_length_range(arr: dt.float64) -> dt.float64:
return pc.subtract(pc.max(arr), pc.min(arr)) ...
>>> # Calculate bill length range by species
>>> result = penguins.group_by("species").agg(
=bill_length_range(penguins.bill_length_mm)
... length_range>>> ).execute()
Creating a weighted average for penguin measurements:
>>> @agg.pyarrow
>>> def weighted_avg_flipper(lengths: dt.float64, weights: dt.float64) -> dt.float64:
return pc.divide(
... sum(pc.multiply(lengths, weights)),
... pc.sum(weights)
... pc. ... )
>>> # Weighted average flipper length by body mass
>>> result = penguins.group_by("species").agg(
=weighted_avg_flipper(
... weighted_flipper
... penguins.flipper_length_mm,
... penguins.body_mass_g
... )>>> ).execute()
Notes
- PyArrow UDAFs typically offer the best performance for numerical operations
- Functions receive PyArrow arrays and should use PyArrow compute functions
- The return type should be compatible with PyArrow scalar types
- Use this for high-performance aggregations on large datasets
See Also
agg.pandas_df : For pandas DataFrame-based aggregations agg.builtin : For database-native aggregate functions