Core Concepts
User-Defined Functions
User-defined functions (UDFs) in xorq serve as powerful tools to streamline data pipelines by:
- Reducing pipeline complexity: UDFs allow you to embed sophisticated logic directly in your data processing workflow, eliminating the need for separate processing steps or microservices.
- Maintaining data locality: Process data where it resides without moving it between environments, reducing latency and resource usage.
- Enabling code reuse: Encapsulate complex logic in functions that can be used across multiple pipelines and projects.
- Simplifying ML workflows: Seamlessly integrate model training and inference within your data pipeline, reducing the complexity of MLOps.
Overview
xorq supports three types of user-defined functions (UDFs):
- Scalar UDFs: Process data row by row
- UDAFs: Aggregate functions that process groups of rows
- UDWFs: Window functions that operate over partitions and frames
All UDFs integrate with XORQ’s execution engine for optimal performance.
Scalar UDFs
The simplest type - processes one row at a time.
UDAFs (Aggregation Functions)
Process groups of rows to produce aggregate values.
UDWFs (Window Functions)
Process partitions of data with ordering and framing.
Expr Scalar UDF
Expr Scalar UDFs allow you to incorporate pre-computed values (like trained models) into your UDF execution. This is particularly useful for machine learning workflows. For the next example we are going to train an XGBoost model on data from the Lending Club
This pattern enables an end-to-end ML workflow where:
- The model is trained once using aggregated data
- The trained model is serialized and passed to prediction UDF
- Predictions are made in the query execution context without manual intervention