ML functions

ML functions and classes helpers

train_test_splits

train_test_splits(
    table,
    unique_key,
    test_sizes,
    num_buckets=10000,
    random_seed=None,
)

Generates multiple train/test splits of an Ibis table for different test sizes.

This function splits an Ibis table into multiple subsets based on a unique key or combination of keys and a list of test sizes. It uses a hashing function to convert the unique key into an integer, then applies a modulo operation to split the data into buckets. Each subset of data is defined by a range of buckets determined by the cumulative sum of the test sizes.

Parameters

Name	Type	Description	Default
table	ir.Table	The input Ibis table to be split.	required
unique_key	str \| tuple[str] \| list[str]	The column name(s) that uniquely identify each row in the table. This unique_key is used to create a deterministic split of the dataset through a hashing process.	required
test_sizes	Iterable[float] \| float	An iterable of floats representing the desired proportions for data splits. Each value should be between 0 and 1, and their sum must equal 1. The order of test sizes determines the order of the generated subsets. If float is passed it assumes that the value is for the test size and that a tradition tain test split of (1-test_size, test_size) is returned.	required
num_buckets	int	The number of buckets into which the data can be binned after being hashed (default is 10000). It controls how finely the data is divided during the split process. Adjusting num_buckets can affect the granularity and efficiency of the splitting operation, balancing between accuracy and computational efficiency.	`10000`
random_seed	int \| None	Seed for the random number generator. If provided, ensures reproducibility of the split (default is None).	`None`

Returns

Name	Type	Description
	Iterator[ir.Table]	An iterator yielding Ibis table expressions, each representing a mutually exclusive subset of the original table based on the specified test sizes.

Raises

Name	Type	Description
	ValueError	If any value in `test_sizes` is not between 0 and 1. If `test_sizes` does not sum to 1. If `num_buckets` is not an integer greater than 1.

Examples

>>> import xorq as ls
>>> table = ls.memtable({"key": range(100), "value": range(100,200)})
>>> unique_key = "key"
>>> test_sizes = [0.2, 0.3, 0.5]
>>> splits = ls.train_test_splits(table, unique_key, test_sizes, num_buckets=10, random_seed=42)
>>> for i, split_table in enumerate(splits):
...     print(f"Split {i+1} size: {split_table.count().execute()}")
...     print(split_table.execute())

Split 1 size: 29
    key  value
0     0    100
1     4    104
2     5    105
3     7    107
4     8    108
5    10    110
6    16    116
7    20    120
8    21    121
9    23    123
10   24    124
11   29    129
12   45    145
13   49    149
14   50    150
15   54    154
16   59    159
17   63    163
18   64    164
19   68    168
20   75    175
21   76    176
22   85    185
23   86    186
24   88    188
25   89    189
26   90    190
27   93    193
28   98    198
Split 2 size: 23
    key  value
0     1    101
1     2    102
2     3    103
3    13    113
4    14    114
5    19    119
6    22    122
7    31    131
8    32    132
9    47    147
10   53    153
11   56    156
12   57    157
13   58    158
14   62    162
15   65    165
16   66    166
17   67    167
18   70    170
19   71    171
20   80    180
21   92    192
22   96    196
Split 3 size: 48
    key  value
0     6    106
1     9    109
2    11    111
3    12    112
4    15    115
5    17    117
6    18    118
7    25    125
8    26    126
9    27    127
10   28    128
11   30    130
12   33    133
13   34    134
14   35    135
15   36    136
16   37    137
17   38    138
18   39    139
19   40    140
20   41    141
21   42    142
22   43    143
23   44    144
24   46    146
25   48    148
26   51    151
27   52    152
28   55    155
29   60    160
30   61    161
31   69    169
32   72    172
33   73    173
34   74    174
35   77    177
36   78    178
37   79    179
38   81    181
39   82    182
40   83    183
41   84    184
42   87    187
43   91    191
44   94    194
45   95    195
46   97    197
47   99    199

Step

Step()

A single step in a machine learning pipeline that wraps a scikit-learn estimator.

This class represents an individual processing step that can either transform data (transformers like StandardScaler, SelectKBest) or make predictions (classifiers like KNeighborsClassifier, LinearSVC). Steps can be combined into Pipeline objects to create complex ML workflows.

Parameters

Name	Type	Description	Default
typ	type	The scikit-learn estimator class (must inherit from BaseEstimator).	required
name	str	A unique name for this step. If None, generates a name from the class name and ID.	required
params_tuple	tuple	Tuple of (parameter_name, parameter_value) pairs for the estimator. Parameters are automatically sorted for consistency.	required

Attributes

Name	Type	Description
typ	type	The scikit-learn estimator class.
name	str	The unique name for this step in the pipeline.
params_tuple	tuple	Sorted tuple of parameter key-value pairs.

Examples

Create a scaler step:

>>> from xorq.ml import Step
>>> from sklearn.preprocessing import StandardScaler
>>> scaler_step = Step(typ=StandardScaler, name="scaler")
>>> scaler_step.instance

StandardScaler()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Create a classifier step with parameters:

>>> from sklearn.neighbors import KNeighborsClassifier
>>> knn_step = Step(
...     typ=KNeighborsClassifier,
...     name="knn",
...     params_tuple=(("n_neighbors", 5), ("weights", "uniform"))
... )
>>> knn_step.instance

KNeighborsClassifier()

Notes

The Step class is frozen (immutable) using attrs.
All estimators must inherit from sklearn.base.BaseEstimator.
Parameter tuples are automatically sorted for hash consistency.
Steps can be fitted to data using the fit() method which returns a FittedStep.

Methods

Name	Description
fit	Fit this step to the given expression data.
from_fit_predict	Create a Step from custom fit and predict functions.
from_instance_name	Create a Step from an existing scikit-learn estimator instance.
from_name_instance	Create a Step from a name and estimator instance.
set_params	Create a new Step with updated parameters.

fit

fit(expr, features=None, target=None, storage=None, dest_col=None)

Fit this step to the given expression data.

Parameters

Name	Type	Description	Default
expr	Expr	The xorq expression containing the training data.	required
features	tuple of str	Column names to use as features. If None, infers from expr.columns.	`None`
target	str	Target column name. Required for prediction steps.	`None`
storage	Storage	Storage backend for caching fitted models.	`None`
dest_col	str	Destination column name for transformed output.	`None`

Returns

Name	Type	Description
	FittedStep	A fitted step that can transform or predict on new data.

from_fit_predict

from_fit_predict(fit, predict, return_type, klass_name=None, name=None)

Create a Step from custom fit and predict functions.

Parameters

Name	Type	Description	Default
fit	callable	Function to fit the model.	required
predict	callable	Function to make predictions.	required
return_type	DataType	The return type for predictions.	required
klass_name	str	Name for the generated estimator class.	`None`
name	str	Name for the step.	`None`

Returns

Name	Type	Description
	Step	A new Step with a dynamically created estimator type.

from_instance_name

from_instance_name(instance, name=None)

Create a Step from an existing scikit-learn estimator instance.

Parameters

Name	Type	Description	Default
instance	object	A scikit-learn estimator instance.	required
name	str	Name for the step. If None, generates from instance class name.	`None`

Returns

Name	Type	Description
	Step	A new Step wrapping the estimator instance.

from_name_instance

from_name_instance(name, instance)

Create a Step from a name and estimator instance.

Parameters

Name	Type	Description	Default
name	str	Name for the step.	required
instance	object	A scikit-learn estimator instance.	required

Returns

Name	Type	Description
	Step	A new Step wrapping the estimator instance.

set_params

set_params(**kwargs)

Create a new Step with updated parameters.

Parameters

Name	Type	Description	Default
**kwargs		Parameter names and values to update.	`{}`

Returns

Name	Type	Description
	Step	A new Step instance with updated parameters.

Examples

>>> knn_step = Step(typ=KNeighborsClassifier, name="knn")
>>> updated_step = knn_step.set_params(n_neighbors=10, weights="distance")

Pipeline

Pipeline()

A machine learning pipeline that chains multiple processing steps together.

This class provides a xorq-native implementation that wraps scikit-learn pipelines, enabling deferred execution and integration with xorq expressions. The pipeline can contain both transform steps (data preprocessing) and a final prediction step.

Parameters

Name	Type	Description	Default
steps	tuple of Step	Sequence of Step objects that make up the pipeline.	required

Attributes

Name	Type	Description
steps	tuple of Step	The sequence of processing steps.
instance	sklearn.pipeline.Pipeline	The equivalent scikit-learn Pipeline instance.
transform_steps	tuple of Step	All steps except the final prediction step (if any).
predict_step	Step or None	The final step if it has a predict method, otherwise None.

Examples

Create a pipeline from scikit-learn estimators:

>>> from xorq.ml import Pipeline
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.neighbors import KNeighborsClassifier
>>> import sklearn.pipeline

>>> sklearn_pipeline = sklearn.pipeline.Pipeline([
...     ("scaler", StandardScaler()),
...     ("knn", KNeighborsClassifier(n_neighbors=5))
... ])
>>> xorq_pipeline = Pipeline.from_instance(sklearn_pipeline)

Fit and predict with xorq expressions:

>>> # Assuming train and test are xorq expressions
>>> fitted = xorq_pipeline.fit(train, features=("feature1", "feature2"), target="target")  # quartodoc: +SKIP
>>> predictions = fitted.predict(test)  # quartodoc: +SKIP

Update pipeline parameters:

>>> updated_pipeline = xorq_pipeline.set_params(knn__n_neighbors=10)

Notes

The Pipeline class is frozen (immutable) using attrs.
Pipelines automatically detect transform vs predict steps based on method availability.
The fit() method returns a FittedPipeline that can transform and predict on new data.
Parameter updates use sklearn’s parameter naming convention (step__parameter).

Methods

Name	Description
fit	Fit the pipeline to training data.
from_instance	Create a Pipeline from an existing scikit-learn Pipeline.

fit

fit(expr, features=None, target=None)

Fit the pipeline to training data.

This method sequentially fits each step in the pipeline, using the output of each transform step as input to the next step.

Parameters

Name	Type	Description	Default
expr	Expr	The xorq expression containing training data.	required
features	tuple of str	Column names to use as features. If None, infers from expr columns excluding the target.	`None`
target	str	Target column name. Required if pipeline has a prediction step.	`None`

Returns

Name	Type	Description
	FittedPipeline	A fitted pipeline that can transform and predict on new data.

Raises

Name	Type	Description
	ValueError	If target is not provided but pipeline has a prediction step.

Examples

>>> fitted = pipeline.fit(
...     train_data,
...     features=("sepal_length", "sepal_width"),
...     target="species"
... )  # quartodoc: +SKIP

from_instance

from_instance(instance)

Create a Pipeline from an existing scikit-learn Pipeline.

Parameters

Name	Type	Description	Default
instance	sklearn.pipeline.Pipeline	A fitted or unfitted scikit-learn pipeline.	required

Returns

Name	Type	Description
	Pipeline	A new xorq Pipeline wrapping the scikit-learn pipeline.

Examples

>>> import sklearn.pipeline
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.svm import SVC

>>> sklearn_pipe = sklearn.pipeline.Pipeline([
...     ("scaler", StandardScaler()),
...     ("svc", SVC())
... ])
>>> xorq_pipe = Pipeline.from_instance(sklearn_pipe)

	copy	True
	with_mean	True
	with_std	True

	n_neighbors	5
	weights	'uniform'
	algorithm	'auto'
	leaf_size	30
	p	2
	metric	'minkowski'
	metric_params	None
	n_jobs	None