Generates multiple train/test splits of an Ibis table for different test sizes.
This function splits an Ibis table into multiple subsets based on a unique key or combination of keys and a list of test sizes. It uses a hashing function to convert the unique key into an integer, then applies a modulo operation to split the data into buckets. Each subset of data is defined by a range of buckets determined by the cumulative sum of the test sizes.
The column name(s) that uniquely identify each row in the table. This unique_key is used to create a deterministic split of the dataset through a hashing process.
An iterable of floats representing the desired proportions for data splits. Each value should be between 0 and 1, and their sum must equal 1. The order of test sizes determines the order of the generated subsets. If float is passed it assumes that the value is for the test size and that a tradition tain test split of (1-test_size, test_size) is returned.
The number of buckets into which the data can be binned after being hashed (default is 10000). It controls how finely the data is divided during the split process. Adjusting num_buckets can affect the granularity and efficiency of the splitting operation, balancing between accuracy and computational efficiency.
A single step in a machine learning pipeline that wraps a scikit-learn estimator.
This class represents an individual processing step that can either transform data (transformers like StandardScaler, SelectKBest) or make predictions (classifiers like KNeighborsClassifier, LinearSVC). Steps can be combined into Pipeline objects to create complex ML workflows.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
n_neighbors
5
weights
'uniform'
algorithm
'auto'
leaf_size
30
p
2
metric
'minkowski'
metric_params
None
n_jobs
None
Notes
The Step class is frozen (immutable) using attrs.
All estimators must inherit from sklearn.base.BaseEstimator.
Parameter tuples are automatically sorted for hash consistency.
Steps can be fitted to data using the fit() method which returns a FittedStep.
A machine learning pipeline that chains multiple processing steps together.
This class provides a xorq-native implementation that wraps scikit-learn pipelines, enabling deferred execution and integration with xorq expressions. The pipeline can contain both transform steps (data preprocessing) and a final prediction step.
Parameters
Name
Type
Description
Default
steps
tuple of Step
Sequence of Step objects that make up the pipeline.