>>> import xorq.api as xo
>>> unique_key = "key"
>>> table = xo.memtable({unique_key: range(100), "value": range(100, 200)})
>>> test_sizes = [0.2, 0.3, 0.5]
>>> col = xo.expr.ml.calc_split_column(table, unique_key, test_sizes, num_buckets=10, random_seed=42, name="my-split")calc_split_column
calc_split_column(
table,
unique_key,
test_sizes,
num_buckets=10000,
random_seed=None,
name='split',
)Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| table | ir.Table | The input Ibis table to be split. | required |
| unique_key | str | tuple[str] | list[str] | Selector | The column name(s) that uniquely identify each row in the table. This unique_key is used to create a deterministic split of the dataset through a hashing process. | required |
| test_sizes | Iterable[float] | An iterable of floats representing the desired proportions for data splits. Each value should be between 0 and 1, and their sum must equal 1. The order of test sizes determines the order of the generated subsets. If float is passed it assumes that the value is for the test size and that a tradition tain test split of (1-test_size, test_size) is returned. | required |
| num_buckets | int | The number of buckets into which the data can be binned after being hashed (default is 10000). It controls how finely the data is divided during the split process. Adjusting num_buckets can affect the granularity and efficiency of the splitting operation, balancing between accuracy and computational efficiency. | 10000 |
| random_seed | int | None | Seed for the random number generator. If provided, ensures reproducibility of the split (default is None). | None |
| name | str | Name for the returned IntegerColumn (default is “split”). | 'split' |
Returns
| Name | Type | Description |
|---|---|---|
| ibis.IntergerColumn | A column with split indices representing mutually exclusive subsets of the original table based on the specified test sizes. |
Raises
| Name | Type | Description |
|---|---|---|
| ValueError | If any value in test_sizes is not between 0 and 1. If test_sizes does not sum to 1. If num_buckets is not an integer greater than 1. |