train_test_splits

train_test_splits(
    table,
    unique_key,
    test_sizes,
    num_buckets=10000,
    random_seed=None,
)

Generates multiple train/test splits of an Ibis table for different test sizes.

This function splits an Ibis table into multiple subsets based on a unique key or combination of keys and a list of test sizes. It uses a hashing function to convert the unique key into an integer, then applies a modulo operation to split the data into buckets. Each subset of data is defined by a range of buckets determined by the cumulative sum of the test sizes.

Parameters

Name Type Description Default
table ir.Table The input Ibis table to be split. required
unique_key str | tuple[str] | list[str] The column name(s) that uniquely identify each row in the table. This unique_key is used to create a deterministic split of the dataset through a hashing process. required
test_sizes Iterable[float] | float An iterable of floats representing the desired proportions for data splits. Each value should be between 0 and 1, and their sum must equal 1. The order of test sizes determines the order of the generated subsets. If float is passed it assumes that the value is for the test size and that a tradition tain test split of (1-test_size, test_size) is returned. required
num_buckets int The number of buckets into which the data can be binned after being hashed (default is 10000). It controls how finely the data is divided during the split process. Adjusting num_buckets can affect the granularity and efficiency of the splitting operation, balancing between accuracy and computational efficiency. 10000
random_seed int | None Seed for the random number generator. If provided, ensures reproducibility of the split (default is None). None

Returns

Name Type Description
Iterator[ir.Table] An iterator yielding Ibis table expressions, each representing a mutually exclusive subset of the original table based on the specified test sizes.

Raises

Name Type Description
ValueError If any value in test_sizes is not between 0 and 1. If test_sizes does not sum to 1. If num_buckets is not an integer greater than 1.

Examples

>>> import xorq as ls
>>> table = ls.memtable({"key": range(100), "value": range(100,200)})
>>> unique_key = "key"
>>> test_sizes = [0.2, 0.3, 0.5]
>>> splits = ls.train_test_splits(table, unique_key, test_sizes, num_buckets=10, random_seed=42)
>>> for i, split_table in enumerate(splits):
...     print(f"Split {i+1} size: {split_table.count().execute()}")
...     print(split_table.execute())
Split 1 size: 29
    key  value
0     0    100
1     4    104
2     5    105
3     7    107
4     8    108
5    10    110
6    16    116
7    20    120
8    21    121
9    23    123
10   24    124
11   29    129
12   45    145
13   49    149
14   50    150
15   54    154
16   59    159
17   63    163
18   64    164
19   68    168
20   75    175
21   76    176
22   85    185
23   86    186
24   88    188
25   89    189
26   90    190
27   93    193
28   98    198
Split 2 size: 23
    key  value
0     1    101
1     2    102
2     3    103
3    13    113
4    14    114
5    19    119
6    22    122
7    31    131
8    32    132
9    47    147
10   53    153
11   56    156
12   57    157
13   58    158
14   62    162
15   65    165
16   66    166
17   67    167
18   70    170
19   71    171
20   80    180
21   92    192
22   96    196
Split 3 size: 48
    key  value
0     6    106
1     9    109
2    11    111
3    12    112
4    15    115
5    17    117
6    18    118
7    25    125
8    26    126
9    27    127
10   28    128
11   30    130
12   33    133
13   34    134
14   35    135
15   36    136
16   37    137
17   38    138
18   39    139
19   40    140
20   41    141
21   42    142
22   43    143
23   44    144
24   46    146
25   48    148
26   51    151
27   52    152
28   55    155
29   60    160
30   61    161
31   69    169
32   72    172
33   73    173
34   74    174
35   77    177
36   78    178
37   79    179
38   81    181
39   82    182
40   83    183
41   84    184
42   87    187
43   91    191
44   94    194
45   95    195
46   97    197
47   99    199