Tables are one of the core data structures in Ibis.
Table
Table(arg)
An immutable and lazy dataframe.
Analogous to a SQL table or a pandas DataFrame. A table expression contains an ordered set of named columns, each with a single known type. Unless explicitly ordered with an .order_by(), the order of rows is undefined.
Table immutability means that the data underlying an Ibis Table cannot be modified: every method on a Table returns a new Table with those changes. Laziness means that an Ibis Table expression does not run your computation every time you call one of its methods. Instead, it is a symbolic expression that represents a set of operations to be performed, which typically is translated into a SQL query. That SQL query is then executed on a backend, where the data actually lives. The result (now small enough to be manageable) can then be materialized back into python as a pandas/pyarrow/python DataFrame/Column/scalar.
You will not create Table objects directly. Instead, you will create one
from a pandas DataFrame, pyarrow table, Polars table, or raw python dicts/lists with xorq.memtable(df)
>>>import xorq as xo>>> xo.options.interactive =True>>> t = xo.examples.penguins.fetch(deferred=False)>>> expr = t.alias("pingüinos").sql('SELECT * FROM "pingüinos" LIMIT 5')>>> expr
Exception: DataFusion error: ArrowError(ParseError("Error while parsing value NA for column 2 at line 4"), None)
as_scalar
as_scalar()
Inform ibis that the table expression should be treated as a scalar.
Note that the table must have exactly one column and one row for this to work. If the table has more than one column an error will be raised in expression construction time. If the table has more than one row an error will be raised by the backend when the expression is executed.
Determines which duplicates to keep. - "first": Drop duplicates except for the first occurrence. - "last": Drop duplicates except for the last occurrence. - None: Drop all duplicates
'first'
Examples
>>>import xorq as xo>>>import xorq.examples as ex>>>import xorq.selectors as s>>> xo.options.interactive =True>>> t = ex.penguins.fetch()>>> t
Exception: DataFusion error: ArrowError(ParseError("Error while parsing value NA for column 2 at line 4"), None)
Sampling is by definition a random operation. Some backends support specifying a seed for repeatable results, but not all backends support that option. And some backends (duckdb, for example) do support specifying a seed but may still not have repeatable results in all cases.
In all cases, results are backend-specific. An execution against one backend is unlikely to sample the same rows when executed against a different backend, even with the same seed set.
The sampling method to use. The default is “row”, which includes each row with a probability of fraction. If method is “block”, some backends may instead perform sampling a fraction of blocks of rows (where “block” is a backend dependent definition). This is identical to “row” for backends lacking a blockwise sampling implementation. For those coming from SQL, “row” and “block” correspond to “bernoulli” and “system” respectively in a TABLESAMPLE clause.
An optional random seed to use, for repeatable sampling. The range of possible seed values is backend specific (most support at least [0, 2**31 - 1]). Backends that never support specifying a seed for repeatable sampling will error appropriately. Note that some backends (like DuckDB) do support specifying a seed, but may still not have repeatable results in all cases.
>>>import xorq as xo>>> xo.options.interactive =True>>> t = xo.memtable({"x": [1, 2, 3, 4], "y": ["a", "b", "c", "d"]})>>> t
┏━━━━━━━┳━━━━━━━━┓
┃ x ┃ y ┃
┡━━━━━━━╇━━━━━━━━┩
│ int64 │ string │
├───────┼────────┤
│ 1 │ a │
│ 2 │ b │
│ 3 │ c │
│ 4 │ d │
└───────┴────────┘
Sample approximately half the rows, with a seed specified for reproducibility.
>>> t.sample(0.5, seed=1234)
Translation to backend failed
Error message: UnsupportedOperationError('`Table.sample` with a random seed is unsupported')
Expression repr follows:
r0 := InMemoryTable
data:
PandasDataFrameProxy:
x y
0 1 a
1 2 b
2 3 c
3 4 d
Sample[r0, fraction=0.5, method='row', seed=1234]
select
select(*exprs, **named_exprs)
Compute a new table expression using exprs and named_exprs.
Passing an aggregate function to this method will broadcast the aggregate’s value over the number of rows in the table and automatically constructs a window function expression. See the examples section for more details.
For backwards compatibility the keyword argument exprs is reserved and cannot be used to name an expression. This behavior will be removed in v4.
>>>import xorq as xo>>>from xorq import _>>> xo.options.interactive =True>>> t = xo.examples.penguins.fetch(table_name="penguins", deferred=False)>>> expr = t.sql(... """... SELECT island, mean(bill_length_mm) AS avg_bill_length... FROM penguins... GROUP BY 1... ORDER BY 2 DESC... """... )>>> expr
Exception: DataFusion error: Shared(ArrowError(ParseError("Error while parsing value NA for column 2 at line 4"), None))
Mix and match ibis expressions with SQL queries
>>> t = xo.examples.penguins.fetch(table_name="penguins", deferred=False)>>> expr = t.sql(... """... SELECT island, mean(bill_length_mm) AS avg_bill_length... FROM penguins... GROUP BY 1... ORDER BY 2 DESC... """... )>>> expr = expr.mutate(... island=_.island.lower(),... avg_bill_length=_.avg_bill_length.round(1),... )>>> expr
Exception: DataFusion error: Shared(ArrowError(ParseError("Error while parsing value NA for column 2 at line 4"), None))
Because ibis expressions aren’t named, they aren’t visible to subsequent .sql calls. Use the alias method to assign a name to an expression.
>>> expr.alias("b").sql("SELECT * FROM b WHERE avg_bill_length > 40")
Exception: DataFusion error: Shared(ArrowError(ParseError("Error while parsing value NA for column 2 at line 4"), None))