Switch between backends

This tutorial shows you how to run the same expression on different execution engines. You’ll learn when to choose each backend and see how Xorq moves data between them using Apache Arrow.

After completing this tutorial, you’ll know how to pick the right backend for your workload.

Why switch backends?

Different backends excel at different tasks. DuckDB handles analytical queries efficiently, Pandas works great for small datasets and prototyping, and DataFusion gives you custom UDF capabilities.

Xorq lets you write your expression once and run it anywhere. Same code, different engines.

Zero-copy transfers

Xorq uses Apache Arrow to move data between backends without serialization overhead. This makes backend switching fast and memory-efficient.

You’ll see this portability in action by running the same expression across three backends: embedded, DuckDB, and Pandas. Start with the default.

Run on the embedded backend

You’ll start with Xorq’s default embedded backend. This uses a modified DataFusion engine optimized for Arrow operations.

import xorq.api as xo


con = xo.connect()


iris = xo.examples.iris.fetch(backend=con)


expr = (
    iris
    .filter(xo._.sepal_length > 6)
    .group_by("species")
    .agg(avg_width=xo._.sepal_width.mean())
)


result = expr.execute()
print(f"Backend: {con}")
print(result)

1: Connect to the embedded backend (DataFusion-based).
2: Load the iris dataset into this backend.
3: Build a filter and aggregation expression.
4: Execute on the embedded backend.

The embedded backend is the default. It’s fast, supports all Xorq features, and doesn’t require external setup.

Switch to DuckDB

Now you’ll run the same expression on DuckDB. DuckDB excels at analytical queries and works well with larger datasets.


duckdb_con = xo.duckdb.connect()


iris_duck = xo.examples.iris.fetch(backend=duckdb_con)


duck_expr = (
    iris_duck
    .filter(xo._.sepal_length > 6)
    .group_by("species")
    .agg(avg_width=xo._.sepal_width.mean())
)


duck_result = duck_expr.execute()
print(f"\nBackend: {duckdb_con}")
print(duck_result)

1: Connect to DuckDB (in-memory by default).
2: Load iris data into DuckDB.
3: Build the same expression as before.
4: Execute on DuckDB.

Notice how the expression code is identical. Only the backend connection changed.

In-memory vs persistent

This DuckDB connection is in-memory. To use a persistent database file, pass database="my_db.duckdb" to connect().

Switch to Pandas

Pandas is great for small datasets and interactive analysis. You’ll run the expression there.


pandas_con = xo.pandas.connect()


iris_pandas = xo.examples.iris.fetch(backend=pandas_con)


pandas_expr = (
    iris_pandas
    .filter(xo._.sepal_length > 6)
    .group_by("species")
    .agg(avg_width=xo._.sepal_width.mean())
)


pandas_result = pandas_expr.execute()
print(f"\nBackend: {pandas_con}")
print(pandas_result)

1: Connect to Pandas backend.
2: Load data into Pandas.
3: Same expression, different backend.
4: Execute on Pandas.

The Pandas backend is perfect for prototyping and working with small datasets that fit in memory.

So far, you’ve loaded data separately into each backend. But what if you start analysis in one backend and need to switch to another mid-workflow? That’s where data transfer comes in.

Move data between backends

Sometimes you need to move data from one backend to another. Xorq makes this easy with .into_backend().


con = xo.connect()
duckdb_con = xo.duckdb.connect()


data_in_embedded = xo.examples.iris.fetch(backend=con)


data_in_duckdb = data_in_embedded.into_backend(duckdb_con)


result = data_in_duckdb.filter(xo._.sepal_length > 6).execute()

print(f"Original backend: {con}")
print(f"Moved to backend: {duckdb_con}")
print(f"Result shape: {result.shape}")

1: Connect to both backends.
2: Load data into the embedded backend.
3: Move the data to DuckDB using .into_backend().
4: Now you can run queries in DuckDB.

.into_backend() transfers data between backends using Arrow’s zero-copy protocol. This is fast even for large datasets.

When to move data

Move data to a different backend when you need specific features (like DuckDB’s AsOf joins) or better performance for your query type.

Compare backend performance

You’ll time the same query on different backends to see performance characteristics.

import time

def time_query(backend, name):
    """Time a query execution."""
    iris = xo.examples.iris.fetch(backend=backend)
    expr = (
        iris
        .filter(xo._.sepal_length > 5)
        .group_by("species")
        .agg(
            count=xo._.species.count(),
            avg_width=xo._.sepal_width.mean()
        )
    )
    
    start = time.time()
    result = expr.execute()
    elapsed = time.time() - start
    
    return elapsed, len(result)


con = xo.connect()
duck = xo.duckdb.connect()
pandas = xo.pandas.connect()


print("Timing comparison:")
print("-" * 50)


t1, rows1 = time_query(con, "Embedded")
print(f"Embedded:  {t1:.4f}s - {rows1} rows")

t2, rows2 = time_query(duck, "DuckDB")
print(f"DuckDB:    {t2:.4f}s - {rows2} rows")

t3, rows3 = time_query(pandas, "Pandas")
print(f"Pandas:    {t3:.4f}s - {rows3} rows")

1: Connect to all three backends.
2: Print a comparison header.
3: Time the same query on each backend.

For small datasets like iris, performance differences are minimal. With larger datasets, you’ll see DuckDB and the embedded backend outperform Pandas.

Choose the right backend

Here’s when to use each backend:

Embedded (DataFusion): - Default choice for most workloads. - Excellent UDF support. - Fast analytical queries. - No external dependencies.

DuckDB: - Analytical queries on moderate-to-large datasets. - AsOf joins and time-series operations. - Efficient with Parquet files. - Persistent storage needs.

Pandas: - Small datasets (<1GB). - Interactive prototyping. - Integration with existing Pandas code. - Quick exploration.

Backend capabilities

Not all backends support every operation. For example, some complex window functions might work in DuckDB but not in Pandas. Check the documentation if you hit an unsupported operation error.

Now that you understand when to use each backend, here’s a complete workflow that ties everything together.

Complete example

Here’s a full example showing backend switching:

import xorq.api as xo

# Connect to backends
embedded = xo.connect()
duckdb = xo.duckdb.connect()

# Load data in embedded backend
data = xo.examples.iris.fetch(backend=embedded)

# Build expression
expr = (
    data
    .filter(xo._.sepal_length > 6)
    .group_by("species")
    .agg(avg_width=xo._.sepal_width.mean())
)

# Execute on embedded backend
result1 = expr.execute()
print("Embedded result:", result1)

# Move to DuckDB and execute there
data_in_duck = data.into_backend(duckdb)
expr_duck = (
    data_in_duck
    .filter(xo._.sepal_length > 6)
    .group_by("species")
    .agg(avg_width=xo._.sepal_width.mean())
)
result2 = expr_duck.execute()
print("DuckDB result:", result2)

Next steps

Now you know how to switch backends. Continue learning:

Your first build shows how to package expressions for deployment across backends
Optimize pipeline performance covers backend selection strategies
Switch backends dynamically teaches advanced backend-switching patterns