Data Lineage in Xorq

Data lineage is one of Xorq’s most powerful features, providing complete visibility into your data transformations and computations. This guide shows you how to track, visualize, and understand the flow of data through your pipelines.

What is Data Lineage?

Data lineage tracks how data flows through your computational pipeline, showing:

  • Column-level dependencies: Which source columns affect each output column
  • Transformation steps: What operations were applied at each stage
  • Data provenance: The complete history of how data was derived
  • Computational graph: Visual representation of your pipeline’s structure

Getting Started

Basic Setup

First, import the necessary lineage utilities:

import xorq as xo
import xorq.expr.datatypes as dt
from xorq.common.utils.lineage_utils import build_column_trees, print_tree

Simple Example

Let’s start with a basic pipeline that demonstrates lineage tracking:

# Create a sample dataset
con = xo.connect()

sales_table = xo.memtable(
    {
        "order_id": [1, 2, 1, 2],
        "price": [100.0, 150.0, 200.0, 250.0],
        "discount": [0.1, 0.2, 0.15, 0.1],
    },
    name="sales",
)

# Define a UDF for discount calculation
@xo.udf.make_pandas_udf(
    schema=xo.schema({"price": float, "discount": float}),
    return_type=dt.float,
    name="calculate_discount_value",
)
def calculate_discount_value(df):
    return df["price"] * df["discount"]

# Build a pipeline with transformations
sales_with_discount = sales_table.mutate(
    discount_value=calculate_discount_value.on_expr(sales_table)
)

# Aggregate the results
expr = sales_with_discount.group_by("order_id").agg(
    total_discount=xo._.discount_value.sum(),
    total_price=xo._.price.sum(),
)

Building and Visualizing Lineage Trees

Column-Level Lineage

Use build_column_trees() to create lineage trees for each column in your expression:

# Build lineage trees for all columns
column_trees = build_column_trees(expr)

# Examine each column's lineage
for column, tree in column_trees.items():
    print(f"Lineage for column '{column}':")
    print_tree(tree)
    print("\n")
Lineage for column 'order_id':


Lineage for column 'total_discount':


Lineage for column 'total_price':

Field:order_id #1
└── Field:order_id #2
    └── InMemoryTable #3
Sum #1
└── Field:discount_value #2
    └── calculate_discount_value #3
        ├── Field:price #4
        │   └── InMemoryTable #5
        └── Field:discount #6
            └── ↻ see #5
Sum #1
└── Field:price #2
    └── Field:price #3
        └── InMemoryTable #4