import xorq as xo
import xorq.expr.datatypes as dt
from xorq.common.utils.lineage_utils import build_column_trees, print_tree
Data Lineage in Xorq
Data lineage is one of Xorq’s most powerful features, providing complete visibility into your data transformations and computations. This guide shows you how to track, visualize, and understand the flow of data through your pipelines.
What is Data Lineage?
Data lineage tracks how data flows through your computational pipeline, showing:
- Column-level dependencies: Which source columns affect each output column
- Transformation steps: What operations were applied at each stage
- Data provenance: The complete history of how data was derived
- Computational graph: Visual representation of your pipeline’s structure
Getting Started
Basic Setup
First, import the necessary lineage utilities:
Simple Example
Let’s start with a basic pipeline that demonstrates lineage tracking:
# Create a sample dataset
= xo.connect()
con
= xo.memtable(
sales_table
{"order_id": [1, 2, 1, 2],
"price": [100.0, 150.0, 200.0, 250.0],
"discount": [0.1, 0.2, 0.15, 0.1],
},="sales",
name
)
# Define a UDF for discount calculation
@xo.udf.make_pandas_udf(
=xo.schema({"price": float, "discount": float}),
schema=dt.float,
return_type="calculate_discount_value",
name
)def calculate_discount_value(df):
return df["price"] * df["discount"]
# Build a pipeline with transformations
= sales_table.mutate(
sales_with_discount =calculate_discount_value.on_expr(sales_table)
discount_value
)
# Aggregate the results
= sales_with_discount.group_by("order_id").agg(
expr =xo._.discount_value.sum(),
total_discount=xo._.price.sum(),
total_price )
Building and Visualizing Lineage Trees
Column-Level Lineage
Use build_column_trees()
to create lineage trees for each column in your expression:
# Build lineage trees for all columns
= build_column_trees(expr)
column_trees
# Examine each column's lineage
for column, tree in column_trees.items():
print(f"Lineage for column '{column}':")
print_tree(tree)print("\n")
Lineage for column 'order_id':
Lineage for column 'total_discount':
Lineage for column 'total_price':
Field:order_id #1 └── Field:order_id #2 └── InMemoryTable #3
Sum #1 └── Field:discount_value #2 └── calculate_discount_value #3 ├── Field:price #4 │ └── InMemoryTable #5 └── Field:discount #6 └── ↻ see #5
Sum #1 └── Field:price #2 └── Field:price #3 └── InMemoryTable #4