Overview

In this tutorial (Part 2 of our series), you’ll learn how to:

  • Load the labeled HackerNews data from Part 1
  • Split the data into training and testing sets
  • Apply TF-IDF vectorization to the text data
  • Build deferred pipelines with fit-transform operations
  • Prepare the transformed data for model training

Prerequisites

Installation and Imports

First, make sure you have the required packages:

pip install xorq pandas scikit-learn 

Then import the necessary modules:

import xorq as xo
import xorq.expr.datatypes as dt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import mean_absolute_error
from xorq.caching import ParquetStorage
from xorq.common.utils.defer_utils import deferred_read_parquet
from xorq.common.utils.import_utils import import_python
from xorq.expr.ml import (
    deferred_fit_transform_series_sklearn,
    train_test_splits,
)


# Import the helper modules we used in Part 1
m = import_python(xo.options.pins.get_path("hackernews_lib"))
o = import_python(xo.options.pins.get_path("openai_lib"))

Setting Up the TF-IDF Transformation

Now, let’s define our TF-IDF transformer using xorq’s deferred operations:

# Define which column we want to transform
transform_col = "title"
transformed_col = f"{transform_col}_transformed"

# Create a deferred TF-IDF transformer
deferred_fit_transform_tfidf = deferred_fit_transform_series_sklearn(
    col=transform_col,
    cls=TfidfVectorizer,
    return_type=dt.Array(dt.float64),
)

The deferred_fit_transform_series_sklearn function creates a deferred operation that will be applied to our data pipeline. We’re using scikit-learn’s TfidfVectorizer to transform our text data into numerical features.

Loading the Labeled Data

Let’s initialize the backend and load our data from Part 1:

# Initialize the backend
con = xo.connect()
storage = ParquetStorage(source=con)

# Define the input dataset name
name = "hn-fetcher-input-large"

# Load the data
raw_expr = (
    deferred_read_parquet(
        con,
        xo.options.pins.get_path(name),
        name,
    )
    .pipe(m.do_hackernews_fetcher_udxf)
)

# Process the data as we did in Part 1
processed_expr = (
    raw_expr
    .filter(xo._.text.notnull())
    .pipe(o.do_hackernews_sentiment_udxf, con=con)
    .cache(storage=ParquetStorage(con))
    .filter(~xo._.sentiment.contains("ERROR"))
    .mutate(
        sentiment_int=xo._.sentiment.cases(
            {"POSITIVE": 2, "NEUTRAL": 1, "NEGATIVE": 0}.items()
        ).cast(int)
    )
)

Splitting Data into Train and Test Sets

Before applying our TF-IDF transformation, we’ll split the data into training and testing sets:

# Split into train (60%) and test (40%) sets
(train_expr, test_expr) = processed_expr.pipe(
    train_test_splits,
    unique_key="id",
    test_sizes=(0.6, 0.4),
    random_seed=42,
)

The train_test_splits function in xorq ensures a proper split of your data. We’re using the ‘id’ field as a unique key to ensure that each record is assigned to either train or test set. The random seed ensures reproducibility.

Building the Deferred TF-IDF Pipeline

Now let’s build our deferred TF-IDF pipeline:

# Create the deferred TF-IDF model, transform operation
(deferred_tfidf_model, tfidf_udaf, deferred_tfidf_transform) = (
    deferred_fit_transform_tfidf(
        train_expr,
        storage=storage,
    )
)

# Apply the transformation to the training data
train_tfidf_transformed = train_expr.mutate(
    **{transformed_col: deferred_tfidf_transform.on_expr}
)

Applying the Transformation to Test Data

Similarly, we can apply the same transformation to the test data:

# Apply the transformation to the test data
test_tfidf_transformed = test_expr.mutate(
    **{transformed_col: deferred_tfidf_transform.on_expr}
)

Notice that we’re using the same deferred_tfidf_transform UDF that uses fitted transform on training data. This ensures that our test data is transformed in exactly the same way, without information leakage.

Executing and Examining the Transformed Data

Now let’s execute our pipeline and examine the transformed data:

# Execute the transformation on the training data
train_transformed = train_tfidf_transformed.execute()

# Check the dimensions and structure of the transformed data
print(f"Number of training samples: {len(train_transformed)}")
print(f"Original title example: {train_transformed['title'].iloc[0]}")
print(f"Vector dimensions: {len(train_transformed[transformed_col].iloc[0])}")

# You can also examine specific feature values if needed
print(f"First 5 feature values: {train_transformed[transformed_col].iloc[0][:5]}")
Number of training samples: 381
Original title example: Show HN: Xenoeye – high performance network traffic analyzer (OSS, *flow-based)
Vector dimensions: 1489
First 5 feature values: [np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0)]

Summary

Congratulations! In this second part of our tutorial series, you’ve:

  1. Set up a deferred TF-IDF transformation pipeline
  2. Split your data into training and testing sets
  3. Applied the TF-IDF transformation to both sets
  4. Examined the transformed data
  5. Saved the transformed data for future use

Next Steps

In the next tutorial (Part 3), we’ll use the transformed data to train an XGBoost model for sentiment classification. We’ll build on the same deferred pipeline pattern to create an end-to-end machine learning workflow.

Further Reading

Appendix

Deferred

Complete Example Code