XGBoost Training
Part 3: Learn how to train XGBoost models, make predictions, and evaluate model performance using xorq’s deferred pipelines
Overview
In this tutorial (Part 3 of our series), you’ll learn how to:
- Define deferred model training and prediction operations
- Split data into train and test sets
- Train an XGBoost model with TF-IDF
- Make predictions on both training and test data
- Evaluate model performance
Prerequisites
Installation and Imports
First, ensure you have the required packages:
Then import the necessary modules:
Setting Up Deferred Operations
Defining Model Training and Prediction Functions
Let’s define functions for training and making predictions with XGBoost:
The fit_xgboost_model
function trains an XGBoost model on the provided
features and target. The predict_xgboost_model
function applies the trained
model to new data to generate predictions.
Note that we’re using multi:softmax
as the objective function since we have
three sentiment classes (POSITIVE=2, NEUTRAL=1, NEGATIVE=0).
Now, let’s set up our deferred operations for both the TF-IDF transformation and XGBoost prediction:
The deferred_fit_predict
function creates a deferred operation that will:
- Fit a model using the specified
fit
function on the training data - Create a prediction operation that can be applied to any dataset
Unlike the TF-IDF transformation (which we covered in detail in Part 2), model training is implemented as an aggregate function rather than a UDXF function. This is because training involves aggregating across the entire dataset to learn patterns, while transformation is applied row by row.
Loading and Preparing the Data
Let’s load and prepare our data, similar to what we did in the previous parts:
Splitting the Data into Train and Test Sets
Before training our model, we’ll split the data into training and testing sets:
The train_test_splits
function in xorq ensures a proper and deterministic
split of your data. It works by using a hashing function to convert the unique
key (id
in our case) into an integer, then applies a modulo operation to
split the data into buckets.
Having a unique key field is essential as it allows xorq to deterministically order the table and assign records to either the training or test set. This approach ensures that:
- The same record will always end up in the same split when using the same random seed
- The splitting is distributed evenly across the dataset
- Records are not duplicated across splits
Applying TF-IDF Transformation
Let’s apply the TF-IDF transformation to our training data:
We’re using the same TF-IDF approach we explored in Part 2, fitting on the training data to create a vocabulary and then transforming the documents into numerical feature vectors. This step is necessary to convert the text into a format that our XGBoost model can process.
Training the XGBoost Model
Now, let’s train our XGBoost model on the transformed training data:
Unlike the transformation step, model training is implemented as an aggregate
function (xgb_udaf
). This is an important distinction:
- Transformation (UDF): Operates row by row, applying the same function to each record independently
- Training (UDAF): Aggregates across the entire dataset, learning patterns from all records collectively
The deferred_fit_predict_xgb
function returns three key components:
deferred_xgb_model
: an Expr that returns a trained model.xgb_udaf
: The User-Defined Aggregate Function that performs the trainingdeferred_xgb_predict
: The scalar UDF that takes Expr as an input i.e.ExprScalarUDF
Making Predictions on Test Data
Similarly, we’ll apply both the TF-IDF transformation and XGBoost prediction to our test data:
Note the use of superfluous .into_backend(xo.connect())
. This is necessary to ensure
proper handling of the data types during the prediction process and should be
fixed. See the GitHub issue for more information.
Evaluating Model Performance
Let’s execute our pipeline and evaluate the model’s performance:
Summary and Next Steps
Congratulations! In this third part of our tutorial series, you’ve:
- Created deferred operations for model training and prediction
- Split data into training and testing sets
- Applied TF-IDF transformation to convert text to features
- Trained an XGBoost model for sentiment classification
- Made predictions on both training and test data
- Evaluated model performance using various metrics
- Applied the model to make predictions on new data
In the next tutorial (Part 4), we’ll explore how to deploy our trained model for real-time predictions using xorq’s Flight serving capabilities.