import pandas as pd
import xorq as xo
import xorq.expr.datatypes as dt
from xorq.caching import ParquetStorage
from xorq.common.utils.import_utils import import_python
= import_python(xo.options.pins.get_path("hackernews_lib", version="20250604T223424Z-2e578"))
m = import_python(xo.options.pins.get_path("openai_lib", version="20250604T223419Z-0ce44")) o
Data Labeling w/ LLMs
This is part 1/4.
Overview
In this tutorial, you’ll learn how to: - Set up xorq and configure the necessary components - Fetch data from the HackerNews API - Use OpenAI gpt-3.5-turbo model to automatically label data with sentiment analysis - Create a labeled dataset ready for future machine learning tasks
Prerequisites
- Python 3.8+ installed on your system
- An OpenAI API key for the sentiment labeling
- Basic understanding of Python and data processing pipelines
Make sure to set your OpenAI API key in your environment:
export OPENAI_API_KEY=your_api_key
Installation and Imports
First, install xorq and the required dependencies:
pip install xorq pandas
Then import the necessary modules:
The imported modules m
(hackernews_lib) and o
(openai_lib) contain utility functions for: - Connecting to the HackerNews Firebase API - Fetching and processing HackerNews stories - Making calls to OpenAI’s API for sentiment analysis - Processing the response into structured data
You’ll need to ensure these files are accessible in your environment or create them based on the code snippets in the Appendix.
Defining the HackerNews Fetcher
We’ll define a User-Defined Exchanger Function (UDXF) that fetches HackerNews stories:
= xo.expr.relations.flight_udxf(
do_hackernews_fetcher_udxf =m.get_hackernews_stories_batch,
process_df=m.schema_in.to_pyarrow(),
maybe_schema_in=m.schema_out.to_pyarrow(),
maybe_schema_out="HackerNewsFetcher",
name )
Setting Up the Backend and Storage
Let’s initialize the xorq backend and storage:
The below code will attempt to download ~100k items from HackerNew API that can take a long time. If you want to just run the tutorial with a smaller data, change the variable name
of the code below to "hn-fetcher-input-small"
= "hn-fetcher-input-small" # or use hn-fetcher-input-large
name = xo.connect()
con = ParquetStorage(source=con) storage
Building the Data Pipeline
Now, let’s set up our complete data pipeline:
# Start by reading the input for the show HN
= (
raw_expr
xo.deferred_read_parquet(
con,# this fetches a DataFrame with two columns; maxitem and n
xo.options.pins.get_path(name),
name,
)# Pipe into the HackerNews fetcher to get the full stories
.pipe(m.do_hackernews_fetcher_udxf)
)
# Build complete pipeline with filtering, labeling, and caching
= (
t
raw_expr# Filter stories with text
filter(xo._.text.notnull())
.# Apply model-assisted labeling with OpenAI
=con)
.pipe(o.do_hackernews_sentiment_udxf, con# Cache the labeled data to Parquet
=ParquetStorage(con))
.cache(storage# Filter out any labeling errors
filter(~xo._.sentiment.contains("ERROR"))
.# Convert sentiment strings to integer codes (useful for future ML tasks)
.mutate(=xo._.sentiment.cases(
sentiment_int"POSITIVE": 2, "NEUTRAL": 1, "NEGATIVE": 0}.items()
{int)
).cast(
) )
Execute and Inspect the Labeled Data
Now let’s execute the pipeline to get our labeled DataFrame:
# Execute the pipeline and get the final DataFrame
= t.execute()
labeled_df
# Inspect the results
print(labeled_df[["id", "title", "sentiment", "sentiment_int"]].head())
Summary
Congratulations! You’ve now: 1. Set up xorq 2. Fetched data from the HackerNews API 3. Set up local caching with ParquetStorage
3. Used OpenAI GPT to automatically label the data with sentiment analysis 4. Created a labeled dataset ready for future machine learning tasks
Next Steps
With this labeled dataset, you can now proceed to: - Split the data into train/test sets for model development - Apply text preprocessing and feature extraction techniques - Train and evaluate various machine learning models - Perform data analysis to gain insights about sentiment patterns in HackerNews stories
Further Reading
Troubleshooting
- API Rate Limiting: If you encounter rate limiting from OpenAI or HackerNews, adjust the
wait_random_exponential
parameters in the helper functions. - Missing Files: Ensure the helper modules are in the correct locations or create them using the provided code snippets.
- OpenAI API Key Issues: Verify your API key is correctly set and has sufficient credits.
- Data Quality: Check for missing values or unexpected content in the fetched data before processing.