Data Labeling w/ LLMs

Learn how to fetch HackerNews data and automatically label data with sentiment using OpenAI GPT models

This is part 1/4.

Overview

In this tutorial, you’ll learn how to: - Set up xorq and configure the necessary components - Fetch data from the HackerNews API - Use OpenAI gpt-3.5-turbo model to automatically label data with sentiment analysis - Create a labeled dataset ready for future machine learning tasks

Prerequisites

Python 3.8+ installed on your system
An OpenAI API key for the sentiment labeling
Basic understanding of Python and data processing pipelines

Warning

Make sure to set your OpenAI API key in your environment:

export OPENAI_API_KEY=your_api_key

Installation and Imports

First, install xorq and the required dependencies:

pip install xorq pandas

Then import the necessary modules:

import pandas as pd
import xorq as xo
import xorq.expr.datatypes as dt

from xorq.caching import ParquetStorage
from xorq.common.utils.import_utils import import_python

m = import_python(xo.options.pins.get_path("hackernews_lib", version="20250604T223424Z-2e578"))
o = import_python(xo.options.pins.get_path("openai_lib", version="20250604T223419Z-0ce44"))

Note

The imported modules m (hackernews_lib) and o (openai_lib) contain utility functions for: - Connecting to the HackerNews Firebase API - Fetching and processing HackerNews stories - Making calls to OpenAI’s API for sentiment analysis - Processing the response into structured data

You’ll need to ensure these files are accessible in your environment or create them based on the code snippets in the Appendix.

Defining the HackerNews Fetcher

We’ll define a User-Defined Exchanger Function (UDXF) that fetches HackerNews stories:

do_hackernews_fetcher_udxf = xo.expr.relations.flight_udxf(
    process_df=m.get_hackernews_stories_batch,
    maybe_schema_in=m.schema_in.to_pyarrow(),
    maybe_schema_out=m.schema_out.to_pyarrow(),
    name="HackerNewsFetcher",
)

Setting Up the Backend and Storage

Let’s initialize the xorq backend and storage:

Warning

The below code will attempt to download ~100k items from HackerNew API that can take a long time. If you want to just run the tutorial with a smaller data, change the variable name of the code below to "hn-fetcher-input-small"

name = "hn-fetcher-input-small" # or use hn-fetcher-input-large 
con = xo.connect()
storage = ParquetStorage(source=con)

Building the Data Pipeline

Now, let’s set up our complete data pipeline:

# Start by reading the input for the show HN
raw_expr = (
    xo.deferred_read_parquet(
        con,
        xo.options.pins.get_path(name), # this fetches a DataFrame with two columns; maxitem and n 
        name,
    )
    # Pipe into the HackerNews fetcher to get the full stories
    .pipe(m.do_hackernews_fetcher_udxf)
)

# Build complete pipeline with filtering, labeling, and caching
t = (
    raw_expr
    # Filter stories with text
    .filter(xo._.text.notnull())
    # Apply model-assisted labeling with OpenAI
    .pipe(o.do_hackernews_sentiment_udxf, con=con)
    # Cache the labeled data to Parquet
    .cache(storage=ParquetStorage(con))
    # Filter out any labeling errors
    .filter(~xo._.sentiment.contains("ERROR"))
    # Convert sentiment strings to integer codes (useful for future ML tasks)
    .mutate(
        sentiment_int=xo._.sentiment.cases(
            {"POSITIVE": 2, "NEUTRAL": 1, "NEGATIVE": 0}.items()
        ).cast(int)
    )
)

Execute and Inspect the Labeled Data

Now let’s execute the pipeline to get our labeled DataFrame:

# Execute the pipeline and get the final DataFrame
labeled_df = t.execute()

# Inspect the results
print(labeled_df[["id", "title", "sentiment", "sentiment_int"]].head())

Summary

Congratulations! You’ve now: 1. Set up xorq 2. Fetched data from the HackerNews API 3. Set up local caching with ParquetStorage 3. Used OpenAI GPT to automatically label the data with sentiment analysis 4. Created a labeled dataset ready for future machine learning tasks

Next Steps

With this labeled dataset, you can now proceed to: - Split the data into train/test sets for model development - Apply text preprocessing and feature extraction techniques - Train and evaluate various machine learning models - Perform data analysis to gain insights about sentiment patterns in HackerNews stories

Data Labeling w/ LLMs

Overview

Prerequisites

Installation and Imports

Defining the HackerNews Fetcher

Setting Up the Backend and Storage

Building the Data Pipeline

Execute and Inspect the Labeled Data

Summary

Next Steps

Further Reading

Troubleshooting