This is part 1/4.

Overview

In this tutorial, you’ll learn how to:

  • Set up xorq and configure the necessary components
  • Fetch data from the HackerNews API
  • Use OpenAI gpt-3.5-turbo model to automatically label data with sentiment analysis
  • Create a labeled dataset ready for future machine learning tasks

Prerequisites

Make sure to set your OpenAI API key in your environment:

export OPENAI_API_KEY=your_api_key

Installation and Imports

First, install xorq and the required dependencies:

pip install xorq pandas 

Then import the necessary modules:

import pandas as pd
import xorq as xo
import xorq.expr.datatypes as dt

from xorq.caching import ParquetStorage
from xorq.common.utils.import_utils import import_python

m = import_python(xo.options.pins.get_path("hackernews_lib"))
o = import_python(xo.options.pins.get_path("openai_lib"))

The imported modules m (hackernews_lib) and o (openai_lib) contain utility functions for:

  • Connecting to the HackerNews Firebase API
  • Fetching and processing HackerNews stories
  • Making calls to OpenAI’s API for sentiment analysis
  • Processing the response into structured data

You’ll need to ensure these files are accessible in your environment or create them based on the code snippets in the Appendix.

Defining the HackerNews Fetcher

We’ll define a User-Defined Exchanger Function (UDXF) that fetches HackerNews stories:

do_hackernews_fetcher_udxf = xo.expr.relations.flight_udxf(
    process_df=m.get_hackernews_stories_batch,
    maybe_schema_in=m.schema_in.to_pyarrow(),
    maybe_schema_out=m.schema_out.to_pyarrow(),
    name="HackerNewsFetcher",
)

Setting Up the Backend and Storage

Let’s initialize the xorq backend and storage:

The below code will attempt to download ~100k items from HackerNew API that can take a long time. If you want to just run the tutorial with a smaller data, change the variable name of the code below to "hn-fetcher-input-small"

name = "hn-fetcher-input-large" # or use hn-fercher-input-small to avoid downloading all data
con = xo.connect()
storage = ParquetStorage(source=con)

Building the Data Pipeline

Now, let’s set up our complete data pipeline:

# Start by reading the input for the show HN
raw_expr = (
    xo.deferred_read_parquet(
        con,
        xo.options.pins.get_path(name), # this fetches a DataFrame with two columns; maxitem and n 
        name,
    )
    # Pipe into the HackerNews fetcher to get the full stories
    .pipe(m.do_hackernews_fetcher_udxf)
)

# Build complete pipeline with filtering, labeling, and caching
t = (
    raw_expr
    # Filter stories with text
    .filter(xo._.text.notnull())
    # Apply model-assisted labeling with OpenAI
    .pipe(o.do_hackernews_sentiment_udxf, con=con)
    # Cache the labeled data to Parquet
    .cache(storage=ParquetStorage(con))
    # Filter out any labeling errors
    .filter(~xo._.sentiment.contains("ERROR"))
    # Convert sentiment strings to integer codes (useful for future ML tasks)
    .mutate(
        sentiment_int=xo._.sentiment.cases(
            {"POSITIVE": 2, "NEUTRAL": 1, "NEGATIVE": 0}.items()
        ).cast(int)
    )
)

Execute and Inspect the Labeled Data

Now let’s execute the pipeline to get our labeled DataFrame:

# Execute the pipeline and get the final DataFrame
labeled_df = t.execute()

# Inspect the results
print(labeled_df[["id", "title", "sentiment", "sentiment_int"]].head())

This will output something like:

         id                                              title sentiment  sentiment_int
0  43083439  Show HN: Xenoeye – high performance network tr...  POSITIVE              2
1  43083558                   Toronto 'Plane Crash Submissions  NEGATIVE              0
2  43083656  Show HN: Generic and variadic printing library...  POSITIVE              2
3  43083755  Show HN: WebMorph – Automate Your Website Tran...  POSITIVE              2
4  43083845                 Ask HN: Small Ideas vs. Big Ideas?  POSITIVE              2

Summary

Congratulations! You’ve now:

  1. Set up xorq
  2. Fetched data from the HackerNews API
  3. Set up local caching with ParquetStorage
  4. Used OpenAI GPT to automatically label the data with sentiment analysis
  5. Created a labeled dataset ready for future machine learning tasks

Next Steps

With this labeled dataset, you can now proceed to:

  • Split the data into train/test sets for model development
  • Apply text preprocessing and feature extraction techniques
  • Train and evaluate various machine learning models
  • Perform data analysis to gain insights about sentiment patterns in HackerNews stories

Further Reading

Appendix

Helper Modules Structure

Troubleshooting