Data Labeling w/ LLMs
Learn how to fetch HackerNews data and automatically label data with sentiment using OpenAI GPT models
This is part 1/4.
Overview
In this tutorial, you’ll learn how to:
- Set up xorq and configure the necessary components
- Fetch data from the HackerNews API
- Use OpenAI gpt-3.5-turbo model to automatically label data with sentiment analysis
- Create a labeled dataset ready for future machine learning tasks
Prerequisites
Make sure to set your OpenAI API key in your environment:
Installation and Imports
First, install xorq and the required dependencies:
Then import the necessary modules:
The imported modules m
(hackernews_lib) and o
(openai_lib) contain utility
functions for:
- Connecting to the HackerNews Firebase API
- Fetching and processing HackerNews stories
- Making calls to OpenAI’s API for sentiment analysis
- Processing the response into structured data
You’ll need to ensure these files are accessible in your environment or create them based on the code snippets in the Appendix.
Defining the HackerNews Fetcher
We’ll define a User-Defined Exchanger Function (UDXF) that fetches HackerNews stories:
Setting Up the Backend and Storage
Let’s initialize the xorq backend and storage:
The below code will attempt to download ~100k items from HackerNew API that can
take a long time. If you want to just run the tutorial with a smaller data,
change the variable name
of the code below to "hn-fetcher-input-small"
Building the Data Pipeline
Now, let’s set up our complete data pipeline:
Execute and Inspect the Labeled Data
Now let’s execute the pipeline to get our labeled DataFrame:
This will output something like:
Summary
Congratulations! You’ve now:
- Set up xorq
- Fetched data from the HackerNews API
- Set up local caching with
ParquetStorage
- Used OpenAI GPT to automatically label the data with sentiment analysis
- Created a labeled dataset ready for future machine learning tasks
Next Steps
With this labeled dataset, you can now proceed to:
- Split the data into train/test sets for model development
- Apply text preprocessing and feature extraction techniques
- Train and evaluate various machine learning models
- Perform data analysis to gain insights about sentiment patterns in HackerNews stories