Homework 2: Prompting and Fine-tuning

In this homework, we’re going to use OpenAI’s API to generate text adventure game components automatically. Starting with the prompting ideas from class and generating descriptions using the Playground, we’ll show how to finetune models to perform specific tasks. In particular, you will generate room descriptions and item properties for text adventure games.

Learning Objectives

For this assignment, we will check your ability to:

Use the OpenAI API for few-shot prompting GPT models
Use the OpenAI API for finetuning GPT early models
Setup data for finetuning
Compare early finetuned output to modern few-shot output

Getting Started

If you haven’t already done so, please complete the in-class activity on Generating Room Descriptions. This will give you a good idea of how the model should be prompted without dealing with the API.

Models

OpenAI has several different chat models. You will probably see gpt-4o and gpt-4o-mini, but there are older models like gpt-3.5-turbo and davinci-002 (GPT-3). These differ from each other in several dimensions:

The context length (how long each message can be, and how many messages of history the conversation can have)
The number of model parameters (larger number of model parameters tend to result in higher quality output)
The speed of the model (gpt-3.5-turbo generates output more quickly)
The cost of the model (gpt-4o is more expensive)

Prompt Design

You can design prompts to get GPT to do all sorts of suprising things. For instance, GPT-3/4 can perform few-shot learning. Given a few examples of a task, it can “learn” a pattern very quickly and then be used for classification tasks. It often times helps to tell the model what you want it to do. Use some of the tips and tricks we talked about in class.

Fine-Tuning

Next, we’ll take a look at how to fine-tune the OpenAI models to perform a specific task. You can use few-shot learning when you have a few dozen training example, and you can use fine-tuning when you have several hundred examples. When we have a few hundred training examples, then it’s not possible to fit them all into a prompt, since GPT-3 has a limit of 2048 tokens in the prompt.

For your homework, you’ll fine-tune GPT-3 to generate different parts of text adventure games. Specifically we’ll train GPT-3 to

Generate descriptions of locations
Predict an item’s properties

Data

We are going to use a text adventure that was developed by Facebook AI Research for their paper Learning to Speak and Act in a Fantasy Text Adventure Game.

Here’s the paper’s abstract:

We introduce a large-scale crowdsourced text adventure game as a research platform for studying grounded dialogue. In it, agents can perceive, emote, and act whilst conducting dialogue with other agents. Models and humans can both act as characters within the game. We describe the results of training state-of-the-art generative and retrieval models in this setting. We show that in addition to using past dialogue, these models are able to effectively use the state of the underlying world to condition their predictions. In particular, we show that grounding on the details of the local environment, including location descriptions, and the objects (and their affordances) and characters (and their previous actions) present within it allows better predictions of agent behavior and dialogue. We analyze the ingredients necessary for successful grounding in this setting, and how each of these factors relate to agents that can talk and act successfully.

Their data is called the LIGHT dataset (Learning in Interactive Games with Humans and Text). It contains 663 locations, 3462 objects and 1755 characters. I have divided this data into training/dev/test splits. We will use this data to fine-tune GPT-3 to generate descriptions of rooms and items.

Jupyter Notebook

You will be working on this Jupyter Notebook for Fine-Tuning/Prompting on LIGHT Enviroment Data, which you can run in your VS Code environment or upload it to Google Colab or DeepNote to do the assignment online.

In addition to working your way through the Jupyter Notebook, I recommend reading the OpenAI documentation, and trying the examples in the Playground.

What to submit

You should submit your completed Jupyter Notebook to Blackboard. You can work in pairs.

Grading

Run fine-tuning code for room descriptions (1 pt)
Fine-tune additional model for item properties
- Setup training data (5 pts)
- Finetune the model (5 pts)
- Call the model (7 pts, one per property)
Call the few-shot model for item properties
- Try multiple prompts (5 pts)
- Zero-shot, One-shot, and Five-shot prompts – prompt, output pairs (6 pts)
Evaluation
- Implement precision and recall using scikit (2 pts)
- Run precision and recall over your fine-tuned item model (1 pt)
- Run precision and recall over your one-shot item model (1 pt)
- Comparison questions (6 pts)