Now that you know how to prompt an LLM from HW2, we will be using some guided story generation techniques from Module 2. In this homework, you will be following a generation pipeline inspired by the Plan-and-Write system. In their work, they generated keywords from a title and then generated a story from the keywords. They tried both dynamic and static schemas to integrate the planning into their generation pipeline. This homework will focus on the “static” schema but use a pre-trained LLM instead of their RNN model.
For this assignment, we will check your ability to:
Like in the last homework, you will be using a Jupyter notebook, but instead of using OpenAI’s suite of models, we’re going to use Meta’s Llama 2 via HuggingFace🤗. Again, you can run in your VS Code environment or upload it to Google Colab or DeepNote to do the assignment online.
You will be using a portion of the data from the original Plan-and-Write work. I have already setup the data in the notebook. You will be using the stories, their titles, and the keywords they extracted. And you will only be looking at 20 stories from the dataset.
In the notebook, you are given a series of functions that will retrieve the story data for you.
load_data
will return a list of all of the data in the file.get_story
will return a list of the sentences in the story.get_title
will return the title of a story from a given line.get_keywords
will return the keywords of a story from a given line.I have taken 20 stories from the original dataset for you to work with.
You will be generating stories for all 20 prompts in two ways (40 generated stories in total):
You are welcome to use any prompting techniques (e.g., zero-shot, few-shot, chain-of-thought). Like in HW2, it will be beneficial for you try multiple prompts until you get the best results, even if it’s just changing the wording of the prompt. However, you are only required to show your final prompt for both conditioned and unconditioned generation.
reader
in the load_data()
function. You can use a story from any other index outside of [1:21]
for your prompts.
You will evaluate the stories in a few different ways: a) BLEU - precision using n-grams b) ROUGE - recall of n-grams
N-grams are a common unit in NLP for talking about words that appear next to each other, where the n
denotes how many words. For example, the sentence “The dog was really happy” contains:
The
, dog
, was
, really
, and happy
The dog
, dog was
, was really
, and really happy
The dog was
, dog was really
, and was really happy
BLEU and ROUGE are common evaluation metrics used in NLP. BLEU was created to evaluate how accurate machine translation (computational translation of one human language to another) systems were. ROUGE was created to evaluate generated summaries of text.
You will implement BLEU and ROUGE using the following libraries:
And you should calculate BLEU-1, BLEU-2, ROUGE-1, ROUGE-2, and ROUGE-L comparing both the controlled generation vs original story and uncontrolled generation vs original story. Specificially, the BLEU will be modified n-gram precision. Calculate these scores over each pair of sentences in the data and then average across the 20 stories. You will be implementing the BLEU and ROGUE functions to compare the sentences one-by-one in the stories and return an average across the 5 sentences.
uncontrolled_stories
and controlled_stories
. Before running these generated stories through BLEU/ROUGE, you will need to cut off the prompt since it will return the entire string with your prompt + the generated story after that.
Keep adjusting the prompt until you can consistently generate 5-sentence stories, but if you’ve tried a bunch of things and are still unable to get it to produce 5 sentences, evaluate on whatever sentences it generates. You can “pad” the story with empty strings to compare against with BLEU/ROUGE.
Please answer the following questions in a separate document and save it as a pdf. Each answer should be a few sentences long.
You should submit
Plan-And-Write: Towards Better Automatic Storytelling - Lili Yao, Nanyun Peng, Ralph Weischedel, Kevin Knight, Dongyan Zhao, Rui Yan. AAAI 2019. |
Llama 2: Open Foundation and Fine-Tuned Chat Models - Hugo Touvron, et al.. arXiv 2023. |