Skip to main content
The assignment is due on Monday, May 6, 2024 before 11:59PM.
Submission Link: https://classroom.google.com/c/NjUwNDE2MzEwMzQx/a/Njc2MTc5MzM5MDM1/details

Please be sure to double check the academic integrity and generative AI policies listed on the syllabus.
If you are having a lot of trouble getting Llama-2 to run, feel free to use another model. Just be sure to specify what model you're using and use the same model for all questions of the homework.
For example, ChatGPT has some free credit that they give you when you sign up with a new account. You could use that instead and make another account if you need more credit and don't want to spend money on refilling your credit.
Materials for this assignment:

Homework 3: Prompting Engineering

Learning Objectives

  • Recall how to evaluate generated output
  • Identify what prompting techniques produce better output
  • Determine when LLMs like Llama-2 would be worth using

Helpful Resources

Other ways of prompting

What to do

Start with this notebook and change the prompts of the model to answer the questions below. This notebook also has the data. Any time we ask for a prompt, please be sure to keep all the cells in the notebook with your prompt text. Copy the output from the model into the document where you answer the questions below. (This will keep the output in case the notebook is accidentally rerun.) The number of suggested prompts are minimums.

The task you will do is called the Story Cloze Test. In cloze tests, a segment of text is removed and the person taking the test is asked to fill in the blank. In the Story Cloze Test, the ending to the 5-sentence story is missing and the model has to figure out which sentence (out of 2 options) is the better choice. Examples of the task can be found here: https://cs.rochester.edu/nlp/rocstories/

All of the questions will be in relation to the Story Cloze Test. You you be doing a variation of this task where you will be giving the model the first 4 sentences of the story, generating the last sentence, and comparing it with the “right” ending.

You will be using various prompting techniques to get the large language model (LLM) Llama-2 to do the story completion.

Select one story from the dataset to use as the story you’re evaluating on for all of the questions below.

  1. (4 points, 2 points per evaluation) Provide 2 different ways you might objectively evaluate the output that the model generates. Pick one of these methods and use it for the rest of the questions to compare the outputs. If you find a method that wasn’t mentioned in the class slides, please cite your source!
  2. (2 points) First, try the generation task using zero-shot prompting in the plainest way possible. Just ask the model to do the task. Don’t give any examples for how to do it, don’t use any fancy prompting techniques. Just ask it like you’re asking a human to do the task. We will refer to this as your baseline.
    a. Provide 2 prompts.
  3. In class, we talked about providing a “role” to the model as part of the instructions.
    a. (2 pts) Provide 2 prompts trying this out.
    b. (2 pts) How does this compare to the output from the baseline? Use your objective measure from question 1 and also use your intuition for a more “subjective measure”.
  4. What happens when you use examples (i.e., few-shot prompting – passing some examples of the how to do the task in addition to the instructions)?
    a. (4 pts) Provide 4 prompts trying out different numbers of examples.
    b. (2 pts) How does changing the number of examples affect performance? Use your objective measure and your subjective measure.
    c. (1 pt) Is there a cut-off where it isn’t helpful to have more examples?
  5. Consider chain-of-thought prompting, where you get the model to “show its work” to produce better results.
    a. (2 pts) Provide 2 prompts trying chain-of-thought prompting.
    b. (2 pts) How does this compare to your baseline? Use your objective measure and your subjective measure.
  6. Llama-2 is not considered “state of the art”, but it could still have its uses. Recall some of the tasks you looked into in HW1.
    a. (3 pts) What types of NLP tasks do you think Llama-2 would be good at? Why?
    b. (3 pts) What would it be bad at? Why?
    c. (2 pts) Was Llama-2 good at this task? Why do you think that?
  7. (2 pts) What types of “everyday” tasks (e.g., doing taxes, writing code) would Llama-2 be good at? Why?

Extra credit

  • Play around with different sampling strategies. Use this guide for implementing them using Hugging Face. Pick one of the prompts that you used above and keep it the same as you change to different ways of sampling. Try 3 different ways of sampling.
    a. (1 pt) What are the different ways you tried?
    b. (6 pts, 2 pts each) How does it affect the generation?

What to turn in

  • The code that you used for evaluation
  • A document with the answers to your questions, all of your prompts, and the output from the prompts

Grading

  • Question 1 - 4 points
  • Question 2 - 2 points
  • Question 3 - 4 points
  • Question 4 - 7 points
  • Question 5 - 4 points
  • Question 6 - 8 points
  • Question 7 - 2 points
  • Extra Credit - 7 points