Beautiful Art

The assignment is due on Tuesday, April 29, 2025 before 11:59PM.
Submission Link: https://blackboard.umbc.edu/ultra/courses/_85408_1/outline/assessment/test/_7438454_1/assessmentSettings?contentId=_7438454_1&courseId=_85408_1&gradeitemView=details

Please be sure to double check the academic integrity and generative AI policies listed on the syllabus.

If you are having a lot of trouble getting Llama-2 to run, feel free to use another model. Just be sure to specify what model you're using and use the same model for all questions of the homework.

Materials for this assignment:

LLAMA-2 Prompting Notebook

Homework 3: Prompting Engineering

Learning Objectives

Recall how to evaluate generated output
Identify what prompting techniques produce better output
Determine when LLMs like Llama-2 would be worth using

Helpful Resources

Original paper on few-shot prompting: Language Models are Few-Shot Learners
Chain-of-thought prompting: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Other ways of prompting

What to do

Start with this notebook and change the prompts of the model to answer the questions below. This notebook also has the data. Any time we ask for a prompt, please be sure to keep all the cells in the notebook with your prompt text. Copy the output from the model into the document where you answer the questions below. (This will keep the output in case the notebook is accidentally rerun.) The number of suggested prompts are minimums.

The task you will do is called the Story Cloze Test. In cloze tests, a segment of text is removed and the person taking the test is asked to fill in the blank. In the Story Cloze Test, the ending to the 5-sentence story is missing and the model has to figure out which sentence (out of 2 options) is the better choice. Examples of the task can be found here: https://cs.rochester.edu/nlp/rocstories/

All of the questions will be in relation to the Story Cloze Test. You you be doing a variation of this task where you will be giving the model the first 4 sentences of the story, generating the last sentence, and comparing it with the “right” ending.

You will be using various prompting techniques to get the large language model (LLM) Llama-2 to do the story completion.

Select one story from the validation set provided in the notebook to use as the story you’re evaluating on for all of the questions below.

(2 points for explaining why your evaluation metric is reasonable) You will need some objective evaluation metrics to determine how well your prompting is doing. You will be using two objective evaluation metrics throughout this homework. Take the time now to implement both in your code. a. The for the first one, you will use BLEU. BLEU is common evaluation metric used in NLP created to evaluate how accurate machine translation (computational translation of one human language to another) systems were. You can call BLEU using this library: https://www.nltk.org/api/nltk.translate.bleu_score.html b. For the second evaluation metric, you will pick your own. If you find a method that wasn’t mentioned in the class slides, please cite your source!
First, try the generation task using zero-shot prompting in the plainest way possible. Just ask the model to do the task. Don’t give any examples for how to do it, don’t use any fancy prompting techniques. Just ask it like you’re asking a human to do the task. We will refer to this as your baseline.
a. (2 pts) Provide 2 prompts. b. (2 pts) Run your 2 implemented objective evaluation metrics and report your scores.
In class, we talked about providing a “role” to the model as part of the instructions.
a. (2 pts) Provide 2 prompts trying this out.
b. (2 pts) How does this compare to the output from the baseline? Use both objective measures from question 1 and also use your intuition for a more “subjective measure”.
What happens when you use examples (i.e., few-shot prompting – passing some examples of the how to do the task in addition to the instructions)?
a. (4 pts) Provide 4 prompts trying out different numbers of examples.
b. (2 pts) How does changing the number of examples affect performance? Use your objective measure and your subjective measure.
c. (1 pt) Is there a cut-off where it isn’t helpful to have more examples?
Consider chain-of-thought prompting, where you get the model to “show its work” to produce better results.
a. (2 pts) Provide 2 prompts trying chain-of-thought prompting.
b. (2 pts) How does this compare to your baseline? Use your objective measures and your subjective measure.
Llama-2 is not considered “state of the art”, but it could still have its uses. Recall some of the tasks you looked into in HW1.
a. (3 pts) What types of NLP tasks do you think Llama-2 would be good at? Why?
b. (3 pts) What would it be bad at? Why?
c. (2 pts) Was Llama-2 good at this task? Why do you think that?
(2 pts) What types of “everyday” tasks (e.g., doing taxes, writing code) would Llama-2 be good at? Why?

Extra credit

Play around with different sampling strategies. Use this guide for implementing them using Hugging Face. Pick one of the prompts that you used above and keep it the same as you change to different ways of sampling. Try 3 different ways of sampling.
a. (1 pt) What are the different ways you tried?
b. (3 pts, 1 pt each) How does it affect the generation?

What to turn in

The code that you used for evaluation
A document with the answers to your questions, all of your prompts, and the output from the prompts

Grading

Question 1 - 2 points
Question 2 - 4 points
Question 3 - 4 points
Question 4 - 7 points
Question 5 - 4 points
Question 6 - 8 points
Question 7 - 2 points
Extra Credit - 4 points