Each of the following scenarios describes the result of some (made up) classifier: there is a list of the correct labels, and the corresponding predictions. For each of the following situations, compute accuracy, recall, precision, and F1 score. You may verify your answers with code, but you must show work, or some intermediate steps, to receive full credit on this problem.
Slides from 2/18 and 2/20 will be the most useful for this question.
a) (4 points total) A binary classification result, where the correct labels are
[T, T, F, T, F, T, F, T]
and the predicted labels are
[T, F, T, T, F, F, F, T]
. Assume T
means “true” (the desired class) and F
(“false”) is the “default” class.
T
.b) (4 points total) A binary classification result, where the correct labels are
[T, F, F, F, F, F, F, T]
and the predicted labels are
[F, T, F, F, F, F, F, F]
. Assume T
means “true” (the desired class) and F
(“false”) is the “default” class.
T
.c) (8 points total) A multiclass classification result, where the correct labels are
[T, F, M, F, F, F, M, T]
and the predicted labels are
[F, T, M, F, F, F, F, T]
. Assume T
means “true,” M
means “maybe,” and
F
(“false”) is the “default” class.
F
class in the precision, recall, and F1 computations.d) (8 points total) A multiclass classification result, where the correct labels are
[C, C, A, C, C, C, C, C, B, A, C, C, C]
and the predicted labels are
[C, C, C, C, C, C, C, C, B, A, A, C, C]
. In this example, there is no
“good” default option, so we can consider A
, B
, and C
to be all possible classes/labels
of interest.
One very important aspect of NLP is the ability to identify good and reasonable baseline systems. These baseline systems help you contextualize any progress (or harm) your proposed approach makes. Coming up with these baselines is not always easy. However, in classification, a very common one is called the “most frequent class” baseline. This most frequent class baseline simply identifies the most common label y’ from the training set, and then when presented with any evaluation instance, simply returns y’.
The purpose of this question is to (1) setup a naive baseline to compare against the model you’ll make in Question 3, and (2) implement evaluation metrics in Python.
a) (1 point) As a toy problem, consider a small training set with 5 instances. If the class labels for those 5 instances are [α, α, β, γ, α], respectively. What label would the most frequent baseline return?
b) (12 points) Working from the GLUE RTE corpus, calculate a most frequent class baseline on the training set for predicting entailment. Implement these 5 metrics: accuracy, macro precision, macro recall, micro precision, and micro recall. I strongly recommend using the existing implementations of accuracy, recall, and precision in the Python library sklearn
, e.g., sklearn.metrics.precision_score
and sklearn.metrics.recall_score
. Then evaluate your baseline “model” on only the dev/validation set using the 5 metrics.
Turn in your code (5 points), these 5 scores (5 points), and a brief paragraph (2 points) both describing what you observe from using this baseline and analyzing how reasonable the predictions made by this baseline are.
Now you will train a basic recurrent neural network and compare it to your baseline. Please note that you will be graded on how well you implemented your model and your analysis of how well it did, not on the performance of the model itself.
||
. Be sure to keep the sentences together as a single input but treat the ||
as its own type.torch.nn.RNN
. Use a hidden size of 128.Train another type of neural network and repeat steps 4-6 on this new network. Then answer the following questions: