Each of the following scenarios describes the result of some (made up) classifier: there is a list of the correct labels, and the corresponding predictions. For each of the following situations, compute accuracy, recall, precision, and F1 score. You may verify your answers with code, but you must show work, or some intermediate steps, to receive full credit on this problem.
Slides from 2/12 and 2/14 will be the most useful for this question.
a) (4 points) A binary classification result, where the correct labels are
[T, T, F, T, F, T, F, T]
and the predicted labels are
[T, F, T, T, F, F, F, T]
. Assume T
means “true” (the desired class) and F
(“false”) is the “default” class.
T
.b) (4 points) A binary classification result, where the correct labels are
[T, F, F, F, F, F, F, T]
and the predicted labels are
[F, T, F, F, F, F, F, F]
. Assume T
means “true” (the desired class) and F
(“false”) is the “default” class.
T
.c) (8 points) A multiclass classification result, where the correct labels are
[T, F, M, F, F, F, M, T]
and the predicted labels are
[F, T, M, F, F, F, F, T]
. Assume T
means “true,” M
means “maybe,” and
F
(“false”) is the “default” class.
F
class in the precision, recall, and F1 computations.d) (8 points) A multiclass classification result, where the correct labels are
[C, C, A, C, C, C, C, C, B, A, C, C, C]
and the predicted labels are
[C, C, C, C, C, C, C, C, B, A, A, C, C]
. In this example, there is no
“good” default option, so we can consider A
, B
, and C
to be all possible classes/labels
of interest.
One very important aspect of NLP is the ability to identify good and reasonable baseline systems. These baseline systems help you contextualize any progress (or harm) your proposed approach makes. Coming up with these baselines is not always easy. However, in classification, a very common one is called the “most frequent class” baseline. This most frequent class baseline simply identifies the most common label y’ from the training set, and then when presented with any evaluation instance, simply returns y’.
For this question, I strongly recommend using the existing implementations of accuracy, re-
call and precision in the Python library sklearn
, e.g., sklearn.metrics.precision_score
and sklearn.metrics.recall_score
.
a) (1 point) Consider a small training set with 5 instances. If the class labels for those 5 instances are [α, α, β, γ, α], respectively. What label would the most frequent baseline return?
b) (12 points) Working from the GLUE RTE corpus, implement a most frequent class baseline for predicting entailment. Evaluate this baseline on only the dev set with 5 measures: accuracy, macro precision, macro recall, micro precision, and micro recall.
Turn in your code (5 points), these 5 scores (5 points), and a brief paragraph (2 points) describing what you observe from using this baseline and analyzing how reasonable the predictions made by this baseline are.
Starting with prepped data like you did in the “Knowledge Check: Data Prep” assignment, train a basic neural network and compare it to your baseline.
||
.torch.nn.RNN
. Use a hidden size of 128.Train another type of neural network and repeat steps 4-6 on this new network. How does this network compare to the previous network and the frequent class baseline? Why do you think it performed that way?