Compute the token-overlap F1 between a predicted answer and a reference answer — the standard span-QA metric (SQuAD).
Signature: def token_f1(prediction: list, reference: list) -> float
Counter & Counter (multiset intersection) to count common tokens with multiplicity.precision = common / len(prediction)recall = common / len(reference)F1 = 2 * p * r / (p + r)0.0 if there are no common tokens (or either input is empty).Math
Asked at
Test Results