Rank | ∆ | Team | Evidence F1 | ∆ | Accuracy | ∆ | FEVER Score | ∆ |
---|---|---|---|---|---|---|---|---|
1 |
|
UNC-NLP | 0.5322 | +0.0026 | 0.6798 | -0.0023 | 0.6398 | -0.0023 |
2 |
|
UCL Machine Reading Group | 0.3521 | +0.0024 | 0.6744 | -0.0018 | 0.6234 | -0.0019 |
3 |
|
Athene UKP TU Darmstadt | 0.3733 | +0.0036 | 0.6522 | -0.0024 | 0.6132 | -0.0026 |
4 |
|
Papelo | 0.6471 | -0.0013 | 0.6074 | -0.0034 | 0.5704 | -0.0032 |
5 |
|
SWEEPer | 0.2994 | +0.0025 | 0.5964 | -0.0009 | 0.4986 | -0.0009 |
6 |
|
ColumbiaNLP | 0.3547 | +0.0014 | 0.5728 | -0.0018 | 0.4888 | -0.0018 |
7 |
|
The Ohio State University | 0.5854 | +0.0001 | 0.4989 | -0.0022 | 0.4322 | -0.0020 |
8 |
|
GESIS Cologne | 0.1981 | +0.0021 | 0.5395 | -0.0021 | 0.4058 | -0.0019 |
9 | +1 | nayeon7lee | 0.4929 | +0.0017 | 0.5125 | -0.0001 | 0.3858 | -0.0002 |
10 | -1 | FujiXerox | 0.1657 | +0.0008 | 0.4677 | -0.0037 | 0.3850 | -0.0032 |
11 |
|
JanK | 0.4218 | +0.0008 | 0.4978 | -0.0023 | 0.3831 | -0.0020 |
12 |
|
Directed Acyclic Graph | 0.4295 | +0.0018 | 0.5122 | -0.0014 | 0.3824 | -0.0009 |
13 |
|
jg | 0.2117 | +0.0030 | 0.5404 | +0.0007 | 0.3721 | +0.0009 |
14 | +1 | SIRIUS-LTG-UIO | 0.3037 | +0.0018 | 0.4898 | +0.0012 | 0.3664 | +0.0010 |
15 | -1 | Py.ro | 0.2977 | +0.0015 | 0.4318 | -0.0030 | 0.3630 | -0.0028 |
16 |
|
hanshan | 0.0000 | +0.0000 | 0.3307 | -0.0038 | 0.2982 | -0.0038 |
17 |
|
lisizhen | 0.3971 | -0.0001 | 0.4517 | -0.0021 | 0.2898 | -0.0024 |
18 |
|
HZ | 0.3722 | +0.0000 | 0.3333 | +0.0000 | 0.2867 | +0.0000 |
19 |
|
UCSB | 0.1255 | +0.0014 | 0.5070 | -0.0010 | 0.2835 | -0.0005 |
20 |
|
FEVER Baseline | 0.1866 | +0.0040 | 0.4892 | +0.0008 | 0.2771 | +0.0026 |
21 |
|
ankur-umbc | 0.3699 | +0.0003 | 0.4489 | +0.0000 | 0.2369 | -0.0007 |
22 |
|
m6.ub.6m.bu | 0.1673 | +0.0008 | 0.5722 | -0.0010 | 0.2275 | -0.0015 |
23 |
|
ubub.bubu.61 | 0.1678 | +0.0010 | 0.5528 | -0.0014 | 0.2154 | -0.0017 |
24 |
|
mithunpaul08 | 0.1866 | +0.0040 | 0.3715 | +0.0022 | 0.1928 | +0.0028 |
NB: Participants will be allowed limited number submissions per system – multiple submission are allowed, but only the final one will be scored/counted.
The purpose of the FEVER challenge is to evaluate the ability of a system to verify information using evidence from Wikipedia.
Our scoring considers classification accuracy and evidence recall. The scoring program can be found on this GitHub repo with a more detailed explanation available in the readme.
5
predicted sentence evidence will be considered for scoring. Additional evidence will be discarded without penalty. In the blind test set, all claims can be sufficiently verified with at most 5 sentences of evidence.For a detailed description of the data annotation process and baseline results see the paper.
The data will be distributed in JSONL format with one example per line (see http://jsonlines.org for more details).
In addition to the task-specific dataset, the full set of Wikipedia pages (segmented at the sentence level) can be found on the resources page.
The training and development data will contain 4 fields:
id
: The ID of the claimlabel
: The annotated label for the claim. Can be one of SUPPORTS|REFUTES|NOT ENOUGH INFO
.claim
: The text of the claim.evidence
: A list of evidence sets (lists of [Annotation ID, Evidence ID, Wikipedia URL, sentence ID]
tuples) or a [Annotation ID, Evidence ID, null, null]
tuple if the label is NOT ENOUGH INFO
.Below are examples of the data structures for each of the three labels.
{
"id": 62037,
"label": "SUPPORTS",
"claim": "Oliver Reed was a film actor.",
"evidence": [
[
[<annotation_id>, <evidence_id>, "Oliver_Reed", 0]
],
[
[<annotation_id>, <evidence_id>, "Oliver_Reed", 3],
[<annotation_id>, <evidence_id>, "Gladiator_-LRB-2000_film-RRB-", 0]
],
[
[<annotation_id>, <evidence_id>, "Oliver_Reed", 2],
[<annotation_id>, <evidence_id>, "Castaway_-LRB-film-RRB-", 0]
],
[
[<annotation_id>, <evidence_id>, "Oliver_Reed", 1]
],
[
[<annotation_id>, <evidence_id>, "Oliver_Reed", 6]
]
]
}
{
"id": 78526,
"label": "REFUTES",
"claim": "Lorelai Gilmore's father is named Robert.",
"evidence": [
[
[<annotation_id>, <evidence_id>, "Lorelai_Gilmore", 3]
]
]
}
{
"id": 137637,
"label": "NOT ENOUGH INFO",
"claim": "Henri Christophe is recognized for building a palace in Milot.",
"evidence": [
[
[<annotation_id>, <evidence_id>, null, null]
]
]
}
The test data will follow the same format as the training/development examples, with the label and evidence fields removed.
{
"id": 78526,
"claim": "Lorelai Gilmore's father is named Robert."
}
predictions.jsonl
).
One claim object per line submitting predicted evidence as a set of [Page, Line ID]
tuples.
Each json object should adhere to the following format (with line breaks removed) and the order of the claims must be preserved.
{
"id": 78526,
"predicted_label": "REFUTES",
"predicted_evidence": [
["Lorelai_Gilmore", 3]
]
}