2018 Shared Task

Citation

The Fact Extraction and VERification (FEVER) Shared Task
James Thorne, Andreas Vlachos, Oana Cocarascu, Christos Christodoulopoulos and Arpit Mittal

@inproceedings{Thorne18Fact,
    author = {Thorne, James and Vlachos, Andreas and Cocarascu, Oana and Christodoulopoulos, Christos and Mittal, Arpit},
    title = {The {Fact Extraction and VERification (FEVER)} Shared Task},
    booktitle = {Proceedings of the First Workshop on {Fact Extraction and VERification (FEVER)}},
    year = {2018}
}

Implementations

UNC-NLP [code] [paper]
UCL Machine Reading Group [code] [paper]
Athene UKP TU Darmstadt [code] [paper]
Papelo (NEC Labs America) [code][paper]
Baseline

Final Leaderboard

These shared task results incorporate approx 1200 anotations contributed from members of the FEVER community.
Rank	∆	Team	Evidence F1	∆	Accuracy	∆	FEVER Score	∆
1		UNC-NLP	0.5322	+0.0026	0.6798	-0.0023	0.6398	-0.0023
2		UCL Machine Reading Group	0.3521	+0.0024	0.6744	-0.0018	0.6234	-0.0019
3		Athene UKP TU Darmstadt	0.3733	+0.0036	0.6522	-0.0024	0.6132	-0.0026
4		Papelo	0.6471	-0.0013	0.6074	-0.0034	0.5704	-0.0032
5		SWEEPer	0.2994	+0.0025	0.5964	-0.0009	0.4986	-0.0009
6		ColumbiaNLP	0.3547	+0.0014	0.5728	-0.0018	0.4888	-0.0018
7		The Ohio State University	0.5854	+0.0001	0.4989	-0.0022	0.4322	-0.0020
8		GESIS Cologne	0.1981	+0.0021	0.5395	-0.0021	0.4058	-0.0019
9	+1	nayeon7lee	0.4929	+0.0017	0.5125	-0.0001	0.3858	-0.0002
10	-1	FujiXerox	0.1657	+0.0008	0.4677	-0.0037	0.3850	-0.0032
11		JanK	0.4218	+0.0008	0.4978	-0.0023	0.3831	-0.0020
12		Directed Acyclic Graph	0.4295	+0.0018	0.5122	-0.0014	0.3824	-0.0009
13		jg	0.2117	+0.0030	0.5404	+0.0007	0.3721	+0.0009
14	+1	SIRIUS-LTG-UIO	0.3037	+0.0018	0.4898	+0.0012	0.3664	+0.0010
15	-1	Py.ro	0.2977	+0.0015	0.4318	-0.0030	0.3630	-0.0028
16		hanshan	0.0000	+0.0000	0.3307	-0.0038	0.2982	-0.0038
17		lisizhen	0.3971	-0.0001	0.4517	-0.0021	0.2898	-0.0024
18		HZ	0.3722	+0.0000	0.3333	+0.0000	0.2867	+0.0000
19		UCSB	0.1255	+0.0014	0.5070	-0.0010	0.2835	-0.0005
20		FEVER Baseline	0.1866	+0.0040	0.4892	+0.0008	0.2771	+0.0026
21		ankur-umbc	0.3699	+0.0003	0.4489	+0.0000	0.2369	-0.0007
22		m6.ub.6m.bu	0.1673	+0.0008	0.5722	-0.0010	0.2275	-0.0015
23		ubub.bubu.61	0.1678	+0.0010	0.5528	-0.0014	0.2154	-0.0017
24		mithunpaul08	0.1866	+0.0040	0.3715	+0.0022	0.1928	+0.0028

Key Dates

Challenge Launch: 3 April 2018
Testing Begins : 24 July 2018
Submission Closes: 27 July 2018
Results Announced: 30 July 2018
System Descriptions Due for Workshop: 10 August 2018
Winners Announced: 1 November (EMNLP)

NB: Participants will be allowed limited number submissions per system – multiple submission are allowed, but only the final one will be scored/counted.

Submission

Codalab: System/Prediction submissions (see the submission instructions)
Softconf: System description paper submissions

Task Definition

The purpose of the FEVER challenge is to evaluate the ability of a system to verify information using evidence from Wikipedia.

Given a factual claim involving one or more entities (resolvable to Wikipedia pages), the system must extract textual evidence (sets of sentences from Wikipedia pages) that support or refute the claim.
Using this evidence, label the claim as Supported, Refuted given the evidence or NotEnoughInfo (if there isn’t sufficient evidence to either support of refute it).
A claim's evidence may consist of multiple sentences that only if examined together provide the stated label (e.g. for the claim “Oliver Reed was a film actor.”, one piece of evidence can be the set {“Oliver Reed starred in the Gladiator”, “Gladiator is film released in 2000”}).

Scoring

Our scoring considers classification accuracy and evidence recall. The scoring program can be found on this GitHub repo with a more detailed explanation available in the readme.

We will only award points for accuracy if the correct evidence is found. We refer to this as the FEVER Score
For a claim, we consider the correct evidence to be found if at least one complete set of annotated sentences is returned (the annotated data may contain multiple sets of evidence, each of which is sufficient to support or refute a claim).
Only the first 5 predicted sentence evidence will be considered for scoring. Additional evidence will be discarded without penalty. In the blind test set, all claims can be sufficiently verified with at most 5 sentences of evidence.
The scorer will produce other diagnostic scores (F1, Precision, Recall and Accuracy). These will not considered for the competition other than to rank two submissions with equal FEVER Scores.

Baseline system

For a detailed description of the data annotation process and baseline results see the paper.

Data Format

The data will be distributed in JSONL format with one example per line (see http://jsonlines.org for more details).

In addition to the task-specific dataset, the full set of Wikipedia pages (segmented at the sentence level) can be found on the FEVER Dataset page.

Training/Development Data format

The training and development data will contain 4 fields:

id: The ID of the claim
label: The annotated label for the claim. Can be one of SUPPORTS|REFUTES|NOT ENOUGH INFO.
claim: The text of the claim.
evidence: A list of evidence sets (lists of [Annotation ID, Evidence ID, Wikipedia URL, sentence ID] tuples) or a [Annotation ID, Evidence ID, null, null] tuple if the label is NOT ENOUGH INFO.
(the Annotation ID and Evidence ID fields are for internal use only and are not used for scoring. They may help debug or correct annotation issues at a later point in time.)

Below are examples of the data structures for each of the three labels.

Supports Example

{
    "id": 62037,
    "label": "SUPPORTS",
    "claim": "Oliver Reed was a film actor.",
    "evidence": [
        [
            [<annotation_id>, <evidence_id>, "Oliver_Reed", 0]
        ],
        [
            [<annotation_id>, <evidence_id>, "Oliver_Reed", 3],
            [<annotation_id>, <evidence_id>, "Gladiator_-LRB-2000_film-RRB-", 0]
        ],
        [
            [<annotation_id>, <evidence_id>, "Oliver_Reed", 2],
            [<annotation_id>, <evidence_id>, "Castaway_-LRB-film-RRB-", 0]
        ],
        [
            [<annotation_id>, <evidence_id>, "Oliver_Reed", 1]
        ],
        [
            [<annotation_id>, <evidence_id>, "Oliver_Reed", 6]
        ]
    ]
}

Refutes Example

{
    "id": 78526,
    "label": "REFUTES",
    "claim": "Lorelai Gilmore's father is named Robert.",
    "evidence": [
        [
            [<annotation_id>, <evidence_id>, "Lorelai_Gilmore", 3]
        ]
    ]
}

NotEnoughInfo Example

{
    "id": 137637,
    "label": "NOT ENOUGH INFO",
    "claim": "Henri Christophe is recognized for building a palace in Milot.",
    "evidence": [
        [
            [<annotation_id>, <evidence_id>, null, null]
        ]
    ]
}

Test Data format

The test data will follow the same format as the training/development examples, with the label and evidence fields removed.

{
    "id": 78526,
    "claim": "Lorelai Gilmore's father is named Robert."
}

Answer Submission Instructions

Go to Codalab
Create a team/system account
Submit answers file as ZIP file containing your predictions file (file name must be predictions.jsonl). One claim object per line submitting predicted evidence as a set of [Page, Line ID] tuples. Each json object should adhere to the following format (with line breaks removed) and the order of the claims must be preserved.
For the competition, we will limit the maximum number of submissions to 10 per user/team.

{
    "id": 78526,
    "predicted_label": "REFUTES",
    "predicted_evidence": [
        ["Lorelai_Gilmore", 3]
    ]
}