Shared Task

Preliminary Leaderboard

These shared task results are preliminary pending a further round of annotation.
# User Team Name Evidence F1 Label Accuracy FEVER Score
1 chaonan99 UNC-NLP 0.5296 0.6821 0.6421
2 tyoneda UCL Machine Reading Group 0.3497 0.6762 0.6252
3 littsler Athene UKP TU Darmstadt 0.3697 0.6546 0.6158
4 papelo 0.6485 0.6108 0.5736
5 chidey 0.2969 0.5972 0.4994
6 Tuhin ColumbiaNLP 0.3533 0.5745 0.4906
7 nanjiang The Ohio State University 0.5853 0.5012 0.4342
8 wotto gesis cologne 0.1960 0.5415 0.4077
9 tomoki Fujixerox 0.1649 0.4713 0.3881
10 nayeon7lee 0.4912 0.5125 0.3859
11 JanK 0.4210 0.5002 0.3850
12 anikethjr Directed Acyclic Graph 0.4277 0.5136 0.3833
13 jg 0.2087 0.5397 0.3713
14 pyro 0.2962 0.4348 0.3658
15 SIRIUS SIRIUS-LTG-UIO 0.3019 0.4887 0.3655
16 hanshan 0.0000 0.3345 0.3020
17 lisizhen 0.3973 0.4538 0.2922
18 hz66pasa HZ 0.3722 0.3333 0.2867
19 guancheng_ren UCSB 0.1241 0.5080 0.2840
20 jamesthorne FEVER Baseline 0.1826 0.4884 0.2745
21 ankur-umbc 0.3695 0.4489 0.2376
22 m6.ub.6m.bu 0.1665 0.5732 0.2289
23 ubub.bubu.61 0.1668 0.5542 0.2171
24 mithunpaul08 0.1826 0.3694 0.1900

Key Dates

  • Challenge Launch: 3 April 2018
  • Testing Begins : 24 July 2018
  • Submission Closes: 27 July 2018
  • Results Announced: 30 July 2018
  • System Descriptions Due for Workshop: 10 August 2018
  • Winners Announced: 1 November (EMNLP)

NB: Participants will be allowed limited number submissions per system – multiple submission are allowed, but only the final one will be scored/counted.

Submission

Task Definition

The purpose of the FEVER challenge is to evaluate the ability of a system to verify information using evidence from Wikipedia.

  • Given a factual claim involving one or more entities (resolvable to Wikipedia pages), the system must extract textual evidence (sets of sentences from Wikipedia pages) that support or refute the claim.
  • Using this evidence, label the claim as Supported, Refuted given the evidence or NotEnoughInfo (if there isn’t sufficient evidence to either support of refute it).
  • A claim's evidence may consist of multiple sentences that only if examined together provide the stated label (e.g. for the claim “Oliver Reed was a film actor.”, one piece of evidence can be the set {“Oliver Reed starred in the Gladiator”, “Gladiator is film released in 2000”}).

Scoring

Our scoring considers classification accuracy and evidence recall. The scoring program can be found on this GitHub repo with a more detailed explanation available in the readme.

  • We will only award points for accuracy if the correct evidence is found. We refer to this as the FEVER Score
  • For a claim, we consider the correct evidence to be found if at least one complete set of annotated sentences is returned (the annotated data may contain multiple sets of evidence, each of which is sufficient to support or refute a claim).
  • Only the first 5 predicted sentence evidence will be considered for scoring. Additional evidence will be discarded without penalty. In the blind test set, all claims can be sufficiently verified with at most 5 sentences of evidence.
  • The scorer will produce other diagnostic scores (F1, Precision, Recall and Accuracy). These will not considered for the competition other than to rank two submissions with equal FEVER Scores.

Baseline system

For a detailed description of the data annotation process and baseline results see the paper.

Data Format

The data will be distributed in JSONL format with one example per line (see http://jsonlines.org for more details).

In addition to the task-specific dataset, the full set of Wikipedia pages (segmented at the sentence level) can be found on the data page.

Training/Development Data format

The training and development data will contain 4 fields:

  • id: The ID of the claim
  • label: The annotated label for the claim. Can be one of SUPPORTS|REFUTES|NOT ENOUGH INFO.
  • claim: The text of the claim.
  • evidence: A list of evidence sets (lists of [Annotation ID, Evidence ID, Wikipedia URL, sentence ID] tuples) or a [Annotation ID, Evidence ID, null, null] tuple if the label is NOT ENOUGH INFO.
    (the Annotation ID and Evidence ID fields are for internal use only and are not used for scoring. They may help debug or correct annotation issues at a later point in time.)

Below are examples of the data structures for each of the three labels.

Supports Example

{
    "id": 62037,
    "label": "SUPPORTS",
    "claim": "Oliver Reed was a film actor.",
    "evidence": [
        [
            [<annotation_id>, <evidence_id>, "Oliver_Reed", 0]
        ],
        [
            [<annotation_id>, <evidence_id>, "Oliver_Reed", 3],
            [<annotation_id>, <evidence_id>, "Gladiator_-LRB-2000_film-RRB-", 0]
        ],
        [
            [<annotation_id>, <evidence_id>, "Oliver_Reed", 2],
            [<annotation_id>, <evidence_id>, "Castaway_-LRB-film-RRB-", 0]
        ],
        [
            [<annotation_id>, <evidence_id>, "Oliver_Reed", 1]
        ],
        [
            [<annotation_id>, <evidence_id>, "Oliver_Reed", 6]
        ]
    ]
}
              

Refutes Example

{
    "id": 78526,
    "label": "REFUTES",
    "claim": "Lorelai Gilmore's father is named Robert.",
    "evidence": [
        [
            [<annotation_id>, <evidence_id>, "Lorelai_Gilmore", 3]
        ]
    ]
}
              

NotEnoughInfo Example

{
    "id": 137637,
    "label": "NOT ENOUGH INFO",
    "claim": "Henri Christophe is recognized for building a palace in Milot.",
    "evidence": [
        [
            [<annotation_id>, <evidence_id>, null, null]
        ]
    ]
}
              

Test Data format

The test data will follow the same format as the training/development examples, with the label and evidence fields removed.

{
    "id": 78526,
    "claim": "Lorelai Gilmore's father is named Robert."
}
              

Answer Submission Instructions

  • Go to Codalab
  • Create a team/system account
  • Submit answers file as a jsonl document (file name must be predictions.jsonl). One claim object per line submitting predicted evidence as a set of [Page, Line ID] tuples. Each json object should adhere to the following format (with line breaks removed) and the order of the claims must be preserved.
  • For the competition, we will limit the maximum number of submissions to 10 per user/team.
{
    "id": 78526,
    "predicted_label": "REFUTES",
    "predicted_evidence": [
        ["Lorelai_Gilmore", 3]
    ]
}