FEVER 2.0 Shared Task

Citations

Adversarial attacks against Fact Extraction and VERification
James Thorne, Andreas Vlachos
Contains guidelines for the FEVER 2.0 Shared Task

@misc{Thorne2019adversarial,
    title={Adversarial attacks against {Fact Extraction and VERification}},
    author={James Thorne and Andreas Vlachos},
    year={2019},
    eprint={1903.05543},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

The FEVER2.0 Shared Task
James Thorne, Andreas Vlachos, Oana Cocarascu, Christos Christodoulopoulos and Arpit Mittal

@inproceedings{Thorne19FEVER2,
    author = {Thorne, James and Vlachos, Andreas and Cocarascu, Oana and Christodoulopoulos, Christos and Mittal, Arpit},
    title = {The {FEVER2.0} Shared Task},
    booktitle = {Proceedings of the Second Workshop on {Fact Extraction and VERification (FEVER)}},
    year = {2018}
}

Results

Builders

Docker images for 3 new systems were received and are compared against the top 4 systems from the 2018 FEVER shared task (marked with *). A test set predictions file was received from CalcWorks, but no docker image was submitted.
Rank System Resilience (%) FEVER Score (%)
1 Papelo* 37.31 57.36
2 UCL MR* 35.83 62.52
3 Dominiks 35.82 68.46
4 CUNLP 32.92 67.08
5 UNC* 30.47 64.21
6 Athene* 25.35 61.58
7 GPLSI 19.63 58.07
8 Baseline 11.06 27.45
9 CalcWorks DNQ 33.56

Breakers

Breaker data was received from 4 participants and was split between a public dev set and a private test set. The split was a balanced sample of data and each instance was annotated to ensure gramaticality and instance correctness. Papelo submitted a submission containing only NotEnoughInfo which did not quality for the shared task. The potency for Papelo is reported, but is not included in the calculations for resilience of the systems.
Rank Team # Test Instances # Breaks from
Valid Instances
Raw Potency
(%)
Correct Rate
(%)
Potency
(%)
1 TMLab 79 402 78.80 84.81 66.83
2 CUNLP 501 2219 68.51 81.44 55.79
3 NbAuzDrLqg 102 401 79.66 64.71 51.54
4 Baseline 498 1976 60.34 82.33 49.68
DNQ Papelo - - 71.20 91.00 64.79

Fixers

Using the dev from the breaker's submissions, participants were invited to fix their systems (making them more resilient). A submission was received from CUNLP.
Rank Team (FEVER Score Before (%)) (Resilience Before (%)) FEVER Score (%) Resilience (%)
1 CUNLP 67.08 32.92 68.80 36.61

Task Definition

The FEVER 2.0 Shared Task will build upon work from the first shared task in a Build it Break it Fix it setting. The shared will comprise three phases. In the first phase of the shared task, Builders build systems for solving the first FEVER shared task dataset. The highest scoring systems from the first shared task will be used as baselines and we will also invite new participants to develop new systems.

In the second phase, Breakers are tasked with generating adversarial examples to fool the existing systems. We consider only novel claims (i.e. not contained in the original FEVER dataset) with either Supports, Refutes or NotEnoughInfo labels. Supported or refuted claims should be accompanied with evidence from the Wikipedia dump used in the original task (claims with NotEnoughInfo as labels do not require evidence). The Breakers will have access to the systems to allow themselves to generate claims which are challenging for the builders. Alongside the labels and evidence for each claim, breakers will be asked to provide meta-information regarding the type of attack they are introducing. The breakers will be invited to submit up to a fixed number of claims as their entry to the shared task. We welcome both manual (through the use of our annotation interface) and automated methods for this phase. Half of the claims generated by the Breakers will be retained as a hold-out blind test set and the remaining half will be released to the participants to fix their systems. The blind set will be manually evaluated by the organisers for quality assurance.

In the final phase of the shared task, the original Builders or teams of dedicated Fixers must incorporate the new data generated by the Breakers to improve the systems' classification performance.

Key Dates

  • Challenge Launch: 7 March 2019
  • Round 1:
  • Builders Prediction Submissions: 15 April 2019
  • Builders Docker Submissions: 30 April 2019
  • Round 2
  • Breakers Sample Submissions: 30 April 2019
  • Breakers Final Submissions: 7 June 2019
  • Round 3
  • Breaking Instances Release: 21 June 2019
  • Fixers Submissions: 21 July 2019
  • Results + Workshop
  • Results Announced: 26 July 2019
  • System Descriptions Due for Workshop: 30 August 2019
  • Winners Announced: 3 November (EMNLP-IJCNLP)

Instructions for Builders

Builders will be creating system that can solve the original FEVER task. Participants in this category are also encouraged to participate as Fixers for their own systems.

Training Data & Other Resources

The FEVER dataset can be found on our Dataset page. The page contains examples of the data structures for each of the three labels. Existing implementations for the FEVER 1.0 task can be found on the FEVER 1.0 Task page

Submission

Participants must submit their predictions to the new FEVERlab leaderboard for scoring. We also invite participants to make their systems available to the Breakers by creating a docker image (sample docker image) and submitting it to the FEVERlab page. The Shared Task organisers will host the docker images and keep them private by mediating access through the Shared Task server. Throughout the shared task, Builders should be able to provide support to Breakers or Fixers that use their system through the FEVER slack channel.

Scoring

A baseline performance of builder systems will be measured using predictions against the FEVER test set -- these results will be displayed on the new FEVERlab leaderboard alongside the Codalab entries for the original FEVER 1.0 task. After Builders submit docker images and the Breakers have submitted adversarial instances, we will measure the builders' resilience to adversarial examples. The results will be presented in a new leaderboard.

Test Data format

The data format submitted by the Breakers (see below) will be the same as the FEVER 1.0 task.

Instructions for Breakers

Breakers will be generating adversarial claims in an attempt to break as many Builders' systems as possible. The adversarial claims can be generated manually or automatically, and participants are free to choose specific systems to target. All three types of claims are allowed (Supported, Refuted or NotEnoughInfo), but Supported and Refuted claims have to be accompanied by at least one evidence sentence (from the FEVER 1.0 pre-processed Wikipedia dump).

[optional] Training Data

At the launch of the challenge, we will release additional annotation artefacts to support adversarial attacks. This will incorporate the mutations that were used to generate the FEVER claims.

Data format

Each adversarial claim submitted has to match the format of the FEVER 1.0 claims with the addition of the attack type field containing meta-information as to how the attack was generated. We will provide a list of expected values with the challenge launch e.g.:

{
    "id": 78526,
    "label": "REFUTES",
    "claim": "Lorelai Gilmore's father is named Robert.",
    "attack": "Entity replacement",
    "evidence": [
        [
            [<annotation_id>, <evidence_id>, "Lorelai_Gilmore", 3]
        ]
    ]
}
              

See the definition of the FEVER 1.0 task for more details.

Submission

In order to register as a Breaker for the FEVER 2.0 task, each participant will have to submit a sample of 50 examples that will be manually evaluated by the organisers of the task by 30th April 2019. For the final submission, participants will have to submit a balanced dataset of up to 1000 examples, 50% of which will be given to Builders as development data, and the other 50% will manually evaluated for accuracy of claim labels and evidence as used as the final test set.

Scoring

Breakers will be scored on the potency of the adversarial instances that they submit. This is an inverted FEVER Score based on the number of systems that incorrectly classify claims that meet the data guidelines. For a formal definition, read Section 3 of this paper: https://arxiv.org/abs/1903.05543.

Instructions for Fixers

Fixers will be working on correcting errors specific to types of (or individual) adversarial attacks. This round is open to everyone, regardless of participation in previous rounds. Builders are invited to submit improved systems based on breaker data alternatively, Fixers can collaborate with one or more Builders, using one of the published systems, and submit improved solutions as a new team.

Systems

The following FEVER2.0 systems are open to fixers: each builder has agreed to collaborate or release code.

  • GPLSI - GitHub
  • Saarland - GitHub
  • Columbia - open to collaboration - contact Tuhin Chakrabarty or Christopher Hidey via slack
The following baseline systems are also open to fixers. The original systems have been forked and modified to run inside docker containers and implement both the web API and batch-mode predictions.
Development Data

The development dataset based on Breaker' submissions is now available from the FEVER 2.0 Dataset page. All submissions (except the rule-based baseline) have been manually annotated for correctness.

Data format

Systems will be provided a set of unlabeled claims and will be scored on their ability to correctly identify evidence and label the claim. The data format provided to the fixers will be the same as the FEVER 1.0 task: i.e. the attack metadata will not be provided to the systems at test time.

Submission

Fixers will be invited to submit a docker image following the same guidelines as the Builders. Both the predict.sh batch mode and and web API for single instance prediction must be implemented. For more information about the web api, see existing systems, a sample submission, or the GitHub page. The submission section in the FEVERlab page will be opened on the 20th of June.

Scoring

Participants will be scored based on the improvement on the final test set of Breakers' adversarial examples as well as the score on the FEVER1.0 test set. The leaderboard will display all scores.