FEVER 2.0 Shared Task

Key Dates

  • Challenge Launch: 7 March 2019
  • Round 1:
  • Builders Prediction Submissions: 15 April 2019
  • Builders Docker Submissions: 30 April 2019
  • Round 2
  • Breakers Sample Submissions: 30 April 2019
  • Breakers Final Submissions: 7 June 2019
  • Round 3
  • Breaking Instances Release: 21 June 2019
  • Fixers Submissions: 21 July 2019
  • Results + Workshop
  • Results Announced: 26 July 2019
  • System Descriptions Due for Workshop: 30 August 2019
  • Winners Announced: 3/4 November (EMNLP-IJCNLP)

Task Definition

The FEVER 2.0 Shared Task will build upon work from the first shared task in a Build it Break it Fix it setting. The shared will comprise three phases. In the first phase of the shared task, Builders build systems for solving the first FEVER shared task dataset. The highest scoring systems from the first shared task will be used as baselines and we will also invite new participants to develop new systems.

In the second phase, Breakers are tasked with generating adversarial examples to fool the existing systems. We consider only novel claims (i.e. not contained in the original FEVER dataset) with either Supports, Refutes or NotEnoughInfo labels. Supported or refuted claims should be accompanied with evidence from the Wikipedia dump used in the original task (claims with NotEnoughInfo as labels do not require evidence). The Breakers will have access to the systems to allow themselves to generate claims which are challenging for the builders. Alongside the labels and evidence for each claim, breakers will be asked to provide meta-information regarding the type of attack they are introducing. The breakers will be invited to submit up to a fixed number of claims as their entry to the shared task. We welcome both manual (through the use of our annotation interface) and automated methods for this phase. Half of the claims generated by the Breakers will be retained as a hold-out blind test set and the remaining half will be released to the participants to fix their systems. The blind set will be manually evaluated by the organisers for quality assurance.

In the final phase of the shared task, the original Builders or teams of dedicated Fixers must incorporate the new data generated by the Breakers to improve the systems' classification performance.

Instructions for Builders

Builders will be creating system that can solve the original FEVER task. Participants in this category are also encouraged to participate as Fixers for their own systems.

Training Data

The definition of the FEVER 1.0 task has examples of the data structures for each of the three labels.

Resources

Existing implementations from the FEVER1.0 shared task as well the FEVER dataset can be found on our resources page.

Submission

Participants must submit their predictions to the new FEVERlab leaderboard for scoring. We also invite participants to make their systems available to the Breakers by creating a docker image (sample docker image) and submitting it to the FEVERlab page. The Shared Task organisers will host the docker images and keep them private by mediating access through the Shared Task server. Throughout the shared task, Builders should be able to provide support to Breakers or Fixers that use their system through the FEVER slack channel.

Scoring

A baseline performance of builder systems will be measured using predictions against the FEVER test set -- these results will be displayed on the new FEVERlab leaderboard alongside the Codalab entries for the original FEVER 1.0 task. After Builders submit docker images and the Breakers have submitted adversarial instances, we will measure the builders' resilience to adversarial examples. The results will be presented in a new leaderboard.

Test Data format

The data format submitted by the Breakers (see below) will be the same as the FEVER 1.0 task.

Instructions for Breakers

Breakers will be generating adversarial claims in an attempt to break as many Builders' systems as possible. The adversarial claims can be generated manually or automatically, and participants are free to choose specific systems to target. All three types of claims are allowed (Supported, Refuted or NotEnoughInfo), but Supported and Refuted claims have to be accompanied by at least one evidence sentence (from the FEVER 1.0 pre-processed Wikipedia dump).

[optional] Training Data

At the launch of the challenge, we will release additional annotation artefacts to support adversarial attacks. This will incorporate the mutations that were used to generate the FEVER claims.

Data format

Each adversarial claim submitted has to match the format of the FEVER 1.0 claims with the addition of the attack type field containing meta-information as to how the attack was generated. We will provide a list of expected values with the challenge launch e.g.:

{
    "id": 78526,
    "label": "REFUTES",
    "claim": "Lorelai Gilmore's father is named Robert.",
    "attack": "Entity replacement",
    "evidence": [
        [
            [<annotation_id>, <evidence_id>, "Lorelai_Gilmore", 3]
        ]
    ]
}
              

See the definition of the FEVER 1.0 task for more details.

Submission

In order to register as a Breaker for the FEVER 2.0 task, each participant will have to submit a sample of 50 examples that will be manually evaluated by the organisers of the task by 30th April 2019. For the final submission, participants will have to submit a balanced dataset of up to 1000 examples, 50% of which will be given to Builders as development data, and the other 50% will manually evaluated for accuracy of claim labels and evidence as used as the final test set.

Scoring

Breakers will be scored on the potency of the adversarial instances that they submit. This is an inverted FEVER Score based on the number of systems that incorrectly classify claims that meet the data guidelines. For a formal definition, read Section 3 of this paper: https://arxiv.org/abs/1903.05543.

Instructions for Fixers

Fixers will be working on correcting errors specific to types of (or individual) adversarial attacks. This round is open to everyone, regardless of participation in previous rounds. Builders are invited to submit improved systems based on breaker data alternatively, Fixers can collaborate with one or more Builders, using one of the published systems, and submit improved solutions as a new team.

Systems

The following FEVER2.0 systems are open to fixers: each builder has agreed to collaborate or release code.

  • GPLSI - GitHub
  • Saarland - GitHub
  • Columbia - open to collaboration - contact Tuhin Chakrabarty or Christopher Hidey via slack
The following baseline systems are also open to fixers. The original systems have been forked and modified to run inside docker containers (original systems and papers can be downloaded from the resources page) and implement both the web API and batch-mode predictions.
  • UNC-NLP - GitHub
  • UCL Machine Reading Group - GitHub
  • Athene UKP TU Darmstadt - GitHub
  • Papelo (NEC Labs America) - GitHub
Development Data

The development dataset based on Breaker' submissions is now available from the resources page. All submissions (except the rule-based baseline) have been manually annotated for correctness.

Data format

Systems will be provided a set of unlabeled claims and will be scored on their ability to correctly identify evidence and label the claim. The data format provided to the fixers will be the same as the FEVER 1.0 task: i.e. the attack metadata will not be provided to the systems at test time.

Submission

Fixers will be invited to submit a docker image following the same guidelines as the Builders. Both the predict.sh batch mode and and web API for single instance prediciton must be implemented. For more information about the web api, see existing systems, a sample submission, or the GitHub page. The submission section in the FEVERlab page will be opened on the 20th of June.

Scoring

Participants will be scored based on the improvement on the final test set of Breakers' adversarial examples as well as the score on the FEVER1.0 test set. The leaderboard will display all scores.