FEVER 2.0 Shared Task

Key Dates

  • Challenge Launch: 7 March 2019
  • Round 1:
  • Builders Submissions: 30 March 2019
  • Round 2
  • Breakers Sample Submissions: 30 April 2019
  • Breakers Final Submissions: 31 May 2019
  • Round 3
  • Breaking Instances Release: 1 June 2019
  • Fixers Submissions: 30 June 2019
  • Results + Workshop
  • Results Announced: 10 July 2019
  • System Descriptions Due for Workshop: 30 August 2019
  • Winners Announced: 3/4 November (EMNLP-IJCNLP)

Task Definition

The FEVER 2.0 Shared Task will build upon work from the first shared task in a Build it Break it Fix it setting. The shared will comprise three phases. In the first phase of the shared task, Builders build systems for solving the first FEVER shared task dataset. The highest scoring systems from the first shared task will be used as baselines and we will also invite new participants to develop new systems.

In the second phase, Breakers are tasked with generating adversarial examples to fool the existing systems. We consider only novel claims (i.e. not contained in the original FEVER dataset) with either Supports, Refutes or NotEnoughInfo labels. Supported or refuted claims should be accompanied with evidence from the Wikipedia dump used in the original task (claims with NotEnoughInfo as labels do not require evidence). The Breakers will have access to the systems to allow themselves to generate claims which are challenging for the builders. Alongside the labels and evidence for each claim, breakers will be asked to provide meta-information regarding the type of attack they are introducing. The breakers will be invited to submit up to a fixed number of claims as their entry to the shared task. We welcome both manual (through the use of our annotation interface) and automated methods for this phase. Half of the claims generated by the Breakers will be retained as a hold-out blind test set and the remaining half will be released to the participants to fix their systems. The blind set will be manually evaluated by the organisers for quality assurance.

In the final phase of the shared task, the original Builders or teams of dedicated Fixers must incorporate the new data generated by the Breakers to improve the systems' classification performance.

Instructions for Builders

Builders will be creating system that can solve the original FEVER task. Participants in this category are also encouraged to participate as Fixers for their own systems.

Training Data

The definition of the FEVER 1.0 task has examples of the data structures for each of the three labels.

Resources

Existing implementations from the FEVER1.0 shared task as well the FEVER dataset can be found on our resources page.

Submission

Participants must submit their predictions to a leaderboard (TBD) for scoring. We invite partipants to make their systems available to the breakers: either through making a web API availalble or by submitting a docker image (details TBD). Throughout the shared task, Builders should be able to provide support to Breakers or Fixers that use their system through the FEVER slack channel.

Scoring

Participants will be scored based on their performance on the total set of adversarial attacks by the Breakers. More information about scoring will be released at the launch of the challenge.

Test Data format

The data format submitted by the Breakers (see below) will be the same as the FEVER 1.0 task.

Instructions for Breakers

Breakers will be generating adversarial claims in an attempt to break as many Builders' systems as possible. The adversarial claims can be generated manually or automatically, and participants are free to choose specific systems to target. Only Supported or Refuted types of claims are supported and each claim has to be accompanied by at least one evidence sentence (from the FEVER 1.0 pre-processed Wikipedia dump).

[optional] Training Data

At the launch of the challenge, we will release additional annotation artefacts to support adversarial attacks. This will incorporate the mutations that were used to generate the FEVER claims.

Data format

Each adversarial claim submitted has to match the format of the FEVER 1.0 claims with the addition of the attack type field containing meta-information as to how the attack was generated. We will provide a list of expected values with the challenge launch e.g.:

{
    "id": 78526,
    "label": "REFUTES",
    "claim": "Lorelai Gilmore's father is named Robert.",
    "attack": "Entity replacement",
    "evidence": [
        [
            [<annotation_id>, <evidence_id>, "Lorelai_Gilmore", 3]
        ]
    ]
}
              

See the definition of the FEVER 1.0 task for more details.

Submission

In order to register as a Breaker for the FEVER 2.0 task, each participant will have to submit a sample of 50 examples that will be manually evaluated by the organisers of the task by 30th April 2019. For the final submission, participants will have to submit up to 1000 examples, 50% of which will be given to Builders as development data, and the other 50% will manually evaluated for accuracy of claim labels and evidence as used as the final test set.

Scoring

Participants will be scored by the number of systems each correct claim breaks. More information about scoring will be released at the launch of the challenge.

Instructions for Fixers

Fixers will be working on correcting errors specific to types of (or individual) adversarial attacks. Participants in this category can be Builders that have improved their system based on the development dataset of the Breakers; alternatively, Fixers can collaborate with one or more Builders and submit improved solutions as a new team.

Submission

TBC

Scoring

Participants will be scored based on the percentage of improvement on the final test set of Breakers' adversarial examples. More information about scoring will be released at the launch of the challenge.