AVERITEC Shared Task

Citations

AVeriTeC: A Dataset for Real-world Claim Verification with Evidence from the Web
Michael Schlichtkrull, Zhijiang Guo, Andreas Vlachos
Contains guidelines for the AVERITEC Shared Task

@inproceedings{
  schlichtkrull2023averitec,
  title={AVeriTeC: A Dataset for Real-world Claim Verification with Evidence from the Web},
  author={Michael Sejr Schlichtkrull and Zhijiang Guo and Andreas Vlachos},
  booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2023},
  url={https://openreview.net/forum?id=fKzSz0oyaI}
}

Key Dates (TBC)

  • Challenge Launch: April 2024
  • Training/Dev Data Release: April 2024
  • Testing Begins: June 30, 2024
  • Submission Closes: July 15, 2024
  • Results Announced: July 18, 2024
  • System Descriptions Due for Workshop: August 15, 2024
  • Winners Announced: November 15 or 16, 2024 (7th FEVER Workshop)

Task Definition

The AVeriTeC challenge aims to evaluate the ability of systems to verify real-world claims with evidence from the Web.

  • Given a claim and its metadata, the systems must retrieve evidence that supports and/or refutes the claim, either from the Web or from the document collection provided by the organisers.
  • Using this evidence, label the claim as Supported, Refuted given the evidence, Not Enough Evidence (if there isn't sufficient evidence to either support or refute it) or Conflicting Evidence/Cherry-picking (if the claim has both supporting and refuting evidence).
  • A response will be considered correct only if both the label is correct and the evidence adequate. As evidence retrieval evaluation is non-trivial to perform automatically, the participants will asked to help evaluate it manually to assess the systems fairly.

To learn more about the task and our baseline implementation, read our paper AVeriTeC: A Dataset for Real-world Claim Verification with Evidence from the Web.

Data

The participants are given access to the training and development datasets of the AVeriTeC paper available here. In addition, we provide the document collection for each claim as compiled by searching the Web using the Google API here. It is guaranteed to contain the right evidence, so that participants do not need to use a search engine to develop their approaches (but doing so is allowed). The test dataset will be released in the end of June, including both claims and document collections (but not the correct responses). The datasets are distributed in the JSONL format with one example per line (see http://jsonlines.org for more details). The data can be downloaded on the AVeriTeC Dataset page.

Submission

System predictions should be submitted to our EvalAI challenge page. Before the release of the testing data you can submit your predictions on the development split to become familiar with the submission system. When submitting system predictions, you need to specify the system name, and, if available, a link to the code. We will use the team name you specified on EvalAI when we compile the final results of the challenge. You can find more details on the submission page itself.

NB: Participants will be allowed a limited number of submissions per system (once a day) – multiple submissions are allowed, but only the final one will be scored/counted.

System Description Paper

You may submit a system description paper, describing the system's method, how it has been trained, the evaluation, and possibly an error analysis to understand strengths and weaknesses of the proposed system. The system description paper must be submitted as a PDF, consisting of a maximum of eight pages (for most description papers four to six pages will be sufficient) of content plus unlimited pages for bibliography. Submissions must follow the EMNLP 2024 two-column format, using the LaTeX style files or Word templates or the Overleaf template from the official EMNLP website. Please submit your system description papers here.

NB: System Description papers are reviewed in a single-blind review process. Thus, your manuscript may contain authors' names and information that would reveal your identity (e.g. team name, score, and rank at the shared task). Also note that at least one author of the system description paper will have to register as a reviewer for the FEVER Workshop.

Baseline system

The implementation of the Baseline system can be found on our Huggingface repository.

For the technical details of the implementation as well as the Baseline performance, please refer to the AVeriTeC paper.

Scoring

The AVeriTec scoring is built on the FEVER scorer. The scoring script can be found on the AVeriTeC Dataset page.

For the AVeriTeC score following changes are made to the FEVER scorer:

  • Claims in Fact-Checking datasets are typically supported or refuted by evidence, or there is not enough evidence. We add a fourth class: conflicting evidence/cherry-picking. This covers both conflicting evidence, and technically true claims that mislead by excluding important context, i.e., the claim has both supporting and refuting evidence.
  • Unlike in FEVER using a closed source of evidence such as Wikipedia, AVERITEC is intended for use with evidence retrieved from the open web. Since the same evidence may be found in different sources, we cannot rely on exact matching to score retrieved evidence. As such, we instead rely on approximate matching. Specifically, we use the Hungarian Algorithm to find an optimal matching of provided evidence to annotated evidence.

Data Format used for submission

The data are distributed in the .JSON format with one example per line (see http://jsonlines.org for more details).

Each example is an object of the following form:

  • id: The ID of the sample.
  • claim: The claim text itself.
  • label: The annotated verdict for the claim.
  • evidence: A list of QA pairs. Each set consists of dictionaries with two fields (question, answer).
    • question: The text of the generated question.
    • answer: The text of the answer of the generated question.