AVERITEC Shared Task

Citations

AVeriTeC: A Dataset for Real-world Claim Verification with Evidence from the Web
Michael Schlichtkrull, Zhijiang Guo, Andreas Vlachos
Contains guidelines for the AVERITEC Shared Task

@inproceedings{
  schlichtkrull2023averitec,
  title={AVeriTeC: A Dataset for Real-world Claim Verification with Evidence from the Web},
  author={Michael Sejr Schlichtkrull and Zhijiang Guo and Andreas Vlachos},
  booktitle={Thirty-thh Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2023},
  url={https://openreview.net/forum?id=fKzSz0oyaI}
}

The Automated Verification of Textual Claims (AVeriTeC) Shared Task
Michael Schlichtkrull, Yulong Chen, Chenxi Whitehouse, Zhenyun Deng, Mubashara Akhtar, Rami Aly, Zhijiang Guo, Christos Christodoulopoulos, Oana Cocarascu, Arpit Mittal, James Thorne, Andreas Vlachos

@inproceedings{schlichtkrull-etal-2024-automated,
    title = "The Automated Verification of Textual Claims ({AV}eri{T}e{C}) Shared Task",
    author = "Schlichtkrull, Michael  and Chen, Yulong and Whitehouse, Chenxi and Deng, Zhenyun and Akhtar, Mubashara  and Aly, Rami  and Guo, Zhijiang  and Christodoulopoulos, Christos  and Cocarascu, Oana  and  Mittal, Arpit  and Thorne, James  and Vlachos, Andreas",
    booktitle = "Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER)",
    month = nov,
    year = "2024",
    url = "https://aclanthology.org/2024.fever-1.1/",
}

Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking
Mubashara Akhtar, Michael Schlichtkrull, Andreas Vlachos

@article{akhtar2024ev2r,
  title={Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking},
  author={Akhtar, Mubashara and Schlichtkrull, Michael and Vlachos, Andreas},
  journal={arXiv preprint arXiv:2411.05375},
  year={2024}
}
}

Final Leaderboard

Rank	Team	Q only (Ev2R recall)	Q + A (Ev2R recall)	new AVeriTeC score (Ev2R recall)	Average runtime per claim (s)
1	CTU AIC	0.2003 ± 0.0066	0.4774 ± 0.0035	0.3317 ± 0.0015	53.67
2	HUMANE	0.1933 ± 0.0048	0.4299 ± 0.0006	0.2707 ± 0.0040	29.19
3	yellow_flash	0.1561 ± 0.0057	0.4098 ± 0.0077	0.2527 ± 0.0051	31.71
4	FZIGOT	0.3622 ± 0.0067	0.3998 ± 0.0031	0.2440 ± 0.0020	18.50
5	EFC	0.1254 ± 0.0005	0.3520 ± 0.0055	0.2047 ± 0.0025	7.01
6	checkmate	0.1848 ± 0.0068	0.3368 ± 0.0049	0.2043 ± 0.0047	22.73
7	Baseline	0.2723 ± 0.0006	0.3362 ± 0.0036	0.2023 ± 0.0068	33.88
8	OldJoe	0.1823 ± 0.0049	0.3878 ± 0.0014	0.1517 ± 0.0025	48.57

Key Dates

Challenge Launch: January 27, 2025
Training/Dev Data Release: January 27, 2025
Testing Begins: April 28, 2025
Shared task system Closes: May 2, 2025
Results Announced: May 9, 2025
Shared Task Paper deadline for Workshop: May 19, 2025
Notification deadline: June 18, 2025
Camera-ready deadline: June 23, 2025

Update

The test data (claims and knowledge base, but not the evidence and the labels) have now been released: https://drive.google.com/drive/folders/1DzcJogH3592Ibv19uFWUI84FpbL7NCcC. Time to check that what you have developed runs as it should. And get ready to submit your systems by Friday midnight anywhere on earth.

There was a discussion about what might be the penalty for systems that are slower than specified. The test contains 1000 claims, and thus systems should run over the test set in 1000 minutes. If a systems exceeds this time limit, we will stop it and any claims that were left unlabelled will be considered incorrectly labeled. Just to repeat that we would very much prefer have efficient systems going for the maximum possible score than penalised less efficient ones.

We have set up Leaderboard https://huggingface.co/spaces/fever/AVeriTeCFever8 for the development set. It uses the same EV2R approach but with a better LLM, Llama-3.3-70B. The evaluation takes 4 hours for the dev set. We don't do multiple runs any more as we found the variance to be small. However, we cannot evaluate multiple submissions in parallel due to HuggingFace competition specifications.

After the shared task, we set up Leaderboard https://huggingface.co/spaces/fever/Fever8 for the test set. It uses the same EV2R approach with LLaMA-3.3-70B. Similarly, we don't do multiple runs any more as we found the variance to be small.

Task Definition

The AVeriTeC challenge aims to evaluate the ability of systems to verify real-world claims with evidence from the Web.

Given a claim and its metadata, the systems must retrieve evidence that supports and/or refutes the claim, either from the Web or from the document collection provided by the organisers.
Using this evidence, label the claim as Supported, Refuted given the evidence, Not Enough Evidence (if there isn't sufficient evidence to either support or refute it) or Conflicting Evidence/Cherry-picking (if the claim has both supporting and refuting evidence).
A response will be considered correct only if both the label is correct and the evidence adequate. As evidence retrieval evaluation is non-trivial to perform automatically, the participants will asked to help evaluate it manually to assess the systems fairly.
This shared task focuses on reproducible and efficient fact verification systems. To this end, this shared task will evaluate systems and run predictions on the test set on a dedicated VM.

Data

The participants are given access to the training and development datasets of the AVeriTeC paper available here, same as in the 2024 shared task. In addition, we provide the document collection for each claim as compiled by searching the Web using the Google API here. It is guaranteed to contain the right evidence, so that participants do not need to use a search engine to develop their approaches. The test dataset will consist of the test set from 2024 but with a revised document collection (fixing the temporal leakage issue; participants from last year should use this revised document collection) and a new portion with more recent claims. The new portion of the test set will be released in April, including both claims and document collections (but not the correct responses). The new part of the test set used in the shared task (i.e. that was not described in the AVeriTeC paper) was annotated using a donation from Google. The datasets are distributed in the JSONL format with one example per line (see http://jsonlines.org for more details). The data can be downloaded on the AVeriTeC Dataset page.

Data Format used for submission to the leaderboard during development

Before the release of the testing data, you can submit your predictions on the development split to our Huggingface competition page, to familiarize yourself with the output requirement/format of your system’s predictions.

When submitting system predictions, you need to specify the system name. We will use the team name you specified on Huggingface when we compile the final results of the challenge.

The data are distributed in the .JSON format with one example per line (see http://jsonlines.org for more details).

Each example is an object of the following form:

claim_id: The ID of the sample.
claim: The claim text itself.
pred_label: The predicted label of the claim.
evidence: A list of QA pairs. Each set consists of dictionaries with four fields (question, answer, url, scraped_text).
- question: The text of the generated question.
- answer: The text of the answer of the generated question.
- url: The source url for the answer.

The example of the submission can be found here. To submit your prediction to our Huggingface competition page, you prediction MUST be a .CSV file named as submission.csv due to Hugging Face specifications. The conversion file (.json to .csv) can be found here, and you can find more details on the submission page itself.

On the leaderboard, the metric to optimise for is the new AVeriTeC score (Ev2R recall). The Hungarian meteor based ones are kept for reference against the results of the 2024 shared task.

Submission of systems for evaluation on the hidden test data

Every system participating in the shared task is expected to produce reproducible and efficient code. Subsequently, shared task systems must be submitted to the shared task organisers, which will then be evaluated on identical virtual machines. The virtual machine is a g5.2xlarge EC2 instance on AWS. The configuration of the virtual machine is:

GPU: Nvidia A10G, with 23GB memory
CPU: 8 vCPUs
RAM: 32GB
Storage: 450GB (including the AVeriTeC knowledge base)

At inference time on the test, a submitted system must verify a single claim on the VM in at most 1 minute, on average. A valid system submission is required to run on the aforementioned VM within the specified time constraints, and will be evaluated with respect to the aforementioned new AVeriTeC score (Ev2R recall).

To help participants ensure that systems will run on the virtual machine, we provide in the shared task repository a Docker image and an exact description of the AWS AMI that will be used. The repository contains further information on system requirements and necessary files to be included in submitted systems. To submit your system, you can either share a ZIP file or a Docker Instance and send a download URL to fever-organisers@googlegroups.com.

System Description Paper

You may submit a system description paper, describing the system's method, how it has been trained, the evaluation, and possibly an error analysis to understand strengths and weaknesses of the proposed system. The system description paper must be submitted as a PDF, consisting of a maximum of eight pages (for most description papers four to six pages will be sufficient) of content plus unlimited pages for bibliography. Submissions must follow the ACL 2025 two-column format, using the LaTeX style files or Word templates or the Overleaf template from the official ACL website. Please submit your system description papers here.

NB: System Description papers are reviewed in a single-blind review process. Thus, your manuscript may contain authors' names and information that would reveal your identity (e.g. team name, score, and rank at the shared task). Also note that at least one author of the system description paper will have to register as a reviewer for the FEVER Workshop.

Baseline system

The implementation of the Baseline system can be found on our Github repository. The baseline builds upon the HerO system, which was the runner-up in the FEVER 7 Shared Task. HerO was selected as the foundation for our baseline due to its combination of strong performance and reproducibility: it provides fully reproducible code and relies exclusively on open-weight models. Our baseline modifies the original HerO implementation, focusing on computational efficiency. The optimized version largely maintains the performance of the HerO 8B parameter variant while significantly reducing inference runtime (through improved parallelisation, retrieval cutoffs, and heuristics), to an average of around 50s per claim on the development set using the aforementioned VM.

Scoring

The new AVeriTeC scoring follows a similar approach to FEVER and considers the correctness of the verdict label conditioned on the correctness of the evidence retrieved. If a claim is labelled as supported, refuted, conflicting evidence/cherry-picking, additionally the evidence will be checked against the list of annotated evidence. The label will only be considered correct if the EV2R recall score between the provided evidence and the annotated evidence is at least 0.44. The scoring script can be found on the AVeriTeC Dataset page. A detailed explanation of the scoring metric can be found in this paper.

For the new AVeriTeC score following changes are made to the FEVER scorer:

Claims in Fact-Checking datasets are typically supported or refuted by evidence, or there is not enough evidence. We add a fourth class: conflicting evidence/cherry-picking. This covers both conflicting evidence, and technically true claims that mislead by excluding important context, i.e., the claim has both supporting and refuting evidence.
Unlike in FEVER using a closed source of evidence such as Wikipedia, AVERITEC is intended for use with evidence retrieved from the open web. Since the same evidence may be found in different sources, we cannot rely on exact matching to score retrieved evidence. As such, we instead rely on approximate matching. Specifically, we use the Ev2R to find an optimal matching of provided evidence to annotated evidence.
The shared task uses Llama 3.3 70B as the grader.

For the new AVeriTeC score following changes are made to the FEVER scorer:

FAQs

Q: The evidence must be in the format of QA pairs?

A: If you don't have the evidence as question-answer pairs, you should submit the same JSON format just omitting the question field.

Q: Is it possible to use an LLM like OpenAI's ChatGPT?

A: You can use LLMs of your choice, and mention this in your system description, but LLMs beyond the ones that can be run on the infrastructure specified in section regarding system submission can only be used during training, e.g. for training data augmentation.

Q: Are the metadata (e.g., fact-checking strategies, claim types) provided in Test set?

A: Recognising the claim type and the fact checking strategies are part of performing the fact-check, thus we won't be giving them for the test data. But we hope you find their annotations in the training and development data.