AVerImaTeC: A Dataset for Automatic Verification of Image-Text Claims with Evidence from the Web
Rui Cao, Zifeng Ding, Zhijiang Guo, Michael Schlichtkrull, Andreas Vlachos
Contains guidelines for the AVerImaTeC Shared Task
@article{cao2025averimatec, title={AVerImaTeC: A Dataset for Automatic Verification of Image-Text Claims with Evidence from the Web}, author={Cao, Rui and Ding, Zifeng and Guo, Zhijiang and Schlichtkrull, Michael and Vlachos, Andreas}, journal={arXiv preprint arXiv:2505.17978}, year={2025} } }
The AVerImaTeC challenge aims to evaluate the ability of systems to verify real-world image-text claims with evidence from the Web.
To learn more about the task read the dataset description paper AVerImaTeC: A Dataset for Automatic Verification of Image-Text Claims with Evidence from the Web, and go to the shared task webpage. You can find the call for papers in our workshop page.
The participants are given access to the training and development datasets of the AVerImaTeC paper available here. In addition, we provide a static knowledge store for each claim as compiled by searching the Web using the Google API here. It is guaranteed to contain the right evidence, so that participants do not need to use a search engine to develop their approaches. The datasets are distributed in the JSONL format with one example per line (see http://jsonlines.org for more details). The data can be downloaded on the AVerImaTeC Dataset page.
Before the release of the testing data, you can submit your predictions on the development split to our Huggingface competition page, to familiarize yourself with the output requirement/format of your system’s predictions.
When submitting system predictions, you need to specify the system name. We will use the team name you specified on Huggingface when we compile the final results of the challenge.
The data are distributed in the .JSON format with one example per line (see http://jsonlines.org for more details).
Each JSON Object (i.e., prediction) is a dictionary, containing three essential keys: id, evidence, verdict:
id
: The index of the image-text claim. questions
: The generated essential questions for verifying the image-text claim. evidence
: The retrieved evidence for verifying the claim. text
: The textual part of a piece of evidence. If the evidence is multimodal, images in the textual part of evidence are represented with placeholders (e.g., [IMG_1], [IMG_2], ...).images
: A list, containing all images in a piece of multimodal evidence. It could be an empty list if the evidence is text-only. Each item in the image list is an encoded image representation with Base64.verdict
: The predicted varacity label of the claim. justification
: A textual justification that explains how a verdict can be reached on the basis of the evidence found. The example of the submission can be found here. To submit your prediction to our Huggingface competition page, you prediction MUST be a .CSV file named as submission.csv due to Hugging Face specifications.
You may submit a system description paper, describing the system's method, how it has been trained, the evaluation, and possibly an error analysis to understand strengths and weaknesses of the proposed system. The system description paper must be submitted as a PDF, consisting of a maximum of eight pages (for most description papers four to six pages will be sufficient) of content plus unlimited pages for bibliography. Submissions must follow the ACL 2026 two-column format, using the LaTeX style files or Word templates or the Overleaf template from the official ACL website.
NB: System Description papers are reviewed in a single-blind review process. Thus, your manuscript may contain authors' names and information that would reveal your identity (e.g. team name, score, and rank at the shared task). Also note that at least one author of the system description paper will have to register as a reviewer for the FEVER Workshop.
The implementation of the Baseline system can be found on our Github repository.
The AVerImaTeC scoring follows a similar approach to previous FEVER shared tasks and considers the correctness of the verdict label conditioned on the correctness of the evidence retrieved. The label will only be considered correct if the recall score between the provided evidence and the annotated evidence is at least 0.3. The scoring script can be found on the AVerImaTeC Dataset page. A detailed explanation of the scoring metric can be found in this paper.
In addition to conditional accuracy which is the main evaluation metric, we will report the following scores for each submission:
Q: The evidence must be in the format of QA pairs?
A: If you don't have the evidence as question-answer pairs, you should submit the same JSON format just omitting the question field.
Q: Are the metadata (e.g., fact-checking strategies, claim types) provided in Test set?
A: Recognising the claim type and the fact checking strategies are part of performing the fact-check, thus we won't be giving them for the test data. But we hope you find their annotations in the training and development data.