AVerImaTeC Shared Task

Citations

AVerImaTeC: A Dataset for Automatic Verification of Image-Text Claims with Evidence from the Web
Rui Cao, Zifeng Ding, Zhijiang Guo, Michael Schlichtkrull, Andreas Vlachos
Contains guidelines for the AVerImaTeC Shared Task

@article{cao2025averimatec,
  title={AVerImaTeC: A Dataset for Automatic Verification of Image-Text Claims with Evidence from the Web},
  author={Cao, Rui and Ding, Zifeng and Guo, Zhijiang and Schlichtkrull, Michael and Vlachos, Andreas},
  journal={arXiv preprint arXiv:2505.17978},
  year={2025}
}
}

Key Dates

Challenge Launch: September 29, 2025
Training/Dev Data Release: September 29, 2025
Testing Begins: November 28, 2025
Shared task system Closes: December 2, 2025
Results Announced: December 9, 2025
Shared Task Paper deadline for Workshop: December 19, 2025
Notification deadline: January 23, 2025
Camera-ready deadline: February 3, 2025

Task Definition

The AVerImaTeC challenge aims to evaluate the ability of systems to verify real-world image-text claims with evidence from the Web.

Given an image-text claim and its metadata, the systems must retrieve evidence (consisting of text and/or images) that supports and/or refutes the claim, either from the Web or from the document and image evidence collection provided by the organisers..
According to the evidence principle in fact-checking, all evidence should be published before the claim date.
Using this evidence, label the claim as Supported, Refuted given the evidence, Not Enough Evidence (if there isn't sufficient evidence to either support or refute it) or Conflicting Evidence/Cherry-picking (if the claim has both supporting and refuting evidence).
A response will be considered correct only if both the label is correct and the evidence adequate. As evidence retrieval evaluation is non-trivial to perform automatically, the participants will be asked to help evaluate it manually to assess the systems fairly.

To learn more about the task read the dataset description paper AVerImaTeC: A Dataset for Automatic Verification of Image-Text Claims with Evidence from the Web, and go to the shared task webpage. You can find the call for papers in our workshop page.

Data

The participants are given access to the training and development datasets of the AVerImaTeC paper available here. In addition, we provide a static knowledge store for each claim as compiled by searching the Web using the Google API here. It is guaranteed to contain the right evidence, so that participants do not need to use a search engine to develop their approaches. The datasets are distributed in the JSONL format with one example per line (see http://jsonlines.org for more details). The data can be downloaded on the AVerImaTeC Dataset page.

Data Format used for submission to the leaderboard during development

Before the release of the testing data, you can submit your predictions on the development split to our Huggingface competition page, to familiarize yourself with the output requirement/format of your system’s predictions.

When submitting system predictions, you need to specify the system name. We will use the team name you specified on Huggingface when we compile the final results of the challenge.

The data are distributed in the .JSON format with one example per line (see http://jsonlines.org for more details).

Each JSON Object (i.e., prediction) is a dictionary, containing three essential keys: id, evidence, verdict:

id: The index of the image-text claim.
questions: The generated essential questions for verifying the image-text claim.
evidence: The retrieved evidence for verifying the claim.

The length of the evidence list denotes the number of retrieved evidence pieces.
Each item in the evidence list is a dictionary with two keys, text and images, representing the image part and textual part of evidence, respectively.
text: The textual part of a piece of evidence. If the evidence is multimodal, images in the textual part of evidence are represented with placeholders (e.g., [IMG_1], [IMG_2], ...).
images: A list, containing all images in a piece of multimodal evidence. It could be an empty list if the evidence is text-only. Each item in the image list is an encoded image representation with Base64.

verdict: The predicted varacity label of the claim.
justification: A textual justification that explains how a verdict can be reached on the basis of the evidence found.

The example of the submission can be found here. To submit your prediction to our Huggingface competition page, you prediction MUST be a .CSV file named as submission.csv due to Hugging Face specifications.

System Description Paper

You may submit a system description paper, describing the system's method, how it has been trained, the evaluation, and possibly an error analysis to understand strengths and weaknesses of the proposed system. The system description paper must be submitted as a PDF, consisting of a maximum of eight pages (for most description papers four to six pages will be sufficient) of content plus unlimited pages for bibliography. Submissions must follow the ACL 2026 two-column format, using the LaTeX style files or Word templates or the Overleaf template from the official ACL website. Please submit your system description papers here.

NB: System Description papers are reviewed in a single-blind review process. Thus, your manuscript may contain authors' names and information that would reveal your identity (e.g. team name, score, and rank at the shared task). Also note that at least one author of the system description paper will have to register as a reviewer for the FEVER Workshop.

Baseline system

The implementation of the Baseline system can be found on our Github repository.

Scoring

The AVerImaTeC scoring follows a similar approach to previous FEVER shared tasks and considers the correctness of the verdict label conditioned on the correctness of the evidence retrieved. The label will only be considered correct if the recall score between the provided evidence and the annotated evidence is at least 0.3. The scoring script can be found on the AVerImaTeC Dataset page. A detailed explanation of the scoring metric can be found in this paper.

In addition to conditional accuracy which is the main evaluation metric, we will report the following scores for each submission:

Question Score: We adopt a reference-based evaluation mechanism for question evaluation, which assesses how well models ask essential questions for image-text claim verification.
Evidence Score: We adopt a reference-based evaluation mechanism for evidence evaluation, which compares model predicted evidence with human annotated one.
Verdict Accuracy: We report a conditional accuracy for verdict prediction. The accuracy of a verdict prediction is considered only if the associated evidence score exceeds a predefined threshold λ (set as 0.3 in the competition), otherwise the claim is considered to be labeled incorrectly.
Justification Score: We report a reference-based evaluation score for generated justifications by comparing against the human annotated ones. Similar to verdict prediction, we use a conditional justification score that only considers a justification only if the associated evidence score exceeds the predefined threshold.

FAQs

Q: The evidence must be in the format of QA pairs?

A: If you don't have the evidence as question-answer pairs, you should submit the same JSON format just omitting the question field.

Q: Are the metadata (e.g., fact-checking strategies, claim types) provided in Test set?

A: Recognising the claim type and the fact checking strategies are part of performing the fact-check, thus we won't be giving them for the test data. But we hope you find their annotations in the training and development data.