Schedule

9:00-9:05 Opening
FEVER Organizers
9:05-9:50 Investigating Datasets
Christo Buschek, Der SPIEGEL
9:50-10:10 Shared Task Overview [show/hide details]
The 2nd Automated Verification of Textual Claims (AVeriTeC) Shared Task: Open-weights, Reproducible and Efficient Systems
Mubashara Akhtar, Rami Aly, Yulong Chen, Zhenyun Deng, Michael Schlichtkrull, Chenxi Whitehouse and Andreas Vlachos
10:10-10:30 Contributed Shared Task Talks [show/hide details]
AIC CTU@FEVER 8: On-premise fact checking through long context RAG
Herbert Ullrich and Jan Drchal
Exploring Semantic Filtering Heuristics For Efficient Claim Verification
Max Upravitelev, Premtim Sahitaj, Arthur Hilbert, Veronika Solopova, Jing Yang, Nils Feldhus, Tatiana Anikina, Simon Ostermann and Vera Schmitt
10:30-11:00 Morning Break
11:00-12:00 Poster Session [show/hide details]
Automated Claim–Evidence Extraction for Political Discourse Analysis: A Large Language Model Approach to Rodong Sinmun Editorials
Gyuri Choi and Hansaem Kim
Language Model Re-rankers are Fooled by Lexical Similarities
Lovisa Hagström, Ercong Nie, Ruben Halifa, Helmut Schmid, Richard Johansson, Alexander Junge
Portuguese Automated Fact-checking: Information Retrieval with Claim Extraction
Juliana Gomes, Eduardo Garcia and Arlindo Rodrigues Galvão Filho
Multilingual Symptom Detection on Social Media: Enhancing Health-related Fact-checking with LLMs
Saidah Zahrotul Jannah, Elyanah Aco, Shaowen Peng, Shoko Wakamiya and Eiji Aramaki
When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification
Hanna Shcharbakova, Tatiana Anikina, Natalia Skachkova and Josef Van Genabith
Less Can be More: An Empirical Evaluation of Small and Large Language Models for Sentence-level Claim Detection
Andrew Bell
RAG based Question Answering of Korean Laws and Precedents
Kiho Seo and Takehito Utsuro
FACT5: A Novel Benchmark and Pipeline for Nuanced Fact-Checking of Complex Statements
Shayan Chowdhury, Sunny Fang and Smaranda Muresan
Correcting Hallucinations in News Summaries: Exploration of Self-Correcting LLM Methods with External Knowledge
Juraj Vladika, Ihsan Soydemir and Florian Matthes
The Law of Knowledge Overshadowing: Towards Understanding, Predicting, and Preventing LLM Hallucination
Yuji Zhang, Sha Li, Cheng Qian, Jiateng Liu, Pengfei Yu, Chi Han, Yi R. Fung, Kathleen McKeown, ChengXiang Zhai, Manling Li and Heng Ji
GQC: LLM-Based Grouped QA Consolidation for Open-Domain Fact Verification at AVeriTeC
Dongzhuoran Zhou, Roxana Pop, Yuqicheng Zhu and Evgeny Kharlamov
(Fact) Check Your Bias
Eivind Morris Bakke and Nora Winger Heggelund
EMULATE: A Multi-Agent Framework for Determining the Veracity of Atomic Claims by Emulating Human Actions (Online Gather Poster Gallery)
Spencer Hong, Meng Luo and Xinyi Wan
SemQA: Evaluating Evidence with Question Embeddings and Answer Entailment for Fact Verification
Kjetil Indrehus, Caroline Vannebo and Roxana Pop
Team HUMANE at AVeriTeC 2025: HerO 2 for Efficient Fact Verification
Yejun Yoon, Jaeyoon Jung, Seunghyun Yoon and Kunwoo Park
Exploring Semantic Filtering Heuristics For Efficient Claim Verification
Max Upravitelev, Premtim Sahitaj, Arthur Hilbert, Veronika Solopova, Jing Yang, Nils Feldhus, Tatiana Anikina, Simon Ostermann and Vera Schmitt
OldJoe at AVeriTeC: In-context learning for fact-checking
Farah Ftouhi, Russel Dsouza, Lance Calvin Lim Gamboa, Asim Abbas, Mubashir Ali, Yue Feng, Mark G. Lee and Venelin Kovatchev
SANCTUARY: An Efficient Evidence-Based Automated Fact Checking System
Arbaaz Dharmavaram and Saqib Hakak
Fathom: A Fast and Modular RAG Pipeline for Fact-Checking
Farrukh Bin Rashid and Saqib Hakak
Graph-of-Thoughts for Fact-Checking with Large Language Models
Sascha Rolinger and Jin Liu
AIC CTU@FEVER 8: On-premise fact checking through long context RAG
Herbert Ullrich and Jan Drchal
Hypothetical Documents or Knowledge Leakage? Rethinking LLM-based Query Expansion
Yejun Yoon, Jaeyoon Jung, Seunghyun Yoon, Kunwoo Park
Assessing the Reasoning Capabilities of LLMs in the context of Evidence-based Claim Verification
John Dougrez-Lewis, Mahmud Elahi Akhter, Federico Ruggeri, Sebastian Löbbers, Yulan He, Maria Liakata
Show Me the Work: Fact-Checkers’ Requirements for Explainable Automated Fact-Checking
Greta Warren, Irina Shklovski, Isabelle Augenstein
Structured Discourse Representation for Factual Consistency Verification
Kun Zhang, Oana Balalau and Ioana Manolescu
Evaluating LLMs’ Assessment of Mixed-Context Hallucination Through the Lens of Summarization
Siya Qi, Rui Cao, Yulan He, and Zheng Yuan
When Claims Evolve: Evaluating and Enhancing the Robustness of Embedding Models Against Misinformation Edits
Jabez Magomere, Emanuele La Malfa, Manuel Tonneau, Ashkan Kazemi, Scott Hale
12:00-12:45 Clarity from Complexity: Automated Reasoning to Navigate Conflicting Scientific Evidence and Misleading Claims
Yufang Hou, Interdisciplinary Transformation University Austria
12:45-14:00 Lunch Break
14:00-14:45 Thoughts You Can Trust? Evaluating the Faithfulness of Model-Generated Explanations and Their Effects on Human Performance
Oana-Maria Camburu, Imperial College London
14:45-15:30 Hallucination in LLMs: Current Advances and Future Frontiers
Leyang Cui, Tencent
15:30-16:00 Afternoon Break
16:00-16:30 Contributed Workshop Talks [show/hide details]
When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification
Hanna Shcharbakova, Tatiana Anikina, Natalia Skachkova and Josef Van Genabith
FACT5: A Novel Benchmark and Pipeline for Nuanced Fact-Checking of Complex Statements
Shayan Chowdhury, Sunny Fang and Smaranda Muresan
The Law of Knowledge Overshadowing: Towards Understanding, Predicting, and Preventing LLM Hallucination
Yuji Zhang, Sha Li, Cheng Qian, Jiateng Liu, Pengfei Yu, Chi Han, Yi R. Fung, Kathleen McKeown, ChengXiang Zhai, Manling Li and Heng Ji
16:30-17:15 AI-Mediated Systems for Collective Human Decision Making
Michiel Bakker, Massachusetts Institute of Technology
17:15-17:30 Closing Remarks
FEVER Organizers

Invited Talks

Investigating Datasets
Christo Buschek

We have internalized the phrase “AI is a black box.” But we need to do better if we want to hold AI systems and the companies building them accountable. When we examine the datasets used to train these incredibly complex machines more closely, we recognize the models they power and learn about the emerging effects of algorithmic systems. Datasets show that AI is not a black box, but rather an assemblage of various technical artifacts and processes. Like all algorithmic systems, unexpected behavior emerges. It is the result of choices and values, and a product of the culture in which it originates. People make AI.



Clarity from Complexity: Automated Reasoning to Navigate Conflicting Scientific Evidence and Misleading Claims
Yufang Hou

The exponential growth of scholarly literature poses a substantial challenge for researchers seeking to stay current with the latest findings and synthesize knowledge effectively. This challenge is further exacerbated by the proliferation of misinformation and the increasing complexity of scientific data. In this talk, I will first present our work on identifying and reconstructing fallacies in misrepresented scientific findings. Next, I will talk about our recent studies for supporting experts in synthesizing biomedical research findings. Finally, I will discuss several open research challenges in modeling and reasoning over scholarly documents and their public communication.



Thoughts You Can Trust? Evaluating the Faithfulness of Model-Generated Explanations and Their Effects on Human Performance
Oana-Maria Camburu

Large Language Models (LLMs) can readily generate natural language explanations—or chain-of-thoughts (CoTs)—to justify their outputs. In this talk, I will first introduce methods for evaluating whether such explanations faithfully reflect the decision-making processes of the models that produce them. Second, I present the results of a user study involving 85 clinicians and medical students diagnosing chest X-rays. The study compares the effectiveness of natural language explanations, saliency maps, and their combination in supporting clinical decision-making.



Hallucination in LLMs: Current Advances and Future Frontiers
Leyang Cui

A critical concern is the propensity of LLMs to generate hallucinations—outputs that deviate from user inputs, contradict prior context, or conflict with established world knowledge. This phenomenon poses significant challenges to the reliability and safe deployment of LLMs in real-world applications. In this talk, we will systematically explore the nature of LLM hallucinations, beginning with a clear definition of the phenomenon. We then discuss methodologies for evaluating hallucinations, identifying key metrics and benchmarks. Additionally, we examine the underlying sources of hallucination. We highlight state-of-the-art mitigation techniques aimed at reducing hallucination while maintaining model performance. Finally, we discuss the relationship between reasoning capabilities and hallucination, offering insights into how improved reasoning may help mitigate hallucination.



AI-Mediated Systems for Collective Human Decision Making
Michiel Bakker

As AI systems become more powerful, there is an urgent need for new systems and institutions that support collective human oversight and decision making. We present a set of AI-mediated systems that combine human input with large language models to help communities deliberate, resolve disagreements, and make accurate collective judgments. These systems scale democratic processes through model-assisted deliberation, collective fact-checking and oversight of AI systems. First, we introduce The Habermas Machine, which refines group statements based on participants’ opinions and critiques. In a preregistered study (N = 5,734), participants preferred AI-mediated statements over those written by human mediators and often updated their own views. These results were replicated in a demographically representative citizens’ assembly. Second, we extend the approach to Community Notes-style fact-checking systems, using LLMs and reward models to synthesize diverse viewpoints into helpful explanatory notes. Finally, we propose methods to align incentives: LLM aggregators use crowd-submitted rationales to form predictions and distribute rewards based on marginal value. Together, these systems demonstrate how LLMs can support scalable, pluralistic oversight in high-stakes domains.



Workshop Organising Committee

Mubashara Akhtar

King's College London

Rami Aly

University of Cambridge

Rui Cao

University of Cambridge

Yulong Chen

University of Cambridge

Christos Christodoulopoulos

Amazon

Oana Cocarascu

King's College London

Zhenyun Deng

University of Cambridge

Zifeng Ding

University of Cambridge

Zhijiang Guo

HKUST (GZ)

Arpit Mittal

Meta

James Thorne

KAIST AI

Chenxi Whitehouse

Meta

Michael Schlichtkrull

Queen Mary University of London

Andreas Vlachos

University of Cambridge