FEVEROUS (Fact Extraction and VERification Over Unstructured and Structured information) is a fact verification dataset which consists of 87,026 verified claims. Each claim is annotated with evidence in the form of sentences and/or cells from tables in Wikipedia, as well as a label indicating whether this evidence supports, refutes, or does not provide enough information to reach a verdict. The dataset also contains annotation metadata such as annotator actions (query keywords, clicks on page, time signatures), and the type of challenge each claim poses.
FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information
Rami Aly, Zhijiang Guo, Michael Sejr Schlichtkrull, James Thorne, Andreas Vlachos, Christos Christodoulopoulos, Oana Cocarascu, Arpit Mittal
@inproceedings{Aly21Feverous, author = {Aly, Rami and Guo, Zhijiang and Schlichtkrull, Michael Sejr and Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Cocarascu, Oana and Mittal, Arpit}, title = {{FEVEROUS}: Fact Extraction and {VERification} Over Unstructured and Structured information}, booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)}, year={2021}, url={https://openreview.net/forum?id=h-flVCIlstW} }
You can also cite the dataset directly using its DOI: https://doi.org/10.5281/zenodo.4911508
The data (Annotations and Wikipedia pages) are distributed in the JSONL format with one example per line (see https://jsonlines.org for more details).
The training and development data contains 5 fields:
id
: The ID of the samplelabel
: The annotated label for the claim. Can be one of SUPPORTS|REFUTES|NOT ENOUGH INFO
.claim
: The text of the claim.evidence
: A list (at maximum three) of evidence sets. Each set consists of dictionaries with two fields (content, context)
.
content
: A list of element ids serving as the evidence for the claim. Each element id is in the format "[PAGE ID]_[EVIDENCE TYPE]_[NUMBER ID]"
. [EVIDENCE TYPE]
can be sentence, cell, header_cell, table_caption, item
.context
: A dictionary that maps each element id in content
to a set of wikipedia elements that are automatically associated with that element id and serve as context. This includes an article's title, relevant sections (the section and sub-section(s) the element is located in), and for cells the closest row and column header (multiple row/column headers if they follow each other). challenge
: The main challenge to verify the claim, one out of the following: Numerical Reasoning, Multi-hop Reasoning, Entity Disambiguation, Combining Tables and Text, Search terms not in claim, Other.
annotator_operations
: A list of operations an annotator used to find the evidence and reach a verdict, given the claim. Each element in the list is a dictionary with the fields (operation, value, time)
.
operation
: Any of the following
start, finish
: Annotation started/finished. The value is the name of the operation.search
: Annotator used the Wikipedia search function. The value is the entered search term or the term selected from the automatic suggestions. If the annotator did not select any of the suggestions but instead went into advanced search, the term is prefixed with "contains..."hyperlink
: Annotator clicked on a hyperlink in the page. The value is the anchor text of the hyperlink.Now on
: The page the annotator has landed after a search or a hyperlink click. The value is the PAGE ID
.Page search
: Annotator search on a page. The value is the search term.page-search-reset
: Annotator cleared the search box. The value is the name of the operation.Highlighting, Highlighting deleted
: Annotator selected/unselected an element on the page. The value is ELEMENT ID
.back-button-clicked
: Annotator pressed the back button. The value is the name of the operation.value
: The value associated with the operation.time
: The time in seconds from the start of the annotation.Below are two examples of the data structure.
{
"id": 33670,
"label": "SUPPORTS",
"claim": "Wolfgang Niedecken is a german rock musician who founded the Kölsch speaking rock group BAP at the end of the 1970s",
"evidence":
{
"content": [""Wolfgang Niedecken_sentence_0", "Wolfgang Niedecken_cell_0_4_1", "Wolfgang Niedecken_sentence_1"]
"context":
{
"Wolfgang Niedecken_sentence_0": ["Wolfgang Niedecken_title"],
"Wolfgang Niedecken_cell_0_4_1":
[
"Wolfgang Niedecken_title", "Wolfgang Niedecken_header_cell_0_4_0", "Wolfgang Niedecken_header_cell_0_1_0", "Wolfgang Niedecken_header_cell_0_0_0"
],
"Wolfgang Niedecken_sentence_1": ["Wolfgang Niedecken_title"]
}
}
"annotator_operations":
[
{
"operation": "start",
"value": "start",
"time": 0
},
{
"operation": "search",
"value": "Wolfgang Niedecken",
"time": 12.654
},
{
"operation": "Now on",
"value": "Wolfgang Niedecken",
"time": 13.547
},
{
"operation": "Highlighting",
"value": "Wolfgang Niedecken_sentence_0",
"time": 20.926
}
...
],
"challenge": "Combining Tables and Text"
}
{
"id": 35206,
"label": "NOT ENOUGH INFO",
"claim": "As of December 2020, the most expensive aircraft of the Korean Air fleet is the Boeing 777-300ER.",
"evidence":
{
"content": ["Korean Air_cell_1_19_0", "Boeing 777_cell_0_11_1"]
"context":
{
"Korean Air_cell_1_19_0":
[
"Korean Air_title", "Korean Air_section_10", "Korean Air_section_11", "Korean Air_header_cell_1_0_0"
],
"Boeing 777_cell_0_11_1":
[
"Boeing 777_title", "Boeing 777_header_cell_0_11_0", "Boeing 777_header_cell_0_0_0"
]
}
}
"annotator_operations":
[
{
"operation": "start",
"value": "start",
"time": 0
},
{
"operation": "search",
"value": "Boeing 777-300ER",
"time": 19.391
},
{
"operation": "Now on",
"value": "Boeing777",
"time": 21.531
},
{
"operation": "search",
"value": "Korean Air fleet",
"time": 62.33
}
...
],
"challenge": "Multi-hop Reasoning"
}
Each Wikipedia article contains 2 base fields:
title
: The title of the Wikipedia articleorder
: A list of elements on the Wikipedia article in order of their appearance. Elements can be: section, table, list, sentence
. Each element specified in order
is a field. A sentence field contains the text of the sentence.
A section element is a dictionary with following fields:
value
: Section textlevel
: The level/depth of the section.A table element is a dictionary with following fields:
type
: Whether the table is an infobox or a normal tabletable
: The content of the table. The table is specified as a list of lists. Each element in a list is a cell with the fields (id, value, is_header, row_span, column_span)
.caption
: Only specified if the table contains a caption.A list element consists of following fields:
type
: Whether the list is an ordered or unordered listlist
: A list of dictionaries, with fields being (id, value, level, type)
. level
is the depth of the list item. The level increments with each nested list. type
specifies type of a nested list, which is starting after the item specifying the type. Field is only specified if the next item is in a nested list.Hyperlinks in text are indicated with double square brackets. If an anchor text is provided, it is the text on the right hand side of a vertical bar in the square backets
{
"title": "Aare", # Article title
"order": [ "sentence_0", "table_0", "section_0", "sentence_1", "list_0" ]
"sentence_0": "This article is about a river in \[\[Switzerland\]\]
"table_0":
{
"type": "infobox"
"table":
[ # Contents of the table
[ # Each row is encoded in seperate list
{
"id": "header_cell_0_0_0",
"value": "Location",
"is_header": true,
"row_span": 1,
"column_span: 1
}
{
...
}
]
[
{
"id": "cell_0_1_0",
"value": "Koblenz",
"is_header": false,
"row_span": 1,
"column_span: 1
}
...
]
]
}
"list_0"
{
"type": "unordered_list" #either unordered_list or ordered_list
"list":
[ # Contents of the list
{
"id": "item_0_0", # numbers indicate list, and item, respectively
"value": ...,
"level": 0,
"type": ordered_list
}
{
...
}
]
"section_0":
{
"value": "Course"
"level": 1 # Level of section
}
}