2021 Shared Task

Implementations

Final Leaderboard

Rank Team FEVEROUS score Accuracy Evidence F1 Evidence Precision Evidence Recall
1 Bust a move! 0.2701 0.5607 0.1308 0.0773 0.4258
2 Papelo 0.2592 0.5757 0.1187 0.0716 0.3460
3 NCU 0.2514 0.5229 0.1581 0.0991 0.3907
4 Z team 0.2251 0.4901 0.1312 0.0776 0.4264
5 EURECOM_Fever 0.2001 0.4779 0.1952 0.1373 0.3373
6 FEVEROUS Baseline 0.1773 0.4548 0.1503 0.1017 0.2878
7 Saturday_Night_Fever 0.1763 0.4804 0.1618 0.1122 0.2900
8 Martin Funkquist 0.1261 0.4302 0.1045 0.0642 0.2789
9 Albatross 0.1159 0.4035 0.0963 0.0644 0.1902
10 METUIS 0.0636 0.3897 0.0634 0.0462 0.1011
11 ChaCha 0.0389 0.4194 0.0398 0.0251 0.0969
12 seda_kaist 0.0362 0.4140 0.0384 0.0242 0.0920
13 qmul_uou_iiith 0.0223 0.3999 0.0282 0.0245 0.0330

Key Dates

  • Challenge Launch: 20 May 2021
  • Training Data Release: 7 June 2021
  • Testing Begins: 24 July 2021
  • Submission Closes: 27 July 2021
  • Results Announced: 30 July 2021
  • System Descriptions Due for Workshop: 8th August 2021
  • Winners Announced: 10 November 2021 (4th FEVER Workshop)

Task Definition

The FEVEROUS challenge aims to evaluate the ability of a system to verify information using unstructured and structured evidence from Wikipedia.

  • Given a factual claim involving one or more entities, the system must extract evidence from sentences, table cells, table captions, and/or list items that support or refute the claim.
  • Using this evidence, label the claim as Supported, Refuted given the evidence or Not Enough Info (NEI), if there isn't sufficient evidence to either support of refute it.
  • A claim's evidence may consist of multiple sentences, table cells, or list items, as well as a combination of these that only if examined together provide the stated label.
    • For a given piece of evidence, there is associated context that can be used. This includes an article's title, sections title (the section and sub-section(s) the evidence is located in), and for cells, the closest row and column header (if the element just before the closest row/column is also a header it will be included in the context). The context was automatically selected during the annotation process (see Data Format section), and can be generated for any Wikipedia element, using the context generation snipped located in the README file of the FEVEROUS repository.
  • Evidence can be located in any section of a Wikipedia article.

To learn more about the task and our baseline implementation, read our paper FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information.

Submission

System predictions should be submitted to our EvalAI challenge page. Before the release of the testing data you can submit your predictions on the development split to become familiar with the submission system. When submitting system predictions, you need to specify the system name , and, if available a link to the code. We will use the team name you specified on EvalAI when we compile the final results of the challenge. You can find more details on the submission page itself.

NB: Participants will be allowed limited number submissions per system – multiple submission are allowed, but only the final one will be scored/counted.

System Description Paper

You may submit a system description paper, describing the system's method, how it has been trained, the evaluation, and possibly an error analysis to understand strengths and weaknesses of the proposed system. The system description paper must be submitted as a PDF, consisting of a maximum of eight pages (for most description papers four to six pages will be sufficient) of content plus unlimited pages for bibliography. Submissions must follow the EMNLP 2021 two-column format, using the LaTeX style files or Word templates or the Overleaf template from the official EMNLP website. Please submit your system description papers here.

NB: System Description papers are reviewed in a single-blind review process. Thus, your manuscript may contain authors names and information that would reveal your identity (e.g. team name, score, and rank at the shared task). Also note that at least one author of the system description paper will have to register as a reviewer for the FEVER Workshop.

Baseline system

The implementation of the Baseline system can be found on our Github repository.

For the technical details of the implementation as well as the Baseline performance, please refer to the FEVEROUS paper.

Scoring

The FEVEROUS scoring is built on the FEVER scorer. The scoring script can be found on the FEVEROUS Dataset page.

  • We will only award points for accuracy if the correct evidence is found.
  • For a claim, we consider the correct evidence to be found if at least one complete set of annotated evidence (any combination of sentences, table cells, table captions, list items) is returned (the annotated data may contain multiple sets of evidence, each of which is sufficient to support or refute a claim).
  • The scorer will produce other diagnostic scores (F1, macro-precision, macro-recall and accuracy). These will not considered for the competition other than to rank two submissions with equal FEVEROUS Scores.

For the FEVEROUS score following changes are made to the FEVER scorer:

  • NEI instances are now treated equally to Supported/Refuted. Thus, accuracy points are only awarded for NEI instances if the correct evidence is found. Correct evidence for NEI instances are the most relevant pieces of information found on Wikipedia to either support or refute the claim.
  • Only the first 5 predicted sentences/table captions/list items and the first 25 predicted table cells will be considered for scoring. Additional evidence will be discarded without penalty.

Data Format

The data (Annotations and Wikipedia pages) are distributed in the JSONL format with one example per line (see http://jsonlines.org for more details). The data can be downloaded on the FEVEROUS Dataset page.

Training/Development Data format

The training and development data contains 5 fields:

  • id: The ID of the sample
  • label: The annotated label for the claim. Can be one of SUPPORTS|REFUTES|NOT ENOUGH INFO.
  • claim: The text of the claim.
  • evidence: A list (at maximum three) of evidence sets. Each set consists of dictionaries with two fields (content, context).
    • content: A list of element ids serving as the evidence for the claim. Each element id is in the format "[PAGE ID]_[EVIDENCE TYPE]_[NUMBER ID]". [EVIDENCE TYPE] can be sentence, cell, header_cell, table_caption, item.
    • context: A dictionary that maps each element id in content to a set of wikipedia elements that are automatically associated with that element id and serve as context. This includes an article's title, relevant sections (the section and sub-section(s) the element is located in), and for cells the closest row and column header (multiple row/column headers if they follow each other).
  • annotator_operations: A list of operations an annotator used to find the evidence and reach a verdict, given the claim. Each element in the list is a dictionary with the fields (operation, value, time).
    • operation: Any of the following
      • start, finish: Annotation started/finished. The value is the name of the operation.
      • search: Annotator used the Wikipedia search function. The value is the entered search term or the term selected from the automatic suggestions. If the annotator did not select any of the suggestions but instead went into advanced search, the term is prefixed with "contains..."
      • hyperlink: Annotator clicked on a hyperlink in the page. The value is the anchor text of the hyperlink.
      • Now on: The page the annotator has landed after a search or a hyperlink click. The value is the PAGE ID.
      • Page search: Annotator search on a page. The value is the search term.
      • page-search-reset: Annotator cleared the search box. The value is the name of the operation.
      • Highlighting, Highlighting deleted: Annotator selected/unselected an element on the page. The value is ELEMENT ID.
      • back-button-clicked: Annotator pressed the back button. The value is the name of the operation.
    • value: The value associated with the operation.
    • time: The time in seconds from the start of the annotation.
  • expected_challenge: The challenge the claim generator selected will be faced when verifying the claim, one out of the following: Numerical Reasoning, Multi-hop Reasoning, Entity Disambiguation, Combining Tables and Text, Search terms not in claim, Other.
  • challenge: The main challenge to verify the claim, one out of the following: Numerical Reasoning, Multi-hop Reasoning, Entity Disambiguation, Combining Tables and Text, Search terms not in claim, Other.

Below are two examples of the data structure.

SUPPORTS Example


        {
          "id": 33670,
          "label": "SUPPORTS",
          "claim": "Wolfgang Niedecken is a german rock musician who founded the Kölsch speaking rock group BAP at the end of the 1970s",
          "evidence":
            {
              "content": [""Wolfgang Niedecken_sentence_0", "Wolfgang Niedecken_cell_0_4_1", "Wolfgang Niedecken_sentence_1"]
              "context":
                {
                  "Wolfgang Niedecken_sentence_0": ["Wolfgang Niedecken_title"],
                  "Wolfgang Niedecken_cell_0_4_1":
                    [
                      "Wolfgang Niedecken_title", "Wolfgang Niedecken_header_cell_0_4_0", "Wolfgang Niedecken_header_cell_0_1_0", "Wolfgang Niedecken_header_cell_0_0_0"
                    ],
                  "Wolfgang Niedecken_sentence_1": ["Wolfgang Niedecken_title"]
                }
            }
          "annotator_operations":
            [
              {
                "operation": "start",
                "value": "start",
                "time": 0
              },
              {
                "operation": "search",
                "value": "Wolfgang Niedecken",
                "time": 12.654
              },
              {
                "operation": "Now on",
                "value": "Wolfgang Niedecken",
                "time": 13.547
              },
              {
                "operation": "Highlighting",
                "value": "Wolfgang Niedecken_sentence_0",
                "time": 20.926
              }
              ...
            ],
            "expected_challenge": "Combining Tables and Text"
            "challenge": "Combining Tables and Text"
        }
            

NOT ENOUGH INFO Example


        {
          "id": 35206,
          "label": "NOT ENOUGH INFO",
          "claim": "As of December 2020, the most expensive aircraft of the Korean Air fleet is the Boeing 777-300ER.",
          "evidence":
            {
              "content": ["Korean Air_cell_1_19_0", "Boeing 777_cell_0_11_1"]
              "context":
                {
                  "Korean Air_cell_1_19_0":
                      [
                      "Korean Air_title", "Korean Air_section_10", "Korean Air_section_11", "Korean Air_header_cell_1_0_0"
                      ],
                  "Boeing 777_cell_0_11_1":
                      [
                      "Boeing 777_title", "Boeing 777_header_cell_0_11_0", "Boeing 777_header_cell_0_0_0"
                      ]
                }
            }
          "annotator_operations":
            [
              {
                "operation": "start",
                "value": "start",
                "time": 0
              },
              {
                "operation": "search",
                "value": "Boeing 777-300ER",
                "time": 19.391
              },
              {
                "operation": "Now on",
                "value": "Boeing777",
                "time": 21.531
              },
              {
                "operation": "search",
                "value": "Korean Air fleet",
                "time": 62.33
              }
              ...
            ],
            "expected_challenge": "Numerical Reasoning"
            "challenge": "Multi-hop Reasoning"
        }
            

Wikipedia Data format

Each Wikipedia article contains 2 base fields:

  • title: The title of the Wikipedia article
  • order: A list of elements on the Wikipedia article in order of their appearance. Elements can be: section, table, list, sentence.

Each element specified in order is a field. A sentence field contains the text of the sentence.

A section element is a dictionary with following fields:

  • value: Section text
  • level: The level/depth of the section.

A table element is a dictionary with following fields:

  • type: Whether the table is an infobox or a normal table
  • table: The content of the table. The table is specified as a list of lists. Each element in a list is a cell with the fields (id, value, is_header, row_span, column_span).
  • caption: Only specified if the table contains a caption.

A list element consists of following fields:

  • type: Whether the list is an ordered or unordered list
  • list: A list of dictionaries, with fields being (id, value, level, type). level is the depth of the list item. The level increments with each nested list. type specifies type of a nested list, which is starting after the item specifying the type. Field is only specified if the next item is in a nested list.

Hyperlinks in text are indicated with double square brackets. If an anchor text is provided, it is the text on the right hand side of a vertical bar in the square backets

Wikipedia Article Example

        {
          "title": "Aare", # Article title
          "order": [ "sentence_0", "table_0", "section_0", "sentence_1", "list_0" ]
          "sentence_0": "This article is about a river in \[\[Switzerland\]\]
          "table_0":
            {
            	"type": "infobox"
            	"table":
            	    [ # Contents of the table
            	    	[ # Each row is encoded in seperate list
            		    {
              		    "id": "header_cell_0_0_0",
              		    "value": "Location",
              		    "is_header": true,
              		    "row_span": 1,
              		    "column_span: 1
            		    }
            		    {
            		    ...
            		    }
            		]
            		[
            		    {
            		    "id": "cell_0_1_0",
            		    "value": "Koblenz",
            		    "is_header": false,
            		    "row_span": 1,
            		    "column_span: 1
            		    }
            		    ...
            		]
            	    ]
            }
          "list_0"
          {
          	"type": "unordered_list" #either unordered_list or ordered_list
          	"list":
          	    [ # Contents of the list
          	    	{
          	    	"id": "item_0_0", # numbers indicate list, and item, respectively
          	    	"value": ...,
          	    	"level": 0,
          	    	"type": ordered_list
          	    	}
          	    	{
          	    	...
          	    	}
          	    ]
         "section_0":
          {
          	"value": "Course"
          	"level": 1 # Level of section
          }
        }