Alex Kreidler

Open Information Extraction (OIE or OpenIE) is a key task in Natural Language Processing (NLP) that aims to extract structured information from natural language, in tuple format, without relying on pre-specified relations. OIE has various applications such as text summarization, question answering, knowledge graph construction, and more.¹ OIE datasets play a critical role in advancing research in this area by providing annotated data for training and evaluation purposes. In this literature review, we focus on OIE datasets and their characteristics. Specifically, we examine five datasets, namely OIE2016, CaRB, Wire57, BenchIE, and DocOIE. We focus on several subtopics including challenges with evaluation metrics, recent dataset improvements, and advanced applications including coreference resolution and inference. We aim to illuminate the strengths and weaknesses of OIE datasets and how those factors may influence the field, to further the development of more accurate and effective OIE systems.

Introduction

First, we provide an overview of the task of triple and tuple extraction. A triple consists of a subject, predicate, and object. For example, in the sentence “John works for Google,” the subject is “John,” the predicate is “works for,” and the object is “Google,” and this could be written as (John; works for; Google). In OpenIE, the predicate is commonly called a relation, and the subject and object are called arguments, and there may be more than two arguments, for example (John; worked for; Google; in 2011). An n-dimensional extension of a triple is a tuple (the strange nomenclature comes from mathematics). OIE is called open information extraction because it uses an open schema, e.g. it does not have predefined relations and uses phrases from the original text.

The first large-scale dataset for Open Information Extraction was the OIE2016 dataset.² It used automatic rules to convert the QA-SRL (Question Answering-Semantic Role Labeling) dataset to tuples, covering WSJ and Wikipedia sources. Later work³ found many bad annotations in OIE2016, and noted it may be because of the automated conversion. They introduce CaRB, a new OpenIE dataset created by giving 1,282 sentences from OIE2016 and annotation guidelines to crowd-workers on Amazon Mechanical Turk. The authors then annotate 50 sentences and find the matching score between that baseline and crowd-annotated CaRB data is higher than with the OIE2016 data, likely indicating better quality, which can be verified qualitatively (see Table 1 of Bhardwaj et al.).

The literature has elaborated some key principles of OpenIE datasets: extractions should be asserted by the sentence, informative, and minimal/atomic, while collectively being exhaustive/complete — covering the information in a sentence.⁴ Others have noted that these preferences may change based on the downstream NLP task.⁵

Challenges with Evaluation Metrics

The phrase “you get what you measure” is apt for the field of machine learning, where we build systems to mimic the statistical output distribution to maximize an evaluation metric on a particular dataset. So the metric is almost as important as the dataset itself. Both OIE2016 and CaRB have highly flawed evaluation metrics based on simple token matching. The clearest illustration of the problem was provided by the creators of the WiRe57 dataset⁶, who built a system dubbed Munchkin that simply outputs permutations of the original sentence. The creators of the BenchIE dataset⁷ also created a baseline that sets each verb as a predicate and the previous and remaining parts of the sentence as the subject and object arguments respectively. Both systems outperform real OpenIE systems when using the CaRB evaluation scorer, and the problem is likely exacerbated on OIE2016.

The OIE2016 lexical match rewards long extractions very similar to the input sentence and doesn’t take into account the ordered structure of the tuples. It works as follows: For each word in the gold triple, for each equal word in the evaluation triple, increment a counter. Then return a true match if the counter is greater than 25% of the length of the gold triple. One can see this is easily exploited — for example by repeating the entire sentence multiple times, this would ensure every word in the gold triple (because they come from the sentence) is repeated multiple times — more than it occurs in the gold triple and thus increasing the lexical match score to above 100%. CaRB changes the scoring algorithm to use multi-match, allowing a gold tuple to match multiple system extractions, and uses “token” matching at the tuple rather than sentence level. While this better captures the positional/structural information of a tuple, it still encourages long extractions.

Recent Dataset Improvements

Recognizing the same problem, the BenchIE and WiRe57 authors try different remedies. WiRe57 computes precision and recall scores for each part of the tuple, e.g. subject, predicate and object, and then sums them and normalizes by total length. BenchIE enumerates all possible representations of a given fact, and groups these into a “fact synset,” and then computes precision and recall based on the number of matched distinct facts (where at least one representation has an exact match with the system’s output). When a system extracts the same fact multiple times, in previous systems it would be rewarded. BenchIE is neutral to such behavior as long as the repeated extractions are in some synset. However in WiRe57, this is penalized by the denominator of the precision metric that measures length of predicted tuples, and the fact that a predicted tuple has a 1-1 relation with a reference tuple, enforced by greedy matching. Due to the robustness of the WiRe57 evaluation metric it would be interesting to apply it to other datasets such as CaRB, OIE2016, or DocOIE.

However BenchIE does identify cases where their approach is better. For example, in cases where the extraction is factually incorrect/not implied by the sentence but has high lexical overlap with the gold triple, any fuzzy metric including the WiRe57 approach will fail, and similarly if the extraction is correct but has lower lexical overlap. The first case is the most common, while hopefully in the second case the relevant words in the gold triples have been marked as optional (which WiRe57 does but CaRB does not) and would have a high overlap.

Both BenchIE and WiRe57 have demonstrated resulting F1 scores in the range of 0.2-0.35 for state of the art OpenIE systems, much lower than the range of 0.3-0.6 on CaRB. Since BenchIE is a subset of CaRB sentences with stricter and less flawed evaluation, and because both benchmarks from different datasets have F1 scores in the same range, this demonstrates that performance on the OpenIE task is significantly inflated by previous benchmarks and there is much room for improvement.

Advanced applications: Coreference resolution and inference

These two benchmarks may also enable more advanced capabilities. Coreference resolution is the task of finding mentions in a text of a given entity, for example resolving “she” to “Marta.” Both benchmarks include optional coreference resolution tuples, which could be used to evaluate future systems that handle that task. WiRe57 also has optional light predicate inference tuples, e.g. inferring from (Tokyo; [is]; [a prefecture]) from “Tokyo … is the capital city of Japan and one of its 47 prefectures.” These benchmarks will likely remain relevant.

DocOIE is the largest “expert-annotated” dataset, containing 800 sentences.⁸ However the annotations appear to be lower quality, although this may be due to the complexity of the source data of patent documents. For example, given the sentence “Alert icon can be selected by the user at the remote station to generate an alert indicator” they annotate (alert icon selected by…; is to generate; an alert indicator), while (alert icon; generate; an alert indicator) would be better. This illustrates a pattern where “is” is prepended to predicates with propositions so they don’t make sense.

A key differentiator of the DocOIE dataset is the coreference resolution data. In section 3.2 they write “To gain an accurate interpretation of a sentence, the annotator needs to read a few surrounding sentences or even the entire document for relevant contexts.” However, the two previously discussed datasets already include basic sentence-level coreference resolution. The authors of DocOIE could track how many resolutions are in-sentence or within the broader document (even how far away the source sentence is). This would support their approach of document-based annotation beyond the two qualitative examples provided in their introduction. Finally, the evaluation of common OpenIE systems on their dataset is limited by their use of the CaRB scorer whose flaws have been outlined above.

Conclusion

There are many ambiguities and human biases in creating an evaluation dataset, which perhaps reflects the underlying difficulty or entropy of the OpenIE task. These challenges are best understood through the guidelines for annotators. OIE2016 and DocOIE do not describe their annotation policy, CaRB provides a brief overview of their principles, while BenchIE and WiRe57 publish detailed annotation guidelines with examples.

These datasets have varying purposes. CaRB and DocOIE are the largest datasets with fair quality but less robust scoring methods. BenchIE and WiRe57 are smaller, expert annotated datasets that are good for a more careful evaluation and developing new OpenIE capabilities. None of the datasets surveyed are of sufficient size to enable training neural-network based OpenIE systems, which instead train on extractions from previous OpenIE systems, resulting in a performance barrier known as the “bootstrapping problem”.⁹ However, these datasets distill the best practices of what an OpenIE system should be, and help us clearly evaluate new approaches and move in the right direction.

Bhardwaj, Sangnie, Samarth Aggarwal, and Mausam Mausam. “CaRB: A Crowdsourced Benchmark for Open IE.” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 6261–66. Hong Kong, China: Association for Computational Linguistics, 2019. https://doi.org/10.18653/v1/D19-1651.

Dong, Kuicai, Yilin Zhao, Aixin Sun, Jung-Jae Kim, and Xiaoli Li. “DocOIE: A Document-Level Context-Aware Dataset for OpenIE,” May 10, 2021. http://arxiv.org/abs/2105.04271.

Gashteovski, Kiril, Mingying Yu, Bhushan Kotnis, Carolin Lawrence, Mathias Niepert, and Goran Glavaš. “BenchIE: A Framework for Multi-Faceted Fact-Based Open Information Extraction Evaluation,” April 13, 2022. http://arxiv.org/abs/2109.06850.

Lechelle, William, Fabrizio Gotti, and Phillippe Langlais. “WiRe57 : A Fine-Grained Benchmark for Open Information Extraction.” In Proceedings of the 13th Linguistic Annotation Workshop, 6–15. Florence, Italy: Association for Computational Linguistics, 2019. https://doi.org/10.18653/v1/W19-4002.

Stanovsky, Gabriel, and Ido Dagan. “Creating a Large Benchmark for Open Information Extraction.” In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2300–2305. Austin, Texas: Association for Computational Linguistics, 2016. https://doi.org/10.18653/v1/D16-1252.

Zhou, Shaowen, Bowen Yu, Aixin Sun, Cheng Long, Jingyang Li, Haiyang Yu, Jian Sun, and Yongbin Li. “A Survey on Neural Open Information Extraction: Current Status and Future Directions,” June 28, 2022. http://arxiv.org/abs/2205.11725.

Zhou et al., “A Survey on Neural Open Information Extraction.”↩
Stanovsky and Dagan, “Creating a Large Benchmark for Open Information Extraction.”↩
Bhardwaj, Aggarwal, and Mausam, “CaRB.”↩
Stanovsky and Dagan, “Creating a Large Benchmark for Open Information Extraction”; Lechelle, Gotti, and Langlais, “WiRe57.”↩
Gashteovski et al., “BenchIE.”↩
Lechelle, Gotti, and Langlais, “WiRe57.”↩
Gashteovski et al., “BenchIE.”↩
Dong et al., “DocOIE.”↩
Zhou et al., “A Survey on Neural Open Information Extraction.”↩