LILLIE: Information extraction and database integration using linguistics and learning-based algorithms. Issue 105 (March 2022)
- Record Type:
- Journal Article
- Title:
- LILLIE: Information extraction and database integration using linguistics and learning-based algorithms. Issue 105 (March 2022)
- Main Title:
- LILLIE: Information extraction and database integration using linguistics and learning-based algorithms
- Authors:
- Smith, Ellery
Papadopoulos, Dimitris
Braschler, Martin
Stockinger, Kurt - Abstract:
- Abstract: Querying both structured and unstructured data via a single common query interface such as SQL or natural language has been a long standing research goal. Moreover, as methods for extracting information from unstructured data become ever more powerful, the desire to integrate the output of such extraction processes with "clean", structured data grows. We are convinced that for successful integration into databases, such extracted information in the form of "triples" needs to be both (1) of high quality and (2) have the necessary generality to link up with varying forms of structured data. It is the combination of both these aspects, which heretofore have been usually treated in isolation, where our approach breaks new ground. The cornerstone of our work is a novel, generic method for extracting open information triples from unstructured text, using a combination of linguistics and learning-based extraction methods, thus uniquely balancing both precision and recall. Our system called LILLIE (LInked Linguistics and Learning-Based Information Extractor) uses dependency tree modification rules to refine triples from a high-recall learning-based engine, and combines them with syntactic triples from a high-precision engine to increase effectiveness. In addition, our system features several augmentations, which modify the generality and the degree of granularity of the output triples. Even though our focus is on addressing both quality and generality simultaneously, ourAbstract: Querying both structured and unstructured data via a single common query interface such as SQL or natural language has been a long standing research goal. Moreover, as methods for extracting information from unstructured data become ever more powerful, the desire to integrate the output of such extraction processes with "clean", structured data grows. We are convinced that for successful integration into databases, such extracted information in the form of "triples" needs to be both (1) of high quality and (2) have the necessary generality to link up with varying forms of structured data. It is the combination of both these aspects, which heretofore have been usually treated in isolation, where our approach breaks new ground. The cornerstone of our work is a novel, generic method for extracting open information triples from unstructured text, using a combination of linguistics and learning-based extraction methods, thus uniquely balancing both precision and recall. Our system called LILLIE (LInked Linguistics and Learning-Based Information Extractor) uses dependency tree modification rules to refine triples from a high-recall learning-based engine, and combines them with syntactic triples from a high-precision engine to increase effectiveness. In addition, our system features several augmentations, which modify the generality and the degree of granularity of the output triples. Even though our focus is on addressing both quality and generality simultaneously, our new method substantially outperforms current state-of-the-art systems on the two widely-used CaRB and Re-OIE16 benchmark sets for information extraction. We have made our code publicly available 1 to facilitate further research. Highlights: A novel, generic method to extract open information triples from unstructured text. Substantially outperforms state-of-the-art systems on CaRB and Re-OIE16 benchmarks. Combines linguistics and learning-based methods to balance both precision and recall. Refines triples with dependency tree rules from a high-recall learning-based engine. Includes several augmentations to modify the generality and granularity of triples. … (more)
- Is Part Of:
- Information systems. Issue 105(2022)
- Journal:
- Information systems
- Issue:
- Issue 105(2022)
- Issue Display:
- Volume 105, Issue 105 (2022)
- Year:
- 2022
- Volume:
- 105
- Issue:
- 105
- Issue Sort Value:
- 2022-0105-0105-0000
- Page Start:
- Page End:
- Publication Date:
- 2022-03
- Subjects:
- Information extraction -- Data integration -- Machine learning for database systems
Database management -- Periodicals
Electronic data processing -- Periodicals
Bases de données -- Gestion -- Périodiques
Informatique -- Périodiques
Database management
Electronic data processing
Periodicals
005.7 - Journal URLs:
- http://www.sciencedirect.com/science/journal/03064379 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.is.2021.101938 ↗
- Languages:
- English
- ISSNs:
- 0306-4379
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4496.367300
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 20306.xml