Gain more with less: Extracting information from business documents with small data. (1st April 2023)
- Record Type:
- Journal Article
- Title:
- Gain more with less: Extracting information from business documents with small data. (1st April 2023)
- Main Title:
- Gain more with less: Extracting information from business documents with small data
- Authors:
- Nguyen, Minh-Tien
Son, Nguyen Hong
Linh, Le Thai - Abstract:
- Abstract: Information extraction (IE) is a vital step of digitization that reduces paperwork in offices. However, the adaptation of common IE systems to actual business cases faces two issues. First, the number of training samples is small (i.e. 100–200 examples). Second, span extraction models based on question answering formulation require a long time for training and inference. To overcome these issues, we introduce a new query-based model for the extraction of information from business documents. For data limitation, the model employs transfer learning which adapts the knowledge of pre-trained language models (i.e. BERT) to specific domains. To do that, we design a new CNN layer for the adaptation of the model to specific domains. For the speed, different from the encoding of normal span extraction methods (BERT-QA), the proposed model encodes short tags and context documents in two channels in parallel, which speeds up training and inference time. Information from short tags is fused with context documents learned from CNN by using attention to predict start and end positions of extracted spans. Promising results on five domain-specific datasets in English and Japanese indicate that the proposed model produces high-quality outputs and can be applied for business scenarios. Highlights: A practical information extraction model for business cases is proposed. FastQA+CNN achieves the best results in terms of F-scores and speed on five datasets. Deep analysis on severalAbstract: Information extraction (IE) is a vital step of digitization that reduces paperwork in offices. However, the adaptation of common IE systems to actual business cases faces two issues. First, the number of training samples is small (i.e. 100–200 examples). Second, span extraction models based on question answering formulation require a long time for training and inference. To overcome these issues, we introduce a new query-based model for the extraction of information from business documents. For data limitation, the model employs transfer learning which adapts the knowledge of pre-trained language models (i.e. BERT) to specific domains. To do that, we design a new CNN layer for the adaptation of the model to specific domains. For the speed, different from the encoding of normal span extraction methods (BERT-QA), the proposed model encodes short tags and context documents in two channels in parallel, which speeds up training and inference time. Information from short tags is fused with context documents learned from CNN by using attention to predict start and end positions of extracted spans. Promising results on five domain-specific datasets in English and Japanese indicate that the proposed model produces high-quality outputs and can be applied for business scenarios. Highlights: A practical information extraction model for business cases is proposed. FastQA+CNN achieves the best results in terms of F-scores and speed on five datasets. Deep analysis on several aspects of the model. Separately encoding short tags and the context speeds up the training and inference. … (more)
- Is Part Of:
- Expert systems with applications. Volume 215(2023)
- Journal:
- Expert systems with applications
- Issue:
- Volume 215(2023)
- Issue Display:
- Volume 215, Issue 2023 (2023)
- Year:
- 2023
- Volume:
- 215
- Issue:
- 2023
- Issue Sort Value:
- 2023-0215-2023-0000
- Page Start:
- Page End:
- Publication Date:
- 2023-04-01
- Subjects:
- Information extraction -- Transfer learning -- Transformers
Expert systems (Computer science) -- Periodicals
Systèmes experts (Informatique) -- Périodiques
Electronic journals
006.33 - Journal URLs:
- http://www.sciencedirect.com/science/journal/09574174 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.eswa.2022.119274 ↗
- Languages:
- English
- ISSNs:
- 0957-4174
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3842.004220
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 25105.xml