Hybrid Chinese text classification approach using general knowledge from Baidu Baike. Issue 4 (14th May 2016)
- Record Type:
- Journal Article
- Title:
- Hybrid Chinese text classification approach using general knowledge from Baidu Baike. Issue 4 (14th May 2016)
- Main Title:
- Hybrid Chinese text classification approach using general knowledge from Baidu Baike
- Authors:
- Ren, Fuji
Li, Chao - Abstract:
- Abstract : Most of the previous studies focused on enriching text representation to address text classification (TC) task. However, conventional classification approaches with VSM (vector space model) on Chinese text study intensively only the words and their relationship in some specific corpus/dataset but ignore the basic concept of categories and the general knowledge behind the words learned and used to recognize entities by people. This paper focuses on enriching text representation and proposes a novel approach, which complements information from the online Chinese encyclopedia Baidu Baike for Chinese TC. The similarities between every text and each concept of categories and the most related words from Baidu Baike are added to the feature space. The performance of the proposed approach is measured on the Fudan University TC corpus, which is an imbalanced Chinese dataset. In the experiments, the proposed Baidu Baike‐based concept similarity approach obtains promising results when compared with a previous research and the conventional method, with macro‐precision of 90.31%, recall of 75.45%, and F 1 score 80.32%, which are about 0.02%, 0.15%, 0.12%, respectively, higher than the conventional method, which obviously improves the recall for some small categories while keeping precision at high level and improving the macro F 1 score. Moreover, the proposed approach has good expandability, so that many other knowledge bases could be integrated and many other concepts couldAbstract : Most of the previous studies focused on enriching text representation to address text classification (TC) task. However, conventional classification approaches with VSM (vector space model) on Chinese text study intensively only the words and their relationship in some specific corpus/dataset but ignore the basic concept of categories and the general knowledge behind the words learned and used to recognize entities by people. This paper focuses on enriching text representation and proposes a novel approach, which complements information from the online Chinese encyclopedia Baidu Baike for Chinese TC. The similarities between every text and each concept of categories and the most related words from Baidu Baike are added to the feature space. The performance of the proposed approach is measured on the Fudan University TC corpus, which is an imbalanced Chinese dataset. In the experiments, the proposed Baidu Baike‐based concept similarity approach obtains promising results when compared with a previous research and the conventional method, with macro‐precision of 90.31%, recall of 75.45%, and F 1 score 80.32%, which are about 0.02%, 0.15%, 0.12%, respectively, higher than the conventional method, which obviously improves the recall for some small categories while keeping precision at high level and improving the macro F 1 score. Moreover, the proposed approach has good expandability, so that many other knowledge bases could be integrated and many other concepts could be referred to improve the effectiveness. © 2016 Institute of Electrical Engineers of Japan. Published by John Wiley & Sons, Inc. … (more)
- Is Part Of:
- IEEJ transactions on electrical and electronic engineering. Volume 11:Issue 4(2016:Jul.)
- Journal:
- IEEJ transactions on electrical and electronic engineering
- Issue:
- Volume 11:Issue 4(2016:Jul.)
- Issue Display:
- Volume 11, Issue 4 (2016)
- Year:
- 2016
- Volume:
- 11
- Issue:
- 4
- Issue Sort Value:
- 2016-0011-0004-0000
- Page Start:
- 488
- Page End:
- 498
- Publication Date:
- 2016-05-14
- Subjects:
- Baidu Baike -- Chinese -- general knowledge -- support vector machine -- text classification -- text representation
Electrical engineering -- Periodicals
Electronics -- Periodicals
621.3 - Journal URLs:
- http://onlinelibrary.wiley.com/ ↗
- DOI:
- 10.1002/tee.22266 ↗
- Languages:
- English
- ISSNs:
- 1931-4973
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4363.240505
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 1752.xml