Emerging trends: Subwords, seriously?. (May 2020)
- Record Type:
- Journal Article
- Title:
- Emerging trends: Subwords, seriously?. (May 2020)
- Main Title:
- Emerging trends: Subwords, seriously?
- Authors:
- Church, Kenneth Ward
- Abstract:
- Abstract: Subwords have become very popular, but the BERT a and ERNIE b tokenizers often produce surprising results. Byte pair encoding (BPE) trains a dictionary with a simple information theoretic criterion that sidesteps the need for special treatment of unknown words. BPE is more about training (populating a dictionary of word pieces) than inference (parsing an unknown word into word pieces). The parse at inference time can be ambiguous. Which parse should we use? For example, "electroneutral" can be parsed as electron-eu-tral or electro-neutral, and "bidirectional" can be parsed as bid-ire-ction-al and bi-directional. BERT and ERNIE tend to favor the parse with more word pieces. We propose minimizing the number of word pieces. To justify our proposal, a number of criteria will be considered: sound, meaning, etc. The prefix, bi-, has the desired vowel (unlike bid) and the desired meaning (bi is Latin for two, unlike bid, which is Germanic for offer).
- Is Part Of:
- Natural language engineering. Volume 26:Part 3(2020)
- Journal:
- Natural language engineering
- Issue:
- Volume 26:Part 3(2020)
- Issue Display:
- Volume 26, Issue 3, Part 3 (2020)
- Year:
- 2020
- Volume:
- 26
- Issue:
- 3
- Part:
- 3
- Issue Sort Value:
- 2020-0026-0003-0003
- Page Start:
- 375
- Page End:
- 382
- Publication Date:
- 2020-05
- Subjects:
- Subwords, -- Word pieces, -- Tokenization, -- Morphology, -- Etymology
Natural language processing (Computer science) -- Periodicals
Software engineering -- Periodicals
006.35 - Journal URLs:
- http://journals.cambridge.org/action/displayJournal?jid=NLE ↗
- DOI:
- 10.1017/S1351324920000145 ↗
- Languages:
- English
- ISSNs:
- 1351-3249
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library HMNTS - ELD Digital store
- Ingest File:
- 14652.xml