Evaluation of text representation schemes and distance measures for authorship linking. (2nd June 2018)
- Record Type:
- Journal Article
- Title:
- Evaluation of text representation schemes and distance measures for authorship linking. (2nd June 2018)
- Main Title:
- Evaluation of text representation schemes and distance measures for authorship linking
- Authors:
- Kocher, Mirco
Savoy, Jacques - Abstract:
- Abstract: Based on n text excerpts, the authorship linking task is to determine a way to link pairs of documents written by the same person together. This problem is closely related to authorship attribution questions, and its solution can be used in the author clustering task. However, no training information is provided and the solution must be unsupervised. To achieve this, various text representation strategies can be applied, such as characters, punctuation symbols, or letter n -grams as well as words, lemmas, Part-Of-Speech (POS) tags, and sequences of them. To estimate the stylistic distance (or similarity) between two text excerpts, different measures have been suggested based on the L 1 norm (e.g. Manhattan, Tanimoto), the L 2 norm (e.g. Matusita), the inner product (e.g. Cosine), or the entropy paradigm (e.g. Jeffrey divergence). From those possible implementations, it is not clear which text representation and distance functions produce the best performance, and this study provides an answer to this question. Three corpora, extracted from French and English literature, have been evaluated using standard methodology. Moreover, we suggest an additional performance measure called high precision (HPrec) capable of judging the quality of a ranked list of links to provide only correct answers. No systematic difference can be found between token- or lemma-based text representations. Simple POS tags do not provide an effective solution but short sequences of them form aAbstract: Based on n text excerpts, the authorship linking task is to determine a way to link pairs of documents written by the same person together. This problem is closely related to authorship attribution questions, and its solution can be used in the author clustering task. However, no training information is provided and the solution must be unsupervised. To achieve this, various text representation strategies can be applied, such as characters, punctuation symbols, or letter n -grams as well as words, lemmas, Part-Of-Speech (POS) tags, and sequences of them. To estimate the stylistic distance (or similarity) between two text excerpts, different measures have been suggested based on the L 1 norm (e.g. Manhattan, Tanimoto), the L 2 norm (e.g. Matusita), the inner product (e.g. Cosine), or the entropy paradigm (e.g. Jeffrey divergence). From those possible implementations, it is not clear which text representation and distance functions produce the best performance, and this study provides an answer to this question. Three corpora, extracted from French and English literature, have been evaluated using standard methodology. Moreover, we suggest an additional performance measure called high precision (HPrec) capable of judging the quality of a ranked list of links to provide only correct answers. No systematic difference can be found between token- or lemma-based text representations. Simple POS tags do not provide an effective solution but short sequences of them form a good text representation. Letter n -grams (with n = 4–6) give high HPrec rates. As distance measures, this study found that the Tanimoto, Matusita, and Clark distance measures perform better than the often-used Cosine function. Finally, applying a pruning procedure (e.g. culling terms appearing once or twice or limiting the vocabulary to the 500 most frequent words) reduces the representation complexity and might even improve the effectiveness of the attribution scheme. … (more)
- Is Part Of:
- Digital scholarship in the humanties. Volume 34:Number 1(2019)
- Journal:
- Digital scholarship in the humanties
- Issue:
- Volume 34:Number 1(2019)
- Issue Display:
- Volume 34, Issue 1 (2019)
- Year:
- 2019
- Volume:
- 34
- Issue:
- 1
- Issue Sort Value:
- 2019-0034-0001-0000
- Page Start:
- 189
- Page End:
- 207
- Publication Date:
- 2018-06-02
- Subjects:
- Philology -- Data processing -- Periodicals
Computational linguistics -- Periodicals
410.285 - Journal URLs:
- http://www.oxfordjournals.org/ ↗
http://dsh.oxfordjournals.org/ ↗ - DOI:
- 10.1093/llc/fqy013 ↗
- Languages:
- English
- ISSNs:
- 2055-768X
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 11977.xml