Evaluation of text representation schemes and distance measures for authorship linking. (2nd June 2018)

Record Type:: Journal Article
Title:: Evaluation of text representation schemes and distance measures for authorship linking. (2nd June 2018)
Main Title:: Evaluation of text representation schemes and distance measures for authorship linking
Authors:: Kocher, Mirco
Savoy, Jacques
Abstract:: Abstract: Based on n text excerpts, the authorship linking task is to determine a way to link pairs of documents written by the same person together. This problem is closely related to authorship attribution questions, and its solution can be used in the author clustering task. However, no training information is provided and the solution must be unsupervised. To achieve this, various text representation strategies can be applied, such as characters, punctuation symbols, or letter n -grams as well as words, lemmas, Part-Of-Speech (POS) tags, and sequences of them. To estimate the stylistic distance (or similarity) between two text excerpts, different measures have been suggested based on the L 1 norm (e.g. Manhattan, Tanimoto), the L 2 norm (e.g. Matusita), the inner product (e.g. Cosine), or the entropy paradigm (e.g. Jeffrey divergence). From those possible implementations, it is not clear which text representation and distance functions produce the best performance, and this study provides an answer to this question. Three corpora, extracted from French and English literature, have been evaluated using standard methodology. Moreover, we suggest an additional performance measure called high precision (HPrec) capable of judging the quality of a ranked list of links to provide only correct answers. No systematic difference can be found between token- or lemma-based text representations. Simple POS tags do not provide an effective solution but short sequences of them form a … (more)
Is Part Of:: Digital scholarship in the humanties. Volume 34:Number 1(2019)
Journal:: Digital scholarship in the humanties
Issue:: Volume 34:Number 1(2019)
Issue Display:: Volume 34, Issue 1 (2019)
Year:: 2019
Volume:: 34
Issue:: 1
Issue Sort Value:: 2019-0034-0001-0000
Page Start:: 189
Page End:: 207
Publication Date:: 2018-06-02
Subjects:: Philology -- Data processing -- Periodicals
Computational linguistics -- Periodicals
410.285
Journal URLs:: http://www.oxfordjournals.org/ ↗
http://dsh.oxfordjournals.org/ ↗
DOI:: 10.1093/llc/fqy013 ↗
Languages:: English
ISSNs:: 2055-768X
Deposit Type:: Legaldeposit
View Content:: Available online (eLD content is only available in our Reading Rooms) ↗
Physical Locations:: British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store
Ingest File:: 11977.xml