Cross-lingual text alignment for fine-grained plagiarism detection. (August 2019)
- Record Type:
- Journal Article
- Title:
- Cross-lingual text alignment for fine-grained plagiarism detection. (August 2019)
- Main Title:
- Cross-lingual text alignment for fine-grained plagiarism detection
- Authors:
- Ehsan, Nava
Shakery, Azadeh
Tompa, Frank Wm - Abstract:
- Fast and easy access to a wide range of documents in various languages, in conjunction with the wide availability of translation and editing tools, has led to the need to develop effective tools for detecting cross-lingual plagiarism. Given a suspicious document, cross-lingual plagiarism detection comprises two main subtasks: retrieving documents that are candidate sources for that document and analysing those candidates one by one to determine their similarity to the suspicious document. In this article, we examine the second subtask, also called the detailed analysis subtask, where the goal is to align plagiarised fragments from source and suspicious documents in different languages. Our proposed approach has two main steps: the first step tries to find candidate plagiarised fragments and focuses on high recall, followed by a more precise similarity analysis based on dynamic text alignment that will filter the results by finding alignments between the identified fragments. With these two steps, the proximity of the terms will be considered in different levels of granularity. In both steps, our approach uses a dictionary to obtain translations of individual terms instead of using a machine translation system to convert longer passages from one language to another. We used a weighting scheme to distinct multiple translations of the terms. Experimental results show that our method outperforms the methods used by the systems that achieved the best results in the PAN-2012 andFast and easy access to a wide range of documents in various languages, in conjunction with the wide availability of translation and editing tools, has led to the need to develop effective tools for detecting cross-lingual plagiarism. Given a suspicious document, cross-lingual plagiarism detection comprises two main subtasks: retrieving documents that are candidate sources for that document and analysing those candidates one by one to determine their similarity to the suspicious document. In this article, we examine the second subtask, also called the detailed analysis subtask, where the goal is to align plagiarised fragments from source and suspicious documents in different languages. Our proposed approach has two main steps: the first step tries to find candidate plagiarised fragments and focuses on high recall, followed by a more precise similarity analysis based on dynamic text alignment that will filter the results by finding alignments between the identified fragments. With these two steps, the proximity of the terms will be considered in different levels of granularity. In both steps, our approach uses a dictionary to obtain translations of individual terms instead of using a machine translation system to convert longer passages from one language to another. We used a weighting scheme to distinct multiple translations of the terms. Experimental results show that our method outperforms the methods used by the systems that achieved the best results in the PAN-2012 and PAN-2014 competitions. … (more)
- Is Part Of:
- Journal of information science. Volume 45:Number 4(2019)
- Journal:
- Journal of information science
- Issue:
- Volume 45:Number 4(2019)
- Issue Display:
- Volume 45, Issue 4 (2019)
- Year:
- 2019
- Volume:
- 45
- Issue:
- 4
- Issue Sort Value:
- 2019-0045-0004-0000
- Page Start:
- 443
- Page End:
- 459
- Publication Date:
- 2019-08
- Subjects:
- Cross-lingual -- dictionary -- document analysis -- plagiarism detection -- proximity -- text alignment
Information science -- Periodicals
Information science
Periodicals
020.5 - Journal URLs:
- http://jis.sagepub.com/archive/ ↗
http://www.ingenta.com/journals/browse/bks/jis?mode=direct ↗
http://www.uk.sagepub.com/home.nav ↗
http://firstsearch.oclc.org ↗
http://firstsearch.oclc.org/journal=0165-5515;screen=info;ECOIP ↗ - DOI:
- 10.1177/0165551518787696 ↗
- Languages:
- English
- ISSNs:
- 0165-5515
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 11452.xml