Comparing techniques for authorship attribution of source code. (29th August 2012)
- Record Type:
- Journal Article
- Title:
- Comparing techniques for authorship attribution of source code. (29th August 2012)
- Main Title:
- Comparing techniques for authorship attribution of source code
- Authors:
- Burrows, Steven
Uitdenbogerd, Alexandra L.
Turpin, Andrew - Abstract:
- <abstract abstract-type="main" id="spe2146-abs-0001"> <title>SUMMARY</title> <p id="spe2146-para-0001">Attributing authorship of documents with unknown creators has been studied extensively for natural language text such as essays and literature, but less so for non‐natural languages such as computer source code. Previous attempts at attributing authorship of source code can be categorised by two attributes: the software features used for the classification, either strings of <italic>n</italic> tokens/bytes (<italic>n</italic>‐grams) or software metrics; and the classification technique that exploits those features, either information retrieval ranking or machine learning. The results of existing studies, however, are not directly comparable as all use different test beds and evaluation methodologies, making it difficult to assess which approach is superior. This paper summarises all previous techniques to source code authorship attribution, implements feature sets that are motivated by the literature, and applies information retrieval ranking methods or machine classifiers for each approach. Importantly, all approaches are tested on identical collections from varying programming languages and author types. Our conclusions are as follows: (i) ranking and machine classifier approaches are around 90% and 85% accurate, respectively, for a one‐in‐10 classification problem; (ii) the byte‐level <italic>n</italic>‐gram approach is best used with different parameters to those<abstract abstract-type="main" id="spe2146-abs-0001"> <title>SUMMARY</title> <p id="spe2146-para-0001">Attributing authorship of documents with unknown creators has been studied extensively for natural language text such as essays and literature, but less so for non‐natural languages such as computer source code. Previous attempts at attributing authorship of source code can be categorised by two attributes: the software features used for the classification, either strings of <italic>n</italic> tokens/bytes (<italic>n</italic>‐grams) or software metrics; and the classification technique that exploits those features, either information retrieval ranking or machine learning. The results of existing studies, however, are not directly comparable as all use different test beds and evaluation methodologies, making it difficult to assess which approach is superior. This paper summarises all previous techniques to source code authorship attribution, implements feature sets that are motivated by the literature, and applies information retrieval ranking methods or machine classifiers for each approach. Importantly, all approaches are tested on identical collections from varying programming languages and author types. Our conclusions are as follows: (i) ranking and machine classifier approaches are around 90% and 85% accurate, respectively, for a one‐in‐10 classification problem; (ii) the byte‐level <italic>n</italic>‐gram approach is best used with different parameters to those previously published; (iii) neural networks and support vector machines were found to be the most accurate machine classifiers of the eight evaluated; (iv) use of <italic>n</italic>‐gram features in combination with machine classifiers shows promise, but there are scalability problems that still must be overcome; and (v) approaches based on information retrieval techniques are currently more accurate than approaches based on machine learning. Copyright © 2012 John Wiley &amp; Sons, Ltd.</p> </abstract> … (more)
- Is Part Of:
- Software, practice & experience. Volume 44:Number 1(2014)
- Journal:
- Software, practice & experience
- Issue:
- Volume 44:Number 1(2014)
- Issue Display:
- Volume 44, Issue 1 (2014)
- Year:
- 2014
- Volume:
- 44
- Issue:
- 1
- Issue Sort Value:
- 2014-0044-0001-0000
- Page Start:
- 1
- Page End:
- 32
- Publication Date:
- 2012-08-29
- Subjects:
- Computer software -- Periodicals
Computer programming -- Periodicals
Computer programs -- Periodicals
005.3 - Journal URLs:
- http://onlinelibrary.wiley.com/ ↗
- DOI:
- 10.1002/spe.2146 ↗
- Languages:
- English
- ISSNs:
- 0038-0644
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 8321.453000
British Library DSC - BLDSS-3PM
British Library STI - ELD Digital store - Ingest File:
- 3116.xml