Effective and Fast Near Duplicate Detection via Signature-Based Compression Metrics. (18th October 2016)
- Record Type:
- Journal Article
- Title:
- Effective and Fast Near Duplicate Detection via Signature-Based Compression Metrics. (18th October 2016)
- Main Title:
- Effective and Fast Near Duplicate Detection via Signature-Based Compression Metrics
- Authors:
- Zhang, Xi
Yao, Yuntao
Ji, Yingsheng
Fang, Binxing - Other Names:
- Wu Yuqiang Academic Editor.
- Abstract:
- Abstract : Detecting near duplicates on the web is challenging due to its volume and variety. Most of the previous studies require the setting of input parameters, making it difficult for them to achieve robustness across various scenarios without careful tuning. Recently, a universal and parameter-free similarity metric, the normalized compression distance or NCD, has been employed effectively in diverse applications. Nevertheless, there are problems preventing NCD from being applied to medium-to-large datasets as it lacks efficiency and tends to get skewed by large object size. To make this parameter-free method feasible on a large corpus of web documents, we propose a new method called SigNCD which measures NCD based on lightweight signatures instead of full documents, leading to improved efficiency and stability. We derive various lower bounds of NCD and propose pruning policies to further reduce computational complexity. We evaluate SigNCD on both English and Chinese datasets and show an increase inF 1 score compared with the original NCD method and a significant reduction in runtime. Comparisons with other competitive methods also demonstrate the superiority of our method. Moreover, no parameter tuning is required in SigNCD, except a similarity threshold.
- Is Part Of:
- Mathematical problems in engineering. Volume 2016(2016)
- Journal:
- Mathematical problems in engineering
- Issue:
- Volume 2016(2016)
- Issue Display:
- Volume 2016, Issue 2016 (2016)
- Year:
- 2016
- Volume:
- 2016
- Issue:
- 2016
- Issue Sort Value:
- 2016-2016-2016-0000
- Page Start:
- Page End:
- Publication Date:
- 2016-10-18
- Subjects:
- Engineering mathematics -- Periodicals
510.2462 - Journal URLs:
- https://www.hindawi.com/journals/mpe/ ↗
http://www.gbhap-us.com/journals/238/238-top.htm ↗ - DOI:
- 10.1155/2016/3919043 ↗
- Languages:
- English
- ISSNs:
- 1024-123X
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library HMNTS - ELD Digital store
- Ingest File:
- 10307.xml