Continuous outlier mining of streaming data in flink. (November 2020)
- Record Type:
- Journal Article
- Title:
- Continuous outlier mining of streaming data in flink. (November 2020)
- Main Title:
- Continuous outlier mining of streaming data in flink
- Authors:
- Toliopoulos, Theodoros
Gounaris, Anastasios
Tsichlas, Kostas
Papadopoulos, Apostolos
Sampaio, Sandra - Abstract:
- Abstract: In this work, we focus on distance-based outliers in a metric space, where the status of an entity as to whether it is an outlier is based on the number of other entities in its neighborhood. In recent years, several solutions have tackled the problem of distance-based outliers in data streams, where outliers must be mined continuously as new elements become available. An interesting research problem is to combine the streaming environment with massively parallel systems to provide scalable stream-based algorithms. However, none of the previously proposed techniques refer to a massively parallel setting. Our proposal fills this gap and investigates the challenges in transferring state-of-the-art techniques to Apache Flink, a modern platform for intensive streaming analytics. We thoroughly present the technical challenges encountered and the alternatives that may be applied, of which a micro-clustering-based one is the most efficient. We show speed-ups of up to 2.27 times over advanced non-parallel solutions, by using just an ordinary four-core machine and a real-world dataset. When moving to a three-machine cluster, due to less contention, we manage to achieve both better scalability in terms of the window slide size and the data dimensionality, and even higher speed-ups, e.g., by a factor of more than 11X. Overall, our results demonstrate that outlier mining can be achieved in an efficient and scalable manner. The resulting techniques have been made publiclyAbstract: In this work, we focus on distance-based outliers in a metric space, where the status of an entity as to whether it is an outlier is based on the number of other entities in its neighborhood. In recent years, several solutions have tackled the problem of distance-based outliers in data streams, where outliers must be mined continuously as new elements become available. An interesting research problem is to combine the streaming environment with massively parallel systems to provide scalable stream-based algorithms. However, none of the previously proposed techniques refer to a massively parallel setting. Our proposal fills this gap and investigates the challenges in transferring state-of-the-art techniques to Apache Flink, a modern platform for intensive streaming analytics. We thoroughly present the technical challenges encountered and the alternatives that may be applied, of which a micro-clustering-based one is the most efficient. We show speed-ups of up to 2.27 times over advanced non-parallel solutions, by using just an ordinary four-core machine and a real-world dataset. When moving to a three-machine cluster, due to less contention, we manage to achieve both better scalability in terms of the window slide size and the data dimensionality, and even higher speed-ups, e.g., by a factor of more than 11X. Overall, our results demonstrate that outlier mining can be achieved in an efficient and scalable manner. The resulting techniques have been made publicly available as open-source software. Highlights: We explore a series of implementation alternatives, differing in the algorithmic features they employ. We provide thorough experimental evaluation results. Overall, we demonstrate how outlier mining can be achieved in an efficient and scalable manner through parallelism. We offer the source code as an open-source library . … (more)
- Is Part Of:
- Information systems. Volume 93(2020)
- Journal:
- Information systems
- Issue:
- Volume 93(2020)
- Issue Display:
- Volume 93, Issue 2020 (2020)
- Year:
- 2020
- Volume:
- 93
- Issue:
- 2020
- Issue Sort Value:
- 2020-0093-2020-0000
- Page Start:
- Page End:
- Publication Date:
- 2020-11
- Subjects:
- Distance-based outlier detection -- Flink -- Data streams
Database management -- Periodicals
Electronic data processing -- Periodicals
Bases de données -- Gestion -- Périodiques
Informatique -- Périodiques
Database management
Electronic data processing
Periodicals
005.7 - Journal URLs:
- http://www.sciencedirect.com/science/journal/03064379 ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.is.2020.101569 ↗
- Languages:
- English
- ISSNs:
- 0306-4379
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 4496.367300
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 13541.xml