Continuous Outlier Mining of Streaming Data in Flink

Research output: Contribution to journalArticlepeer-review

  • Authors:
  • Theodoros Toliopoulos
  • Anastasios Gounaris
  • Kostas Tsichlas
  • Apostolos Papadopoulos
  • Sandra Sampaio

Abstract

In this work, we focus on distance-based outliers in a metric space, where the status of an entity as to whether it is an outlier is based on the number of other entities in its neighborhood. In recent years, several solutions have tackled the problem of distance-based outliers in data streams, where outliers must be mined continuously as new elements become available. An interesting research problem is to combine the streaming environment with massively parallel systems to provide scalable stream-based algorithms. However, none of the previously proposed techniques refer to a massively parallel setting. Our proposal fills this gap and investigates the challenges in transferring state-of-the-art techniques to Apache Flink, a modern platform for intensive streaming analytics. We thoroughly present the technical challenges encountered and the alternatives that may be applied, of which a micro-clustering-based one is the most efficient. We show speed-ups of up to 2.27 times over advanced non-parallel solutions, by using just an ordinary four-core machine and a real-world dataset. When moving to a three-machine cluster, due to less contention, we manage to achieve both better scalability in terms of the window slide size and the data dimensionality, and even higher speed-ups, e.g., by a factor of more than 11X. Overall, our results demonstrate that outlier mining can be achieved in an efficient and scalable manner. The resulting techniques have been made publicly available as open-source software.

Bibliographical metadata

Original languageEnglish
Article number101569
Number of pages45
JournalInformation Systems
Volume93
Early online date29 May 2020
DOIs
Publication statusPublished - 1 Nov 2020

Related information

Researchers

View all