Evolving Controllably Difficult Datasets for Clustering

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review


Synthetic datasets play an important role in evaluating clustering algorithms,
as they can help shed light on consistent biases, strengths, and weaknesses of particular techniques, thereby supporting sound conclusions. Despite this, there is a surprisingly small set of established clustering benchmark data, and many of these are currently handcrafted. Even then, their difficulty is typically not quantified or considered, limiting the ability to interpret algorithmic performance on these datasets. Here, we introduce HAWKS, a new data generator that uses an evolutionary algorithm to evolve cluster structure of a synthetic data set. We demonstrate how such an approach can be used to produce datasets of a pre-specified difficulty, to trade off different aspects of problem difficulty, and how these interventions directly translate into changes in the clustering performance of established algorithms.

Bibliographical metadata

Original languageEnglish
Title of host publicationProceedings of the Annual Conference on Genetic and Evolutionary Computation (GECCO '19)
Publication statusPublished - 13 Jul 2019
EventThe Genetic and Evolutionary Computation Conference - Prague, Czech Republic
Event duration: 13 Jul 201917 Jul 2019


ConferenceThe Genetic and Evolutionary Computation Conference
Abbreviated titleGECCO 2019
CountryCzech Republic