DisClose: Discovering Colossal Closed Itemsets from High Dimensional Datasets via a Compact Row-Tree

UoM administered thesis: Phd

  • Authors:
  • Nurul Zulkurnain


Data mining is an essential part of knowledge discovery, and performs the extraction of useful information from a collection of data, so as to assist human beings in making necessary decisions. This thesis describes research in the field of itemset mining, which performs the extraction of a set of items that occur together in a dataset, based on a user specified threshold. Recent focus of itemset mining has been on the discovery of closed itemsets from high-dimensional datasets, characterised by relatively few rows and a relatively larger number of columns. A closed itemset is the maximal set of items common to a set of rows. By exponentially increasing running time as the average row length increases, mining closed itemsets from such datasets renders most column enumeration-based algorithm impractical. Existing row enumeration-based algorithms also show that they struggle to reach large cardinality closed itemsets. This is due to the implementation of the support constraint, which is based on the frequency of occurrence of the itemset. Frequent closed itemsets are usually smaller in size and larger in numbers, hence taking much of the memory space. Unfortunately, large cardinality closed itemsets are likely to be more informative than small cardinality closed itemsets in this type of dataset. The research investigates the area of large cardinality closed itemset discovery by examining and analysing the literature and identifying both strengths and weaknesses of existing approaches. Based on this synthesis, a new algorithm, termed DisClose, has been designed and developed to discover large cardinality (colossal) closed itemsets from high-dimensional datasets. The algorithm strategy begins by enumerating large cardinality itemsets and from these, builds smaller itemsets. This is done by applying a bottom-up search of the row-enumeration tree. A minimum cardinality threshold has been proposed to identify colossal closed itemsets and to further reduce the search space. A novel closedness-checking method has been proposed which uses a unique generator to immediately discover closed itemsets without the need to check if each new closed itemset has previously been found. These approaches have been combined using a Compact Row-Tree (CR-Tree) data structure designed to assist in the efficient discovery of the colossal closed itemsets. For evaluation purposes four state-of-the-art algorithms have been selected for comparison. Experimental results show that algorithm DisClose is scalable and can efficiently extract colossal closed itemsets in the considered dataset, even for low support thresholds that existing algorithms cannot find.


Original languageEnglish
Awarding Institution
  • John Keane (Supervisor)
Award date31 Dec 2012