Classifying SDSS Data using Active Learning

UoM administered thesis: Master of Science by Research

  • Authors:
  • Nathan Steer

Abstract

This thesis applies active learning to a dataset of spectroscopically labelled sources from the Sloan Digital Sky Survey (SDSS). The sources are selected from the photometric data in the SDSS and the Widefield Infrared Survey Explorer (WISE). Two machine learning techniques were used: a neural network and a random forest classifier. Four different active learning methods were investigated with these data: uncertainty sampling, best- vs-second-best, variance reduction, and learning active learning, plus a generic random method as a control. The uncertainty sampling was implemented using a form known as entropy measure, for which a binary case and a multi-class case were tested separately. These machine learning techniques were also applied to different configurations of Gaussian clouds to help understand their effect on different types of data. The learning active learning received particular focus as the most expandable method. To assist in the selection of active learning methods, the average accuracy scores and feature importances, as well as the class precision, recall, and F1-scores were all compared. These tests resulted in the entropy sampling and the learning active learning being selected as most capable, requiring only 25, 600 datapoints in the training set, with the latter having the most room for improvement.

Details

Original languageEnglish
Awarding Institution
Supervisors/Advisors
Award date1 Aug 2022