We study the nature of filter methods for feature selection. In particular, we examine information theoretic approaches to this problem, looking at the literature over the past 20 years. We consider this literature from a different perspective, by viewing feature selection as a process which minimises a loss function. We choose to use the model likelihood as the loss function, and thus we seek to maximise the likelihood. The first contribution of this thesis is to show that the problem of information theoretic filter feature selection can be rephrased as maximising the likelihood of a discriminative model. From this novel result we can unify the literature revealing that many of these selection criteria are approximate maximisers of the joint likelihood. Many of these heuristic criteria were hand-designed to optimise various definitions of feature "relevancy" and "redundancy", but with our probabilistic interpretation we naturally include these concepts, plus the "conditional redundancy", which is a measure of positive interactions between features. This perspective allows us to derive the different criteria from the joint likelihood by making different independence assumptions on the underlying probability distributions. We provide an empirical study which reinforces our theoretical conclusions, whilst revealing implementation considerations due to the varying magnitudes of the relevancy and redundancy terms.We then investigate the benefits our probabilistic perspective provides for the application of these feature selection criteria in new areas. The joint likelihood automatically includes a prior distribution over the selected feature sets and so we investigate how including prior knowledge affects the feature selection process. We can now incorporate domain knowledge into feature selection, allowing the imposition of sparsity on the selected feature set without using heuristic stopping criteria. We investigate the use of priors mainly in the context of Markov Blanket discovery algorithms, in the process showing that a family of algorithms based upon IAMB are iterative maximisers of our joint likelihood with respect to a particular sparsity prior. We thus extend the IAMB family to include a prior for domain knowledge in addition to the sparsity prior.Next we investigate what the choice of likelihood function implies about the resulting filter criterion. We do this by applying our derivation to a cost-weighted likelihood, showing that this likelihood implies a particular cost-sensitive filter criterion. This criterion is based on a weighted branch of information theory and we prove several novel results justifying its use as a feature selection criterion, namely the positivity of the measure, and the chain rule of mutual information. We show that the feature set produced by this cost-sensitive filter criterion can be used to convert a cost-insensitive classifier into a cost-sensitive one by adjusting the features the classifier sees. This can be seen as an analogous process to that of adjusting the data via over or undersampling to create a cost-sensitive classifier, but with the crucial difference that it does not artificially alter the data distribution.Finally we conclude with a summary of the benefits this loss function view of feature selection has provided. This perspective can be used to analyse other feature selection techniques other than those based upon information theory, and new groups of selection criteria can be derived by considering novel loss functions.