We compare two methods for assessing the performance of groupwise non-rigid registration algorithms. The first approach, which has been described previously, utilizes a measure of overlap between ground-truth anatomical labels. The second, which is new, exploits the fact that, given a set of non-rigidly registered images, a generative statistical model of appearance can be constructed. We observe that the quality of this model depends on the quality of the registration, and define measures of model specificity and generalisation -- based on comparing synthetic images sampled from the model, with those in the original image set -- that can be used to assess model/registration quality. We show that both approaches detect the loss of registration accuracy as the alignment of a set of correctly registered MR images of the brain is progressively perturbed. We compare the sensitivities of the two approaches and show that, as well as requiring no ground truth, specificity provides the most sensitive measure of misregistration. Finally, we use specificity and generalisation to compare three NRR algorithms.