Similarity (and distance metric) learning plays a very important role in many artificial intelligence tasks aiming at quantizing the relevance between objects. We address the challenge of learning complex relation patterns from data objects exhibiting heterogeneous properties, and develop an effective multi-view multimodal similarity learning model with much improved learning performance and model interpretability. The proposed method firstly computes multi-view convolutional features to achieve improved object representation, then analyses the similarities between objects by operating over multiple hidden relation types (modalities), and finally fine-tunes the entire model variables via back-propagating a ranking loss to the convolutional layers. We develop a topic-driven initialization scheme, so that each learned relation type can be interpreted as a representative of semantic topics of the objects. To improve model interpretability and generalization, sparsity is imposed over these hidden relations. The proposed method is evaluated by solving the image retrieval task using challenging image datasets, and is compared with seven state-of-the-art algorithms in the field. Experimental results demonstrate significant performance improvement of the proposed method over the competing ones.