Learning of Multimodal Representations With Random Walks on the Click Graph

0
421
Learning of Multimodal Representations With Random Walks on the Click Graph

Learning of Multimodal Representations With Random Walks on the Click Graph

Abstract

Learning of Multimodal Representations With Random Walks on the Click Graph,In multimedia information retrieval, most classic approaches tend to represent different modalities of media in the same feature space. With the click data collected from the users’ searching behavior, existing approaches take either one-to-one paired data (text-image pairs) or ranking examples (text-query-image and/or image-query-text ranking lists) as training examples, which do not make full use of the click data, particularly the implicit connections among the data objects. In this paper, we treat the click data as a large click graph, in which vertices are images/text queries and edges indicate the clicks between an image and a query. We consider learning a multimodal representation from the perspective of encoding the explicit/implicit relevance relationship between the vertices in the click graph. By minimizing both the truncated random walk loss as well as the distance between the learned representation of vertices and their corresponding deep neural network output, the proposed model which is named multimodal random walk neural network (MRW-NN) can be applied to not only learn robust representation of the existing multimodal data in the click graph, but also deal with the unseen queries and images to support cross-modal retrieval. We evaluate the latent representation learned by MRW-NN on a public large-scale click log data set Clickture and further show that MRW-NN achieves much better cross-modal retrieval performance on the unseen queries/images than the other state-of-the-art methods.

Conclusions

In this Learning of Multimodal Representations With Random Walks on the Click Graph work, we have presented a new approach to learning latent representation of the multimodal data from a click graph. By the minimization of the random walk error and the regularization penalty from the output of the modal-specific neural networks, the learned model has the ability not only to represent the explicit connections and the implicit connections of the vertices in the click graph with low-dimensional continuous vectors, but also to map the unseen queries and images to the latent subspace to support cross-modal retrieval. We have demonstrated the effectiveness of the learned representation by the proposed method MRW-NN and shown its superior to the comparative methods on cross-modal retrieval on a large-scale click log dataset.