Learning visual and textual representations for multimodal matching and classification. (December 2018)