Understanding the semantic shifts of multimodal information is only possible with models that capture cross-modal interactions over time. In this work we developed a novel embedding that structures visual-textual interactions according to the temporal dimension, thus, preserving data’s original temporal organization.
The key contributions of this paper are:
- The first Diachronic Cross-modal embedding learning approach, where the evolution of multimodal data correlations are modelled. Time is explicitly modelled, thus allowing conditioning on time at both training and inference time;
- A novel temporally constrained ranking loss formulation aligns instances embeddings over time, and enables the learning of neural projections from timestamped multimodal data;
- A principled approach that offers statistical guarantees, and allows for correct joint-inferences (image+text+time) that other methods do not, enabling it to be used for a wide number of media interpretation tasks.
Visual (blue) and textual (purple) instances, at an instant t^i, are mapped to a D dimensional diachronic embedding space. A shared temporal structuring layer takes the timestamp t^i as input and learns an embedding for t^i, that is then used to independently condition modality projections on time. A diachronic ranking loss is responsible for structuring instances over time.
Dataset – A 20+ years of Flickr Multimodal Instances
To access the dataset, please fill in the following form:
If you find this work useful and/or if you use our dataset, please cite our paper:
Diachronic Cross-modal Embeddings, Semedo D., Magalhães J., ACM Multimedia 2019, Nice, France. [PDF]
This work has been partially funded by the CMU Portugal research project GoLocal Ref. CMUP-ERI/TIC/0046/2014, by the H2020 ICT project COGNITUS with the grant agreement nº 687605 and by the FCT project NOVA LINCS Ref. UID/CEC/04516/2019. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used for this research.