** Under Construction **
There are many domains where the temporal dimension is critical to unveil how different modalities, such as images and texts, are correlated. Notably, in the social media domain, information is constantly evolving over time according to the events that take place in the real-world.
In this work, we seek for highly expressive loss functions that allow the encoding of data temporal traits into cross-modal embedding spaces.
To achieve this goal, we propose to steer the learning procedure of such embedding through a set of adaptively enforced temporal constraints. In particular, we propose a new formulation of the triplet loss function, where the traditional static margin is superseded by a novel temporally adaptive maximum margin function. This novel redesign of the static margin formulation, allows the embedding to effectively capture not only the semantic correlations across data modalities, but also data’s fine-grained temporal correlations.
The key contributions of this paper are:
- A neural framework for temporal cross-modal embedding learning, supporting a fine-grain structuring of the embedding space, to cope with data complex temporal correlations;
- An expressive adaptive temporal triplet-loss formulation, that enables an effective joint structuring of both data modalities (image+text) and contextualizing information, in a common cross-modal space;
- A thorough analysis of the proposed approach, comprised by a canonical demonstration and evaluation of what the proposed model accomplishes, evidencing the characteristics of the temporal embedding.
Embedding space condition on a Continuous variable (Time)
NUS-WIDE extension – Timestamped data:
To access the dataset, please fill in the following form:
If you find this work useful and/or if you use the extended dataset, please cite our work:
Adaptive Temporal Triplet-loss for Cross-modal Embedding Learning, Semedo D., Magalhães J., ACM Multimedia 2020, Seattle, USA. [PDF]
Temporal Cross-Media Retrieval with Soft-Smoothing, Semedo D., Magalhães J., ACM Multimedia 2018, Seoul, Korea. [PDF]
This work has been partially funded by the iFetch project, Ref.45920, co-financed by ERDF, COMPETE 2020, NORTE 2020 andFCT under CMU Portugal, by the CMU Portugal GoLocal project Ref.CMUP-ERI/TIC/0046/2014, by the H2020 ICT COGNITUS projectwith the grant agreement no687605 and by the FCT project NOVALINCS Ref. UID/CEC/04516/2019. We also gratefully acknowledgethe support of NVIDIA Corporation with the donation of the GPUsused for this research.