ReINVenTA aims to build and evaluate a computational model for representing the semantics of multimodal objects, e.g. videos with and without audio description, and pictures with captions. ReINVenTA is based on FrameNet: a computational model of linguistic cognition structured by frames, their elements and the relations between them. This project assumes that, similarly to how lexical items can evoke frames, visual elements may also evoke them or interact with frames evoked by language. The methodology includes creating a multimodal gold standard dataset to be used in tasks such as automatic semantic role labelling of multimodal objects.