Automatic Labeling of Visual Data


Today, we live in a world where breaking news are almost instantly captured by the crowd and shared on social networks such as Twitter, Instagram or TikTok. Yet, media companies are currently unable to effectively extract meaningful information from the deluge of visual data uploaded online every second. The VITA laboratory is developing such technologies that were used, for example, in 2019 to estimate the size of the crowd during the 14 June women’s strike in Geneva, using images and videos captured by demonstrators. The resulting article, published by, even questioned the official estimates.

In this project, the VITA laboratory, in partnership with RTS, aims to develop computationally efficient methods that will go beyond crowd counting, and extract a large set of semantics ranging from a list of objects to human actions and their relationships. The proposed technology will allow complex metadata to be automatically extracted from any image or video. By working in real time on live or crowdsourced content, this information will enable journalists to enhance the quality of their coverage, and to produce new and innovative formats. Content recommendation or content retrieval systems will also greatly benefit from the enriched metadata.

Keywords :
image & video classification, recommendation system, content retrieval, deep learning, visual relationships, multi-task learning

24 months

People involved

Principal investigator

Prof. Alexandre Alahi (EPFL)



Sven Kreiss (EPFL)



Duncan Zauss (EPFL)


Media partner

Léonard Bouchet (RTS)

Academic institution

Visual lntelligence for Transportation – VITA (EPFL)

Media partner

RTS – Radio Télévision Suisse (SRG) – Data and Archives (D+A)


This project started in February 2021 and is ongoing

Related call for projects

IMI Research Grant (October 2020)