Imagine the sound of waves. This sound may evoke the memories of days at the beach.A single sound serves as a bridge to connect multiple instances of a visual scene. It can groupscenes that ’go together’ and set apart the ones that do not. Co-occurring sensory signals can thusbe used as a target to learn powerful representations for visual inputs without relying on costly human annotations. In this thesis, I introduce effective self-supervised learning methods that curb the needfor human supervision. I discuss several tasks that benefit from audio-visual learning, includingrepresentation learning for action and audio recognition, visually-driven sound source localization,and spatial sound generation. I introduce an effective contrastive learning framework that learns audio-visual models by answering multiple-choice audio-visual association questions. I alsodiscuss critical challenges we face when learning from audio supervision related to noisy audio-visual associations, and the lack of spatial grounding of sound signals in common videos.
Published at: PhD Thesis, University of California San Diego, 2021.
@phdthesis{morgado_phdthesis,
author = {Pedro Morgado},
title = {Learning to see and hear without human supervision},
school = {University of California San Diego},
year = 2021
}