Abstract

Imagine the sound of waves. This sound may evoke the memories of days at the beach.A single sound serves as a bridge to connect multiple instances of a visual scene. It can groupscenes that ’go together’ and set apart the ones that do not. Co-occurring sensory signals can thusbe used as a target to learn powerful representations for visual inputs without relying on costly human annotations. In this thesis, I introduce effective self-supervised learning methods that curb the needfor human supervision. I discuss several tasks that benefit from audio-visual learning, includingrepresentation learning for action and audio recognition, visually-driven sound source localization,and spatial sound generation. I introduce an effective contrastive learning framework that learns audio-visual models by answering multiple-choice audio-visual association questions. I alsodiscuss critical challenges we face when learning from audio supervision related to noisy audio-visual associations, and the lack of spatial grounding of sound signals in common videos.

Published at: PhD Thesis, University of California San Diego, 2021.

Learning to see and hear without human supervision

Pedro Morgado

Abstract

Bibtex