Event

Last Fridays Talks: Signals and Decoding

Featured image

Location

Date

Type

Organizer

Last Fridays Talks 

Each last-Friday-of-the-month, we are hosting the Last Fridays Talks, where one of our seven Collaboratories will present insights from their current work. Join us for a discussion on results and recent papers, followed by some socializing afterwards for everyone who wish to attend. 

 

Title

Alternatives to global self-attention for self-supervised audio representation learning

 

Abstract

Transformers, enabled by global self-attention, have become the deep learning architecture of choice for self-supervised representation learning, spanning multiple modalities and domains, such as vision, language, and audio. In this talk, we discuss:

  1. How explicitly modelling local-global attention using the Multi-Window Multi-Head Attention module enabled us to learn better audio representations within a Masked Autoencoder framework, as evaluated on 10 diverse audio tasks. (As featured in our recent ICLR 2024 paper).
  2. Discuss our recent (under review) work on Structured State Space Models for audio feature representation learning in a masked spectrogram modelling framework. We proposed Self-Supervised Audio Mamba (SSAM), which consistently yielded ~40% better performance across 10 diverse audio tasks over comparable transformer baselines.

 

Bio

Sarthak Yadav is a PhD Fellow at the Department of Electronic Systems, Aalborg University and the Pioneer Centre for Artificial Intelligence, Copenhagen. His research is focused on self-supervised audio representation learning, with an emphasis on approaches beyond Transformers and self-attention for modelling sequences.

 
Previously, Sarthak worked as a Research Intern at the IDIAP Research Institute under the supervision of Dr Mathew Magimai Doss, working on explainability of speech and bio-signal based DNNs for emotion recognition. He completed his MSc(R) in Computing Science from the University of Glasgow, under the guidance of Prof. Mary Ellen Foster, examining how self-supervised audio representations differ from supervised ones using deconvolutions to map hidden representations back to the input signal domain. Sarthak also has extensive industry experience: he worked as the Lead Research Engineer at Staqu Technologies for 4 years, leading the design and development of several large-scale mission-critical intelligent systems, spanning computer vision (such as violence recognition, and multispectral geospatial imaging), biometrics (speaker and face) and language understanding (ASR and NMT).

 

The Talk will be streamed at DTU as well:

Building 321, room 227

Richard Petersens Plads, 2800 Lyngby

 

Join us online on ZOOM via this link (Meeting ID: 648 7986 6417).