AuriStream: Representing speech through autoregressive prediction of cochlear tokens

Greta Tuckute, Klemen Kotar, Evelina Fedorenko, Dan Yamins

We introduce AuriStream, a biologically inspired model for encoding speech via a two-stage framework inspired by the human auditory processing hierarchy. The first stage transforms raw audio into a time-frequency representation based on the human cochlea, from which we extract discrete cochlear tokens. The second stage applies an autoregressive sequence model over the cochlear tokens. AuriStream learns meaningful phoneme and word representations, and state-of-the-art lexical semantics. AuriStream shows competitive performance on diverse downstream SUPERB speech tasks. Complementing AuriStream's strong representational capabilities, it generates continuations of audio which can be visualized in a time-frequency space and decoded back into audio, providing insights into the model's predictions. In summary, we present a two-stage framework for speech representation learning to advance the development of more human-like models that efficiently handle a range of speech-based tasks.

AuriStream sample generations

We present examples of AuriStream (1B) audio generations. For each example, we prompt AuriStream with the first 3 seconds of an audio clip from the LibriSpeech test set (unseen during training), and it predicts the subsequent 3 seconds. For comparison, we provide the ground truth (GT) continuations below along with AuriStream's predictions across several random seeds. The predicted cochleagrams were converted into audio with a vocoder and we visualized the speech continuation in the cochlegram space. We would like to emphasize that the examples are not cherry picked; they are simply the first random generations obtained under a consecutive series of random seeds.

LibriSpeech – 121-123859-0004

LibriSpeech – 908-31957-0019

LibriSpeech – 672-122797-0001

LibriSpeech – 1089-134686-0006

LibriSpeech – 2300-131720-0015

LibriSpeech – 4970-29095-0033

Common Failiure Modes

As observed in the examples above, AuriStream learns to generate fluent continuations of sentences (despite its minimal assumptions!). However, it also exhibits a few (interesting) failure modes. Among these is the generation of some plausible sounding nonwords in the middle of sentences. Another (less common) failure mode is the generation of a completely nonsensical sound or slurred word, which causes the entire sentence to degenerate.

Architecture

Schematic of the WavCoch tokenizer (panel A) and the AuriStream model (panel B).