Sitemap

Member-only story

From Sound to Sight: Using Vision Transformers for Audio Classification

5 min readJan 3, 2023
Image by author

Audio is an entire field on its own when it comes to machine learning, with a wide range of applications including speech recognition, music categorization, and sound event detection.
Traditionally, audio classification has been approached using methods such as spectrogram analysis and hidden Markov models, which have proven effective but have their limitations.

Recently, transformers have emerged as a promising alternative for audio tasks, Whisper by OpenAI is a very good example.

In this article, we’ll leverage the Vision Transformer implementation from ViT — Vision Transformer, a Pytorch implementation and train it on an audio classification dataset GTZAN Dataset — Music Genre Classification.

What we will be working on

The GTZAN dataset is the most-used public dataset for evaluation in machine listening research for music genre recognition (MGR). The files were collected in 2000–2001 from a variety of sources including personal CDs, radio, microphone recordings, in order to represent a variety of recording conditions.

--

--

Alessandro Lamberti
Alessandro Lamberti

Written by Alessandro Lamberti

Machine Learning Engineer | R&D and Intelligence

Responses (2)