Member-only story
From Sound to Sight: Using Vision Transformers for Audio Classification
Audio is an entire field on its own when it comes to machine learning, with a wide range of applications including speech recognition, music categorization, and sound event detection.
Traditionally, audio classification has been approached using methods such as spectrogram analysis and hidden Markov models, which have proven effective but have their limitations.
Recently, transformers have emerged as a promising alternative for audio tasks, Whisper by OpenAI is a very good example.
In this article, we’ll leverage the Vision Transformer implementation from ViT — Vision Transformer, a Pytorch implementation and train it on an audio classification dataset GTZAN Dataset — Music Genre Classification.
What we will be working on
The GTZAN dataset is the most-used public dataset for evaluation in machine listening research for music genre recognition (MGR). The files were collected in 2000–2001 from a variety of sources including personal CDs, radio, microphone recordings, in order to represent a variety of recording conditions.