What libraries can I use for Audio Classification?

The speechbrainand transformers libraries are compatible with Audio Classification.

What models can I use for Audio Classification?

The speechbrain/google_speech_command_xvector, ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition, and facebook/mms-lid-126 models can be used for Audio Classification.

What datasets can I use for Audio Classification?

The and superb dataset can be used for Audio Classification.

What metrics can I use for Audio Classification?

The accuracy, recall, precision, and f1 metrics can be used for Audio Classification.

Tasks

Audio Classification

Audio classification is the task of assigning a label or class to a given audio. It can be used for recognizing which command a user is giving or the emotion of a statement, as well as identifying a speaker.

Inputs

Audio Classification Model

Output

0.200

Down

0.800

About Audio Classification

Use Cases

Command Recognition

Command recognition or keyword spotting classifies utterances into a predefined set of commands. This is often done on-device for fast response time.

As an example, using the Google Speech Commands dataset, given an input, a model can classify which of the following commands the user is typing:

'yes', 'no', 'up', 'down', 'left', 'right', 'on', 'off', 'stop', 'go', 'unknown', 'silence'

Speechbrain models can easily perform this task with just a couple of lines of code!

from speechbrain.pretrained import EncoderClassifier
model = EncoderClassifier.from_hparams(
  "speechbrain/google_speech_command_xvector"
)
model.classify_file("file.wav")

Language Identification

Datasets such as VoxLingua107 allow anyone to train language identification models for up to 107 languages! This can be extremely useful as a preprocessing step for other systems. Here's an example modeltrained on VoxLingua107.

Emotion recognition

Emotion recognition is self explanatory. In addition to trying the widgets, you can use the Inference API to perform audio classification. Here is a simple example that uses a HuBERT model fine-tuned for this task.

import json
import requests

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.Model Database.co/models/superb/hubert-large-superb-er"

def query(filename):
    with open(filename, "rb") as f:
        data = f.read()
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))

data = query("sample1.flac")
# [{'label': 'neu', 'score': 0.60},
# {'label': 'hap', 'score': 0.20},
# {'label': 'ang', 'score': 0.13},
# {'label': 'sad', 'score': 0.07}]

You can use Model Database.js to infer with audio classification models on Model Database Hub.

import { HfInference } from "@Model Database/inference";

const inference = new HfInference(HF_ACCESS_TOKEN);
await inference.audioClassification({
  data: await (await fetch("sample.flac")).blob(),
  model: "facebook/mms-lid-126",  
})

Speaker Identification

Speaker Identification is classifying the audio of the person speaking. Speakers are usually predefined. You can try out this task with this model. A useful dataset for this task is VoxCeleb1.

Solving audio classification for your own data

We have some great news! You can do fine-tuning (transfer learning) to train a well-performing model without requiring as much data. Pretrained models such as Wav2Vec2 and HuBERT exist. Facebook's Wav2Vec2 XLS-R model is a large multilingual model trained on 128 languages and with 436K hours of speech.

Useful Resources

Would you like to learn more about the topic? Awesome! Here you can find some curated resources that you may find helpful!

Compatible libraries

speechbrain Transformers

Audio Classification demo

using facebook/mms-lid-126

Audio Classification

This model can be loaded on the Inference API on-demand.

Models for Audio Classification

Browse Models (1,228)

speechbrain/google_speech_command_xvector

Audio Classification • Updated Jul 23 • 43 • 5

Note An easy-to-use model for Command Recognition.

ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition

Audio Classification • Updated Sep 21, 2021 • 83.6k • 76

Note An Emotion Recognition model.

facebook/mms-lid-126

Audio Classification • Updated Jun 13 • 1.44k • 8

Note A language identification model.

Datasets for Audio Classification

Browse Datasets (97)

superb

Preview • Updated Jan 25 • 4.89k • 18

Note A benchmark of 10 different audio tasks.

Spaces using Audio Classification

💩

akhaliq/Speechbrain-audio-classification

Note An application that can predict the language spoken in a given audio.

Metrics for Audio Classification

accuracy: Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with: Accuracy = (TP + TN) / (TP + TN + FP + FN) Where: TP: True positive TN: True negative FP: False positive FN: False negative

recall: Recall is the fraction of the positive examples that were correctly labeled by the model as positive. It can be computed with the equation: Recall = TP / (TP + FN) Where TP is the true positives and FN is the false negatives.

precision: Precision is the fraction of correctly labeled positive examples out of all of the examples that were labeled as positive. It is computed via the equation: Precision = TP / (TP + FP) where TP is the True positives (i.e. the examples correctly labeled as positive) and FP is the False positive examples (i.e. the examples incorrectly labeled as positive).

f1: The F1 score is the harmonic mean of the precision and recall. It can be computed with the equation: F1 = 2 * (precision * recall) / (precision + recall)

Audio Classification

About Audio Classification

Use Cases

Command Recognition

Language Identification

Emotion recognition

Speaker Identification

Solving audio classification for your own data

Useful Resources

Notebooks

Scripts for training

Documentation

Compatible libraries