python - audio classification of equal length samples / 'vocoder' thingy

1.1k Views Asked by At

Anybody able to supply links, advice, or other forms of help to the following?

Objective - use python to classify 10-second audio samples so that I afterwards can speak into a microphone and have python pick out and play snippets (faded together) of closest matches from db.

My objective is not to have the closest match and I don't care what the source of the audio samples is. So the result is probably of no use other than speaking in noise (fun).

I would like the python app to be able to find a specific match of FFT for example within the 10 second samples in the db. I guess the real-time sampling of the microphone will have a 100 millisecond buffersample.

Any ideas? FFT? What db? Other?

3

There are 3 best solutions below

1
On

In order to do this, you need three things:

  1. Segmentation (decide how to make your audio samples)
  2. Feature Extraction (decide what audio feature (e.g. FFT) you care about)
  3. Distance Metric (decide what the "closest" sample is)

Segmentation: you currently describe using 10-second samples. I think you might have better results with shorter segments (closer to 100-1000ms) in order to get something that fits the changes in the voice better.

Feature Extraction: you mention using FFT. The zero crossing rate is surprisingly ok considering how simple it is. If you want to get more fancy, using MFCCs or spectral centroid is probably the way to go.

Distance Metric: most people use the euclidean distance, but there are also fancier ones like the manhattan distance, cosine distance, and earth-movers distance.

For a database, if you have a small enough set of samples, you might try just loading everything up into a kdtree so that you can do fast distance calculations, and just hold it in memory.

Good luck! It sounds like a fun project.

2
On

Try searching for algorithms on "music fingerprinting".

0
On

You could try some typical short-term feature extraction (e.g. energy, zero crossing rate, MFCCs, spectral features, chroma, etc) and then model your segment through a vector of feature statistics. Then you could use a simple distance-based classifier (e.g. kNN) in order to retrieve the "closest" training samples from a manually laballed set, given an unknown "query".

Check out my lib on several Python Audio Analysis functionalities: pyAudioAnalysis