Speaker Diarization
Speaker diarization is the task of answering the question "who spoke when".
pyAudioAnalysis
implementation is the variant based on [1].
There are four main algorithmic steps 1. Feature extraction 2. FLsD 3. Clustering 4. Smoothing
- Feature extraction (short-term and mid-term) step For each mid-term segment, the averages and standard deviations of the MFCCs are used, along with an estimate of the probabilities that segment belongs to a male or a female speaker (model
knnSpeakerFemaleMale
) - (optional) FLsD step The mid-term feature statistic vectors in original feature space are projected onto the FLsD subspace.
- Clustering A k-means clustering method on either original feature space or the FLsD subspace. If the number of speakers is unknown, clustering process is repeated for a range of number of speakers and the Silhouette width criterion is used to find the optimal number of speakers.
- Smoothing A combination of a meidan filtering step on the extracted cluster IDs and a Viterbi Smoothing step.
Function speackDiarization()
in audioSegmentation.py
is used to extract a sequence of audio segments and respective cluster labels, given an audio file.
Command-line example:
python audioAnalysis.py speakerDiarization -i data/diarizationExample.wav --num 4
This command takes 3 arguments
-i <fileName>
, fileName is the filename of audio recording as input.--num <numOfSpeakers (0 for unknown)>
, numOfSpeakers is the number of speakers in audio recordings.--flsd
, flag to enable FLsD method
Function evaluateSpeakerDiarization()
also compute cluster purity and speaker purity.
Function evaluateSpeakerDiarization()
is used in speakerDiarization()
to compare the extracted sequence of speaker label and ground-truth label.
Ground-truth file is the file with .segments
extension.
Function speakerDiarizationEvaluateScript()
is used to extract the overall performance measures for a set of auto recordings, and respective `.segments. files, stored in directory.
Figure 3 function speakerDiarization()
result of data/diarizationExample.wav
Figure 3 is the result of typing
python audioAnalysis.py speakerDiarization -i data/diarizationExample.wav --num 4
紅色實現是ground-truth label,藍色實線是speaker diarization classification的結果,cluster purity是86.2%, speaker purity是86.2%.