Visualization
There are two theories of a human hearing - place (frequency-based) and temporal.
In speech recognition, there are two tendencies: input spectrogram (frequency) and Mel-Frequency Cepstral Coefficients.
1.1 Wave and spectrogram
The file can be read as following
train_audiopath = '../input/train/audio/' filename = '/yes/0a7c2a8d_nohash_0.wav' sample_rate, samples = wavfile.read(str(train_audio_path) + filename)
A function which calculates spectrum can be defined as
def log_specgram(audio, sample_rate, window_size=20,step_size=10, eps=1e-10):
nperseg = int(round(window_size * sample_rate / 1e3))
noverlap = int(round(step_size * sample_rate / 1e3))
freqs, times, spec = signal.spectrogram(audio,fs=sample_rate,window='hann',
nperseg=nperseg,noverlap=noverlap,detrend=False)
return freqs, times, np.log( spec.T.astype(np.float32) + eps)
Note that the spectrum values are in log domain.
Fig.1 Raw wave and spectrogram of "yes".
Fig.1 shows the raw wave and spectrogram of "yes", respectively.
The frequencies are in range (0,8000) according to Nyquist theorem.
The codes for plotting these two figures are
freqs,times,spectrogram=log_specgram(samples,sample_rate)fig=plt.figure(figsize=(14,8))
ax1=fig.add_subplot(211)
ax1.set_title('Raw wave of '+filename)
ax1.set_ylabel('Amplitude')
ax1.plot(np.linspace(0,sample_rate/len(samples),sample_rate),samples)
ax2=fig.add_subplot(212)
ax2.imshow(spectrogram.T,aspect='auto',origin='lower',extent=[times.min(),times.max(),freqs.min(),freqs.max()])
ax2.set_yticks(freqs[::16])
ax2.set_xticks(times[::16])
ax2.set_title('Spectrogram of '+filename)
ax2.set_ylabel('Freqs in Hz')
ax2.set_xlabel('Seconds')
The spectrogram should be normalized to serve as an input features for NN
The normalization process can be realized as following example codemean = np.mean(spectrogram, axis=0) std = np.std(spectrogram, axis=0) spectrogram = (spectrogram - mean) / std
1.2 MFCC
The tutorial for MFCC can be found in
Fig.2 Mel power spectrogram.
The Mel power spectrogram can be calculated using librosa python packages, and shown as Fig.2. The codes are as follows
# From this tutorial
# https://github.com/librosa/librosa/blob/master/examples/LibROSA%20demo.ipynb
S=librosa.feature.melspectrogram(samples,sr=sample_rate,n_mels=128)
# Convert to log scale (dB). We'll use the peak power (max) as reference.
log_S=librosa.power_to_db(S,ref=np.max)plt.figure(figsize=(12,4))librosa.display.specshow(log_S,sr=sample_rate,x_axis='time',y_axis='mel')
plt.title('Mel power spectrogram ')
plt.colorbar(format='%+02.0f dB')
plt.tight_layout()
Fig.3 MFCC
The MFCC can be calculated using librosa python packages, and shown as Fig.3. The codes are as follows
mfcc=librosa.feature.mfcc(S=log_S,n_mfcc=13)
# Let's pad on the first and second deltas while we're at it
delta2_mfcc=librosa.feature.delta(mfcc,order=2)
plt.figure(figsize=(12,4))librosa.display.specshow(delta2_mfcc)
plt.ylabel('MFCC coeffs')
plt.xlabel('Time')
plt.title('MFCC')
plt.colorbar()
plt.tight_layout()
MFCC are taken as the input tot he system instead of spectrograms in most systems.
However, in end-to-end (NN-based) systems, the most common input features are raw spectrograms, or mell power spectrograms.
MFCC decorrelates features, but NNs deal with correlated features well.
1.3 Spectrogram in 3D
The spectrogram can plotted in 3D as
(adding figure)
The codes for plotting spectrogram are as
data=[go.Surface(z=spectrogram.T)]
layout=go.Layout(
title='Specgtrogram of "yes" in 3d',
scene=dict(
yaxis=dict(title='Frequencies',range=freqs),
xaxis=dict(title='Time',range=times),
zaxis=dict(title='Log amplitude'),
),
)
fig=go.Figure(data=data,layout=layout)
py.iplot(fig)
1.4 Silence removal
The file can be listen by the codes as follows
ipd.Audio(samples,rate=sample_rate)
A bit of the file from the begining and from the end can be cut, and we can listen to it again. The following codes take from 4000 to 13000:
samples_cut=samples[4000:13000]
ipd.Audio(samples_cut,rate=sample_rate)
_webrtcvad_package can be used to have a good Voice Activity Detection (VAD). The audio file, together with guessed alignment of 'y' 'e' 's' graphems, are ploted as
Fig. 4 Raw wave and spectrogram together with guessed alignment of 'y' 'e' 's'.
The codes are as follows
freqs,times,spectrogram_cut=log_specgram(samples_cut,sample_rate)
fig=plt.figure(figsize=(14,8))
ax1=fig.add_subplot(211)
ax1.set_title('Raw wave of '+filename)
ax1.set_ylabel('Amplitude')
ax1.plot(samples_cut)
ax2=fig.add_subplot(212)
ax2.set_title('Spectrogram of '+filename)
ax2.set_ylabel('Frequencies * 0.1')
ax2.set_xlabel('Samples')
ax2.imshow(spectrogram_cut.T,aspect='auto',origin='lower',
extent=[times.min(),times.max(),freqs.min(),freqs.max()])
ax2.set_yticks(freqs[::16])
ax2.set_xticks(times[::16])
ax2.text(0.06,1000,'Y',fontsize=18)
ax2.text(0.17,1000,'E',fontsize=18)
ax2.text(0.36,1000,'S',fontsize=18)
xcoords=[0.025,0.11,0.23,0.49]
for xc in xcoords:
ax1.axvline(x=xc*16000,c='r')
ax2.axvline(x=xc,c='r')
1.5 Resampling-dimensionality reduction
Resample recordings is nother way to reduce the dimensionality of data.
The most speech related frequencies are presented in smaller band. The GSM (2G wireless communication) signal is sampled to 8,000 Hz, and you can still understand another person talking to the telephone.
Resampling the dataset to 8k will reduce the size of data.
The Fast Fourier Transform (FFT) is calculated as
def custom_fft ( y , fs ):
T = 1.0 / fs N = y . shape [ 0 ]
yf = fft (y)
xf=np.linspace(0.0,1.0/(2.0*T),N//2)
vals=2.0/N*np.abs(yf[0:N//2])# FFT is simmetrical, so we take just the first half
# FFT is also complex, to we take just the real part (abs)
return xf ,vals
The following codes read the recording, and resample it.
filename='/happy/0b09edd3_nohash_0.wav'
new_sample_rate=8000
sample_rate,samples=wavfile.read(str(train_audio_path)+filename)
resampled=signal.resample(samples,int(new_sample_rate/sample_rate*samples.shape[0]))
ipd.Audio(samples, rate=sample_rate)
ipd.Audio(resampled, rate=new_sample_rate)
Note that the recording and its resampling version sound very similar.
Fig.5 The FFT of the recording data and its resampled version.
Fig.5 shows the FFT of the recording data and its resampled version.
The applying codes are as follows
xf,vals=custom_fft(samples,sample_rate)
plt.figure(figsize=(12,4))
plt.title('FFT of recording sampled with '+str(sample_rate)+'Hz')
plt.plot(xf,vals)
plt.xlabel('Frequency')
plt.grid()
plt.show()
xf, vals = custom_fft(resampled, new_sample_rate)
plt.figure(figsize=(12, 4))
plt.title('FFT of recording sampled with ' + str(new_sample_rate) + ' Hz')
plt.plot(xf, vals)
plt.xlabel('Frequency')
plt.grid()
plt.show()
1.6 Features Extraction steps
The feature extraction algorithm is as follows:
1 resampling, 2 VAD, 3 padding with 0 to make signals be equal length, 4 log spectrogram (or MFCC, or PLP), 5, features normalization with mean and std, 6. stacking of a given number of frames to get temporal information.
These work have not been done in this notebook.
[0]
https://www.kaggle.com/davids1992/speech-representation-and-data-exploration