DSP를 이용한 음성 인식 (speech recognition) 구현 1편 : 음성 데이터 분석

2019-03-12-speech-classification

이 포스팅은 Kaggle Speech representation and data exploration kernel의 데이터 분석 부분을 참고하였다.

My kaggle study repository : https://github.com/go1217jo/kaggle_study

항상 학습 모델을 만들기 전에 데이터에 대한 분석이 반드시 이루어져야 한다.

Sampling

임의로 'yes' 라고 말하는 음성 파일 하나를 선택하여 분석해보자. 먼저, wav 파일을 샘플링해야 한다.


xxxxxxxxxx
from scipy.io import wavfile
train_audio_path = 'data/train/audio/'
filename = 'yes/0a7c2a8d_nohash_0.wav'
sample_rate, samples = wavfile.read(train_audio_path + filename)
print('sample rate : {}, samples.shape : {}'.format(sample_rate, samples.shape))

Result > sample rate : 16000, samples.shape : (16000,)

sample rate ( = sample frequency)가 16000 Hz일 때, sample 수가 16000개이므로 이 음성 파일은 1초 라는 것을 간접적으로 알 수 있다.

Visualization

이 음성 파일을 시각화하여 살펴보자. 음성은 시간, 주파수, 진폭(amplitude)으로 이루어져 있다. 하지만 단순하게 spectrum 그래프를 그리면 이 세 가지 요소를 동시에 살펴볼 수 없다. 그래서 신호의 spectral content의 시간 변위를 표시하는 시간, 주파수에 대한 2차 함수인 Spectrogram을 계산해야 한다.

Spectrogram 함수 정의


xxxxxxxxxx
from scipy import signal
import numpy as np
def log_specgram(audio, sample_rate, window_size=20, step_size=10, eps=1e-10):
    # nperseg: Length of each segment
    # noverlap: Number of points to overlap between segments
    nperseg = int(round(window_size * sample_rate / 1e3))
    noverlap = int(round(step_size * sample_rate / 1e3))
    freqs, times, spec = signal.spectrogram(audio, fs=sample_rate,
                                            window='hann', nperseg=nperseg,
                                            noverlap=noverlap, detrend=False)
    return freqs, times, np.log(spec.T.astype(np.float32) + eps)

Amplitude, Spectrogram plot 그리기


x
import matplotlib.pyplot as plt
freqs, times, spectrogram = log_specgram(samples, sample_rate)
fig = plt.figure(figsize=(14, 8))
ax1 = fig.add_subplot(211)
ax1.set_title('Raw wave of ' + filename)
ax1.set_ylabel('Amplitude')
ax1.plot(np.linspace(0, sample_rate/len(samples), sample_rate), samples)
ax2 = fig.add_subplot(212)
ax2.imshow(spectrogram.T, aspect='auto', origin='lower',
           extent=[times.min(), times.max(), freqs.min(), freqs.max()])
ax2.set_yticks(freqs[::16])
ax2.set_xticks(times[::16])
ax2.set_title('Spectrogram of ' + filename)
ax2.set_ylabel('Freqs in Hz')
ax2.set_xlabel('Seconds')

하지만 많은 좋은 tool이 나와서 다음과 같이 3D 그래프로도 그릴 수 있게 되었다.


xxxxxxxxxx
import IPython.display as ipd
import plotly.graph_objs as go
data = [go.Surface(x=times, y=freqs, z=spectrogram.T)]
layout = go.Layout(
    autosize=False,
    width=800, height=600,
    title = 'Spectrogram of "yes" in 3D',
    scene = dict(
        yaxis = dict(title='Frequencies', range=[freqs.min(), freqs.max()]),
        xaxis = dict(title='Time', range=[times.min(), times.max()]),
        zaxis = dict(title='Log amplitude')
    )
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

Normalization

이제 spectrogram 값의 범위를 살펴보자.


xxxxxxxxxx
print('{} ~ {}'.format(spectrogram.min(), spectrogram.max()))

Result > -19.381107330322266 ~ 11.731490135192871

값의 분포가 넓기 때문에 훈련 데이터로 사용하기에 적합하지 않다. 따라서 정규분포로 Normalize를 해야 한다.


mean = np.mean(spectrogram, axis=0)
std = np.std(spectrogram, axis=0)
spectrogram = (spectrogram - mean) / std
spectrogram.shape

Dimensionality Reduction

음성 데이터는 이미지만큼이나 크기가 크다. 그렇기 때문에 훈련 속도를 위해서는 크기를 줄여줄 필요가 있다. 첫 번째로 VAD (Voice Activity Detection)을 시도해보자.

Jupyter 환경에서 다음 코드로 음성을 들을 수 있다.


xxxxxxxxxx
import IPython.display as ipd
ipd.Audio(samples, rate=sample_rate)

음성을 들어보면 "yes" 가 명확히 들린다. 하지만 위 그래프에서 amplitude 값들을 보면 중앙에 몰려있고 앞뒤로는 silence가 있다는 것을 알 수 있다. 그래서 silence를 삭제해서 데이터 크기를 줄일 수 있다.


# 0.25 ~ 0.8125 * sample_rate(16000)
samples_cut = samples[4000:13000]
ipd.Audio(samples_cut, rate=sample_rate)

앞뒤로 음성을 잘랐음에도 불구하고 "yes"가 잘 들리는 것을 확인할 수 있다.

그렇지만 수동으로 자르기에는 한계가 있다. 그러니 webrtcvad를 이용해서 자동으로 잘라보는 것을 해보자.


x
import webrtcvad
vad = webrtcvad.Vad()
# 1~3 까지 설정 가능, 높을수록 aggressive
vad.set_mode(3)
class Frame(object):
    """Represents a "frame" of audio data."""
    def __init__(self, bytes, timestamp, duration):
        self.bytes = bytes
        self.timestamp = timestamp
        self.duration = duration
def frame_generator(frame_duration_ms, audio, sample_rate):
    frames = []
    n = int(sample_rate * (frame_duration_ms / 1000.0) * 2)
    offset = 0
    timestamp = 0.0
    duration = (float(n) / sample_rate) / 2.0
    while offset + n < len(audio):
        frames.append(Frame(audio[offset:offset + n], timestamp, duration))
        timestamp += duration
        offset += n
    
    return frames
# 10, 20, or 30
frame_duration_ms = 10 # ms
frames = frame_generator(frame_duration, samples, sample_rate)
for i, frame in enumerate(frames):
    if not vad.is_speech(frame.bytes, sample_rate):
        print(i, end=' ')

Result > 0 1 2 3 4 5 6 7 8 9 10 11 12 13 36 37 38 39 40 41 42 43 44 45 46 47 48

음성으로 인식되지 않는 인덱스를 출력해본 것인데 중간에 인덱스 간격이 큰 곳이 바로 음성이 집중되어 있는곳이라 할 수 있다. 이대로 잘라본 것을 들어보자.


samples_cut = samples[4480:11755]
ipd.Audio(samples_cut, rate=sample_rate)

꽤나 만족스러운 결과를 들을 수 있었다. 이제 자동으로 음성 파일 내 silence를 자르는 함수를 만들어보자.


def auto_vad(vad, samples, sample_rate, frame_duration_ms = 10):
    not_speech = []
    frames = frame_generator(frame_duration_ms, samples, sample_rate)
    n_frame = len(frames)
    for idx, frame in enumerate(frames):
        if not vad.is_speech(frame.bytes, sample_rate):
            not_speech.append(idx)
    prior = 0
    cutted_samples = []
    for i in not_speech:
        if i - prior > 2:
            start = int((float(prior) / n_frame) * sample_rate)
            end = int((float(i) / n_frame) * sample_rate)
            print(start, end)
            if len(cutted_samples) == 0:
                cutted_samples = samples[start:end]
            else:
                cutted_samples = np.append(cutted_samples, samples[start:end])
        prior = i
    return cutted_samples


cutted_samples = auto_vad(vad, samples, sample_rate, 10)
ipd.Audio(cutted_samples, rate=sample_rate)

Result > 4244 12081

제대로 "yes" 가 들리는 것을 확인할 수 있다. 이제 spectrogram으로도 한 번 확인해보자.


freqs, times, spectrogram = log_specgram(cutted_samples, sample_rate)
fig,ax = plt.subplots(1)
ax.imshow(spectrogram.T, aspect='auto', origin='lower',
           extent=[times.min(), times.max(), freqs.min(), freqs.max()])
ax.set_yticks(freqs[::16])
ax.set_xticks(times[::16])
ax.set_title('Spectrogram of ' + filename)
ax.set_ylabel('Freqs in Hz')
ax.set_xlabel('Seconds')

두 번째 방법으로는 resampling이 있다. 현재 sample rate는 16000 Hz이다. 만약 8000 Hz 정도로 resample하면 어떻게 될까?

당연히 아무런 문제가 없다. 왜냐하면 speech와 많이 관계된 주파수는 대부분 낮은 대역대에 존재하기 때문이다.


filename = 'happy/0b09edd3_nohash_0.wav'
new_sample_rate = 8000
sample_rate, samples = wavfile.read(str(train_audio_path) + filename)
resampled = signal.resample(samples, int(new_sample_rate/sample_rate * samples.shape[0]))

이어지는 포스팅에서는 데이터셋에 대한 분석이 이루어지며, 최종적으로 학습 모델 구성을 해볼 것이다.

저작자표시 (새창열림)

'A·I' 카테고리의 다른 글

Keras로 구현하는 DCGAN (4)	2019.05.26
What is AI and an Agent? (0)	2019.04.18
Regularization과 딥러닝의 일반적인 흐름 정리 (0)	2019.01.13
[keras] Boston Housing 데이터를 통한 주택 가격 예측(regression) (0)	2019.01.05
[keras] 정확한 평가를 위한 검증(validation) 데이터 나누기 (0)	2019.01.05

Wide and Deep Programming

DSP를 이용한 음성 인식 (speech recognition) 구현 1편 : 음성 데이터 분석

Sampling

Visualization

Normalization

Dimensionality Reduction

'A·I' 카테고리의 다른 글

댓글

티스토리툴바

DSP를 이용한 음성 인식 (speech recognition) 구현 1편 : 음성 데이터 분석

Sampling

Visualization

Normalization

Dimensionality Reduction

'A·I' 카테고리의 다른 글

관련글

댓글

티스토리툴바