wav2vec: unsupervised pre-training for speech recognition

논문

ppoqq 2022. 10. 12. 16:32

목적: raw audio의 representation을 학습하기 위해 unsupervised pre-training 방법을 탐색
Method: 대량의 unlabeled audio data로 훈련된 wav2vec 사용
- resulting representation을 사용하여 acoustic model training을 향상시키고자 함
- simple multi-layer CNN을 학습시킴
  - noise contrastive binary classification task로 최적화
Result
- WSJ dataset: WER 36% 감소
- nov92 test set: WER 2.43% 달성

음성인식 모델들은 대량의 transcribed audio data를 필요로 함
이를 극복하기 위해 pre-training 기법이 사용됨
- 충분한 양의 labeled or unlabeled data가 있을 때 이걸 사용해서 general representation을 학습
- 데이터의 양이 한정되어 있는 downstream task에서 학습된 representation을 사용하여 성능을 향상시키고자 함
computer vision, NLP, speech processing 등의 분야에서 pre-training 기법이 다수 사용되고, 성능 좋게 나타남

raw audio signal을 입력으로 받아 두개의 네트워크에 통과시킴
1. encoder network: audio signal을 latent space에 임베딩함
2. context network: encoder의 multiple time-step을 combine하여 contextualized representation을 얻음

Layers of the encoder and context networks
- a casual convolution with 512 channels
- group normalization layer → normalize across the feature and temporal dimenstion for each sample
- ReLU nonlinearity
wav2vec large
- two additional linear transformations in the encoder
- considerably larger context network comprised of 12 layers with increasing kernel sizes (2, 3, ..., 13)
- aggregator에서 skip connection을 사용하여 수렴을 도움
- total receptive field: about 810ms

k step 뒤의 sample인 $z_{i+k}$ (true)와 조건부 확률분포인 $p_n$에서 유도된 distractor sample $\tilde{z}$ (negative)를 구분하도록 모델을 훈련시킴
loss function: minimize the contrastive loss for each step $k=1, ..., K$
- $\mathcal{L}_k = -\sum_{i=1}^{T-k}(log\sigma(z_{i+k}^Th_k(c_i)) + \lambda\underset{\tilde{z}~p_n}{\mathbb{E}}[log\sigma(-\tilde{z}^Th_k(c_i))])$
- $\sigma(z_{i+k}^Th_k(c_i)$: probability of $z_{i+k}$ being the true sample
  - step-specific affine transformation $h_k(c_i) = W_kc_i + b_k$ for each step k
- optimize the loss $\mathcal{L} = \sum_{k=1}^{K}\mathcal{L}_k$
실제로는 각 오디오 샘플에서 distractor를 uniform하게 선택하여 10개의 negative example로 expectation을 approximate함
- $p_n(z) = \frac{1}{T}$, T: sequence length, $\lambda$: number of negatives