wav2vec: unsupervised pre-training for speech recognition

Abstract

목적: raw audio의 representation을 학습하기 위해 unsupervised pre-training 방법을 탐색
Method: 대량의 unlabeled audio data로 훈련된 wav2vec 사용
- resulting representation을 사용하여 acoustic model training을 향상시키고자 함
- simple multi-layer CNN을 학습시킴
  - noise contrastive binary classification task로 최적화
Result
- WSJ dataset: WER 36% 감소
- nov92 test set: WER 2.43% 달성

Introduction

Pre-training of neural networks

음성인식 모델들은 대량의 transcribed audio data를 필요로 함
이를 극복하기 위해 pre-training 기법이 사용됨
- 충분한 양의 labeled or unlabeled data가 있을 때 이걸 사용해서 general representation을 학습
- 데이터의 양이 한정되어 있는 downstream task에서 학습된 representation을 사용하여 성능을 향상시키고자 함
computer vision, NLP, speech processing 등의 분야에서 pre-training 기법이 다수 사용되고, 성능 좋게 나타남

This paper...

unsupervised pre-training을 적용하여 supervised speech recognition 성능을 향상시키고자 함
wav2vec
- CNN이 raw audio를 input으로 받아 general representation을 계산
- objective: contrastive loss
  - true future audio sample과 negative를 구별
- frame-wise phoneme classification을 넘어 학습된 representation을 사용하여 supervised ASR system의 성능을 향상시키고자 함
- fully convolutional architecture → 쉽게 병렬화 가능

Pre-training approach

audio signal을 입력으로 받아 모델이 주어진 signal context에서 future sample을 예측하도록 최적화함
Problem: data distribution p(x)를 어떻게 정확히 모델링할 수 있는지
- Solution

Model

raw audio signal을 입력으로 받아 두개의 네트워크에 통과시킴
1. encoder network: audio signal을 latent space에 임베딩함
2. context network: encoder의 multiple time-step을 combine하여 contextualized representation을 얻음

raw audio sample $x_i \in \mathit{X}$ 에 encoder network $f: \mathit{X} \mapsto \mathit{Z}$ 를 적용
- 5-layers
- kernel sizes: (10, 8, 4, 4, 4)
- strides: (5, 4, 2, 2, 2)
- output: a low frequency feature representation $z_i \in \mathit(Z)$
  - 약 30ms의 16kHz audio를 인코딩
  - striding을 적용하면 10ms마다 representation $z_i$ 가 생성됨
encoder network의 output $z_i$ 에 context network $g: \mathit(Z) \mapsto \mathit(C)$ 적용하여 multiple latent representations $z_i ... z_{i-v}$ 를 하나의 contextualized tensor $c_i = g(z_i ... z_{i-v})$ 로 만듦 (v: receptive field size)
- 9-layers
- kernel size: 3
- stride: 1
- total receptive field of the context network: about 210ms
  - cf. receptive field: 하나의 뉴런이 원본 이미지에서 담당하는 범위

Layers of the encoder and context networks
- a casual convolution with 512 channels
- group normalization layer → normalize across the feature and temporal dimenstion for each sample
- ReLU nonlinearity
wav2vec large
- two additional linear transformations in the encoder
- considerably larger context network comprised of 12 layers with increasing kernel sizes (2, 3, ..., 13)
- aggregator에서 skip connection을 사용하여 수렴을 도움
- total receptive field: about 810ms

Objective

와 조건부 확률분포인 $p_n$ 에서 유도된 distractor sample $\tilde{z}$ (negative)를 구분하도록 모델을 훈련시킴
loss function: minimize the contrastive loss for each step $k=1, ..., K$
- $\mathcal{L}_k = -\sum_{i=1}^{T-k}(log\sigma(z_{i+k}^Th_k(c_i)) + \lambda\underset{\tilde{z}~p_n}{\mathbb{E}}[log\sigma(-\tilde{z}^Th_k(c_i))])$
- $\sigma(z_{i+k}^Th_k(c_i)$ : probability of $z_{i+k}$ being the true sample
  - step-specific affine transformation $h_k(c_i) = W_kc_i + b_k$ for each step k
- optimize the loss $\mathcal{L} = \sum_{k=1}^{K}\mathcal{L}_k$
실제로는 각 오디오 샘플에서 distractor를 uniform하게 선택하여 10개의 negative example로 expectation을 approximate함
- $p_n(z) = \frac{1}{T}$ , T: sequence length, $\lambda$ : number of negatives

'논문' 카테고리의 다른 글

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations 정리 (1)	2023.01.10
Neural Discrete Representation Learning (2017) 정리 (0)	2022.12.17

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

≪뽀大ㄴŀフ-łł살スŀ≫

wav2vec: unsupervised pre-training for speech recognition

Abstract

Introduction

Pre-training of neural networks

This paper...

Pre-training approach

Model

Objective

'논문' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

wav2vec: unsupervised pre-training for speech recognition

Abstract

Introduction

Pre-training of neural networks

This paper...

Pre-training approach

Model

Objective

'논문' 카테고리의 다른 글

'논문' Related Articles

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역