wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations 정리

1. Introduction

기존의 neural networks는 대량의 labeled training data로 훈련됨.

그러나 labeled data는 unlabeled data보다 수집이 어려움
현재의 음성인식 시스템은 납득할 만한 성능을 얻기 위해서는 수천 시간의 transcribed speech를 필요로 함
그리고 사람은 언어를 습득할 때 labeled examples로부터 학습하지 않음

Self-supervised learning

unlabeled examples로부터 general representation을 학습하고, labeled data를 사용하여 fine-tuning을 진행하는 방법.

Wav2vec 2.0

raw audio data로부터 self-supervised learning 기법으로 representation을 학습하는 framework
Pre-training
- speech audio를 multi-layer convolutional neural network를 사용하여 인코딩
- 인코딩 결과로 나타난 latent speech representation의 일부를 masking
- masking된 representation이 Transformer network에 입력으로 주어져 contextualized representation을 생성
- contrastive task로 모델이 훈련됨
  - Contrastive task: emebedding space 상에서 유사한 특징을 갖는 데이터는 가깝게, 다른 데이터는 멀리 존재하도록 mapping하는 task
  - gumbel softmax로 discrete speech units을 학습함으로써 contrastive task에서 latent representation을 나타내게끔 함
Fine-tuning - downstream speech recognition tasks
- Connectionist Temporal Classification - CTC loss를 사용하여 labeled data로 fine-tuning
실험 결과
- discrete speech unit과 contextualized representation을 함께 학습하는 것이 fixed unit으로 학습하는 것보다 더 좋은 성능을 냄
- 10분의 labeled data만으로도 WER 4.8/8.2 (on the clean/other test sets of Librispeech) 달성
- 960시간의 Librispeech labeled data를 사용했을 때 WER 1.8/3.3 달성

2. Model

Feature encoder f : X ⟼
- several blocks containing a temporal convolution + layer normalization + GELU activation function
- 입력으로 주어지는 raw waveform은 평균 0, 분산 1로 normalized
- encoder의 전체 stride가 number of time-steps T를 결정
Contextualized representations with Transformers Z ⟼
- Product quantization: multiple codebooks에서 quantized representation을 선택하고 그들을 연결
- Gumbel softmax를 사용하면 이산화된 codebook entries를 완전 미분가능하게 선택 가능
  - straight-through estimator (STE) 사용
  - hard Gumbel softmax operation 설정
  - feature encoder output z가 $l \in R^{G \times V}$ 의 logit으로 매핑됨
  - group g에 대해 v번째 codebook을 선택하는 확률 $p_{g, v} = \frac{e x p (l_{g, v} + n_{v}) / τ}{\sum_{k = 1}^{V} e x p (l_{g, k} + n_{k}) / τ}$
  - forward pass: codeword i는 $i = a r g m a x_{j} p_{g, j}$ 에 의해 결정
  - backward pass: Gumbel softmax 출력의 true gradient 사용

3. Training

3.1. Masking

feature encoder output 혹은 time step의 일부를 context network에 넣기 전에 masking하고, masking된 모든 time step 간에 공유되는 feature vector로 이를 대체
모든 time step의 특정한 비율 p가 starting index가 되도록 랜덤하게 샘플링하고, 모든 샘플링된 인덱스에서부터 다음 M개의 연속된 time step을 masking

3.2. Objective

$L = L_{m} + α L_{d}$

Contrastive Loss: $L_{m}$

masking된 time step t에서 중심이 되는 context network output $c_{t}$ 가 주어졌을 때, 모델은 K개의 distractors를 포함하는 K+1개의 quantized candidate representations $\tilde{q} \in Q_{t}$ 에서 정답 representation을 찾아야 함
$L_{m} = - l o g \frac{e x p (s i m (c_{t}, q_{t}) / κ)}{\sum_{\tilde{q} Q_{t}}^{} e x p (s i m (c_{t}, \tilde{q}) / κ)}$
- $s i m (a, b) = a^{T} b / | | a | | | | b | |$ : cosine similarity

Diversity Loss: $L_{d}$

quantized codebook representation 사용을 향상시키기 위해 도입됨
a batch of utterances에서 각 codebook $\bar{p_{g}}$ 에 존재하는 codebook entries에 대한 평균 softmax distributio $l$ 의 엔트로피를 최대화하여 각 G codebook에서 V개 entry의 동등한 사용을 권장함
$L_{d} = \frac{1}{G V} \sum_{g = 1}^{G} - H ({\bar{p}}_{g}) = \frac{1}{G V} \sum_{g = 1}^{G} \sum_{v = 1}^{V} {\bar{p}}_{g, v} l o g {\bar{p}}_{g, v}$

3.3. Fine-tuning

context network 위에 randomly initialized linear projection을 추가함으로써 fine-tuning을 진행
- 음성인식의 경우 vocabulary를 나타내는 C개의 classes로 projection 됨
CTC loss를 최소화하는 방향으로 최적화
modified version of SpecAugment를 적용하여 overfitting, error rate를 낮춤

'논문' 카테고리의 다른 글

Neural Discrete Representation Learning (2017) 정리 (0)	2022.12.17
wav2vec: unsupervised pre-training for speech recognition (0)	2022.10.12

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

≪뽀大ㄴŀフ-łł살スŀ≫

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations 정리

1. Introduction

2. Model

3. Training

3.1. Masking

3.2. Objective

3.3. Fine-tuning

'논문' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations 정리

1. Introduction

2. Model

3. Training

3.1. Masking

3.2. Objective

3.3. Fine-tuning

'논문' 카테고리의 다른 글

'논문' Related Articles

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역