GPT-1,2,3, DialoGPT 차이에 대해 간단히 알아보자.

GPT-1,2,3 차이가 뭘까

GPT-1,2,3, DialoGPT 차이가 뭘까? 간단히 알아보자.

GPT-1 : Improving Language Understanding by Generative Pre-Training
GPT-2 : Language Models are Unsupervised Multitask Learners
GPT-3 : Language Models are Few-Shot Learners
DialoGPT : DIALOGPT : Large-Scale Generative Pre-training for Conversational Response Generation

먼저, GPT-1에 대해 알아보자.

GPT-1은 특정 tasks를 학습할 labeled data가 희소하기 때문에 특정 tasks에 대해 적절히 수행하도록 훈련하는 것이 어렵다는 문제를 제기한다. 하지만 unlabeled text corpora는 풍부하다는 점을 이용해 unlabeled texts에 generative pre-training of a language model을 진행하고 각 특정 tasks에 대해 fine-tuning 하는 방법을 제안한다:

Two-stage training:

1) Learning high-capacity language model (Unsupervised pre-training)

$u=\{u_1,...,u_n\}$ 이 주어지면 다음 likelihood를 최대화하는 표준 LM objective를 사용하여 학습한다.

\mathcal{L}_1(u)=\sum_i{log{P(u_i|u_{i-k},...,u_{i-1};\theta)}}

$k$ $\theta$ 는 모델의 파라미터다. 학습 시 optimizer는 SGD를 사용했으며, 네트워크 구조는 multi-layer transformer decoder를 사용했다. 해당 모델은 target tokens에 대한 output distribution을 생산하기 위해 position-wise feedforward layers 전 input context tokens에 대해 multi-head self-attention operation을 적용했다.

h_0=UW_e + W_p \\ h_l=transformer\_block(h_{l-1}) \forall i\in[1,n] \\ P(u)=softmax(h_nW_e^T)

2) Fine-tuning stage on a discriminative task with labeled data

$y$ $W_y$ $\mathcal{C}$ $h_l^m$ $y$ 에 대해서 log 확률을 최대화하도록 학습한다:

P(y|x^1,...,x^m)=softmax(h_l^m W_y),\\ \mathcal{L}_2(\mathcal{C})=\sum_{(x,y)}{log{P(y|x^1,...,x^m)}}.

추가적으로 supervised model의 generalization 향상과 수렴 가속화를 위해 auxiliary objective를 추가하여 최종적으로는 다음 objective를 최적화한다:

\mathcal{L}_3(\mathcal{C})=\mathcal{L}_2(\mathcal{C})+\lambda\mathcal{L}_1(\mathcal{C})

GPT-2는 단일 도메인이 아닌 여러 도메인의 가능한한 크고 다양한 데이터셋을 만드는 것에 동기부여 되어 Dragnet and Newspaper content extractors을 사용하여 HTML을 추출해 WebText 데이터셋을 구축했다. 그리고 이 대용량 데이터셋에 language model을 학습했을 때 명확한 supervision 없이 tasks를 학습할 수 있음을 이야기한다. (Zero-shot)

$p(output|input)$ $p(output|input,task)$ 으로 모델링되어야 한다. 예를 들면, 번역 같은 경우는 (translate to french, english text, french text), reading comprehension은 (answer the question, document, question, answer)로 task를 조건화할 수 있다. 이를 unsupervised multitask learning으로써 이야기할 수 있을 듯 하다.

즉, GPT-1과의 차이는 크게 보면 대용량 데이터셋 WebText에 학습, task를 조건화한 unsupervised multitask learning 관점 제시라고 볼 수 있겠다. 모델 구조 상 변경점은 다음과 같다.


xxxxxxxxxx
Layer normalization was moved to the input of each sub-block, similar to a pre-activation residual network and an additional layer normalization was added after the final selfattention block. A modified initialization which accounts for the accumulation on the residual path with model depth is used. We scale the weights of residual layers at initialization by a factor of 1/√N where N is the number of residual layers. The vocabulary is expanded to 50,257. We also increase the context size from 512 to 1024 tokens and a larger batchsize of 512 is used.

GPT-3의 명확한 변경점은 엄청난 스케일의 모델 파라미터, 그리고 few-shot이다.

위 차트는 "단어에 섞인 랜덤한 기호 제거하기" 에 대한 모델의 성능에 대한 것이다. 위 차트가 GPT-3에서 이야기하고자 하는 두 포인트를 잘 설명하고 있다고 생각한다. 첫 번째는 모델 파라미터 크기에 따른 정확도이다. 차트를 보면 알겠지만 모델 파라미터 크기에 따라 정확도가 계속 향상하고 있다는 것을 알 수 있다. 두 번째는 few-shot에 따른 성능이다. 여기서 Number of Examples는 few-shot에서 예시로 주어진 sample의 수를 의미한다. 즉, 예시가 많을수록 성능이 향상된다는 사실을 알 수 있다. 이 때 모델 업데이트는 일어나지 않는다.

위 그림은 fine-tuning, zero-shot, few-shot에 대한 예시를 보여준다. 이 예시를 보면 알겠지만 여기서 few-shot은 "A는 Apple이고 B는 Banana고 그럼 C는?" 이렇게 물어보는 것과 같다. 물론, 실제론 GPT-2의 zero-shot처럼 task에 대해 조건화된다. GPT-3에 대해 좀 더 자세히 알고 싶다면 다음의 잘 정리된 블로그 링크를 참고하면 좋을 듯 하다.

DialoGPT $P(Target, Source)$ 를 모델링할 수 있도록 Mutual Information Maximization를 사용한다.

Mutual Information Maximization

저자는 source에 대해서만 조건화하여 응답을 생성하기 때문에 크게 상관 없는 일반적인 응답을 내놓는다고 이야기한다. 그래서 역으로도 조건화하기 위해 다음을 최대화하도록 한다.

\hat T = argmax_T\{(1-\lambda)log{P(T|S)+\lambda logP(S|T)}\}

$\lambda$ 는 얼마나 많이 generic responses에 대해 페널티를 줄 것인지를 제어하는 hyperparameter다.

저작자표시 비영리 변경금지 (새창열림)

'NLP' 카테고리의 다른 글

[ODS] Personalized Response Generation via Generative Split Memory Network 논문에 대한 간단 요약 (0)	2021.08.14
[TDS] 논문 Coorperative Memory Network for Personalized Task-oriented Dialogue Systems with Incomplete User Profiles 요약 (0)	2021.08.14
[VCR] From Recognition to Cognition: Visual Commonsense Reasoning 논문 이해 (0)	2021.07.11
다중 감성(multi-class sentiment) 분류 모델 개발일지 - 2 (0)	2019.03.17
다중 감성(multi-class sentiment) 분류 모델 개발일지 - 1 (1)	2019.03.09

Wide and Deep Programming

GPT-1,2,3, DialoGPT 차이에 대해 간단히 알아보자.

'NLP' 카테고리의 다른 글

댓글

티스토리툴바

GPT-1,2,3, DialoGPT 차이에 대해 간단히 알아보자.

'NLP' 카테고리의 다른 글

관련글

댓글

티스토리툴바