[TDS] MinTL: Minimalist Transfer Learning for Task-Oriented Dialogue Systems의 이해

MinTL

이 글은 MinTL: Minimalist Transfer Learning for Task-Oriented Dialogue Systems의 방법론 이해를 위한 글이다.

Abstract

대화 시스템 학습을 위해 데이터를 수집하고 annotation하는 건 시간이 많이 들고 도메인 간 호환도 잘 안 된다. 그래서 human supervision을 줄이기 위해 pre-trained language model을 이용한다. 이전의 TDS는 보통 특정 task에 특화된 여러 모듈로 이루어져있고 그러한 모듈은 pre-training stage를 거의 거치지 않는다. 이렇다보니 pre-trained LM을 다른 대화 tasks에 적응시키기 위해 tasks-specific 아키텍쳐 수정이 필요하다. MinTL (Minimalist Transfer Learning)은 TDS의 시스템 설계 과정을 단순화하고 annotated data에 대한 의존도를 낮춘다.

$Lev$ $Lev$ 를 decoding한다. 그런 다음, 업데이트된 state는 외부 knowledge base를 검색하는데 사용된다. 마지막으로, response decoder는 dialogue context와 knowledge base match 결과를 조건화하여 response를 decoding한다.

Contributions

1) They propose the MinTL framework that efficiently leverages pre-trained language models for task-oriented dialogue without any ad hoc module.

$Lev$ for efficiently tracking the dialogue state with the minimal length of generation, which greatly reduces the inference latency.

3) They instantiate our framework with two different pre-trained backbones, and both of them improve the SOTA results by a large margin.

4) They demonstrate the robustness of our approach in the low-resource setting. By only using 20% training data, MinTL-based systems achieve competitive results compared to the SOTA.

Methods

Notation

\begin{aligned} \mbox{dialogue } \mathcal{C}&=\{U_1,R_1,...,U_T,R_T\} \\ \mbox{dialogue context } \mathcal{C}_t&=\{U_{t-w},R_{t-w},...,R_{t-1},U_t\} \\ \mbox{dialogue states } \mathcal{B}&=\{B_1,...,B_T\} \\ \mbox{domains } \mathcal{D}&=\{d_1,...,d_N\} \\ \mbox{slots } \mathcal{S}&=\{s_1,...,s_M\} \end{aligned}

$U$ $R$ $t$ $w$ $B_t$ $t$ $d_i$ $s_j$ $v$ $(d_i,s_j)$ $B_t(d_i,s_j)=v$ $(d_i,s_j)$ $B_t$ $B_t(d_i,s_j)=\varepsilon$ $\varepsilon$ $|\varepsilon|=0$ 이다.

$Lev$ )

$Lev$ 의 아이디어는 이전 dialogue states를 편집하기 위해 각 turn에서 최소한의 belief spans을 생성하는 것이다.

$Belief\mbox{ }span$ 의 개념을 도입했다. 분류 기반 DST보다는 생성 DST 모델이 사전정의된 온톨로지에 대한 완전한 접근 없이 slot values를 예측할 수 있지만, 생성 모델은 밑바닥부터 belief span을 생성하거나 필요한 slot values를 디코딩하기 위해 domain slot pairs의 모든 결합에 대한 state operations을 분류한다. 이런 방식의 경우 많은 수의 서비스나 API를 멀티 도메인으로 확장할 때 규모를 키우기 어렵다.

$Lev$ 는 훈련 과정에선 DST training target으로써 construct 된다.

$Lev$ Preprocessing

$B_{t-1}$ $B_t$ $(d_i,s_j)$ $\mbox{(INS)}$ $\mbox{(DEL)}$ $\mbox{(SUB)}$ .

\begin{aligned} \mbox{INS}&\rightarrow B_t(d_i,s_j)\ne \varepsilon \and B_{t-1}(d_i,s_j)=\varepsilon \\ \mbox{DEL}&\rightarrow B_t(d_i,s_j)= \varepsilon \and B_{t-1}(d_i,s_j)\ne\varepsilon \\ \mbox{SUB}&\rightarrow B_t(d_i,s_j)\ne B_{t-1}(d_i,s_j). \end{aligned}

$s_j$ $s_j$ $d_i$ $B_{t-1}(d_i,s_j)$ $B_t(d_i,s_j)$ $E(d_i,s_j)$ 를 다음과 같이 정의한다.

E(d_i,s_j)= \begin{cases} \begin{aligned} &s_j\oplus B_t(d_i,s_j) &\mbox{if INS} \\ &s_j\oplus \mbox{NULL} &\mbox{if DEL} \\ &s_j\oplus B_t(d_i,s_j) &\mbox{if SUB} \\ &\varepsilon &\mbox{otherwize} \end{aligned} \end{cases}

$\oplus$ $B_{t-1}$ $(d_i,s_j)$ $d_i$ $E(d_i,s_j)$ 를 다음과 같이 집약한다.

L(d_i)=E(d_i,s_1)\oplus...\oplus E(d_i,s_M)

$L(d_i)$ $d_i$ $d_i$ $Lev$ $L(d_i)$ $[d_i]$ 를 append한다:

\delta(L,d_i)= \begin{cases} \begin{aligned} &[d_i]\oplus L(d_i) &\mbox{ if }L(d_i)\ne\varepsilon \\ &\varepsilon &otherwise. \end{aligned} \end{cases}

$Lev$ $Lev$ 는 모든 도메인 내 slot values를 모두 합친 문자열이다.

Lev=\delta(L,d_1)\oplus...\oplus\delta(L,d_N)

Inference time

$t$ $Lev_t$ $f$ $B_{t-1}$ 을 편집한다:

B_t=f(Lev_t,B_{t-1})

$Lev_t$ $B_{t-1}$ $Lev$ $Lev_6$ $people$ $Lev_7$ $B_6$ $(hotel,area)$ $B_7(hotel,area)=\varepsilon$ 과 같다.

MinTL Framework

MinTL $C_t$ $B_{t-1}$ 이다. 각 sub-sequence 사이에는 special segment tokens이 결합된다:

\mbox{Input: }B_{t-1}\mbox{<EOB>}...R_{t-1}\mbox{<EOR>}U_t\mbox{<EOU>} \\ H=Encoder(\mathcal{C}_t,B_{t-1})

$H\in \mathbb{R}^{I\times d_{model}}$ $I$ $Lev$ $H$ $Lev_t$ 를 디코딩한다:

Lev_t=Decoder_L(H)

$\mathcal{C}_t$ $B_{t-1}$ $Lev_t$ 의 negative log-likelihood를 최소화하는 것이다.

\mathcal{L}_L=-log\mbox{ }p(Lev_t|\mathcal{C}_t,B_{t-1})

$Lev_t$ $f$ $B_{t-1}$ 을 편집하는데 사용된다.

Knowledge Base Query

$B_t$ $k_t$ $T_1$ $T_2$ $E_k\in \mathbb{R}^{K\times d_{model}}$ $e_k\in\mathbb{R}^{d_{model}}$ $K$ $e_k$ $R_t$ 을 생성하기 위해 response decoder의 start token embedding으로써 사용된다.

R_t=Decoder_R(H,e_k)

$B_{t-1},\mathcal{C}_t,k_t$ $R_t$ $\mathcal{L}$ $Lev$ construction loss와 합해 동시에 최적화된다.

\mathcal{L}_R=-log\mbox{ }p(R_t|\mathcal{C}_t,B_{t-1},k_t) \\ \mathcal{L}=\mathcal{L}_L+\mathcal{L}_R

Backbone Models

Encoder와 Decoder는 pre-trained language models의 weights로 초기화된다. 이 논문에선 BART와 Text-To-Text Transfer Transformer (T5)를 사용한다.

BART (Lewis et al., 2019)

BART는 bidirectional encoder와 autoregressive decoder로 encoder-decoder Transformer를 구현한다. 손상된 문서를 denoising하는 autoencoders로써 pre-training한다. 그 다음, decoder의 output과 원본 문서 간 reconstruction loss로써 cross-entropy loss를 최적화한다. BART는 pre-training에서 5개의 문서 손상 방법을 적용한다: Token Masking, Token Deletion, Text Infilling, Sentence Permutation, Document Rotation.

T5 (Raffel et al., 2019)

T5는 relative position embeddings을 사용하는 encoder-decoder Transformer다. 데이터셋은 약 750GB의 clean and natural English text를 포함하는 Colossal Clean Crawled Corpus (C4)를 사용했다. Pre-training objective는 spans prediction을 사용한다. 입력 spans의 15% 정도를 마스킹한 다음 decoder를 사용해서 missing spans을 예측한다.

Experiments

평가를 위해 Multi-WOZ 2.0 데이터셋 사용 (식당, 기차, 어트랙션, 호텔, 택시, 병원, 경찰)

Implementation Details

T5-small (60M parameters) : 6 encoder-decoder layers

T5-base (220M parameters) : 12 encoder-decoder layers

BART-large (400M parameters) : 12 encoder-decoder layers

공평한 비교를 위해 DAMD pre-processing을 따라간다. 참고로, GPT2-small의 파라미터 수는 117M, GPT2-medium의 파라미터 수는 345M이다.

Low resource setting에서의 실험 결과는 논문을 참고하길 바란다.

저작자표시 비영리 변경금지 (새창열림)

'NLP' 카테고리의 다른 글

[Dialogue] GPT-2, BART로 대화 생성 모델 설계 구조 (0)	2021.11.10
[TDS] Multi-WOZ 데이터셋 Delexicalization 코드 분석 (SOLOIST) (0)	2021.09.08
[TDS] A Tailored Pre-Training Model for Task-Oriented Dialog Generation (PRAL)에 대한 이해 (0)	2021.09.05
[TDS] Alternating Recurrent Dialog Model with Large-scale Pre-trained Language Models (ARDM) 논문에 대한 이해 (0)	2021.09.03
[TDS] SOLOIST: Building Task Bots at Scale에 대한 이해 (0)	2021.09.02

Wide and Deep Programming

[TDS] MinTL: Minimalist Transfer Learning for Task-Oriented Dialogue Systems의 이해

Abstract

Contributions