對話系統(tǒng)中的中文自然語言理解(NLU)(3.1)學(xué)術(shù)界中的方法(聯(lián)合訓(xùn)練)
3 學(xué)術(shù)界中的方法 (Academic Methods)
在這一節(jié)中,我們會詳細(xì)的介紹2個(gè)任務(wù):意圖分類 和 槽位填充。除此之外,我們還會對以往的一些工作進(jìn)行回顧。來看一看這些工作是如何完成這兩個(gè)任務(wù)的。你會發(fā)現(xiàn),其中有一些工作便會用到在上一節(jié)中提到的中文特色的特征。
In this section, we will cover the 2 tasks in detail: Intent Classification and Slot Filling. In addition to this, we will also review some of the previous work for these two tasks. You will notice that some of these studies used the Chinese-specific features mentioned in the previous article.
需要注意的是,這一系列文章中不包含端到端的工作(比如,將說的話作為喂給一個(gè)語言模型,然后模型吐出來意圖分類或者槽位填充的結(jié)果,或者兩者同時(shí)吐出來)。
It should be noted that this series of articles does not include end-to-end methods (e.g. feeding utterances to a language model and generating the answers, intent classification results, slot filling, or both).
2個(gè)任務(wù) (2 Tasks)
意圖分類任務(wù)(Intent Classification Task)和文本分類比較類似,對說出的話進(jìn)行分類。類別需要提前預(yù)先設(shè)置好,比如:播放新聞、播放音樂、調(diào)高音量等。Similar to Text Classification, the utterances are categorised. The categories need to be pre-defined in advance, e.g. Play News, Play Music, Turn Up Volume, etc.
槽位填充任務(wù)(Slot Filling Task)當(dāng)模型聽懂人類的意圖之后,為了執(zhí)行任務(wù),模型便需要進(jìn)一步了解細(xì)節(jié)。比如當(dāng)意圖是播放新聞時(shí),模型想進(jìn)一步知道哪里的以及哪一天發(fā)生的新聞。這里的日期和地點(diǎn)便是槽位,而它們對應(yīng)的信息便是槽位填充的內(nèi)容。Once the model understands the intent, in order to perform the task, the model needs to obtain further details. For example, when the intention is to Play News, the model has to know Where and on Which Day. In the example in the figure, the Date and Location are the slots, and the information they correspond to is what the slots are filled with.
解決任務(wù)的方法(Methods)
目前比較流行的方法有兩類。There are roughly two types of popular approaches.
聯(lián)合訓(xùn)練:為這兩個(gè)任務(wù)設(shè)計(jì)一個(gè)共享的模型主體,而在輸出預(yù)測結(jié)果的時(shí)候會進(jìn)行特別的設(shè)計(jì)。一般的做法是,這兩個(gè)任務(wù)分別有自己對應(yīng)的輸出層。
Joint training: A shared model body is designed for the two tasks, while the output prediction layers are specifically designed.
單獨(dú)訓(xùn)練:把這兩個(gè)任務(wù)當(dāng)作互相沒有關(guān)聯(lián)的任務(wù)。也就是說針對每一個(gè)任務(wù)會單獨(dú)設(shè)計(jì)一個(gè)方法。
Separate training: The two tasks are considered they are unrelated to each other. This means that a separate method will be designed for each task.
因?yàn)槲覀冞@篇文章主要介紹針對中文的方法。這里我們簡單提一下處理中文和英文方法之間的關(guān)系。如下面的圖說的那樣,處理這兩種語言的方法是有重疊部分的。也就是說,有的方法既適用于中文,也適用于英文。需要注意的是,也有一些針對某一種語言的方法,而這些方法是沒有辦法直接的運(yùn)用到另一個(gè)語種上的。
Because our article focuses on methods for the Chinese language. Here we briefly mention the relationship between the methods for dealing with Chinese and English. As the picture illustrates, there is an overlap between the methods for dealing with the two languages. That is, there are methods that work for Chinese as well as English. It is important to note that there are also methods for one language that cannot be directly applied to another language.
3.1 聯(lián)合訓(xùn)練(Joint Training)
BERT for Joint Intent Classification and Slot Filling, 2019, DAMO Academy, Alibaba
我們從這篇論文說起。在上圖中,我們展示了如何處理英文文本。它的框架在目前是比較流行的。Let's start with this paper. The figure shows how the model works with English text. Its framework is relatively popular at the moment.
輸入包含兩部分:特殊token [CLS] 和 tokens;輸出也包含兩部分:這句話的意圖分類結(jié)果(上圖中為Play News) 以及 每個(gè)token對應(yīng)的序列標(biāo)注的標(biāo)簽(today對應(yīng)Date, Beijing對應(yīng)Location).
Input consists of two parts: special token [CLS] and tokens; Output also consists of two parts: the result of the intention classification (Play News in the picture above) and the label of the sequence corresponding to each token (Today corresponds to Date, Beijing to Location).
從右上角圖中可以看出,在兩個(gè)公開的數(shù)據(jù)集上,可以獲得不錯(cuò)的效果。那么,如果我們想把它運(yùn)用到中文上,但同時(shí)又不改變模型的結(jié)構(gòu),應(yīng)該如何做呢? As you can see from the top right table, good results can be obtained on the two publicly available datasets. What should we do if we plan to apply this model to Chinese, and at the same time not change the structure of the model?
一種簡單粗暴的方法是,將每個(gè)字作為一個(gè)單獨(dú)的token為輸入,并且每個(gè)字都會有一個(gè)embedding向量。A simple approach would be to consider each Chinese character as an individual token as input, and each character would have an embedding vector.
另一種方法是,先將這段文本進(jìn)行中文分詞,然后以詞向量的形式輸入。Alternatively, we can apply word segmentation to this text (i.e., split text into Chinese words) and then consider word vectors as the inputs.
有同學(xué)要問了,這里為什么一定要使用BERT模型呢?如果說我的業(yè)務(wù)場景或者任務(wù)需要更加輕量級的模型,但是又需要還可以的效果,又該怎么辦呢?You may ask, what if my business scenario or task requires a more lightweight model with acceptable performance? Bert is somehow too heavy for me.
中間是不一定必須使用BERT模型的。中間即使是換成其他的模型,再加上任務(wù)沒有那么的難,也同樣會達(dá)到還可以的效果。In this case, it is not necessary to use the BERT model. Even if you replace it with another model, it will still achieve decent performance if the task is not difficult.
如果你是NLP新手的話,對于如何使用CRF層還不是很了解,可以參考我之前寫的另一系列文章:CRF Layer on the Top of BiLSTM (https://createmomo.github.io/2017/09/12/CRF_Layer_on_the_Top_of_BiLSTM_1/)。當(dāng)然,也可以閱讀其他小伙伴的優(yōu)秀的學(xué)習(xí)筆記。相信最終你一定會搞懂它的工作原理。If you are new to NLP, and you don't know much about the CRF layer, you can refer to another series of articles I wrote before: CRF Layer on the Top of BiLSTM. Of course, you can also read the excellent articles of other authors. I am sure you will eventually figure out how it works.
A joint model of intent determination and slot filling for spoken language understanding, IJCAI 2016, Peking University
這一篇文章展示的結(jié)構(gòu)為:輸入向量→雙向GRU結(jié)構(gòu)→CRF層用來做序列標(biāo)注(即槽位填充)& 最后一個(gè)token對應(yīng)的輸出用來做意圖分類。The structure shown in this paper is: input vector → bidirectional GRU → CRF layer used for sequence labelling (i.e. slot filling) & the output corresponding to the last token is used to predict the intent.
需要注意的是,這里的輸入是經(jīng)過小小的精心設(shè)計(jì)的。首先,每一個(gè)詞會擁有自己的向量。雖然如此,真正直接進(jìn)入模型的,并不是每個(gè)詞的向量。而是每連續(xù)3個(gè)詞向量,組成新的向量。而這個(gè)新的向量才是與GRU部分直接接觸的輸入。It should be noted that the input here is carefully designed. Firstly, each word will have its own vector. Despite this, it is not the vector of each word that goes directly into the model. Rather, it is every three consecutive word vectors that make up a new vector, and this new vector goes into directly the GRU part.
下圖展示了模型可以獲得效果。在圖中我們也寫了一個(gè)小提醒:本文使用的詞向量是在中文大數(shù)據(jù)上經(jīng)過訓(xùn)練的詞向量。The figure below shows the results. Note that the word vectors used in this paper are word vectors that have been pre-trained on large Chinese text.
Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling, Interspeech 2016, Carnegie Mellon University
這篇論文提出了2個(gè)模型。This paper proposes 2 models.
模型1:基于注意力機(jī)制的RNN (Model1: Attention-Based RNN)
共享部分: 輸入→BiLSTM→輸出→計(jì)算注意力權(quán)重分?jǐn)?shù)(在這個(gè)例子中,應(yīng)該總共為16個(gè)分?jǐn)?shù)。例如在“From”這個(gè)位置,,我們需要得出這4個(gè)權(quán)重。在其他位置的情況是類似的。每一步都需要計(jì)算4個(gè)注意力權(quán)重分?jǐn)?shù))→根據(jù)注意力權(quán)重分?jǐn)?shù)計(jì)算
Shared: Input → BiLSTM → Output → Compute attention weight scores (in this example, there should be a total of 16 scores. For example in the position "From",, we need to derive the 4 weights . The situation is similar in the other positions. (Each step requires the calculation of 4 attentional weight scores) → Calculate based on the attention scores
有的小伙伴可能要問了,這個(gè)注意力分?jǐn)?shù)是如何計(jì)算的呢?現(xiàn)在注意力分?jǐn)?shù)的計(jì)算其實(shí)已經(jīng)有了很多不同的做法。它既可以是簡單的矩陣、向量之間的操作,也可以是由一個(gè)小型的神經(jīng)網(wǎng)絡(luò)來決定的。如果感興趣,可以仔細(xì)閱讀論文,來研究一下這篇文章是如何計(jì)算的。Some of you may be asking, how is this attention score calculated? Nowadays the calculation of the attention score has actually been studied in many different ways. It can either be as simple as manipulating matrices and vectors or it can be determined by a small neural network. If interested, you can read the paper carefully to figure out how this is calculated in this paper.
意圖分類部分: 比如圖中有4個(gè)向量,那么將這幾個(gè)向量組成一個(gè)矩陣,然后使用mean-pooling操作得到一個(gè)新的向量。再利用這個(gè)新的向量來做分類任務(wù)。
Intent Classification Part: For example, if there are 4 vectors in the figure, then form a matrix of these vectors and use the mean-pooling operation to get a new vector. This new vector is used to complete the classification task.
槽位填充部分: 利用向量來預(yù)測每一個(gè)位置應(yīng)該是什么標(biāo)簽(比如:From→O,LA→B-From,to→O,Seattle→B-To)。
Slot-Filling Part: Use the vectors to predict what label should be at each position (e.g. From→O, LA→B-From, to→O, Seattle→B-To).
模型2:基于注意力機(jī)制的編碼-解碼器結(jié)構(gòu) (Model2: Attention-Based Encoder-Decoder)
共享部分: 這一部分和模型1是一樣的。輸入→BiLSTM→輸出→計(jì)算注意力權(quán)重分?jǐn)?shù)(在這個(gè)例子中,應(yīng)該總共為16個(gè)分?jǐn)?shù)。例如在“From”這個(gè)位置,,我們需要得出這4個(gè)權(quán)重。在其他位置的情況是類似的。每一步都需要計(jì)算4個(gè)注意力權(quán)重分?jǐn)?shù))→根據(jù)注意力權(quán)重分?jǐn)?shù)計(jì)算
Shared: Input → BiLSTM → Output → Compute attention weight scores (in this example, there should be a total of 16 scores. For example in the position "From",, we need to derive the 4 weights . The situation is similar in the other positions. (Each step requires the calculation of 4 attentional weight scores) → Calculate based on the attention scores
意圖分類部分: 將BiLSTM最后一刻的輸出與1個(gè)向量(4個(gè)的加權(quán)求和結(jié)果)組合到一起→一個(gè)簡單的前饋神經(jīng)網(wǎng)絡(luò)→分類結(jié)果
Intent Classification Part: Combining the last output of the BiLSTM with 1 vector (the weighted summation result of 4 s) → a simple feed-forward neural network → classification result
槽位填充部分: 這一部分相當(dāng)于是一個(gè)文字生成過程,只不過這里生成的是序列標(biāo)注的標(biāo)簽。每一步的輸入由2部分組成:BiLSTM相關(guān)的輸出結(jié)果(和)+ 前一時(shí)刻LSTM的輸出。每一步根據(jù)輸入的內(nèi)容來預(yù)測序列標(biāo)注的標(biāo)簽。
Slot-Filling Part: This part is equivalent to a token-by-token sentencegeneration process, except that here the labels are generated for the sequence labelling. The input to each step consists of 2 parts: the output results associated with the BiLSTM ( and ) + the output of the LSTM from the previous step. Each step predicts the label based on the input.
CM-Net A Novel Collaborative Memory Network for Spoken Language Understanding, EMNLP2019, Beijing Jiaotong University, WeChat AI Tencent
這篇論文引入了2個(gè)記憶模塊:槽位記憶和意圖記憶。在這個(gè)任務(wù)中,有多少個(gè)槽位,槽位記憶中就有幾個(gè)向量。類似的,任務(wù)中有多少個(gè)意圖,意圖模塊中就會有幾個(gè)意圖向量。我們的期待是這2個(gè)模塊可以記住一些有用的信息,并且在模型做預(yù)測的時(shí)候,它們的記憶會有助于提高預(yù)測的準(zhǔn)確度。This study introduces 2 memory modules: Slot Memory and Intent Memory. There are as many slots of this task as there are vectors in the slot memory (this is similar to intent memory). Our expectation is that these 2 modules will remember useful information and their memory will help to improve the model performance.
那么記憶模塊有了,該怎么讓他們在模型做預(yù)測的時(shí)候提供幫助呢?這篇論文還設(shè)計(jì)了一個(gè)特別的神經(jīng)網(wǎng)絡(luò)塊:CM-block。它的作用就是聰明地綜合輸入的文本以及記憶模塊中的信息,來輔助最終輸出層(Inference Layer)做決策。如果你想增強(qiáng)整個(gè)模型的推理能力,可以考慮疊加更多的這種神經(jīng)網(wǎng)絡(luò)塊。The problem is how they work when the model is making predictions. This paper designed a special neural network block: the CM-block. This block is able to intelligently combine input text and the information from the memory block to improve the Inference Layer in making decisions. One way which may increase the reasoning ability of this model is to stack more of these neural network blocks.
需要注意的是,針對序列標(biāo)注任務(wù),這篇論文采用的不是BIO2序列標(biāo)簽框架。而是采用了BIOES的標(biāo)簽框架。在下面這個(gè)圖中可以看出,針對同一句話,這兩種不同的標(biāo)注框架會有哪些不同。It should be noticed that for the slot-filling task (a type of sequence labelling task), this paper did not use the BIO2 sequence tagging framework. Instead, the BIOES was used. The picture below explains how these two different labelling frameworks would differ for the same sentence.
Injecting Word Information with Multi-Level Word Adapter for Chinese Spoken Language Understanding, ICASSP 2021, Harbin Institute of Technology
從這一篇工作中你會發(fā)現(xiàn),它不僅使用了基本的中文字符向量。同時(shí),它還對句子進(jìn)行了分詞預(yù)處理,然后將分詞后的信息融合在模型中。As you will see from this work, it not only uses basic Chinese character features. It also pre-processes the utterances by word segmentation and then incorporates such word information into the model.
下圖中模型看起來有一些復(fù)雜。我們不會展開說模型結(jié)構(gòu)的細(xì)節(jié)。這里是想為大家展示,這個(gè)工作綜合了字符和分詞兩者的信息。作者們認(rèn)為,雖然偶爾可能會有錯(cuò)誤的分詞結(jié)果,但是這不會影響模型從分詞結(jié)果中獲取對理解中文有幫助的信息。The model architecture below looks somewhat complex. We will not describe the details of the model structure. The point here is to show that this work combines information from both characters and words. Although there may be occasional incorrect word segmentation results, this does not affect the model's ability to capture useful information from the words for a better understanding of Chinese.
圖中我們可以看出,這個(gè)模型的基本輸入為中文字符(我,洗,手,了)。As we can see in the figure, the basic input is Chinese characters.
同時(shí),模型也考慮了,中文分詞后的結(jié)果:我,洗手,了。Moreover, the model takes into account the word segmentation results.
論文還提出了一個(gè)小模塊Word Adapter。你可以把它看成是一個(gè)通用的小工具。我們期待這個(gè)小工具的作用是將兩種不同的信息聰明的結(jié)合。這里聰明的含義便是它可以判斷誰輕誰重(weighted sum)。而這種判斷的能力,是可以伴隨著模型的訓(xùn)練習(xí)得的(learnable)。為了將字符和分詞結(jié)果兩者的信息更聰明的結(jié)合到一起,便使用了這個(gè)聰明的小工具。The paper also presents a small module Word Adapter. You can think of it as a smart mini-module. The purpose of it is tocleverly combine two different kinds of information. We expect that it can determine which information is more important and which is not (weighted sum). Such ability can be learned with the training of the model. You can see that this module is used in combining the information from Chinese characters and words.
這張表格展示了模型的效果。可以看出,和很多很厲害的其他工作相比,這個(gè)模型帶來的提升是很明顯的。This table shows the model performance. As you can see, the improvement is significant compared to several strong baselines.
下一篇 (Next)
會回顧幾篇工作來解釋它們是如何單獨(dú)解決“意圖分類”或者“槽位填充”這兩個(gè)任務(wù)的~ The next article will present several studies addressing the two tasks of "Intent Classification" or "Slot Filling" individually~
審核編輯 :李倩
-
模型
+關(guān)注
關(guān)注
1文章
3483瀏覽量
49962 -
自然語言
+關(guān)注
關(guān)注
1文章
291瀏覽量
13599
原文標(biāo)題:對話系統(tǒng)中的NLU 3.1 學(xué)術(shù)界中的方法(聯(lián)合訓(xùn)練)
文章出處:【微信號:zenRRan,微信公眾號:深度學(xué)習(xí)自然語言處理】歡迎添加關(guān)注!文章轉(zhuǎn)載請注明出處。
發(fā)布評論請先 登錄
SOLIDWORKS科研版?面向學(xué)術(shù)界的解決方案

評論