通俗理解文本生成的常用解碼策略

General Understanding of Decoding Strategies Commonly Used in Text Generation

”

注意：這一篇文章只包含了常用的解碼策略，未包含更新的研究成果。Note: this post contains only commonly used decoding strategies and does not include more recent research findings.

背景簡介
解決的問題
解碼策略

Standard Greedy Search
Beam Search
Sampling
Top-k Sampling
Sampling with Temperature
Top-p (Nucleus) Sampling

代碼快覽
總結

This post covers:

Background
Problem
Decoding Strategies

Standard Greedy Search
Beam Search
Sampling
Top-k Sampling
Sampling with Temperature
Top-p (Nucleus) Sampling

Code Tips
Summary

1. 背景簡介（Background）

“Autoregressive”語言模型的含義是：當生成文本時，它不是一下子同時生成一段文字（模型吐出來好幾個字），而是一個字一個字的去生成。"Autoregressive" means that when a model generates text, it does not generate all the words in the text at once, but word by word.

舉例來說，圖中（For example as shown in the figure）：

1）小伙問了模型一個問題 (he asked the modal a question)：Hello! How are you today?

2）模型需要生成回復文本，它生成的第一個單詞是(the first word in the response generated by the model is): “I”

3）當生成“I”之后，模型根據目前得到的信息（小伙的問題 + “I”），繼續生成下一個單詞 (Once the "I" has been generated, the model continues to generate the next word based on the information, the question + "I", it has received so far): “am”

4）當生成“I am”之后，模型根據目前得到的信息（小伙的問題 + “I am”），繼續生成下一個單詞(Once "I am" is generated, the model continues to generate the next word based on the information, the question + "I am", it has received so far): “good”

5）重復上面的步驟，直到生成一種特殊的詞之后，表示這一次的文本生成可以停止了。在例子中，當生成“[EOS]”時表示這段文本生成完畢，（EOS = End of Sentence）。Repeat the above steps until a particular word is generated, which means that the text generation can be stopped for this time. In the example, the text is complete when the word "[EOS]" is generated.

2. 解決的問題（Problem）

由于這種模型生成文本的方式是一個詞一個詞的輸出，所以它是否能夠產生好的文本取決于我們是否能夠聰明地決定下一步應該輸出哪一個詞匯。Since this way of generating text is output word by word, its ability to produce goodtext depends on the text generation strategy being smart enough to decide which word should be output at each step.

“
需要注意的是，在這篇文章中，對“好”的定義并不是指的這個模型在經過良好的訓練之后，具備了接近人類并且高質量的表達能力。這里對“好”的定義是一個好的挑選輸出詞的策略。詳細來說就是，在模型預測下一個詞應該是什么的時候，它在任何狀態下（也就是說不管模型是否經過了良好的訓練），這個策略總是有一套自己的辦法去盡全力挑選出來一個它認為最合理的詞作為輸出。It is important to note that the definition of 'good' in this post does not mean that the model is well-trained and has a high quality of expression close to that of a human. In this context, the definition of 'good' is a good strategy for selecting output words. In detail, this means that when the model is predicting what the next word should be, in any status (i.e. whether the model is well-trained or not), the strategy always have a way of doing its best to pick the word that the startegy think makes the most sense as an output.
”

當選詞策略將要對下一個輸出詞做選擇的時候，在我們面前有一張巨大的表格。這張表格便是模型目前認為下一步應該輸出哪一個詞的概率。The word selection strategy refers to a large table when making decisions about what words to output. This table stores the probability of what the model currently thinks the next word should be.

3. 解碼策略（Decoding Strategies）

3.1 Standard Greedy Search

最簡單樸素的方法是我們可以總是可以去挑選概率最高的詞去輸出。但是這樣的方法會有一個潛在的問題，當整個句子（多個詞匯）輸出完畢的時候，我們不能保證這整個句子就是最好的。我們也有可能找到比它更好的句子。雖然我們在每一步輸出的時候，選擇了我們目前認為最好的選擇，但從長遠全局來看，這并不代表這些詞組合出來的整個句子就是最好的。The simplest and most straightforward way is that we can always pick the word with the highest probability. But there will be a potential problem, when we get the whole sentence, we can't guarantee that the sentence is good. It is also possible that we will find a sentence that is better than it is. Although we choose what we think are the best words at each step, in the big picture this does not mean that the whole sentence resulting from the combination of these words is good.

3.2 Beam Search

為了解決“大局觀”這個問題，我們可以嘗試Beam搜索這個策略。舉個例子，如圖所示（To solve the "big picture" problem, we can try the Beam search strategy. An example is shown in the figure）：在3.1的策略中，我們的目光比較狹窄，因為我們只會關注我們認為最好的那1個輸出。而在Beam搜索策略中，我們可以關注更多的“選手”（在圖中我們關注了2個）。前期表現好的選手到最后不一定是最好的。如果你愿意的話，也可以同時關注更多的“選手”，只不過這樣付出的代價就是你需要的運算資源更多了。In the strategy mentioned in section 3.1, we have a narrow focus, as we only focus on the 1 prediction we think is the best. Whereas in the Beam search strategy, we can focus on more options (in the figure we focus on 2). The text fragments with the highest scores in the early stages are not necessarily the best in the end. You can also focus on more text fragments at the same time if you wish, but this requires more computing resources.

1）在上圖中，當前的輸入為“The” → Beam搜索策略需要根據輸出的概率表格選擇下一個輸出詞 → Beam選擇關注最好的2位“選手”，即"The dog"和"The nice"，他們的得分分別為0.4與0.5。In the above figure, the current input is "The" → Beam search strategy selects the next word based on the output probability table → Beam focuses on the best 2 choices, "The dog" and "The nice", which have scores of 0.4 and 0.5 respectively.

2）現在，在上圖中，當前的輸入為“The dog”和“The nice” → 當前的策略為這2個輸入挑選出分數最高的輸出→“The dog has”和“The nice woman”。Now, in the above figure, the current inputs are "The dog" and "The nice" → the current strategy picks the highest scoring output for these 2 inputs → "The dog has" and "The nice woman".

3）繼續按照上面的思路，一直執行到最后你會得到2個得分最高的句子。Keep executing until the end and you will get the 2 highest-scoring sentences.

缺陷之一（Shortcomings 1）：但是這種策略生成的文本可能會有一個缺陷，容易重復的說一些話（如下圖所示：“I'm not sure if I'll...”出現了2次）。一種補救辦法是用簡單的規則來限制生成的文本，比如同一小段文本（n-gram）不能出現2次。However, the text generated by this strategy may have the drawback of being prone to saying something over and over again (as shown below: "I'm not sure if I'll..." which appears 2 times). One remedy is to restrict the generated text with simple rules, such as the same text fragment (n-gram) not appearing twice.

缺陷之二（Shortcomings 2）：當我們想要的很大時（的含義是我們想要同時觀察個分數最高的生成結果），相對應地對運算資源的需求也會變大。When we want to be large (in the sense that we want to observe thehighest scoring generated results simultaneously), the corresponding demand on computing resources becomes large.

缺陷之三（Shortcomings 3）：模型生成的文本比較枯燥、無趣。經過研究表明，在Beam搜索策略的引導下，模型雖然可以生成人類能夠理解的句子，但是這些句子并沒有給真正的人類帶來驚喜。The text generated by the model is rather boring and uninteresting. It has been shown that guided by the Beam search strategy, the model can generate sentences that humans can understand, but these sentences do not surprise real humans.

Beam搜索策略的變體 (More Variants of the Beam Search Strategy)：

3.3 Sampling

使用采樣的方法可以讓生成的文本更加多樣化。很樸素的一種方法就是按照當前的概率分布來進行采樣。在模型的認知中，它認為合理的詞（也就是概率大的詞）就會有更大的幾率被采樣到。這種方法的缺陷是會有一定幾率亂說話，或者生成的句子并不像人類話那般流利。The use of sampling allows for a greater variety of text to be generated. A very simple way of doing this is to sample according to the current probability distribution. Words with a high probability will have a higher chance of being sampled. The disadvantageof this approach is that there is a certain chance that incoherent words will be generated, or that the sentences generated will not be as fluent as in the language used by humans.

3.4 Top-k Sampling

為了緩解上述問題，我們可以限制采樣的范圍。例如我們可以每次只在概率表中的排名前個詞中采樣。To alleviate the above problem, we can limit the scope of sampling. For example, we could sample only the top words in the probability table at a time.

3.5 Sampling with Temperature

這種方法可以對當前的概率分布進行縮放，例如讓概率大的更大、讓小的變的更小，或者讓大概率和小概率之間差別沒那么明顯等。而控制這種縮放力度的參數為[0,1)。在公式中，。This method allows the current probability distribution to be rescaled, for example by making larger probabilities larger, making smaller ones smaller, or making the difference between large and small probabilities less significant, etc. The parameter that controls the strength of this scaling is [0,1). In the equation,.

當變大時，模型在生成文本時更傾向于比較少見的詞匯。越大，重新縮放后的分布就越接近均勻采樣。As becomes larger, the model favours less common words when generating text. The largeris, the closer the rescaled distribution is to uniform sampling.
當變小時，模型在生成文本時更傾向于常見的詞。越大，重新縮放后的分布就越接近我們最開始提到的貪婪生成方法（即總是去選擇概率最高的那個詞）。When becomes small, the model tends to favour common words when generating text. The largeris, the closer the rescaled distribution is to the greedy search strategy we mentioned at the beginning (i.e. always going for the word with the highest probability).

在Meena相關的論文中是這樣使用這種策略的(This strategy is used in the relevant Meena paper in the following way)：

針對同一段輸入，論文讓模型使用這種策略生成20個不同的文本回復。For the same input, the paper has the model generate 20 different text responses using this strategy.
然后從這20個句子中，挑選出整個句子概率最大的作為最終輸出。Then from these 20 sentences, the one with the highest probability for the whole sentence is selected as the final output.使用上述方法生成的句子明顯比使用Beam搜索策略生成的句子更加多樣化、高質量。The sentences generated using the above method are significantly more diverse and of higher quality than those generated using the Beam search strategy.

3.6 Top-p (Nucleus) Sampling

在top-k采樣的方法中，我們把可采樣的范圍的限制非常的嚴格。例如“Top-5”表示我們只能在排名前5位的詞進行采樣。這樣其實會有潛在的問題 (In the top-k sampling method, we strictly limit the range of words that can be sampled. For example, "Top-5" means that we can only sample the top 5 words in the rankings. This has the potential to be problematic)：

排在第5位之后的詞也有可能是概率值并不算小的詞，但是我們把這些詞永遠的錯過了，而這些詞很有可能也是非常不錯的選擇。Words after the 5th position are also likely to be words with high probability values, but we miss these words forever when they are likely to be good choices as well.
排在5位之內的詞也有可能是概率值并不高的詞，但是我們也把他們考慮進來了，而這些詞很有可能會降低文本質量。Words ranked within the top 5 may also be words that do not have a high probability value, but we have taken them into account and they are likely to reduce the quality of the text.

在Top-p這種方法中，我們通過設置一個閾值（）來達到讓取詞的范圍可以動態地自動調整的效果：我們把排序后的詞表從概率值最高的開始算起，一直往后累加，一直到我們累加的概率總值超過閾值為止。在閾值內的所有詞便是采樣取詞范圍。In the Top-p method, we make the range of words taken dynamically adjustable by setting a threshold (): we add the probabilities in the word list starting with the highest probability value and keep adding them up until the total value of the probabilities we have accumulated exceeds the threshold. All words within the threshold are in the sampling range.

假設，我們設置的閾值為0.92(Suppose, we set a threshold of0.92)：

在左圖中，前9個詞的概率加起來才超過了0.92。In the left figure, the probabilities of the first 9 words add up to more than 0.92.
在右圖中，前3個詞的概率和就可以超過0.92。In the right figure, the probabilities of the first 3 words can sum to more than 0.92.

4 代碼快覽（Code Tips）

圖中展示了部分Huggingface接口示例。我們可以看的出來，盡管在這篇文章中提到了不同的方法，但是它們之間并不是完全孤立的。有些方法是可以混合使用的，這也是為什么我們可以在代碼中可以同時設置多個參數。The figure shows a partial example of the Huggingface interface. As we can see, despite the different methods mentioned in this post, they are not completely separated from each other. Some of the methods are mixable, which is why we can have more than one argument in the code at the same time.

5 總結 (Summary)

文章中介紹了一些常用的解碼策略，但是其實很難評價到底哪一個是最好的。Some common decoding strategies are described in the post, and it is actually difficult to evaluate which one is actually the best.

一般來講，在開放領域的對話系統中，基于采樣的方法是好于貪婪和Beam搜索策略的。因為這樣的方法生成的文本質量更高、多樣性更強。In general, sampling-based approaches are preferable to greedy and Beam search strategies in open-domain dialogue systems. This is because such an approach generates higher quality and more diverse text.

但這并不意味著我們徹底放棄了貪婪和Beam搜索策略，因為有研究證明，經過良好的訓練，這兩種方法是可以生成比Top-p采樣策略更好的文本。However, this does not mean that we completely abandon the greedy and Beam search strategies, as it has been shown that, with proper training, these two methods are capable of generating better text than the Top-p sampling strategy.

自然語言處理之路還有很長很長，繼續加油吧~ There is still a long, long way to go in natural language processing research. Keep working hard!

“
小提醒 (Note)：使用原創文章內容前請先閱讀說明（菜單→所有文章）Please read the instructions before using any original post content (Menu → All posts) or contact me if you have any questions.
”

審核編輯：李倩

聲明：本文內容及配圖由入駐作者撰寫或者入駐合作網站授權轉載。文章觀點僅代表作者本人，不代表電子發燒友網立場。文章及其配圖僅供工程師學習之用，如有內容侵權或者其他違規問題，請聯系本站處理。舉報投訴