在本書中,我們經常會使用表示為單詞、字符或單詞序列的文本數據。首先,我們需要一些基本工具來將原始文本轉換為適當形式的序列。典型的預處理流水線執行以下步驟:
-
將文本作為字符串加載到內存中。
-
將字符串拆分為標記(例如,單詞或字符)。
-
構建一個詞匯詞典,將每個詞匯元素與一個數字索引相關聯。
-
將文本轉換為數字索引序列。
import collections
import random
import re
import torch
from d2l import torch as d2l
import collections
import random
import re
import tensorflow as tf
from d2l import tensorflow as d2l
9.2.1. 讀取數據集
在這里,我們將使用 HG Wells 的The Time Machine,這是一本 30000 多字的書。雖然實際應用程序通常會涉及大得多的數據集,但這足以演示預處理管道。以下_download
方法將原始文本讀入字符串。
class TimeMachine(d2l.DataModule): #@save
"""The Time Machine dataset."""
def _download(self):
fname = d2l.download(d2l.DATA_URL + 'timemachine.txt', self.root,
'090b5e7e70c295757f55df93cb0a180b9691891a')
with open(fname) as f:
return f.read()
data = TimeMachine()
raw_text = data._download()
raw_text[:60]
'時間機器,HG Wells [1898]nnnnnInnnThe Time Tra'
class TimeMachine(d2l.DataModule): #@save
"""The Time Machine dataset."""
def _download(self):
fname = d2l.download(d2l.DATA_URL + 'timemachine.txt', self.root,
'090b5e7e70c295757f55df93cb0a180b9691891a')
with open(fname) as f:
return f.read()
data = TimeMachine()
raw_text = data._download()
raw_text[:60]
Downloading ../data/timemachine.txt from http://d2l-data.s3-accelerate.amazonaws.com/timemachine.txt...
'The Time Machine, by H. G. Wells [1898]nnnnnInnnThe Time Tra'
class TimeMachine(d2l.DataModule): #@save
"""The Time Machine dataset."""
def _download(self):
fname = d2l.download(d2l.DATA_URL + 'timemachine.txt', self.root,
'090b5e7e70c295757f55df93cb0a180b9691891a')
with open(fname) as f:
return f.read()
data = TimeMachine()
raw_text = data._download()
raw_text[:60]
'The Time Machine, by H. G. Wells [1898]nnnnnInnnThe Time Tra'
class TimeMachine(d2l.DataModule): #@save
"""The Time Machine dataset."""
def _download(self):
fname = d2l.download(d2l.DATA_URL + 'timemachine.txt', self.root,
'090b5e7e70c295757f55df93cb0a180b9691891a')
with open(fname) as f:
return f.read()
data = TimeMachine()
raw_text = data._download()
raw_text[:60]
'The Time Machine, by H. G. Wells [1898]nnnnnInnnThe Time Tra'
為簡單起見,我們在預處理原始文本時忽略標點符號和大寫字母。
@d2l.add_to_class(TimeMachine) #@save
def _preprocess(self, text):
return re.sub('[^A-Za-z]+', ' ', text).lower()
text = data._preprocess(raw_text)
text[:60]
'the time machine by h g wells i the time traveller for so it'
'the time machine by h g wells i the time traveller for so it'
'the time machine by h g wells i the time traveller for so it'
9.2.2. 代幣化
標記是文本的原子(不可分割)單元。每個時間步對應 1 個 token,但究竟什么是 token 是一種設計選擇。例如,我們可以將句子“Baby needs a new pair of shoes”表示為一個包含 7 個單詞的序列,其中所有單詞的集合包含一個很大的詞匯表(通常是數萬或數十萬個單詞)。或者我們將同一個句子表示為更長的 30 個字符序列,使用更小的詞匯表(只有 256 個不同的 ASCII 字符)。下面,我們將預處理后的文本標記為一系列字符。
't,h,e, ,t,i,m,e, ,m,a,c,h,i,n,e, ,b,y, ,h, ,g, ,w,e,l,l,s, '
't,h,e, ,t,i,m,e, ,m,a,c,h,i,n,e, ,b,y, ,h, ,g, ,w,e,l,l,s, '
't,h,e, ,t,i,m,e, ,m,a,c,h,i,n,e, ,b,y, ,h, ,g, ,w,e,l,l,s, '
9.2.3. 詞匯
這些標記仍然是字符串。然而,我們模型的輸入最終必須由數值輸入組成。接下來,我們介紹一個用于構建詞匯表的類,即,將每個不同的標記值與唯一索引相關聯的對象。首先,我們確定訓練語料庫中的唯一標記集。然后我們為每個唯一標記分配一個數字索引。為方便起見,通常會刪除不常用的詞匯元素。Whenever we encounter a token at training or test time that had not been previously seen or was dropped from the vocabulary, we represent it by a special “” token, signifying that this is an unknown value.
class Vocab: #@save
"""Vocabulary for text."""
def __init__(self, tokens=[], min_freq=0, reserved_tokens=[]):
# Flatten a 2D list if needed
if tokens and isinstance(tokens[0], list):
tokens = [token for line in tokens for token in line]
# Count token frequencies
counter = collections.Counter(tokens)
self.token_freqs = sorted(counter.items(), key=lambda x: x[1],
reverse=True)
# The list of unique tokens
self.idx_to_token = list(sorted(set([''] + reserved_tokens + [
token for token, freq in self.token_freqs if freq >= min_freq])))
self.token_to_idx = {token: idx
for idx, token in enumerate(self.idx_to_token)}
def __len__(self):
return len(self.idx_to_token)
def __getitem__(self, tokens):
if not isinstance(tokens, (list, tuple)):
return self.token_to_idx.get(tokens,
評論