6.3.7. Tokenizers¶
Getting started¶
Quicktour¶
Build a tokenizer from scratch¶
Training the tokenizer:
from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
# train our tokenizer on the wikitext files:
from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
# 英文等增加对空格的分格(否则像it is这种会因为常出现被认识是一个token)
from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()
# train
files = [f"data/wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
tokenizer.train(files, trainer)
tokenizer.save("data/tokenizer-wiki.json")
# 重新加载:
tokenizer = Tokenizer.from_file("data/tokenizer-wiki.json")
Using the tokenizer:
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)
# ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]
print(output.ids)
# [27253, 16, 93, 11, 5097, 5, 7961, 5112, 6218, 0, 35]
print(output.offsets[9])
# (26, 27)
sentence = "Hello, y'all! How are you 😁 ?"
sentence[26:27]
# "😁"
Post-processing:
tokenizer.token_to_id("[SEP]")
# 2
from tokenizers.processors import TemplateProcessing
tokenizer.post_processor = TemplateProcessing(
single="[CLS] $A [SEP]",
pair="[CLS] $A [SEP] $B:1 [SEP]:1",
special_tokens=[
("[CLS]", tokenizer.token_to_id("[CLS]")),
("[SEP]", tokenizer.token_to_id("[SEP]")),
],
)
# 说明
1. 指定单句子的模板:格式是 “[CLS] $A [SEP]”,其中 $A 代表我们的句子
2. 指定句子对的模板:格式是 “[CLS] $A [SEP] $B [SEP]”,其中 $A 代表第一个句子,$B 代表第二个句子
# 单句子示例
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)
# ["[CLS]", "Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?", "[SEP]"]
# 句子对示例
output = tokenizer.encode("Hello, y'all!", "How are you 😁 ?")
print(output.tokens)
# ["[CLS]", "Hello", ",", "y", "'", "all", "!", "[SEP]", "How", "are", "you", "[UNK]", "?", "[SEP]"]
print(output.type_ids)
# [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
Encoding multiple sentences in a batch:
# process your texts by batches
output = tokenizer.encode_batch(["Hello, y'all!", "How are you 😁 ?"])
# batch of sentences pairs
output = tokenizer.encode_batch(
[["Hello, y'all!", "How are you 😁 ?"], ["Hello to you too!", "I'm fine, thank you!"]]
)
# automatically pad the outputs to the longest sentence
tokenizer.enable_padding(pad_id=3, pad_token="[PAD]") # 分词器为填充标记 [PAD] 预设的唯一整数 ID。当分词器遇到 [PAD] 时,自动会将其转换为 3
output = tokenizer.encode_batch(["Hello, y'all!", "How are you 😁 ?"])
print(output[1].tokens)
# ["[CLS]", "How", "are", "you", "[UNK]", "?", "[SEP]", "[PAD]"]
print(output[1].attention_mask)
# [1, 1, 1, 1, 1, 1, 1, 0]
Pretrained¶
Using a pretrained tokenizer:
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
Importing a pretrained tokenizer from legacy vocabulary files:
from tokenizers import BertWordPieceTokenizer
tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)
The tokenization pipeline¶
When calling Tokenizer.encode
or Tokenizer.encode_batch
, the input text(s)
go through the following pipeline:
normalization
pre-tokenization
model
post-processing
Normalization¶
标准化是对原始字符串应用的一组操作,以使其不那么随机或“更干净”。
常见操作包括去除空格、删除重音字符或小写所有文本。
每个标准化操作在 🤗 Tokenizers 库中都由Normalizer表示,您可以使用normalizers.Sequence组合其中的多个操作
应用 NFD Unicode 标准化 并 删除重音符号 的标准化器:
from tokenizers import normalizers
from tokenizers.normalizers import NFD, StripAccents
normalizer = normalizers.Sequence([NFD(), StripAccents()])
使用:
normalizer.normalize_str("Héllò hôw are ü?")
# "Hello how are u?"
更改相应的属性来自定义其规范化器:
tokenizer.normalizer = normalizer
Pre-Tokenization¶
预标记化是将文本分割成更小的对象的行为,这些对象为训练结束时的标记给出了上限(upper bound)。
A good way to think of this is that the pre-tokenizer will split your text into “words” and then, your final tokens will be parts of those words.
预标记输入的一种简单方法是根据空格和标点符号进行分割:
from tokenizers.pre_tokenizers import Whitespace
pre_tokenizer = Whitespace()
pre_tokenizer.pre_tokenize_str("Hello! How are you? I'm fine, thank you.")
# [("Hello", (0, 5)), ("!", (5, 6)), ("How", (7, 10)), ("are", (11, 14)), ("you", (15, 18)),
# ("?", (18, 19)), ("I", (20, 21)), ("'", (21, 22)), ('m', (22, 23)), ("fine", (24, 28)),
# (",", (28, 29)), ("thank", (30, 35)), ("you", (36, 39)), (".", (39, 40))]
可以将任何PreTokenizer组合在一起:
# 将按空格、标点符号和数字进行分割,将数字分隔成各个数字
from tokenizers import pre_tokenizers
from tokenizers.pre_tokenizers import Digits
pre_tokenizer = pre_tokenizers.Sequence([Whitespace(), Digits(individual_digits=True)])
pre_tokenizer.pre_tokenize_str("Call 911!")
# [("Call", (0, 4)), ("9", (5, 6)), ("1", (6, 7)), ("1", (7, 8)), ("!", (8, 9))]
更改相应的属性来自定义Tokenizer的预分词器:
tokenizer.pre_tokenizer = pre_tokenizer
Model¶
一旦输入文本被标准化和预标记化, Tokenizer就会在预标记上应用模型。
模型的作用是使用它学到的规则将你的“单词”分割成标记。它还负责将这些标记映射到模型词汇表中相应的 ID。
This model is passed along when intializing the Tokenizer
Tokenizers 库支持:
models.BPE
models.Unigram
models.WordLevel
models.WordPiece
示例:
# Tokenizer初使化时就传递
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
Post-Processing¶
后处理是最后一步,用于在返回Encoding之前对其执行任何其他转换,例如添加潜在的特殊标记。
通过设置相应的属性来自定义Tokenizer的后处理器
示例-通过以下方式进行后处理以使输入适合 BERT 模型:
from tokenizers.processors import TemplateProcessing
tokenizer.post_processor = TemplateProcessing(
single="[CLS] $A [SEP]",
pair="[CLS] $A [SEP] $B:1 [SEP]:1",
special_tokens=[("[CLS]", 1), ("[SEP]", 2)],
)
与预分词器或标准化器不同,无需在更改后处理器后重新训练分词器
All together: a BERT tokenizer from scratch¶
BERT 依赖于 WordPiece,因此我们使用此模型实例化一个新的Tokenizer:
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
bert_tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
BERT 通过删除重音符号和小写字母来预处理文本:
from tokenizers import normalizers
from tokenizers.normalizers import NFD, Lowercase, StripAccents
bert_tokenizer.normalizer = normalizers.Sequence([NFD(), Lowercase(), StripAccents()])
根据空格和标点符号进行分割:
from tokenizers.pre_tokenizers import Whitespace
bert_tokenizer.pre_tokenizer = Whitespace()
uses the template:
from tokenizers.processors import TemplateProcessing
bert_tokenizer.post_processor = TemplateProcessing(
single="[CLS] $A [SEP]",
pair="[CLS] $A [SEP] $B:1 [SEP]:1",
special_tokens=[
("[CLS]", 1),
("[SEP]", 2),
],
)
use this tokenizer and train on it on wikitext:
from tokenizers.trainers import WordPieceTrainer
trainer = WordPieceTrainer(vocab_size=30522, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
files = [f"data/wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
bert_tokenizer.train(files, trainer)
bert_tokenizer.save("data/bert-wiki.json")
Decoding¶
除了对输入文本进行编码之外, Tokenizer还具有用于解码的 API,即将模型生成的 ID 转换回文本:
Tokenizer.decode (for one predicted text) Tokenizer.decode_batch (for a batch of predictions)
decoder首先将 ID 转换回标记(使用标记器的词汇表)并删除所有特殊标记,然后将这些标记与空格连接起来:
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.ids)
# [1, 27253, 16, 93, 11, 5097, 5, 7961, 5112, 6218, 0, 35, 2]
tokenizer.decode([1, 27253, 16, 93, 11, 5097, 5, 7961, 5112, 6218, 0, 35, 2])
# "Hello , y ' all ! How are you ?"
如果您使用的模型添加了特殊字符来表示给定“单词”的子标记(例如 WordPiece 中的”##” ):
output = bert_tokenizer.encode("Welcome to the 🤗 Tokenizers library.")
print(output.tokens)
# ["[CLS]", "welcome", "to", "the", "[UNK]", "tok", "##eni", "##zer", "##s", "library", ".", "[SEP]"]
bert_tokenizer.decode(output.ids)
# "welcome to the tok ##eni ##zer ##s library ."
这种情况需要自定义decoder以正确处理它们:
from tokenizers import decoders
bert_tokenizer.decoder = decoders.WordPiece()
bert_tokenizer.decode(output.ids)
# "welcome to the tokenizers library."
Components¶
Normalizers¶
常用的Normalizer方法:
NFD
NFKD
NFC
NFKC
Lowercase => Input: HELLO ὈΔΥΣΣΕΎΣ Output: helloὀδυσσεύς`
Strip => Input: " hi " Output: "hi"
StripAccents => Input: é Ouput: e
Replace => Replace("a", "e") will behave like this: Input: "banana" Ouput: "benene"
BertNormalizer
=> Provides an implementation of the Normalizer used in the original BERT
clean_text
handle_chinese_chars
strip_accents
lowercase
Sequence
=> Composes multiple normalizers that will run in the provided order
Sequence([NFKC(), Lowercase()])
Pre-tokenizers¶
The PreTokenizer takes care of splitting the input according to a set of rules.
algorithms:
ByteLevel
Splits on whitespaces while remapping all the bytes to a set of visible characters.
introduced by OpenAI with GPT-2
Whitespace => Input: "Hello there!" Output: "Hello", "there!"
Splits on word boundaries (using the following regular expression: `\w+|[^\w\s]+`
WhitespaceSplit
Splits on any whitespace character
Punctuation => Input: "Hello?" Ouput: "Hello", "?"
Metaspace => Input: "Hello there" Ouput: "Hello", "▁there"
Splits on whitespaces and replaces them with a special char “▁” (U+2581)
CharDelimiterSplit => Example with x: Input: "Helloxthere" Ouput: "Hello", "there"
Digits => Input: "Hello123there" Output: "Hello", "123", "there"
Splits the numbers from any other characters.
Split
有下面3个参数:
1. `pattern` should be either a custom string or regexp.
2. `behavior` should be one of:
removed
isolated
merged_with_previous
merged_with_next
contiguous
3. `invert` should be a boolean flag.
示例:
pattern = , behavior = "isolated", invert = False:
Input: "Hello, how are you?"
Output: "Hello,", " ", "how", " ", "are", " ", "you?"
Sequence
示例:
Sequence([Punctuation(), WhitespaceSplit()])
Models¶
分词:
models.BPE => Byte-Pair-Encoding
models.Unigram =>
models.WordLevel =>
models.WordPiece =>
WordLevel¶
简介:WordLevel 是最基础的分词方法,也是我们通常所说的“词级别分词”。它将每个完整的词映射到一个唯一的ID,而不对词进一步分解。
优点:简单直观,易于理解和实现。只需要一个单词-ID映射表。
缺点:需要非常大的词汇表来覆盖所有可能出现的单词,导致模型体积大。而且在处理未见过的单词(out-of-vocabulary, OOV)时很可能会出现“[UNK]”(未知词)。
适用场景:适用于词汇量较小的任务,或对词汇量没有严格要求的简单应用。
BPE¶
简介:BPE 是一种流行的子词分词算法。它从字符级别开始,通过合并在语料中最常出现的字符对逐步创建新的子词(tokens)。这种合并操作是迭代进行的,以构建更长的子词。
优点:BPE 可以处理未见过的单词,因为它能够将生词分解为子词并进行组合。因此它的词汇表可以相对较小。
缺点:BPE 的分词是基于频率统计的,因此分词结果是固定的,不具备动态性和上下文敏感性。
适用场景:广泛用于 GPT-2 等子词模型,适合词汇丰富的语言以及希望减少词汇表大小的场景。
WordPiece¶
简介:WordPiece 是一种与 BPE 相似的子词分词算法,主要由 Google 在 BERT 模型中使用。它的分词策略是贪婪的,即优先尝试生成最长的子词。对于没有完整词汇的单词,WordPiece 会将其拆分为多个子词,并在单词内部使用 ## 前缀标识后续子词。
优点:能有效处理未见过的单词,通过将未知单词拆分成多个子词以覆盖更多可能的组合。较少出现“[UNK]”标记。
缺点:依赖语料的训练,需要更多的时间和计算资源。
适用场景:用于 BERT 及其衍生模型。适合较长文本或希望通过分词算法获得稳定效果的情况。
Unigram¶
简介:Unigram 是另一种子词分词算法,与 BPE 和 WordPiece 不同,Unigram 基于概率模型来选择最优的分词组合。它会为一个句子计算多种分词方式,选择其中概率最高的组合。
优点:Unigram 不依赖固定规则,而是基于概率动态选择最优分词,具有一定的灵活性和上下文敏感性。能够有效地压缩词汇表并减少 OOV 问题。
缺点:相比其他算法更为复杂,可能计算量较大。
适用场景:应用于 XLNet 和 SentencePiece 等模型,适合需要灵活分词的语言模型应用。
总结对比¶
算法 |
分词方式 |
词汇表大小 |
未知词处理 |
主要应用模型 |
---|---|---|---|---|
WordLevel |
直接按词映射ID |
非常大 |
使用“[UNK]” |
简单文本分析任务 |
BPE |
子词合并,频率统计 |
中等 |
子词组合 |
GPT-2 等模型 |
WordPiece |
贪婪匹配,词首“##”标记 |
中等 |
子词组合 |
BERT 及其变体 |
Unigram |
概率模型,最优组合 |
中等 |
动态选择 |
SentencePiece, XLNet |
Post-Processors¶
TemplateProcessing
具体参见上面
Decoders¶
ByteLevel
Metaspace
WordPiece
BPEDecoder
CTC
API¶
Normalizers¶
BertNormalizer
Lowercase
Trainers¶
BpeTrainer
UnigramTrainer
WordLevelTrainer
WordPieceTrainer