主页

索引

模块索引

搜索页面

6.3.7. Tokenizers

Getting started

Quicktour

Build a tokenizer from scratch

Training the tokenizer:

from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

# train our tokenizer on the wikitext files:
from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

# 英文等增加对空格的分格(否则像it is这种会因为常出现被认识是一个token)
from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()

# train
files = [f"data/wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
tokenizer.train(files, trainer)

tokenizer.save("data/tokenizer-wiki.json")

# 重新加载:
tokenizer = Tokenizer.from_file("data/tokenizer-wiki.json")

Using the tokenizer:

output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)
# ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]

print(output.ids)
# [27253, 16, 93, 11, 5097, 5, 7961, 5112, 6218, 0, 35]

print(output.offsets[9])
# (26, 27)

sentence = "Hello, y'all! How are you 😁 ?"
sentence[26:27]
# "😁"

Post-processing:

tokenizer.token_to_id("[SEP]")
# 2

from tokenizers.processors import TemplateProcessing
tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", tokenizer.token_to_id("[CLS]")),
        ("[SEP]", tokenizer.token_to_id("[SEP]")),
    ],
)
# 说明
1. 指定单句子的模板:格式是 “[CLS] $A [SEP]”,其中 $A 代表我们的句子
2. 指定句子对的模板:格式是 “[CLS] $A [SEP] $B [SEP]”,其中 $A 代表第一个句子,$B 代表第二个句子

# 单句子示例
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)
# ["[CLS]", "Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?", "[SEP]"]

# 句子对示例
output = tokenizer.encode("Hello, y'all!", "How are you 😁 ?")
print(output.tokens)
# ["[CLS]", "Hello", ",", "y", "'", "all", "!", "[SEP]", "How", "are", "you", "[UNK]", "?", "[SEP]"]
print(output.type_ids)
# [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]

Encoding multiple sentences in a batch:

# process your texts by batches
output = tokenizer.encode_batch(["Hello, y'all!", "How are you 😁 ?"])

# batch of sentences pairs
output = tokenizer.encode_batch(
    [["Hello, y'all!", "How are you 😁 ?"], ["Hello to you too!", "I'm fine, thank you!"]]
)

# automatically pad the outputs to the longest sentence
tokenizer.enable_padding(pad_id=3, pad_token="[PAD]") # 分词器为填充标记 [PAD] 预设的唯一整数 ID。当分词器遇到 [PAD] 时,自动会将其转换为 3

output = tokenizer.encode_batch(["Hello, y'all!", "How are you 😁 ?"])
print(output[1].tokens)
# ["[CLS]", "How", "are", "you", "[UNK]", "?", "[SEP]", "[PAD]"]
print(output[1].attention_mask)
# [1, 1, 1, 1, 1, 1, 1, 0]

Pretrained

Using a pretrained tokenizer:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")

Importing a pretrained tokenizer from legacy vocabulary files:

from tokenizers import BertWordPieceTokenizer

tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)

The tokenization pipeline

When calling Tokenizer.encode or Tokenizer.encode_batch, the input text(s) go through the following pipeline:

normalization
pre-tokenization
model
post-processing

Normalization

  • 标准化是对原始字符串应用的一组操作,以使其不那么随机或“更干净”。

  • 常见操作包括去除空格、删除重音字符或小写所有文本。

  • 每个标准化操作在 🤗 Tokenizers 库中都由Normalizer表示,您可以使用normalizers.Sequence组合其中的多个操作

应用 NFD Unicode 标准化删除重音符号 的标准化器:

from tokenizers import normalizers
from tokenizers.normalizers import NFD, StripAccents
normalizer = normalizers.Sequence([NFD(), StripAccents()])

使用:

normalizer.normalize_str("Héllò hôw are ü?")
# "Hello how are u?"

更改相应的属性来自定义其规范化器:

tokenizer.normalizer = normalizer

Pre-Tokenization

  • 预标记化是将文本分割成更小的对象的行为,这些对象为训练结束时的标记给出了上限(upper bound)。

  • A good way to think of this is that the pre-tokenizer will split your text into “words” and then, your final tokens will be parts of those words.

预标记输入的一种简单方法是根据空格和标点符号进行分割:

from tokenizers.pre_tokenizers import Whitespace
pre_tokenizer = Whitespace()
pre_tokenizer.pre_tokenize_str("Hello! How are you? I'm fine, thank you.")
# [("Hello", (0, 5)), ("!", (5, 6)), ("How", (7, 10)), ("are", (11, 14)), ("you", (15, 18)),
#  ("?", (18, 19)), ("I", (20, 21)), ("'", (21, 22)), ('m', (22, 23)), ("fine", (24, 28)),
#  (",", (28, 29)), ("thank", (30, 35)), ("you", (36, 39)), (".", (39, 40))]

可以将任何PreTokenizer组合在一起:

# 将按空格、标点符号和数字进行分割,将数字分隔成各个数字
from tokenizers import pre_tokenizers
from tokenizers.pre_tokenizers import Digits
pre_tokenizer = pre_tokenizers.Sequence([Whitespace(), Digits(individual_digits=True)])
pre_tokenizer.pre_tokenize_str("Call 911!")
# [("Call", (0, 4)), ("9", (5, 6)), ("1", (6, 7)), ("1", (7, 8)), ("!", (8, 9))]

更改相应的属性来自定义Tokenizer的预分词器:

tokenizer.pre_tokenizer = pre_tokenizer

Model

  • 一旦输入文本被标准化和预标记化, Tokenizer就会在预标记上应用模型。

  • 模型的作用是使用它学到的规则将你的“单词”分割成标记。它还负责将这些标记映射到模型词汇表中相应的 ID。

  • This model is passed along when intializing the Tokenizer

Tokenizers 库支持:

models.BPE
models.Unigram
models.WordLevel
models.WordPiece

示例:

# Tokenizer初使化时就传递
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

Post-Processing

  • 后处理是最后一步,用于在返回Encoding之前对其执行任何其他转换,例如添加潜在的特殊标记。

  • 通过设置相应的属性来自定义Tokenizer的后处理器

示例-通过以下方式进行后处理以使输入适合 BERT 模型:

from tokenizers.processors import TemplateProcessing
tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[("[CLS]", 1), ("[SEP]", 2)],
)
  • 与预分词器或标准化器不同,无需在更改后处理器后重新训练分词器

All together: a BERT tokenizer from scratch

BERT 依赖于 WordPiece,因此我们使用此模型实例化一个新的Tokenizer:

from tokenizers import Tokenizer
from tokenizers.models import WordPiece
bert_tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))

BERT 通过删除重音符号和小写字母来预处理文本:

from tokenizers import normalizers
from tokenizers.normalizers import NFD, Lowercase, StripAccents
bert_tokenizer.normalizer = normalizers.Sequence([NFD(), Lowercase(), StripAccents()])

根据空格和标点符号进行分割:

from tokenizers.pre_tokenizers import Whitespace
bert_tokenizer.pre_tokenizer = Whitespace()

uses the template:

from tokenizers.processors import TemplateProcessing
bert_tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", 1),
        ("[SEP]", 2),
    ],
)

use this tokenizer and train on it on wikitext:

from tokenizers.trainers import WordPieceTrainer
trainer = WordPieceTrainer(vocab_size=30522, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
files = [f"data/wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
bert_tokenizer.train(files, trainer)
bert_tokenizer.save("data/bert-wiki.json")

Decoding

  • 除了对输入文本进行编码之外, Tokenizer还具有用于解码的 API,即将模型生成的 ID 转换回文本:

    Tokenizer.decode (for one predicted text)
    Tokenizer.decode_batch (for a batch of predictions)
    

decoder首先将 ID 转换回标记(使用标记器的词汇表)并删除所有特殊标记,然后将这些标记与空格连接起来:

output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.ids)
# [1, 27253, 16, 93, 11, 5097, 5, 7961, 5112, 6218, 0, 35, 2]
tokenizer.decode([1, 27253, 16, 93, 11, 5097, 5, 7961, 5112, 6218, 0, 35, 2])
# "Hello , y ' all ! How are you ?"

如果您使用的模型添加了特殊字符来表示给定“单词”的子标记(例如 WordPiece 中的”##” ):

output = bert_tokenizer.encode("Welcome to the 🤗 Tokenizers library.")
print(output.tokens)
# ["[CLS]", "welcome", "to", "the", "[UNK]", "tok", "##eni", "##zer", "##s", "library", ".", "[SEP]"]
bert_tokenizer.decode(output.ids)
# "welcome to the tok ##eni ##zer ##s library ."

这种情况需要自定义decoder以正确处理它们:

from tokenizers import decoders
bert_tokenizer.decoder = decoders.WordPiece()
bert_tokenizer.decode(output.ids)
# "welcome to the tokenizers library."

Components

Normalizers

常用的Normalizer方法:

NFD
NFKD
NFC
NFKC
Lowercase       => Input: HELLO ὈΔΥΣΣΕΎΣ  Output: helloὀδυσσεύς`
Strip           => Input: "  hi  " Output: "hi"
StripAccents    => Input: é Ouput: e
Replace         => Replace("a", "e") will behave like this: Input: "banana" Ouput: "benene"
BertNormalizer
    => Provides an implementation of the Normalizer used in the original BERT
        clean_text
        handle_chinese_chars
        strip_accents
        lowercase
Sequence
    => Composes multiple normalizers that will run in the provided order
        Sequence([NFKC(), Lowercase()])

Pre-tokenizers

  • The PreTokenizer takes care of splitting the input according to a set of rules.

algorithms:

ByteLevel
    Splits on whitespaces while remapping all the bytes to a set of visible characters.
    introduced by OpenAI with GPT-2
Whitespace          => Input: "Hello there!" Output: "Hello", "there!"
    Splits on word boundaries (using the following regular expression: `\w+|[^\w\s]+`
WhitespaceSplit
    Splits on any whitespace character
Punctuation         => Input: "Hello?" Ouput: "Hello", "?"
Metaspace           => Input: "Hello there" Ouput: "Hello", "▁there"
    Splits on whitespaces and replaces them with a special char “▁” (U+2581)
CharDelimiterSplit  => Example with x: Input: "Helloxthere" Ouput: "Hello", "there"
Digits              => Input: "Hello123there" Output: "Hello", "123", "there"
        Splits the numbers from any other characters.
Split
    有下面3个参数:
        1. `pattern` should be either a custom string or regexp.
        2. `behavior` should be one of:
            removed
            isolated
            merged_with_previous
            merged_with_next
            contiguous
        3. `invert` should be a boolean flag.
    示例:
        pattern = , behavior = "isolated", invert = False:
        Input: "Hello, how are you?"
        Output: "Hello,", " ", "how", " ", "are", " ", "you?"
Sequence
    示例:
        Sequence([Punctuation(), WhitespaceSplit()])

Models

分词:

models.BPE          => Byte-Pair-Encoding
models.Unigram      =>
models.WordLevel    =>
models.WordPiece    =>
WordLevel
  • 简介:WordLevel 是最基础的分词方法,也是我们通常所说的“词级别分词”。它将每个完整的词映射到一个唯一的ID,而不对词进一步分解。

  • 优点:简单直观,易于理解和实现。只需要一个单词-ID映射表。

  • 缺点:需要非常大的词汇表来覆盖所有可能出现的单词,导致模型体积大。而且在处理未见过的单词(out-of-vocabulary, OOV)时很可能会出现“[UNK]”(未知词)。

  • 适用场景:适用于词汇量较小的任务,或对词汇量没有严格要求的简单应用。

BPE
  • 简介:BPE 是一种流行的子词分词算法。它从字符级别开始,通过合并在语料中最常出现的字符对逐步创建新的子词(tokens)。这种合并操作是迭代进行的,以构建更长的子词。

  • 优点:BPE 可以处理未见过的单词,因为它能够将生词分解为子词并进行组合。因此它的词汇表可以相对较小。

  • 缺点:BPE 的分词是基于频率统计的,因此分词结果是固定的,不具备动态性和上下文敏感性。

  • 适用场景:广泛用于 GPT-2 等子词模型,适合词汇丰富的语言以及希望减少词汇表大小的场景。

WordPiece
  • 简介:WordPiece 是一种与 BPE 相似的子词分词算法,主要由 Google 在 BERT 模型中使用。它的分词策略是贪婪的,即优先尝试生成最长的子词。对于没有完整词汇的单词,WordPiece 会将其拆分为多个子词,并在单词内部使用 ## 前缀标识后续子词。

  • 优点:能有效处理未见过的单词,通过将未知单词拆分成多个子词以覆盖更多可能的组合。较少出现“[UNK]”标记。

  • 缺点:依赖语料的训练,需要更多的时间和计算资源。

  • 适用场景:用于 BERT 及其衍生模型。适合较长文本或希望通过分词算法获得稳定效果的情况。

Unigram
  • 简介:Unigram 是另一种子词分词算法,与 BPE 和 WordPiece 不同,Unigram 基于概率模型来选择最优的分词组合。它会为一个句子计算多种分词方式,选择其中概率最高的组合。

  • 优点:Unigram 不依赖固定规则,而是基于概率动态选择最优分词,具有一定的灵活性和上下文敏感性。能够有效地压缩词汇表并减少 OOV 问题。

  • 缺点:相比其他算法更为复杂,可能计算量较大。

  • 适用场景:应用于 XLNet 和 SentencePiece 等模型,适合需要灵活分词的语言模型应用。

总结对比

算法

分词方式

词汇表大小

未知词处理

主要应用模型

WordLevel

直接按词映射ID

非常大

使用“[UNK]”

简单文本分析任务

BPE

子词合并,频率统计

中等

子词组合

GPT-2 等模型

WordPiece

贪婪匹配,词首“##”标记

中等

子词组合

BERT 及其变体

Unigram

概率模型,最优组合

中等

动态选择

SentencePiece, XLNet

Post-Processors

  • TemplateProcessing

  • 具体参见上面

Decoders

ByteLevel
Metaspace
WordPiece

BPEDecoder
CTC

API

Normalizers

BertNormalizer
Lowercase

Trainers

BpeTrainer
UnigramTrainer
WordLevelTrainer
WordPieceTrainer

主页

索引

模块索引

搜索页面