Tokenizers ########## * From: https://huggingface.co/docs/tokenizers/index Getting started =============== Quicktour --------- Build a tokenizer from scratch ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Training the tokenizer:: from tokenizers import Tokenizer from tokenizers.models import BPE tokenizer = Tokenizer(BPE(unk_token="[UNK]")) # train our tokenizer on the wikitext files: from tokenizers.trainers import BpeTrainer trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]) # 英文等增加对空格的分格(否则像it is这种会因为常出现被认识是一个token) from tokenizers.pre_tokenizers import Whitespace tokenizer.pre_tokenizer = Whitespace() # train files = [f"data/wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]] tokenizer.train(files, trainer) tokenizer.save("data/tokenizer-wiki.json") # 重新加载: tokenizer = Tokenizer.from_file("data/tokenizer-wiki.json") Using the tokenizer:: output = tokenizer.encode("Hello, y'all! How are you 😁 ?") print(output.tokens) # ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"] print(output.ids) # [27253, 16, 93, 11, 5097, 5, 7961, 5112, 6218, 0, 35] print(output.offsets[9]) # (26, 27) sentence = "Hello, y'all! How are you 😁 ?" sentence[26:27] # "😁" Post-processing:: tokenizer.token_to_id("[SEP]") # 2 from tokenizers.processors import TemplateProcessing tokenizer.post_processor = TemplateProcessing( single="[CLS] $A [SEP]", pair="[CLS] $A [SEP] $B:1 [SEP]:1", special_tokens=[ ("[CLS]", tokenizer.token_to_id("[CLS]")), ("[SEP]", tokenizer.token_to_id("[SEP]")), ], ) # 说明 1. 指定单句子的模板:格式是 “[CLS] $A [SEP]”,其中 $A 代表我们的句子 2. 指定句子对的模板:格式是 “[CLS] $A [SEP] $B [SEP]”,其中 $A 代表第一个句子,$B 代表第二个句子 # 单句子示例 output = tokenizer.encode("Hello, y'all! How are you 😁 ?") print(output.tokens) # ["[CLS]", "Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?", "[SEP]"] # 句子对示例 output = tokenizer.encode("Hello, y'all!", "How are you 😁 ?") print(output.tokens) # ["[CLS]", "Hello", ",", "y", "'", "all", "!", "[SEP]", "How", "are", "you", "[UNK]", "?", "[SEP]"] print(output.type_ids) # [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1] Encoding multiple sentences in a batch:: # process your texts by batches output = tokenizer.encode_batch(["Hello, y'all!", "How are you 😁 ?"]) # batch of sentences pairs output = tokenizer.encode_batch( [["Hello, y'all!", "How are you 😁 ?"], ["Hello to you too!", "I'm fine, thank you!"]] ) # automatically pad the outputs to the longest sentence tokenizer.enable_padding(pad_id=3, pad_token="[PAD]") # 分词器为填充标记 [PAD] 预设的唯一整数 ID。当分词器遇到 [PAD] 时,自动会将其转换为 3 output = tokenizer.encode_batch(["Hello, y'all!", "How are you 😁 ?"]) print(output[1].tokens) # ["[CLS]", "How", "are", "you", "[UNK]", "?", "[SEP]", "[PAD]"] print(output[1].attention_mask) # [1, 1, 1, 1, 1, 1, 1, 0] Pretrained ^^^^^^^^^^ Using a pretrained tokenizer:: from tokenizers import Tokenizer tokenizer = Tokenizer.from_pretrained("bert-base-uncased") Importing a pretrained tokenizer from legacy vocabulary files:: from tokenizers import BertWordPieceTokenizer tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True) The tokenization pipeline ------------------------- When calling ``Tokenizer.encode`` or ``Tokenizer.encode_batch``, the ``input text(s)`` go through the following pipeline:: normalization pre-tokenization model post-processing Normalization ^^^^^^^^^^^^^ * 标准化是对原始字符串应用的一组操作,以使其不那么随机或“更干净”。 * 常见操作包括去除空格、删除重音字符或小写所有文本。 * 每个标准化操作在 🤗 Tokenizers 库中都由Normalizer表示,您可以使用normalizers.Sequence组合其中的多个操作 应用 **NFD Unicode 标准化** 并 **删除重音符号** 的标准化器:: from tokenizers import normalizers from tokenizers.normalizers import NFD, StripAccents normalizer = normalizers.Sequence([NFD(), StripAccents()]) 使用:: normalizer.normalize_str("Héllò hôw are ü?") # "Hello how are u?" 更改相应的属性来自定义其规范化器:: tokenizer.normalizer = normalizer Pre-Tokenization ^^^^^^^^^^^^^^^^ * 预标记化是将文本分割成更小的对象的行为,这些对象为训练结束时的标记给出了上限(upper bound)。 * A good way to think of this is that the pre-tokenizer will split your text into “words” and then, your final tokens will be parts of those words. 预标记输入的一种简单方法是根据空格和标点符号进行分割:: from tokenizers.pre_tokenizers import Whitespace pre_tokenizer = Whitespace() pre_tokenizer.pre_tokenize_str("Hello! How are you? I'm fine, thank you.") # [("Hello", (0, 5)), ("!", (5, 6)), ("How", (7, 10)), ("are", (11, 14)), ("you", (15, 18)), # ("?", (18, 19)), ("I", (20, 21)), ("'", (21, 22)), ('m', (22, 23)), ("fine", (24, 28)), # (",", (28, 29)), ("thank", (30, 35)), ("you", (36, 39)), (".", (39, 40))] 可以将任何PreTokenizer组合在一起:: # 将按空格、标点符号和数字进行分割,将数字分隔成各个数字 from tokenizers import pre_tokenizers from tokenizers.pre_tokenizers import Digits pre_tokenizer = pre_tokenizers.Sequence([Whitespace(), Digits(individual_digits=True)]) pre_tokenizer.pre_tokenize_str("Call 911!") # [("Call", (0, 4)), ("9", (5, 6)), ("1", (6, 7)), ("1", (7, 8)), ("!", (8, 9))] 更改相应的属性来自定义Tokenizer的预分词器:: tokenizer.pre_tokenizer = pre_tokenizer Model ^^^^^ * 一旦输入文本被标准化和预标记化, Tokenizer就会在预标记上应用模型。 * 模型的作用是使用它学到的规则将你的“单词”分割成标记。它还负责将这些标记映射到模型词汇表中相应的 ID。 * This model is passed along when intializing the Tokenizer Tokenizers 库支持:: models.BPE models.Unigram models.WordLevel models.WordPiece 示例:: # Tokenizer初使化时就传递 tokenizer = Tokenizer(BPE(unk_token="[UNK]")) Post-Processing ^^^^^^^^^^^^^^^ * 后处理是最后一步,用于在返回Encoding之前对其执行任何其他转换,例如添加潜在的特殊标记。 * 通过设置相应的属性来自定义Tokenizer的后处理器 示例-通过以下方式进行后处理以使输入适合 BERT 模型:: from tokenizers.processors import TemplateProcessing tokenizer.post_processor = TemplateProcessing( single="[CLS] $A [SEP]", pair="[CLS] $A [SEP] $B:1 [SEP]:1", special_tokens=[("[CLS]", 1), ("[SEP]", 2)], ) * 与预分词器或标准化器不同,无需在更改后处理器后重新训练分词器 All together: a BERT tokenizer from scratch ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ BERT 依赖于 WordPiece,因此我们使用此模型实例化一个新的Tokenizer:: from tokenizers import Tokenizer from tokenizers.models import WordPiece bert_tokenizer = Tokenizer(WordPiece(unk_token="[UNK]")) BERT 通过删除重音符号和小写字母来预处理文本:: from tokenizers import normalizers from tokenizers.normalizers import NFD, Lowercase, StripAccents bert_tokenizer.normalizer = normalizers.Sequence([NFD(), Lowercase(), StripAccents()]) 根据空格和标点符号进行分割:: from tokenizers.pre_tokenizers import Whitespace bert_tokenizer.pre_tokenizer = Whitespace() uses the template:: from tokenizers.processors import TemplateProcessing bert_tokenizer.post_processor = TemplateProcessing( single="[CLS] $A [SEP]", pair="[CLS] $A [SEP] $B:1 [SEP]:1", special_tokens=[ ("[CLS]", 1), ("[SEP]", 2), ], ) use this tokenizer and train on it on wikitext:: from tokenizers.trainers import WordPieceTrainer trainer = WordPieceTrainer(vocab_size=30522, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]) files = [f"data/wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]] bert_tokenizer.train(files, trainer) bert_tokenizer.save("data/bert-wiki.json") Decoding ^^^^^^^^ * 除了对输入文本进行编码之外, Tokenizer还具有用于解码的 API,即将模型生成的 ID 转换回文本:: Tokenizer.decode (for one predicted text) Tokenizer.decode_batch (for a batch of predictions) decoder首先将 ID 转换回标记(使用标记器的词汇表)并删除所有特殊标记,然后将这些标记与空格连接起来:: output = tokenizer.encode("Hello, y'all! How are you 😁 ?") print(output.ids) # [1, 27253, 16, 93, 11, 5097, 5, 7961, 5112, 6218, 0, 35, 2] tokenizer.decode([1, 27253, 16, 93, 11, 5097, 5, 7961, 5112, 6218, 0, 35, 2]) # "Hello , y ' all ! How are you ?" 如果您使用的模型添加了特殊字符来表示给定“单词”的子标记(例如 WordPiece 中的"##" ):: output = bert_tokenizer.encode("Welcome to the 🤗 Tokenizers library.") print(output.tokens) # ["[CLS]", "welcome", "to", "the", "[UNK]", "tok", "##eni", "##zer", "##s", "library", ".", "[SEP]"] bert_tokenizer.decode(output.ids) # "welcome to the tok ##eni ##zer ##s library ." 这种情况需要自定义decoder以正确处理它们:: from tokenizers import decoders bert_tokenizer.decoder = decoders.WordPiece() bert_tokenizer.decode(output.ids) # "welcome to the tokenizers library." Components ---------- Normalizers ^^^^^^^^^^^ 常用的Normalizer方法:: NFD NFKD NFC NFKC Lowercase => Input: HELLO ὈΔΥΣΣΕΎΣ Output: helloὀδυσσεύς` Strip => Input: " hi " Output: "hi" StripAccents => Input: é Ouput: e Replace => Replace("a", "e") will behave like this: Input: "banana" Ouput: "benene" BertNormalizer => Provides an implementation of the Normalizer used in the original BERT clean_text handle_chinese_chars strip_accents lowercase Sequence => Composes multiple normalizers that will run in the provided order Sequence([NFKC(), Lowercase()]) Pre-tokenizers ^^^^^^^^^^^^^^ * The PreTokenizer takes care of splitting the input according to a set of rules. algorithms:: ByteLevel Splits on whitespaces while remapping all the bytes to a set of visible characters. introduced by OpenAI with GPT-2 Whitespace => Input: "Hello there!" Output: "Hello", "there!" Splits on word boundaries (using the following regular expression: `\w+|[^\w\s]+` WhitespaceSplit Splits on any whitespace character Punctuation => Input: "Hello?" Ouput: "Hello", "?" Metaspace => Input: "Hello there" Ouput: "Hello", "▁there" Splits on whitespaces and replaces them with a special char “▁” (U+2581) CharDelimiterSplit => Example with x: Input: "Helloxthere" Ouput: "Hello", "there" Digits => Input: "Hello123there" Output: "Hello", "123", "there" Splits the numbers from any other characters. Split 有下面3个参数: 1. `pattern` should be either a custom string or regexp. 2. `behavior` should be one of: removed isolated merged_with_previous merged_with_next contiguous 3. `invert` should be a boolean flag. 示例: pattern = , behavior = "isolated", invert = False: Input: "Hello, how are you?" Output: "Hello,", " ", "how", " ", "are", " ", "you?" Sequence 示例: Sequence([Punctuation(), WhitespaceSplit()]) Models ^^^^^^ 分词:: models.BPE => Byte-Pair-Encoding models.Unigram => models.WordLevel => models.WordPiece => WordLevel """"""""" * 简介:WordLevel 是最基础的分词方法,也是我们通常所说的“词级别分词”。它将每个完整的词映射到一个唯一的ID,而不对词进一步分解。 * 优点:简单直观,易于理解和实现。只需要一个单词-ID映射表。 * 缺点:需要非常大的词汇表来覆盖所有可能出现的单词,导致模型体积大。而且在处理未见过的单词(out-of-vocabulary, OOV)时很可能会出现“[UNK]”(未知词)。 * 适用场景:适用于词汇量较小的任务,或对词汇量没有严格要求的简单应用。 BPE """ * 简介:BPE 是一种流行的子词分词算法。它从字符级别开始,通过合并在语料中最常出现的字符对逐步创建新的子词(tokens)。这种合并操作是迭代进行的,以构建更长的子词。 * 优点:BPE 可以处理未见过的单词,因为它能够将生词分解为子词并进行组合。因此它的词汇表可以相对较小。 * 缺点:BPE 的分词是基于频率统计的,因此分词结果是固定的,不具备动态性和上下文敏感性。 * 适用场景:广泛用于 GPT-2 等子词模型,适合词汇丰富的语言以及希望减少词汇表大小的场景。 WordPiece """"""""" * 简介:WordPiece 是一种与 BPE 相似的子词分词算法,主要由 Google 在 BERT 模型中使用。它的分词策略是贪婪的,即优先尝试生成最长的子词。对于没有完整词汇的单词,WordPiece 会将其拆分为多个子词,并在单词内部使用 ## 前缀标识后续子词。 * 优点:能有效处理未见过的单词,通过将未知单词拆分成多个子词以覆盖更多可能的组合。较少出现“[UNK]”标记。 * 缺点:依赖语料的训练,需要更多的时间和计算资源。 * 适用场景:用于 BERT 及其衍生模型。适合较长文本或希望通过分词算法获得稳定效果的情况。 Unigram """"""" * 简介:Unigram 是另一种子词分词算法,与 BPE 和 WordPiece 不同,Unigram 基于概率模型来选择最优的分词组合。它会为一个句子计算多种分词方式,选择其中概率最高的组合。 * 优点:Unigram 不依赖固定规则,而是基于概率动态选择最优分词,具有一定的灵活性和上下文敏感性。能够有效地压缩词汇表并减少 OOV 问题。 * 缺点:相比其他算法更为复杂,可能计算量较大。 * 适用场景:应用于 XLNet 和 SentencePiece 等模型,适合需要灵活分词的语言模型应用。 总结对比 """""""" +-----------+------------------------+------------+-------------+----------------------+ | 算法 | 分词方式 | 词汇表大小 | 未知词处理 | 主要应用模型 | +===========+========================+============+=============+======================+ | WordLevel | 直接按词映射ID | 非常大 | 使用“[UNK]” | 简单文本分析任务 | +-----------+------------------------+------------+-------------+----------------------+ | BPE | 子词合并,频率统计 | 中等 | 子词组合 | GPT-2 等模型 | +-----------+------------------------+------------+-------------+----------------------+ | WordPiece | 贪婪匹配,词首“##”标记 | 中等 | 子词组合 | BERT 及其变体 | +-----------+------------------------+------------+-------------+----------------------+ | Unigram | 概率模型,最优组合 | 中等 | 动态选择 | SentencePiece, XLNet | +-----------+------------------------+------------+-------------+----------------------+ Post-Processors ^^^^^^^^^^^^^^^ * TemplateProcessing * 具体参见上面 Decoders ^^^^^^^^ :: ByteLevel Metaspace WordPiece BPEDecoder CTC API === Normalizers ----------- :: BertNormalizer Lowercase Trainers -------- :: BpeTrainer UnigramTrainer WordLevelTrainer WordPieceTrainer