2.4. bag-of-words¶

bag-of-words model: 将每个文档转换为固定长度的整数向量

2.4.1. 示例说明¶

如:

John likes to watch movies. Mary likes movies too.
John also likes to watch football games. Mary hates football.

=>

[1, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0]
[1, 1, 1, 1, 0, 1, 0, 1, 2, 1, 1]

# 每个元素计算特定单词在文档中出现的次数
# 元素的顺序是任意的
# 此例中，元素的顺序对应于单词:
["John", "likes", "to", "watch", "movies", "Mary", "too", "also", "football", "games", "hates"]

Bag-of-words models are surprisingly effective, but have several weaknesses:

1. they lose all information about word order
    说明：n-grams 模型的确考虑到长度为 n 的单词短语，将文档表示为固定长度的向量，以捕获局部词序，但会受到数据稀疏和高维度的影响
2. 该模型不会尝试学习底层单词的含义，因此向量之间的距离并不总是反映含义的差异

备注

The Word2Vec model addresses this second problem.

2.4.2. N-Grams¶

n-grams 模型统计相邻n个单词出现的频率来预测下一个单词出现的概率。
n可以取1、2、3等任意值，常见的有unigram（1-gram）、bigram（2-gram）和trigram（3-gram）模型

示例：bag-of-n-grams models (NB, SVM, BiNB):

NB: Naive Bayes(朴素贝叶斯)
SVM:
BiNB: Bigram Naive Bayes