2.4. bag-of-words¶
bag-of-words model: 将每个文档转换为固定长度的整数向量
2.4.1. 示例说明¶
如:
John likes to watch movies. Mary likes movies too.
John also likes to watch football games. Mary hates football.
=>
[1, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0]
[1, 1, 1, 1, 0, 1, 0, 1, 2, 1, 1]
# 每个元素计算特定单词在文档中出现的次数
# 元素的顺序是任意的
# 此例中,元素的顺序对应于单词:
["John", "likes", "to", "watch", "movies", "Mary", "too", "also", "football", "games", "hates"]
Bag-of-words models are surprisingly effective, but have several weaknesses:
1. they lose all information about word order
说明:n-grams 模型的确考虑到长度为 n 的单词短语,将文档表示为固定长度的向量,以捕获局部词序,但会受到数据稀疏和高维度的影响
2. 该模型不会尝试学习底层单词的含义,因此向量之间的距离并不总是反映含义的差异
备注
The Word2Vec model addresses this second problem.
2.4.2. N-Grams¶
n-grams 模型统计相邻n个单词出现的频率来预测下一个单词出现的概率。
n可以取1、2、3等任意值,常见的有unigram(1-gram)、bigram(2-gram)和trigram(3-gram)模型
示例:bag-of-n-grams models (NB, SVM, BiNB):
NB: Naive Bayes(朴素贝叶斯) SVM: BiNB: Bigram Naive Bayes