主页

索引

模块索引

搜索页面

2.4. bag-of-words

  • bag-of-words model: 将每个文档转换为固定长度的整数向量

2.4.1. 示例说明

如:

John likes to watch movies. Mary likes movies too.
John also likes to watch football games. Mary hates football.

=>

[1, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0]
[1, 1, 1, 1, 0, 1, 0, 1, 2, 1, 1]

# 每个元素计算特定单词在文档中出现的次数
# 元素的顺序是任意的
# 此例中,元素的顺序对应于单词:
["John", "likes", "to", "watch", "movies", "Mary", "too", "also", "football", "games", "hates"]

Bag-of-words models are surprisingly effective, but have several weaknesses:

1. they lose all information about word order
    说明:n-grams 模型的确考虑到长度为 n 的单词短语,将文档表示为固定长度的向量,以捕获局部词序,但会受到数据稀疏和高维度的影响
2. 该模型不会尝试学习底层单词的含义,因此向量之间的距离并不总是反映含义的差异

备注

The Word2Vec model addresses this second problem.

2.4.2. N-Grams

  • n-grams 模型统计相邻n个单词出现的频率来预测下一个单词出现的概率。

  • n可以取1、2、3等任意值,常见的有unigram(1-gram)、bigram(2-gram)和trigram(3-gram)模型

  • 示例:bag-of-n-grams models (NB, SVM, BiNB):

    NB: Naive Bayes(朴素贝叶斯)
    SVM:
    BiNB: Bigram Naive Bayes
    

主页

索引

模块索引

搜索页面