.. _bag-of-words: bag-of-words ############ * bag-of-words model: 将每个文档转换为固定长度的整数向量 示例说明 ======== 如:: John likes to watch movies. Mary likes movies too. John also likes to watch football games. Mary hates football. => [1, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0] [1, 1, 1, 1, 0, 1, 0, 1, 2, 1, 1] # 每个元素计算特定单词在文档中出现的次数 # 元素的顺序是任意的 # 此例中,元素的顺序对应于单词: ["John", "likes", "to", "watch", "movies", "Mary", "too", "also", "football", "games", "hates"] Bag-of-words models are surprisingly effective, but have several weaknesses:: 1. they lose all information about word order 说明:n-grams 模型的确考虑到长度为 n 的单词短语,将文档表示为固定长度的向量,以捕获局部词序,但会受到数据稀疏和高维度的影响 2. 该模型不会尝试学习底层单词的含义,因此向量之间的距离并不总是反映含义的差异 .. note:: The :ref:`Word2Vec ` model addresses this second problem. N-Grams ======= * n-grams 模型统计相邻n个单词出现的频率来预测下一个单词出现的概率。 * n可以取1、2、3等任意值,常见的有unigram(1-gram)、bigram(2-gram)和trigram(3-gram)模型 * 示例:bag-of-n-grams models (NB, SVM, BiNB):: NB: Naive Bayes(朴素贝叶斯) SVM: BiNB: Bigram Naive Bayes