.. _bag-of-words:

bag-of-words
############

* bag-of-words model: 将每个文档转换为固定长度的整数向量


示例说明
========


如::

    John likes to watch movies. Mary likes movies too.
    John also likes to watch football games. Mary hates football.

    =>

    [1, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0]
    [1, 1, 1, 1, 0, 1, 0, 1, 2, 1, 1]

    # 每个元素计算特定单词在文档中出现的次数
    # 元素的顺序是任意的
    # 此例中，元素的顺序对应于单词:
    ["John", "likes", "to", "watch", "movies", "Mary", "too", "also", "football", "games", "hates"]


Bag-of-words models are surprisingly effective, but have several weaknesses::

    1. they lose all information about word order
        说明：n-grams 模型的确考虑到长度为 n 的单词短语，将文档表示为固定长度的向量，以捕获局部词序，但会受到数据稀疏和高维度的影响
    2. 该模型不会尝试学习底层单词的含义，因此向量之间的距离并不总是反映含义的差异

.. note:: The :ref:`Word2Vec <word2vec>` model addresses this second problem.


N-Grams
=======


* n-grams 模型统计相邻n个单词出现的频率来预测下一个单词出现的概率。
* n可以取1、2、3等任意值，常见的有unigram（1-gram）、bigram（2-gram）和trigram（3-gram）模型

* 示例：bag-of-n-grams models (NB, SVM, BiNB)::

    NB: Naive Bayes(朴素贝叶斯)
    SVM:
    BiNB: Bigram Naive Bayes