6.15.2. Core Tutorials¶

备注

参见项目 demo_python

Core Concepts¶

https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html

关键定义:

1. Document: some text.
    文档(字符串)
2. Corpus: a collection of documents.
    文集(字符串列表)
3. Vector: a mathematically convenient representation of a document.
    using Vector to represent each document as a vector of features.
    文档和向量之间的区别在于前者是文本，后者是文本在数学上方便的表示
4. Model: an algorithm for transforming vectors from one representation to another.
    将向量从一种表示形式转换为另一种表示形式的算法

备注

根据表示的获取方式，两个不同的文档可能具有相同的向量表示。

当内容太多时不可能一次将 corpus 加载到内存
所以使用 streaming 处理，参见 Corpus Streaming – One Document at a Time 相关模块``corpus_streaming_tutorial``

Corpora and Vector Spaces¶

Demonstrates transforming text into a vector space representation.

save a corpus in the Matrix Market format:

corpus = [[(1, 0.5)], []]  # make one document empty, for the heck of it
corpora.MmCorpus.serialize('/tmp/corpus.mm', corpus)

## save in other formats
# corpora.SvmLightCorpus.serialize('/tmp/corpus.svmlight', corpus)
# corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)
# corpora.LowCorpus.serialize('/tmp/corpus.low', corpus)

load a corpus iterator from a Matrix Market file:

corpus = corpora.MmCorpus('/tmp/corpus.mm')

Topics and Transformations¶

Gensim implements several popular Vector Space Model algorithms:

1. Tf-Idf can normalize the resulting vectors to (Euclidean) unit length
    model = models.TfidfModel(corpus, normalize=True)

2. Okapi_BM25是搜索引擎用来估计文档与给定搜索查询的相关性的标准排名函数
    model = models.OkapiBM25Model(corpus)

3. LSI（or LSA）将文档从 bag-of-words 或 TfIdf(最好) 加权空间转换为较低维度的潜在空间
    LSI: Latent Semantic Indexing
    model = models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300)
    参数:
        主题数目 (num_topics=300)： 这是 LSI 模型要生成的主题的数量
        LSI 是一种基于奇异值分解（Singular Value Decomposition，SVD）的主题建模方法，通过降维将文档表示为主题空间中的向量

    LSI training 的独特之处在于，我们可以随时继续 “training”，只需提供更多培训文档即可:
        model.add_documents(another_tfidf_corpus)  # now LSI has been trained on tfidf_corpus + another_tfidf_corpus
        lsi_vec = model[tfidf_vec]  # convert some new document into the LSI space, without affecting the model

        model.add_documents(more_documents)  # tfidf_corpus + another_tfidf_corpus + more_documents
        lsi_vec = model[tfidf_vec]
4. RP 旨在降低向量空间维数。这是一种非常有效（内存和 CPU 友好）的方法，通过引入一点随机性来近似文档之间的 TfIdf 距离
    model = models.RpModel(tfidf_corpus, num_topics=500)

5. LDA 是从词袋计数到较低维度主题空间的另一种转换
    LDA 是 LSA 的概率扩展
    model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)

6. HDP 是一种非参数贝叶斯方法
    model = models.HdpModel(corpus, id2word=dictionary)

Term Frequency * Inverse Document Frequency, Tf-Idf: https://en.wikipedia.org/wiki/Tf%E2%80%93idf
Okapi Best Matching, Okapi BM25: https://en.wikipedia.org/wiki/Okapi_BM25
Latent Semantic Indexing, LSI (or sometimes LSA): https://en.wikipedia.org/wiki/Latent_semantic_indexing
Random Projections, RP: http://www.cis.hut.fi/ella/publications/randproj_kdd.pdf
Latent Dirichlet Allocation, LDA: https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
Hierarchical Dirichlet Process, HDP: http://jmlr.csail.mit.edu/proceedings/papers/v15/wang11a/wang11a.pdf

Similarity Queries¶

Demonstrates querying a corpus for similar documents.