Core Tutorials ############## .. note:: 参见项目 demo_python Core Concepts ============= * https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html 关键定义:: 1. Document: some text. 文档(字符串) 2. Corpus: a collection of documents. 文集(字符串列表) 3. Vector: a mathematically convenient representation of a document. using Vector to represent each document as a vector of features. 文档和向量之间的区别在于前者是文本,后者是文本在数学上方便的表示 4. Model: an algorithm for transforming vectors from one representation to another. 将向量从一种表示形式转换为另一种表示形式的算法 .. note:: 根据表示的获取方式,两个不同的文档可能具有相同的向量表示。 * 当内容太多时不可能一次将 corpus 加载到内存 * 所以使用 *streaming* 处理,参见 ``Corpus Streaming – One Document at a Time`` 相关模块``corpus_streaming_tutorial`` Corpora and Vector Spaces ========================= * Demonstrates transforming text into a vector space representation. save a corpus in the Matrix Market format:: corpus = [[(1, 0.5)], []] # make one document empty, for the heck of it corpora.MmCorpus.serialize('/tmp/corpus.mm', corpus) ## save in other formats # corpora.SvmLightCorpus.serialize('/tmp/corpus.svmlight', corpus) # corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus) # corpora.LowCorpus.serialize('/tmp/corpus.low', corpus) load a corpus iterator from a Matrix Market file:: corpus = corpora.MmCorpus('/tmp/corpus.mm') Topics and Transformations ========================== Gensim implements several popular Vector Space Model algorithms:: 1. Tf-Idf can normalize the resulting vectors to (Euclidean) unit length model = models.TfidfModel(corpus, normalize=True) 2. Okapi_BM25是搜索引擎用来估计文档与给定搜索查询的相关性的标准排名函数 model = models.OkapiBM25Model(corpus) 3. LSI(or LSA)将文档从 bag-of-words 或 TfIdf(最好) 加权空间转换为较低维度的潜在空间 LSI: Latent Semantic Indexing model = models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300) 参数: 主题数目 (num_topics=300): 这是 LSI 模型要生成的主题的数量 LSI 是一种基于奇异值分解(Singular Value Decomposition,SVD)的主题建模方法,通过降维将文档表示为主题空间中的向量 LSI training 的独特之处在于,我们可以随时继续 “training”,只需提供更多培训文档即可: model.add_documents(another_tfidf_corpus) # now LSI has been trained on tfidf_corpus + another_tfidf_corpus lsi_vec = model[tfidf_vec] # convert some new document into the LSI space, without affecting the model model.add_documents(more_documents) # tfidf_corpus + another_tfidf_corpus + more_documents lsi_vec = model[tfidf_vec] 4. RP 旨在降低向量空间维数。这是一种非常有效(内存和 CPU 友好)的方法,通过引入一点随机性来近似文档之间的 TfIdf 距离 model = models.RpModel(tfidf_corpus, num_topics=500) 5. LDA 是从词袋计数到较低维度主题空间的另一种转换 LDA 是 LSA 的概率扩展 model = models.LdaModel(corpus, id2word=dictionary, num_topics=100) 6. HDP 是一种非参数贝叶斯方法 model = models.HdpModel(corpus, id2word=dictionary) * Term Frequency * Inverse Document Frequency, Tf-Idf: https://en.wikipedia.org/wiki/Tf%E2%80%93idf * Okapi Best Matching, Okapi BM25: https://en.wikipedia.org/wiki/Okapi_BM25 * Latent Semantic Indexing, LSI (or sometimes LSA): https://en.wikipedia.org/wiki/Latent_semantic_indexing * Random Projections, RP: http://www.cis.hut.fi/ella/publications/randproj_kdd.pdf * Latent Dirichlet Allocation, LDA: https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation * Hierarchical Dirichlet Process, HDP: http://jmlr.csail.mit.edu/proceedings/papers/v15/wang11a/wang11a.pdf Similarity Queries ================== Demonstrates querying a corpus for similar documents.