5.15.2. Core Tutorials¶
备注
参见项目 demo_python
Core Concepts¶
关键定义:
1. Document: some text.
文档(字符串)
2. Corpus: a collection of documents.
文集(字符串列表)
3. Vector: a mathematically convenient representation of a document.
using Vector to represent each document as a vector of features.
文档和向量之间的区别在于前者是文本,后者是文本在数学上方便的表示
4. Model: an algorithm for transforming vectors from one representation to another.
将向量从一种表示形式转换为另一种表示形式的算法
备注
根据表示的获取方式,两个不同的文档可能具有相同的向量表示。
当内容太多时不可能一次将 corpus 加载到内存
所以使用 streaming 处理,参见
Corpus Streaming – One Document at a Time
相关模块``corpus_streaming_tutorial``
Corpora and Vector Spaces¶
Demonstrates transforming text into a vector space representation.
save a corpus in the Matrix Market format:
corpus = [[(1, 0.5)], []] # make one document empty, for the heck of it
corpora.MmCorpus.serialize('/tmp/corpus.mm', corpus)
## save in other formats
# corpora.SvmLightCorpus.serialize('/tmp/corpus.svmlight', corpus)
# corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus)
# corpora.LowCorpus.serialize('/tmp/corpus.low', corpus)
load a corpus iterator from a Matrix Market file:
corpus = corpora.MmCorpus('/tmp/corpus.mm')
Topics and Transformations¶
Gensim implements several popular Vector Space Model algorithms:
1. Tf-Idf can normalize the resulting vectors to (Euclidean) unit length
model = models.TfidfModel(corpus, normalize=True)
2. Okapi_BM25是搜索引擎用来估计文档与给定搜索查询的相关性的标准排名函数
model = models.OkapiBM25Model(corpus)
3. LSI(or LSA)将文档从 bag-of-words 或 TfIdf(最好) 加权空间转换为较低维度的潜在空间
LSI: Latent Semantic Indexing
model = models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300)
参数:
主题数目 (num_topics=300): 这是 LSI 模型要生成的主题的数量
LSI 是一种基于奇异值分解(Singular Value Decomposition,SVD)的主题建模方法,通过降维将文档表示为主题空间中的向量
LSI training 的独特之处在于,我们可以随时继续 “training”,只需提供更多培训文档即可:
model.add_documents(another_tfidf_corpus) # now LSI has been trained on tfidf_corpus + another_tfidf_corpus
lsi_vec = model[tfidf_vec] # convert some new document into the LSI space, without affecting the model
model.add_documents(more_documents) # tfidf_corpus + another_tfidf_corpus + more_documents
lsi_vec = model[tfidf_vec]
4. RP 旨在降低向量空间维数。这是一种非常有效(内存和 CPU 友好)的方法,通过引入一点随机性来近似文档之间的 TfIdf 距离
model = models.RpModel(tfidf_corpus, num_topics=500)
5. LDA 是从词袋计数到较低维度主题空间的另一种转换
LDA 是 LSA 的概率扩展
model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)
6. HDP 是一种非参数贝叶斯方法
model = models.HdpModel(corpus, id2word=dictionary)
Term Frequency * Inverse Document Frequency, Tf-Idf: https://en.wikipedia.org/wiki/Tf%E2%80%93idf
Okapi Best Matching, Okapi BM25: https://en.wikipedia.org/wiki/Okapi_BM25
Latent Semantic Indexing, LSI (or sometimes LSA): https://en.wikipedia.org/wiki/Latent_semantic_indexing
Random Projections, RP: http://www.cis.hut.fi/ella/publications/randproj_kdd.pdf
Latent Dirichlet Allocation, LDA: https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
Hierarchical Dirichlet Process, HDP: http://jmlr.csail.mit.edu/proceedings/papers/v15/wang11a/wang11a.pdf
Similarity Queries¶
Demonstrates querying a corpus for similar documents.