How-to Guides: Solve a Problem ############################## How to download pre-trained models and corpora ============================================== * corpora and pretrained models are stored in `gensim_data `_ project's `release_attachments `_ * There's no need for you to use this repository directly. Instead, simply install Gensim and use its download API (see the Quickstart below). It will "talk" to this repository automagically. * When you use the Gensim download API, all data is stored in your ~/gensim-data home folder. Example: load a pre-trained model (gloVe word vectors):: import gensim.downloader as api info = api.info() # show info about available models/datasets model = api.load("glove-twitter-25") # download the model and return as object ready for use model.most_similar("cat") output: [(u'dog', 0.9590819478034973), (u'monkey', 0.9203578233718872), Example: load a corpus and use it to train a Word2Vec model:: from gensim.models.word2vec import Word2Vec import gensim.downloader as api corpus = api.load('text8') # download the corpus and return it opened as an iterable model = Word2Vec(corpus) # train a model from the corpus model.most_similar("car") output: [(u'driver', 0.8273754119873047), (u'motorcycle', 0.769528865814209), Example: only download a dataset and return the local file path (no opening):: import gensim.downloader as api print(api.load("20-newsgroups", return_path=True)) # output: /home/user/gensim-data/20-newsgroups/20-newsgroups.gz print(api.load("glove-twitter-25", return_path=True)) # output: /home/user/gensim-data/glove-twitter-25/glove-twitter-25.gz CLI & python command:: $ python -m gensim.downloader -i text8 => >>> api.info('text8') python -m gensim.downloader --info # show info about available models/datasets => >>> print(api.info()) # load a corpus python -m gensim.downloader --download text8 # download text8 dataset to ~/gensim-data/text8 => >>> text8_corpus = api.load('text8') # load a pre-trained model python -m gensim.downloader --download glove-twitter-25 # download model to ~/gensim-data/glove-twitter-50/ => >>> glove_model = api.load('glove-twitter-200') How to reproduce the doc2vec ‘Paragraph Vector’ paper ===================================================== * 本文主要讲的是用 ``Gensim`` 对 `论文_ `_ 3.2节的实现 steps:: 1. Load the IMDB dataset 2. Train a variety of Doc2Vec models on the dataset 3. Evaluate the performance of each model using a logistic regression 4. Examine some of the results directly: `IMDB `_ 简介:: 大小: 84MB it contains several thousand movie reviews. Each review is a single line of text containing multiple sentences There are 100 thousand reviews in total:: 25k reviews for training (12.5k positive, 12.5k negative) 25k reviews for testing (12.5k positive, 12.5k negative) 50k unlabeled reviews define a convenient datatype for holding data for a single document:: words: The text of the document, as a list of words. tags: Used to keep the index of the document in the entire dataset. split: one of train, test or extra. Determines how the document will be used (for training, testing, etc). sentiment: either 1 (positive), 0 (negative) or None (unlabeled document). How to Compare LDA Models =========================