Getting started with svd2vec

I - Installation

svd2vec can be installed using pip:

pip install svd2vec

II - Usage

svd2vec can be used like the word2vec implementation of Gensim. The full documentation can be seen here.

A/ Corpus creation

The corpus (documents) parameter of svd2vec should be a list of documents. Each document should be a list of words representing that document.

In [18]:
# saving the word2vec corpus locally
import requests, zipfile, io

url = "http://mattmahoney.net/dc/text8.zip"
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()
In [19]:
# loading the word2vec demo corpus as a single document
documents = [open("text8", "r").read().split(" ")]

B/ Creation of the vectors

In [1]:
from svd2vec import svd2vec
In [3]:
# showing first fifteen words of each documents
[d[:15] + ['...'] for d in documents]
Out[3]:
[['',
  'anarchism',
  'originated',
  'as',
  'a',
  'term',
  'of',
  'abuse',
  'first',
  'used',
  'against',
  'early',
  'working',
  'class',
  'radicals',
  '...']]
In [46]:
# creating the words representation (can take a while)
svd = svd2vec(documents, window=5, min_count=100, verbose=False)

C/ Similarity and distance

In [47]:
svd.similarity("bad", "good")
Out[47]:
0.5595044997663727
In [48]:
svd.similarity("monday", "friday")
Out[48]:
0.8000593208690482
In [56]:
svd.distance("apollo", "moon")
Out[56]:
0.51619968887672
In [57]:
svd.most_similar(positive=["january"], topn=2)
Out[57]:
[('december', 0.7869627196261781), ('march', 0.7782765534824396)]

D/ Analogy

In [51]:
svd.analogy("paris", "france", "berlin")
Out[51]:
[('germany', 0.7240066875926087),
 ('weimar', 0.6371445233683818),
 ('reich', 0.631414594126022),
 ('munich', 0.5917068813628168),
 ('sch', 0.5591401823289636),
 ('brandenburg', 0.5468138153874815),
 ('und', 0.541566598856033),
 ('hermann', 0.5411562914966189),
 ('adolf', 0.5394922186458038),
 ('otto', 0.5391901427839293)]
In [55]:
svd.analogy("road", "cars", "rail", topn=5)
Out[55]:
[('locomotives', 0.7626203484386807),
 ('locomotive', 0.7587259422633467),
 ('trucks', 0.7255470578340787),
 ('trains', 0.717637832883044),
 ('automobiles', 0.6737808582283374)]
In [53]:
svd.analogy("cow", "cows", "pig")
Out[53]:
[('sheep', 0.5829199353965691),
 ('pigs', 0.5629631047865382),
 ('goat', 0.5611478942276642),
 ('eat', 0.5592920869267609),
 ('cats', 0.523851442525088),
 ('goats', 0.5230269418385303),
 ('meat', 0.5202435333205421),
 ('animal', 0.5194570523705068),
 ('fish', 0.5131523388198542),
 ('dogs', 0.5125122379464395)]
In [54]:
svd.analogy("man", "men", "woman")
Out[54]:
[('women', 0.7754647153730071),
 ('couples', 0.6097503266776299),
 ('male', 0.5914266186445117),
 ('sex', 0.5782558939194317),
 ('female', 0.570068551351722),
 ('intercourse', 0.5302306678128059),
 ('heterosexual', 0.5222203608894108),
 ('children', 0.5139059481091136),
 ('lesbian', 0.5132646381911999),
 ('feminism', 0.5027363468750581)]

E/ Saving and loading vectors

In [12]:
# saving to a binary format
svd.save("svd.binary")
In [13]:
# loading from binary file
loaded = svd2vec.load("svd.binary")
loaded.similarity("bad", "good")
Out[13]:
0.5259838000029272
In [15]:
# saving to a word2vec like representation
svd.save_word2vec_format("svd.word2vec")