Getting started with `svd2vec`¶

I - Installation¶

svd2vec can be installed using pip:

pip install svd2vec

II - Usage¶

svd2vec can be used like the word2vec implementation of Gensim. The full documentation can be seen here.

A/ Corpus creation¶

The corpus (documents) parameter of svd2vec should be a list of documents. Each document should be a list of words representing that document.

# saving the word2vec corpus locally
import requests, zipfile, io

url = "http://mattmahoney.net/dc/text8.zip"
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

# loading the word2vec demo corpus as a single document
documents = [open("text8", "r").read().split(" ")]

B/ Creation of the vectors¶

from svd2vec import svd2vec

# showing first fifteen words of each documents
[d[:15] + ['...'] for d in documents]

[['',
  'anarchism',
  'originated',
  'as',
  'a',
  'term',
  'of',
  'abuse',
  'first',
  'used',
  'against',
  'early',
  'working',
  'class',
  'radicals',
  '...']]

# creating the words representation (can take a while)
svd = svd2vec(documents, window=5, min_count=100, verbose=False)

C/ Similarity and distance¶

svd.similarity("bad", "good")

0.5595044997663727

svd.similarity("monday", "friday")

0.8000593208690482

svd.distance("apollo", "moon")

0.51619968887672

svd.most_similar(positive=["january"], topn=2)

[('december', 0.7869627196261781), ('march', 0.7782765534824396)]

D/ Analogy¶

svd.analogy("paris", "france", "berlin")

[('germany', 0.7240066875926087),
 ('weimar', 0.6371445233683818),
 ('reich', 0.631414594126022),
 ('munich', 0.5917068813628168),
 ('sch', 0.5591401823289636),
 ('brandenburg', 0.5468138153874815),
 ('und', 0.541566598856033),
 ('hermann', 0.5411562914966189),
 ('adolf', 0.5394922186458038),
 ('otto', 0.5391901427839293)]

svd.analogy("road", "cars", "rail", topn=5)

[('locomotives', 0.7626203484386807),
 ('locomotive', 0.7587259422633467),
 ('trucks', 0.7255470578340787),
 ('trains', 0.717637832883044),
 ('automobiles', 0.6737808582283374)]

svd.analogy("cow", "cows", "pig")

[('sheep', 0.5829199353965691),
 ('pigs', 0.5629631047865382),
 ('goat', 0.5611478942276642),
 ('eat', 0.5592920869267609),
 ('cats', 0.523851442525088),
 ('goats', 0.5230269418385303),
 ('meat', 0.5202435333205421),
 ('animal', 0.5194570523705068),
 ('fish', 0.5131523388198542),
 ('dogs', 0.5125122379464395)]

svd.analogy("man", "men", "woman")

[('women', 0.7754647153730071),
 ('couples', 0.6097503266776299),
 ('male', 0.5914266186445117),
 ('sex', 0.5782558939194317),
 ('female', 0.570068551351722),
 ('intercourse', 0.5302306678128059),
 ('heterosexual', 0.5222203608894108),
 ('children', 0.5139059481091136),
 ('lesbian', 0.5132646381911999),
 ('feminism', 0.5027363468750581)]

E/ Saving and loading vectors¶

# saving to a binary format
svd.save("svd.binary")

# loading from binary file
loaded = svd2vec.load("svd.binary")
loaded.similarity("bad", "good")

0.5259838000029272

# saving to a word2vec like representation
svd.save_word2vec_format("svd.word2vec")

Getting started with svd2vec¶