Using Gensim with svd2vec output

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora.

Gensim can use word2vec to compute similarity (and more!) between words. svd2vec can save it's vectors in a word2vec format that Gensim can process.

In this notebook it is shown how you can use Gensim with vectors learnt from svd2vec. We also compare our results with the pure word2vec model.


I - Preparation

In [1]:
from svd2vec import svd2vec, FilesIO
from gensim.models import Word2Vec
from gensim.models.keyedvectors import Word2VecKeyedVectors
In [2]:
# Gensim does not have any implementation of an analogy method, so we add one here (3CosAdd)
def analogy_keyed(self, a, b, c, topn=10):
    return self.most_similar(positive=[b, c], negative=[a], topn=topn)
Word2VecKeyedVectors.analogy = analogy_keyed
def analogy_w2v(self, a, b, c, topn=10):
    return self.wv.most_similar(positive=[b, c], negative=[a], topn=topn)
Word2Vec.analogy = analogy_w2v
In [3]:
documents = FilesIO.load_corpus("text8")

II - Models construction

SVD with svd2vec

In [4]:
svd2vec_svd = svd2vec(documents, size=300, window=5, min_count=100, verbose=False)

SVD with Gensim from svd2vec

In [5]:
# we first need to export svd2vec_svd to the word2vec format
svd2vec_svd.save_word2vec_format("svd.word2vec")

# we then load the model using Gensim
gensim_svd = Word2VecKeyedVectors.load_word2vec_format("svd.word2vec")

word2vec

In [6]:
import os
if not os.path.isfile("w2v.word2vec") or True:
    # we train the model using word2vec (needs to be installed)
    !word2vec -min-count 100 -size 300 -window 5 -train text8 -output w2v.word2vec

# we load it
word2vec_w2v = Word2VecKeyedVectors.load_word2vec_format("w2v.word2vec")
Starting training using file text8
Vocab size: 11816
Words in train file: 15471434
Alpha: 0.000005  Progress: 100.04%  Words/thread/sec: 206.92k  

word2vec with Gensim

In [7]:
gensim_w2v = Word2Vec(documents, size=300, window=5, min_count=100, workers=16)

III - Cosine similarity comparison

In [8]:
def compare_similarity(w1, w2):
    print("cosine similarity between", w1, "and", w2, ":")
    print("\tsvd2vec_svd ", svd2vec_svd.similarity(w1, w2))
    print("\tgensim_svd  ", gensim_svd.similarity(w1, w2))
    print("\tgensim_w2v  ", gensim_w2v.wv.similarity(w1, w2))
    print("\tword2vec_w2v", word2vec_w2v.similarity(w1, w2))

def compare_analogy(w1, w2, w3, topn=3):
    
    def analogy_str(model):
        a = model.analogy(w1, w2, w3, topn=topn)
        s = "\n\t\t".join(["{: <20}".format(w) + str(c) for w, c in a])
        return "\n\t\t" + s
    
    print("analogy similaties :", w1, "is to", w2, "as", w3, "is to?")
    print("\tsvd2vec_svd", analogy_str(svd2vec_svd))
    print("\tgensim_svd", analogy_str(gensim_svd))
    print("\tgensim_w2v", analogy_str(gensim_w2v))
    print("\tword2vec_w2v", analogy_str(word2vec_w2v))
In [9]:
compare_similarity("good", "bad")
cosine similarity between good and bad :
	svd2vec_svd  0.16960606494176927
	gensim_svd   0.16960636
	gensim_w2v   0.709178
	word2vec_w2v 0.5649511
In [10]:
compare_similarity("truck", "car")
cosine similarity between truck and car :
	svd2vec_svd  0.15923853102586527
	gensim_svd   0.15923877
	gensim_w2v   0.6416824
	word2vec_w2v 0.54270566
In [11]:
compare_analogy("january", "month", "monday")
analogy similaties : january is to month as monday is to?
	svd2vec_svd 
		friday              0.37471685333642046
		calendar            0.3676012349365715
		calendars           0.35337028413753047
	gensim_svd 
		friday              0.3747172951698303
		calendar            0.36760085821151733
		calendars           0.3533702492713928
	gensim_w2v 
		week                0.6988834142684937
		evening             0.595579981803894
		weekend             0.5807653665542603
	word2vec_w2v 
		week                0.5819252729415894
		weekend             0.45014166831970215
		meal                0.44025975465774536
In [12]:
compare_analogy("paris", "france", "berlin")
analogy similaties : paris is to france as berlin is to?
	svd2vec_svd 
		germany             0.332750582318881
		reich               0.24643302456973284
		himmler             0.24013156257244123
	gensim_svd 
		germany             0.33275020122528076
		reich               0.24643322825431824
		himmler             0.24013197422027588
	gensim_w2v 
		germany             0.7466158866882324
		austria             0.6257748007774353
		hungary             0.6240533590316772
	word2vec_w2v 
		germany             0.5713405609130859
		austria             0.4441128671169281
		poland              0.4427028298377991
In [13]:
compare_analogy("man", "king", "woman")
analogy similaties : man is to king as woman is to?
	svd2vec_svd 
		composite           0.22709743018737916
		ruling              0.22502265780447406
		marry               0.21227657323674393
	gensim_svd 
		composite           0.2270975261926651
		ruling              0.2250228077173233
		marry               0.2122761458158493
	gensim_w2v 
		queen               0.6124744415283203
		throne              0.5374159812927246
		isabella            0.5357133150100708
	word2vec_w2v 
		queen               0.4962505102157593
		isabella            0.4501015841960907
		consort             0.4426434636116028
In [14]:
compare_analogy("road", "cars", "rail")
analogy similaties : road is to cars as rail is to?
	svd2vec_svd 
		locomotives         0.41394615263127865
		diesel              0.3844179279358335
		vehicles            0.3656490174820006
	gensim_svd 
		locomotives         0.4139465391635895
		diesel              0.3844173550605774
		vehicles            0.3656482696533203
	gensim_w2v 
		locomotives         0.7243439555168152
		trucks              0.6970822215080261
		vehicles            0.6947606801986694
	word2vec_w2v 
		locomotives         0.5803698301315308
		trucks              0.5537331104278564
		diesel              0.5369356870651245

IV - Evaluations

In [15]:
def compare_similarity(path, d='\t'):
    print("pearson correlation of", os.path.basename(path))
    print("\tsvd2vec_svd   ", svd2vec_svd.evaluate_word_pairs(path,   delimiter=d)[0])
    print("\tgensim_svd    ", gensim_svd.evaluate_word_pairs(path,    delimiter=d)[0][0])
    print("\tgensim_w2v    ", gensim_w2v.wv.evaluate_word_pairs(path, delimiter=d)[0][0])
    print("\tword2vec_w2v  ", word2vec_w2v.evaluate_word_pairs(path,  delimiter=d)[0][0])
    print("")
In [16]:
compare_similarity(FilesIO.path('similarities/wordsim353.txt'))
compare_similarity(FilesIO.path('similarities/men_dataset.txt'))
compare_similarity(FilesIO.path('similarities/mturk.txt'))
compare_similarity(FilesIO.path('similarities/simlex999.txt'))
compare_similarity(FilesIO.path('similarities/rarewords.txt'))
pearson correlation of wordsim353.txt
	svd2vec_svd    0.5323995150227655
	gensim_svd     0.5515552747176535
	gensim_w2v     0.6464943909892069
	word2vec_w2v   0.670949787953047

pearson correlation of men_dataset.txt
	svd2vec_svd    0.616499909184789
	gensim_svd     0.6164999243162618
	gensim_w2v     0.6185734102715774
	word2vec_w2v   0.6550970460650091

pearson correlation of mturk.txt
	svd2vec_svd    0.5294242885931566
	gensim_svd     0.5294241677182177
	gensim_w2v     0.6538684259695825
	word2vec_w2v   0.6797273248466004

pearson correlation of simlex999.txt
	svd2vec_svd    0.1641437066750331
	gensim_svd     0.16414389519128816
	gensim_w2v     0.2712333067742968
	word2vec_w2v   0.3015538925618266

pearson correlation of rarewords.txt
	svd2vec_svd    0.3206575342617015
	gensim_svd     0.3206569780059028
	gensim_w2v     0.4074306702489544
	word2vec_w2v   0.44227213735273435

In [17]:
def compare_analogy(path):
    print("analogies success rate of", os.path.basename(path))
    print("\tsvd2vec_svd   ", svd2vec_svd.evaluate_word_analogies(path))
    print("\tgensim_svd    ", gensim_svd.evaluate_word_analogies(path)[0])
    print("\tgensim_w2v    ", gensim_w2v.wv.evaluate_word_analogies(path)[0])
    print("\tword2vec_w2v  ", word2vec_w2v.evaluate_word_analogies(path)[0])
In [18]:
compare_analogy(FilesIO.path('analogies/questions-words.txt'))
compare_analogy(FilesIO.path('analogies/msr.txt'))
analogies success rate of questions-words.txt
	svd2vec_svd    0.18744727518137339
	gensim_svd     0.18744727518137339
	gensim_w2v     0.5026151510038805
	word2vec_w2v   0.558798717732411
analogies success rate of msr.txt
	svd2vec_svd    0.04246344206974128
	gensim_svd     0.04246344206974128
	gensim_w2v     0.4873453318335208
	word2vec_w2v   0.5444319460067492