svd2vec documentation¶
- 
class svd2vec.svd2vec(documents, size=150, min_count=2, window=10, dyn_window_weight=1, cds_alpha=0.75, neg_k_shift=5, eig_p_weight=0, nrm_type='row', sub_threshold=1e-05, verbose=False, workers=-1)¶
- The representation of the documents words in a vector format. - Parameters
- documents (list of list of string) – The list of document, each document being a list of words 
- size (int) – Maximum numbers of extracted features for each word 
- min_count (int) – Minimum number of occurence of each word to be included in the model 
- window (int or tuple of ints) – Window word counts for getting context of words. If an int is given, it’s equivalent of a symmetric tuple (int, int). 
- dyn_window_weight (WINDOW_WEIGHT_HARMONIC or WINDOW_WEIGHT_WORD2VEC) – The window weighing scheme. 
- cds_alpha (float) – The context distribution smoothing constant that smooths the context frequency 
- neg_k_shift (int) – The negative PMI log shifting 
- eig_p_weight (float) – The eigenvalue weighting applied to the eigenvalue matrix 
- nrm_type (string) – A normalization scheme to use with the L2 normalization 
- sub_threshold (float) – A threshold for subsampling (diluting very frequent words). Higher value means less words removed. 
- verbose (bool) – If True, displays progress during the init step 
- workers (int) – The numbers of workers to use in parallel (should not exceed the available number of cores on the computer) 
 
 - 
WINDOW_WEIGHT_HARMONIC= 0¶
- The harmonic weighing scheme for context words (1/5, 1/4, 1/3, 1/2, …) 
 - 
WINDOW_WEIGHT_WORD2VEC= 1¶
- The word2vec weighing scheme for context words (1/5, 2/5, 3/5, 4/5, …) 
 - 
NRM_SCHEMES= ['none', 'row', 'column', 'both']¶
- Available normalization schemes 
 - 
save(path)¶
- Saves the svd2vec object to the given path. - Parameters
- path (string) – The file path to write the object to. The directories should exists. 
 
 - 
load()¶
- Load a previously saved svd2vec object from a path. - Parameters
- path (string) – The file path to load the object from. 
- Returns
- A new svd2vec object 
- Return type
 
 - 
save_word2vec_format(path)¶
- Saves the word vectors to a path using the same format as word2vec. The file can then be used by other modules or libraries able to load word2vec vectors. - Parameters
- path (string) – The file path to write the object to. The directories should exists. 
 
 - 
similarity(x, y)¶
- Computes and returns the cosine similarity of the two given words. - Parameters
- x (string) – The first word to compute the similarity 
- y (string) – The second word to compute the similarity 
 
- Returns
- The cosine similarity between the two words 
- Return type
- float 
 - Warning - The two words - xand- yshould have been trainned during the initialization step.
 - 
distance(x, y)¶
- Computes and returns the cosine distance of the two given words. - Parameters
- x (string) – The first word to compute the distance 
- y (string) – The second word to compute the distance 
 
- Returns
- The cosine distance between the two words 
- Return type
- float 
- Raises
- ValueError – If either x or y have not been trained during the initialization step. 
 - Warning - The two words - xand- yshould have been trained during the initialization step.
 - 
most_similar(positive=[], negative=[], topn=10)¶
- Computes and returns the most similar words from those given in positive and negative. - Parameters
- positive (list of string or string) – Each word in positive will contribute positively to the output words A single word can also be passed to compute it’s most similar words. 
- negative (list of string) – Each word in negative will contribute negatively to the output words 
- topn (int) – Number of similar words to output 
 
- Returns
- Each tuple is a similar word with it’s similarity to the given word. 
- Return type
- list of - (word, similarity)
- Raises
- ValueError – If the no input is given in both positive and negative 
- ValueError – If some words have not been trained during the initialization step. 
 
 - Warning - The input words should have been trained during the initialization step. 
 - 
analogy(exampleA, answerA, exampleB, topn=10)¶
- Returns the topn most probable answers to the analogy question “exampleA if to answerA as exampleB is to ?” - Parameters
- exampleA (string) – The first word to “train” the analogy on 
- answerA (string) – The second word to “train” the analogy on 
- exampleB (string) – The first word to ask the answer 
 
- Returns
- Each word and similarity is a probable answer to the analogy 
- Return type
- list of (word, similarity) 
- Raises
- ValueError – If some words have not been trained during the initialization step. 
 - Warning - The three input words should have been trained during the initialization step. 
 - 
evaluate_word_pairs(pairs, delimiter='\t')¶
- Evaluates the model similarity using a pairs file of human judgments of similarities. - Parameters
- pairs (string) – A filepath of a csv file. Lines starting by ‘#’ will be ignored. The first and second column are the words. The third column is the human made similarity. 
- delimiter (string) – The delimiter of the csv file 
 
- Returns
- The first value is the pearson coefficient (1.0 means the model is very good according to humans, 0.0 it’s very bad). The second value is the two-tailed p-value. 
- Return type
- tuple