svd2vec documentation¶

class svd2vec.svd2vec(documents, size=150, min_count=2, window=10, dyn_window_weight=1, cds_alpha=0.75, neg_k_shift=5, eig_p_weight=0, nrm_type='row', sub_threshold=1e-05, verbose=False, workers=-1)¶

The representation of the documents words in a vector format.

Parameters

documents (list of list of string) – The list of document, each document being a list of words
size (int) – Maximum numbers of extracted features for each word
min_count (int) – Minimum number of occurence of each word to be included in the model
window (int or tuple of ints) – Window word counts for getting context of words. If an int is given, it’s equivalent of a symmetric tuple (int, int).
dyn_window_weight (WINDOW_WEIGHT_HARMONIC or WINDOW_WEIGHT_WORD2VEC) – The window weighing scheme.
cds_alpha (float) – The context distribution smoothing constant that smooths the context frequency
neg_k_shift (int) – The negative PMI log shifting
eig_p_weight (float) – The eigenvalue weighting applied to the eigenvalue matrix
nrm_type (string) – A normalization scheme to use with the L2 normalization
sub_threshold (float) – A threshold for subsampling (diluting very frequent words). Higher value means less words removed.
verbose (bool) – If True, displays progress during the init step
workers (int) – The numbers of workers to use in parallel (should not exceed the available number of cores on the computer)

WINDOW_WEIGHT_HARMONIC = 0¶: The harmonic weighing scheme for context words (1/5, 1/4, 1/3, 1/2, …)

WINDOW_WEIGHT_WORD2VEC = 1¶: The word2vec weighing scheme for context words (1/5, 2/5, 3/5, 4/5, …)

NRM_SCHEMES = ['none', 'row', 'column', 'both']¶: Available normalization schemes

save(path)¶

Saves the svd2vec object to the given path.

Parameters: path (string) – The file path to write the object to. The directories should exists.

load()¶

Load a previously saved svd2vec object from a path.

Parameters: path (string) – The file path to load the object from.
Returns: A new svd2vec object
Return type: svd2vec

save_word2vec_format(path)¶

Saves the word vectors to a path using the same format as word2vec. The file can then be used by other modules or libraries able to load word2vec vectors.

Parameters: path (string) – The file path to write the object to. The directories should exists.

similarity(x, y)¶

Computes and returns the cosine similarity of the two given words.

Parameters

x (string) – The first word to compute the similarity
y (string) – The second word to compute the similarity

Returns

The cosine similarity between the two words

Return type

float

Warning

The two words x and y should have been trainned during the initialization step.

distance(x, y)¶

Computes and returns the cosine distance of the two given words.

Parameters

x (string) – The first word to compute the distance
y (string) – The second word to compute the distance

Returns

The cosine distance between the two words

Return type

float

Raises

ValueError – If either x or y have not been trained during the initialization step.

Warning

The two words x and y should have been trained during the initialization step.

most_similar(positive=[], negative=[], topn=10)¶

Computes and returns the most similar words from those given in positive and negative.

Parameters

positive (list of string or string) – Each word in positive will contribute positively to the output words A single word can also be passed to compute it’s most similar words.
negative (list of string) – Each word in negative will contribute negatively to the output words
topn (int) – Number of similar words to output

Returns

Each tuple is a similar word with it’s similarity to the given word.

Return type

list of (word, similarity)

Raises

ValueError – If the no input is given in both positive and negative
ValueError – If some words have not been trained during the initialization step.

Warning

The input words should have been trained during the initialization step.

analogy(exampleA, answerA, exampleB, topn=10)¶

Returns the topn most probable answers to the analogy question “exampleA if to answerA as exampleB is to ?”

Parameters

exampleA (string) – The first word to “train” the analogy on
answerA (string) – The second word to “train” the analogy on
exampleB (string) – The first word to ask the answer

Returns

Each word and similarity is a probable answer to the analogy

Return type

list of (word, similarity)

Raises

ValueError – If some words have not been trained during the initialization step.

Warning

The three input words should have been trained during the initialization step.

evaluate_word_pairs(pairs, delimiter='\t')¶

Evaluates the model similarity using a pairs file of human judgments of similarities.

Parameters

pairs (string) – A filepath of a csv file. Lines starting by ‘#’ will be ignored. The first and second column are the words. The third column is the human made similarity.
delimiter (string) – The delimiter of the csv file

Returns

The first value is the pearson coefficient (1.0 means the model is very good according to humans, 0.0 it’s very bad). The second value is the two-tailed p-value.

Return type

tuple