svd2vec documentation¶
-
class
svd2vec.
svd2vec
(documents, size=150, min_count=2, window=10, dyn_window_weight=1, cds_alpha=0.75, neg_k_shift=5, eig_p_weight=0, nrm_type='row', sub_threshold=1e-05, verbose=False, workers=-1)¶ The representation of the documents words in a vector format.
- Parameters
documents (list of list of string) – The list of document, each document being a list of words
size (int) – Maximum numbers of extracted features for each word
min_count (int) – Minimum number of occurence of each word to be included in the model
window (int or tuple of ints) – Window word counts for getting context of words. If an int is given, it’s equivalent of a symmetric tuple (int, int).
dyn_window_weight (WINDOW_WEIGHT_HARMONIC or WINDOW_WEIGHT_WORD2VEC) – The window weighing scheme.
cds_alpha (float) – The context distribution smoothing constant that smooths the context frequency
neg_k_shift (int) – The negative PMI log shifting
eig_p_weight (float) – The eigenvalue weighting applied to the eigenvalue matrix
nrm_type (string) – A normalization scheme to use with the L2 normalization
sub_threshold (float) – A threshold for subsampling (diluting very frequent words). Higher value means less words removed.
verbose (bool) – If True, displays progress during the init step
workers (int) – The numbers of workers to use in parallel (should not exceed the available number of cores on the computer)
-
WINDOW_WEIGHT_HARMONIC
= 0¶ The harmonic weighing scheme for context words (1/5, 1/4, 1/3, 1/2, …)
-
WINDOW_WEIGHT_WORD2VEC
= 1¶ The word2vec weighing scheme for context words (1/5, 2/5, 3/5, 4/5, …)
-
NRM_SCHEMES
= ['none', 'row', 'column', 'both']¶ Available normalization schemes
-
save
(path)¶ Saves the svd2vec object to the given path.
- Parameters
path (string) – The file path to write the object to. The directories should exists.
-
load
()¶ Load a previously saved svd2vec object from a path.
- Parameters
path (string) – The file path to load the object from.
- Returns
A new svd2vec object
- Return type
-
save_word2vec_format
(path)¶ Saves the word vectors to a path using the same format as word2vec. The file can then be used by other modules or libraries able to load word2vec vectors.
- Parameters
path (string) – The file path to write the object to. The directories should exists.
-
similarity
(x, y)¶ Computes and returns the cosine similarity of the two given words.
- Parameters
x (string) – The first word to compute the similarity
y (string) – The second word to compute the similarity
- Returns
The cosine similarity between the two words
- Return type
float
Warning
The two words
x
andy
should have been trainned during the initialization step.
-
distance
(x, y)¶ Computes and returns the cosine distance of the two given words.
- Parameters
x (string) – The first word to compute the distance
y (string) – The second word to compute the distance
- Returns
The cosine distance between the two words
- Return type
float
- Raises
ValueError – If either x or y have not been trained during the initialization step.
Warning
The two words
x
andy
should have been trained during the initialization step.
-
most_similar
(positive=[], negative=[], topn=10)¶ Computes and returns the most similar words from those given in positive and negative.
- Parameters
positive (list of string or string) – Each word in positive will contribute positively to the output words A single word can also be passed to compute it’s most similar words.
negative (list of string) – Each word in negative will contribute negatively to the output words
topn (int) – Number of similar words to output
- Returns
Each tuple is a similar word with it’s similarity to the given word.
- Return type
list of
(word, similarity)
- Raises
ValueError – If the no input is given in both positive and negative
ValueError – If some words have not been trained during the initialization step.
Warning
The input words should have been trained during the initialization step.
-
analogy
(exampleA, answerA, exampleB, topn=10)¶ Returns the topn most probable answers to the analogy question “exampleA if to answerA as exampleB is to ?”
- Parameters
exampleA (string) – The first word to “train” the analogy on
answerA (string) – The second word to “train” the analogy on
exampleB (string) – The first word to ask the answer
- Returns
Each word and similarity is a probable answer to the analogy
- Return type
list of (word, similarity)
- Raises
ValueError – If some words have not been trained during the initialization step.
Warning
The three input words should have been trained during the initialization step.
-
evaluate_word_pairs
(pairs, delimiter='\t')¶ Evaluates the model similarity using a pairs file of human judgments of similarities.
- Parameters
pairs (string) – A filepath of a csv file. Lines starting by ‘#’ will be ignored. The first and second column are the words. The third column is the human made similarity.
delimiter (string) – The delimiter of the csv file
- Returns
The first value is the pearson coefficient (1.0 means the model is very good according to humans, 0.0 it’s very bad). The second value is the two-tailed p-value.
- Return type
tuple