svd2vec documentation

class svd2vec.svd2vec(documents, size=150, min_count=2, window=10, dyn_window_weight=1, cds_alpha=0.75, neg_k_shift=5, eig_p_weight=0, nrm_type='row', sub_threshold=1e-05, verbose=False, workers=-1)

The representation of the documents words in a vector format.

Parameters
  • documents (list of list of string) – The list of document, each document being a list of words

  • size (int) – Maximum numbers of extracted features for each word

  • min_count (int) – Minimum number of occurence of each word to be included in the model

  • window (int or tuple of ints) – Window word counts for getting context of words. If an int is given, it’s equivalent of a symmetric tuple (int, int).

  • dyn_window_weight (WINDOW_WEIGHT_HARMONIC or WINDOW_WEIGHT_WORD2VEC) – The window weighing scheme.

  • cds_alpha (float) – The context distribution smoothing constant that smooths the context frequency

  • neg_k_shift (int) – The negative PMI log shifting

  • eig_p_weight (float) – The eigenvalue weighting applied to the eigenvalue matrix

  • nrm_type (string) – A normalization scheme to use with the L2 normalization

  • sub_threshold (float) – A threshold for subsampling (diluting very frequent words). Higher value means less words removed.

  • verbose (bool) – If True, displays progress during the init step

  • workers (int) – The numbers of workers to use in parallel (should not exceed the available number of cores on the computer)

WINDOW_WEIGHT_HARMONIC = 0

The harmonic weighing scheme for context words (1/5, 1/4, 1/3, 1/2, …)

WINDOW_WEIGHT_WORD2VEC = 1

The word2vec weighing scheme for context words (1/5, 2/5, 3/5, 4/5, …)

NRM_SCHEMES = ['none', 'row', 'column', 'both']

Available normalization schemes

save(path)

Saves the svd2vec object to the given path.

Parameters

path (string) – The file path to write the object to. The directories should exists.

load()

Load a previously saved svd2vec object from a path.

Parameters

path (string) – The file path to load the object from.

Returns

A new svd2vec object

Return type

svd2vec

save_word2vec_format(path)

Saves the word vectors to a path using the same format as word2vec. The file can then be used by other modules or libraries able to load word2vec vectors.

Parameters

path (string) – The file path to write the object to. The directories should exists.

similarity(x, y)

Computes and returns the cosine similarity of the two given words.

Parameters
  • x (string) – The first word to compute the similarity

  • y (string) – The second word to compute the similarity

Returns

The cosine similarity between the two words

Return type

float

Warning

The two words x and y should have been trainned during the initialization step.

distance(x, y)

Computes and returns the cosine distance of the two given words.

Parameters
  • x (string) – The first word to compute the distance

  • y (string) – The second word to compute the distance

Returns

The cosine distance between the two words

Return type

float

Raises

ValueError – If either x or y have not been trained during the initialization step.

Warning

The two words x and y should have been trained during the initialization step.

most_similar(positive=[], negative=[], topn=10)

Computes and returns the most similar words from those given in positive and negative.

Parameters
  • positive (list of string or string) – Each word in positive will contribute positively to the output words A single word can also be passed to compute it’s most similar words.

  • negative (list of string) – Each word in negative will contribute negatively to the output words

  • topn (int) – Number of similar words to output

Returns

Each tuple is a similar word with it’s similarity to the given word.

Return type

list of (word, similarity)

Raises
  • ValueError – If the no input is given in both positive and negative

  • ValueError – If some words have not been trained during the initialization step.

Warning

The input words should have been trained during the initialization step.

analogy(exampleA, answerA, exampleB, topn=10)

Returns the topn most probable answers to the analogy question “exampleA if to answerA as exampleB is to ?”

Parameters
  • exampleA (string) – The first word to “train” the analogy on

  • answerA (string) – The second word to “train” the analogy on

  • exampleB (string) – The first word to ask the answer

Returns

Each word and similarity is a probable answer to the analogy

Return type

list of (word, similarity)

Raises

ValueError – If some words have not been trained during the initialization step.

Warning

The three input words should have been trained during the initialization step.

evaluate_word_pairs(pairs, delimiter='\t')

Evaluates the model similarity using a pairs file of human judgments of similarities.

Parameters
  • pairs (string) – A filepath of a csv file. Lines starting by ‘#’ will be ignored. The first and second column are the words. The third column is the human made similarity.

  • delimiter (string) – The delimiter of the csv file

Returns

The first value is the pearson coefficient (1.0 means the model is very good according to humans, 0.0 it’s very bad). The second value is the two-tailed p-value.

Return type

tuple