Vocab¶

Vocab is a python package that provides vocabulary objects for natural language processing.

Installation¶

pip install vocab
pip install git+https://github.com/vzhong/vocab.git

Usage¶

>>> from vocab import Vocab, UnkVocab
>>> v = Vocab()
>>> v.word2index('hello', train=True)
0
>>> v.word2index(['hello', 'world'], train=True)
[0, 1]
>>> v.index2word([1, 0])
['world', 'hello']
>>> v.index2word(1)
'world'
>>> small = v.prune_by_count(2)
>>> small.to_dict()
{'counts': {'hello': 2}, 'index2word': ['hello']}
>>> u = UnkVocab()
>>> u.word2index(['hello', 'world'], train=True)
[1, 2]
>>> u.word2index('hello friend !'.split())
[1, 0, 0]
>>> u.index2word(0)
'<unk>'

vocab package¶

Submodules¶

vocab.unk_vocab module¶

class vocab.unk_vocab.UnkVocab(words=())[source]¶: Bases: vocab.vocab.Vocab

vocab.vocab module¶

exception vocab.vocab.OutOfVocabularyException[source]¶: Bases: Exception

class vocab.vocab.Vocab(words=())[source]¶

Bases: object

A vocabulary object for converting between words and numerical indices.

_index2word¶

an ordered list of words in the vocabulary.

Type:	list

_word2index¶

maps words to their respective indices.

Type:	dict

counts¶

the number of times each word has been added to the vocabulary.

Type:	dict

__init__(words=())[source]¶

Parameters:	words (`list` of `str`, optional) – words to build vocab from.

Example

>>> Vocab(['initial', 'words', 'for', 'the', 'vocabulary'])

__len__()[source]¶

Returns:	number of words in the vocabulary.
Return type:	int

contains_same_content(another, same_counts=True)[source]¶

Parameters:	another (Vocab) – another vocab to compare against. same_counts (`bool`, optional) – whether to also check the counts.
Returns:	whether this vocab and another contains the same content.
Return type:	bool

copy(keep_words=True)[source]¶

Parameters:	keep_words (bool) – whether to copy words in the vocab. Defaults to True.
Returns:	a copy of this vocab.
Return type:	Vocab

classmethod from_dict(d)[source]¶

Parameters:	d (dict) – dictionary of the vocab object.
Returns:	vocab object from the given dictionary.
Return type:	Vocab

index2word(index)[source]¶

Parameters:	index (int) – index to look up word for.
Returns:	word corresponding to index. if index is a `list` of `int` then this function will be applied for each index and the corresponding list of words is returned.
Return type:	str
Raises:	`OutOfVocabularyException` – if index is not a valid index to the vocabulary.

padded_index2word(padded_indices, pad='<pad>')[source]¶

Parameters:	padded_indices (list) – list of lists of word indices to depad pad (`str`, optional) – word to use for padding. Defaults to ‘<pad>’.
Returns:	list of lists of words that correspond to the depadded padded_indices. list: list of lengths for each valid sequence. Note that if enforce_end_pad=True, then the valid sequence includes the additional pad at the end.
Return type:	list
Raises:	`OutOfVocabularyException` – if padded_indices contains indices not in the vocabulary or if pad is a word not in the vocabulary.

prune_by_count(cutoff)[source]¶

Parameters:	cutoff (int) – words occurring less than this number of times are removed from the new vocab.
Returns:	a copy of this vocab object with words occurring less than cutoff times removed.
Return type:	Vocab

prune_by_total(total)[source]¶

Parameters:	total (int) – maximum vocab size
Returns:	a copy of this vocab with only the top total words kept.
Return type:	Vocab

to_dict()[source]¶

Returns:	dictionary of the voca object.
Return type:	dict

word2index(word, train=False)[source]¶

Parameters:	word (str) – word to look up index for. train (`bool`, optional) – if True, then this word will be added to the voculary. Defaults to False.
Returns:	index corresponding to word. if word is a `list` of `str` then this function will be applied for each word and the corresponding list of indices is returned.
Return type:	int
Raises:	`OutOfVocabularyException` – if train is False and word is not in the vocabulary

word2padded_index(lists_of_words, pad='<pad>', train=False, enforce_end_pad=True)[source]¶

Parameters:	lists_of_words (list) – list of lists of words to pad pad (`str`, optional) – word to use for padding. Defaults to ‘<pad>’. train (`bool`, optional) – whether to add unknown words to the vocabulary. Defaults to False. enforce_end_pad (`bool`, optional) – whether to always append a pad word to the end of each sentence.
Returns:	list of lists of word indices that are padded to be a matrix list: list of lengths for each valid sequence. Note that if enforce_end_pad=True, then the valid sequence includes the additional pad at the end.
Return type:	list
Raises:	`OutOfVocabularyException` – if lists_of_words contains words not in the vocabulary and train=False.