Vocab

Documentation Status https://travis-ci.org/vzhong/vocab.svg?branch=master

Vocab is a python package that provides vocabulary objects for natural language processing.

Installation

pip install vocab
pip install git+https://github.com/vzhong/vocab.git

Usage

>>> from vocab import Vocab, UnkVocab
>>> v = Vocab()
>>> v.word2index('hello', train=True)
0
>>> v.word2index(['hello', 'world'], train=True)
[0, 1]
>>> v.index2word([1, 0])
['world', 'hello']
>>> v.index2word(1)
'world'
>>> small = v.prune_by_count(2)
>>> small.to_dict()
{'counts': {'hello': 2}, 'index2word': ['hello']}
>>> u = UnkVocab()
>>> u.word2index(['hello', 'world'], train=True)
[1, 2]
>>> u.word2index('hello friend !'.split())
[1, 0, 0]
>>> u.index2word(0)
'<unk>'

vocab package

Submodules

vocab.unk_vocab module

class vocab.unk_vocab.UnkVocab(words=())[source]

Bases: vocab.vocab.Vocab

vocab.vocab module

exception vocab.vocab.OutOfVocabularyException[source]

Bases: Exception

class vocab.vocab.Vocab(words=())[source]

Bases: object

A vocabulary object for converting between words and numerical indices.

_index2word

list – an ordered list of words in the vocabulary.

_word2index

dict – maps words to their respective indices.

counts

dict – the number of times each word has been added to the vocabulary.

__init__(words=())[source]
Parameters:words (list of str, optional) – words to build vocab from.

Example

>>> Vocab(['initial', 'words', 'for', 'the', 'vocabulary'])
__len__()[source]
Returns:number of words in the vocabulary.
Return type:int
contains_same_content(another, same_counts=True)[source]
Parameters:
  • another (Vocab) – another vocab to compare against.
  • same_counts (bool, optional) – whether to also check the counts.
Returns:

whether this vocab and another contains the same content.

Return type:

bool

copy(keep_words=True)[source]
Parameters:keep_words (bool) – whether to copy words in the vocab. Defaults to True.
Returns:a copy of this vocab.
Return type:Vocab
classmethod from_dict(d)[source]
Parameters:d (dict) – dictionary of the vocab object.
Returns:vocab object from the given dictionary.
Return type:Vocab
index2word(index)[source]
Parameters:index (int) – index to look up word for.
Returns:word corresponding to index.

if index is a list of int then this function will be applied for each index and the corresponding list of words is returned.

Return type:str
Raises:OutOfVocabularyException – if index is not a valid index to the vocabulary.
padded_index2word(padded_indices, pad='<pad>')[source]
Parameters:
  • padded_indices (list) – list of lists of word indices to depad
  • pad (str, optional) – word to use for padding. Defaults to ‘<pad>’.
Returns:

list of lists of words that correspond to the depadded padded_indices. list: list of lengths for each valid sequence. Note that if enforce_end_pad=True, then the valid sequence includes the additional pad at the end.

Return type:

list

Raises:

OutOfVocabularyException – if padded_indices contains indices not in the vocabulary or if pad is a word not in the vocabulary.

prune_by_count(cutoff)[source]
Parameters:cutoff (int) – words occurring less than this number of times are removed from the new vocab.
Returns:a copy of this vocab object with words occurring less than cutoff times removed.
Return type:Vocab
prune_by_total(total)[source]
Parameters:total (int) – maximum vocab size
Returns:a copy of this vocab with only the top total words kept.
Return type:Vocab
to_dict()[source]
Returns:dictionary of the voca object.
Return type:dict
word2index(word, train=False)[source]
Parameters:
  • word (str) – word to look up index for.
  • train (bool, optional) – if True, then this word will be added to the voculary. Defaults to False.
Returns:

index corresponding to word.

if word is a list of str then this function will be applied for each word and the corresponding list of indices is returned.

Return type:

int

Raises:

OutOfVocabularyException – if train is False and word is not in the vocabulary

word2padded_index(lists_of_words, pad='<pad>', train=False, enforce_end_pad=True)[source]
Parameters:
  • lists_of_words (list) – list of lists of words to pad
  • pad (str, optional) – word to use for padding. Defaults to ‘<pad>’.
  • train (bool, optional) – whether to add unknown words to the vocabulary. Defaults to False.
  • enforce_end_pad (bool, optional) – whether to always append a pad word to the end of each sentence.
Returns:

list of lists of word indices that are padded to be a matrix list: list of lengths for each valid sequence. Note that if enforce_end_pad=True, then the valid sequence includes the additional pad at the end.

Return type:

list

Raises:

OutOfVocabularyException – if lists_of_words contains words not in the vocabulary and train=False.

Module contents