Vocab¶
Vocab is a python package that provides vocabulary objects for natural language processing.
Installation¶
pip install vocab
pip install git+https://github.com/vzhong/vocab.git
Usage¶
>>> from vocab import Vocab, UnkVocab
>>> v = Vocab()
>>> v.word2index('hello', train=True)
0
>>> v.word2index(['hello', 'world'], train=True)
[0, 1]
>>> v.index2word([1, 0])
['world', 'hello']
>>> v.index2word(1)
'world'
>>> small = v.prune_by_count(2)
>>> small.to_dict()
{'counts': {'hello': 2}, 'index2word': ['hello']}
>>> u = UnkVocab()
>>> u.word2index(['hello', 'world'], train=True)
[1, 2]
>>> u.word2index('hello friend !'.split())
[1, 0, 0]
>>> u.index2word(0)
'<unk>'
vocab package¶
Submodules¶
vocab.unk_vocab module¶
-
class
vocab.unk_vocab.
UnkVocab
(words=())[source]¶ Bases:
vocab.vocab.Vocab
vocab.vocab module¶
-
class
vocab.vocab.
Vocab
(words=())[source]¶ Bases:
object
A vocabulary object for converting between words and numerical indices.
-
__init__
(words=())[source]¶ Parameters: words ( list
ofstr
, optional) – words to build vocab from.Example
>>> Vocab(['initial', 'words', 'for', 'the', 'vocabulary'])
-
contains_same_content
(another, same_counts=True)[source]¶ Parameters: Returns: whether this vocab and another contains the same content.
Return type:
-
copy
(keep_words=True)[source]¶ Parameters: keep_words (bool) – whether to copy words in the vocab. Defaults to True. Returns: a copy of this vocab. Return type: Vocab
-
classmethod
from_dict
(d)[source]¶ Parameters: d (dict) – dictionary of the vocab object. Returns: vocab object from the given dictionary. Return type: Vocab
-
index2word
(index)[source]¶ Parameters: index (int) – index to look up word for. Returns: word corresponding to index. if index is a
list
ofint
then this function will be applied for each index and the corresponding list of words is returned.Return type: str Raises: OutOfVocabularyException
– if index is not a valid index to the vocabulary.
-
padded_index2word
(padded_indices, pad='<pad>')[source]¶ Parameters: Returns: list of lists of words that correspond to the depadded padded_indices. list: list of lengths for each valid sequence. Note that if enforce_end_pad=True, then the valid sequence includes the additional pad at the end.
Return type: Raises: OutOfVocabularyException
– if padded_indices contains indices not in the vocabulary or if pad is a word not in the vocabulary.
-
prune_by_count
(cutoff)[source]¶ Parameters: cutoff (int) – words occurring less than this number of times are removed from the new vocab. Returns: a copy of this vocab object with words occurring less than cutoff times removed. Return type: Vocab
-
prune_by_total
(total)[source]¶ Parameters: total (int) – maximum vocab size Returns: a copy of this vocab with only the top total words kept. Return type: Vocab
-
word2index
(word, train=False)[source]¶ Parameters: Returns: index corresponding to word.
if word is a
list
ofstr
then this function will be applied for each word and the corresponding list of indices is returned.Return type: Raises: OutOfVocabularyException
– if train is False and word is not in the vocabulary
-
word2padded_index
(lists_of_words, pad='<pad>', train=False, enforce_end_pad=True)[source]¶ Parameters: - lists_of_words (list) – list of lists of words to pad
- pad (
str
, optional) – word to use for padding. Defaults to ‘<pad>’. - train (
bool
, optional) – whether to add unknown words to the vocabulary. Defaults to False. - enforce_end_pad (
bool
, optional) – whether to always append a pad word to the end of each sentence.
Returns: list of lists of word indices that are padded to be a matrix list: list of lengths for each valid sequence. Note that if enforce_end_pad=True, then the valid sequence includes the additional pad at the end.
Return type: Raises: OutOfVocabularyException
– if lists_of_words contains words not in the vocabulary and train=False.
-