Requirement
Some machine learning models require tokens generated from smile strings or from a molecule based on a vocabulary. The proposal outlines an utility for building vocabulary from datasets in DeepChem. The problem which the utility should solve are:
- Build vocabulary from a dataset
- Save and load the vocabulary
- Tokenizing data points using the vocabulary.
A tokenizer is a different form of a featurizer. The vocabulary part of a tokenizer makes it different from featurizer since the vocabulary must be built before featurization for use cases like model training. Once we have built the vocabulary, we can use a tokenizer like any other featurizer.
There are two approaches to the design:
- Model the vocabulary builder and featurizer components as seperate entities
- Model the vocabulary builder and featurizer as a single entities
I consider the latter approach in this design. The reason is the featurizer depends on the vocabulary generated by the vocabulary builder. If the featurizer is agnostic of the vocabulary, then a vocabulary builder and featurizer can be different classes but in some cases, a featurizer will depend on the vocabulary builder.
End User API
# Generating Vocabulary
from deepchem.featurizer.vocab_builder import NewVocabularyBuilder
vocab_builder = NewTokenizerVocabularyBuilder()
vocab_builder.build_vocab_from_file(files)
# can be a json or a pickle depending on the vocabulary
vocab_builder.save_vocab('file_name')
# tokenizing a new word
mol = Chem.MolFromSmiles('CC(=O') # or other suitable args for the vocabulary
tokens = vocab_builder(mol)
Psuedocode
class NewVocabularyBuilder:
def __init__(self):
pass
def build_vocab_from_file(files: List[str]):
sentences = []
for filename in filenames:
with open(filename, 'r') as f:
sentences.extend(filename.readlines)
self._build_vocab(sentences)
def build_vocab_from_csv(filename: str, column: str):
df = pd.read_csv(csv_filename)
sentences = df[csv_column]
self._build_vocab(sentences)
def _build_vocab(smiles):
# get smiles string and implement the algorithm for building the vocabulary, tokens or embeddings
pass
def save_vocab(self, filename):
pass
def extend_vocab(self, filename):
pass
@staticmethod
def load_vocab(path):
pass
def _featurize(self, *args):
pass
def __call__(self, *args):
return self._featurize(*args)