Proposal to add a vocabulary builder

Requirement

Some machine learning models require tokens generated from smile strings or from a molecule based on a vocabulary. The proposal outlines an utility for building vocabulary from datasets in DeepChem. The problem which the utility should solve are:

  • Build vocabulary from a dataset
  • Save and load the vocabulary
  • Tokenizing data points using the vocabulary.

A tokenizer is a different form of a featurizer. The vocabulary part of a tokenizer makes it different from featurizer since the vocabulary must be built before featurization for use cases like model training. Once we have built the vocabulary, we can use a tokenizer like any other featurizer.

There are two approaches to the design:

  1. Model the vocabulary builder and featurizer components as seperate entities
  2. Model the vocabulary builder and featurizer as a single entities

I consider the latter approach in this design. The reason is the featurizer depends on the vocabulary generated by the vocabulary builder. If the featurizer is agnostic of the vocabulary, then a vocabulary builder and featurizer can be different classes but in some cases, a featurizer will depend on the vocabulary builder.

End User API

# Generating Vocabulary
from deepchem.featurizer.vocab_builder import NewVocabularyBuilder
vocab_builder = NewTokenizerVocabularyBuilder()
vocab_builder.build_vocab_from_file(files)

# can be a json or a pickle depending on the vocabulary
vocab_builder.save_vocab('file_name')

# tokenizing a new word
mol = Chem.MolFromSmiles('CC(=O') # or other suitable args for the vocabulary
tokens = vocab_builder(mol)

Psuedocode

class NewVocabularyBuilder:
	
	def __init__(self):
		pass

	def build_vocab_from_file(files: List[str]):
		sentences = []
		for filename in filenames:
			with open(filename, 'r') as f:
				sentences.extend(filename.readlines)
		self._build_vocab(sentences)
		
	def build_vocab_from_csv(filename: str, column: str):
		df = pd.read_csv(csv_filename)
		sentences = df[csv_column]
		self._build_vocab(sentences)

	def _build_vocab(smiles):
		# get smiles string and implement the algorithm for building the vocabulary, tokens or embeddings
		pass

	def save_vocab(self, filename):
		pass

	def extend_vocab(self, filename):
		pass

	@staticmethod
	def load_vocab(path):
		pass
	
	def _featurize(self, *args):
		pass

	def __call__(self, *args):
		return self._featurize(*args)

Is the idea that this would be an abstract class, and subclasses would implement particular methods of tokenizing strings? What methods do you have in mind to implement?

It doesn’t need to be limited to just SMILES. Any string input could be processed in the same way (SELFIES, protein sequences, DNA sequences, etc.).

Yes, the ideas is about an abstract class and subclasses implement particular methods, for any kind of strings.

Currently, there is one particular method in mind, which is used in the paper Self-Supervised Graph Transformer on Large-Scale Molecular Data. A vocabulary here is used for a self-supervision task and it is built by representing the contextual property of an atom’s neighborhood as a string. For the tokenization part, given the vocabulary, the molecule and the atom, the tokenizer should return id (an index) of the contextual properties of the atom from the vocabulary.

We would like to have an interoperability with HuggingFace tokenizers.

Here is another architecture (the one based on idea 1 with the vocabulary builder and featurizer as seperate entities):

class BaseVocabularyBuilder():
    
    def __init__(self):
         # all subclasses should have mappings in these variables
         self.token_to_idx = {}
         self.idx_to_token = {}
         pass

    def build_vocab(self, filename: str, column: str):
        # Or other suitable input format, instead of filename and column
        pass
    
    def save_vocab(self, filename: str):
        import json
        vocab = {'token_to_idx': self.token_to_idx, 'idx_to_token': self.index_to_token}
        with open(filename, 'w') as f:
            json.dump(vocab, f)
        
    @classmethod
    def load_vocab(cls, filename: str):
        import json
        with open(filename, 'r') as f:
            data = json.load(f)
        # the filename must contain `token_to_idx` and `idx_to_token`
        return data['token_to_idx'], data['idx_to_token']

class NewVocabularyBuilder(BaseVocabularyBuilder):
     # Implements methods in NewVocabularyBuilder
     pass

from deepchem.feat.base_classes import Featurizer

class NewTokenizer(Featurizer):
    
    def __init__(self, filename: str):
        self.token_to_idx, _ = NewVocabularyBuilder.load_vocab(filename)

    def _featurize(self, datapoint):
        # returns an encoding
        return self.token_to_idx[datapoint]

Let’s call this one as design 1 (separate approach) and the earlier posted one as design 2 (the combined approach). Design 1 looks simpler and cleaner to me, conforming to existing deepchem standards on using a featurizer but it involves two steps, creating a vocabulary builder and also a class to perform the tokenization while design 2 combines them but I don’t think it is very clean.

A draft pull request with the above features can be found here: https://github.com/deepchem/deepchem/pull/3259

The summary of changes in the pull request are:

  • A base class for VocabularyBuilder
  • Adding huggingface vocabulary builder - an utility to use vocabulary builders from hugging face
  • Adding featurizer method to use existing hugging face tokenizers via a has-a relationship model
  • Grover vocabulary builder and featurizer
1 Like