I have an issue with chemberta’s tokenizer. Either I’m not using it properly or it’s not working correctly.
The example below demonstrates that while
'[O+]' is in the vocabulary, it is encoded as
'O'. (this is a fragment of a smiles, it might not be a valid molecule but the same issue happens with the whole smiles).
from transformers import AutoTokenizer model_checkpoint = 'DeepChem/ChemBERTa-77M-MTR' tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) test_smiles = 'CCC1=[O+]' print(tokenizer.vocab['[O+]']) print(tokenizer.tokenize(test_smiles))
[‘C’, ‘C’, ‘C’, ‘1’, ‘=’, ‘O’]