I have an issue with chemberta’s tokenizer. Either I’m not using it properly or it’s not working correctly.
The example below demonstrates that while '[O+]'
is in the vocabulary, it is encoded as 'O'
. (this is a fragment of a smiles, it might not be a valid molecule but the same issue happens with the whole smiles).
from transformers import AutoTokenizer
model_checkpoint = 'DeepChem/ChemBERTa-77M-MTR'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
test_smiles = 'CCC1=[O+]'
print(tokenizer.vocab['[O+]'])
print(tokenizer.tokenize(test_smiles))
this outputs:
73
[‘C’, ‘C’, ‘C’, ‘1’, ‘=’, ‘O’]