Chemberta tokenizer

I have an issue with chemberta’s tokenizer. Either I’m not using it properly or it’s not working correctly.
The example below demonstrates that while '[O+]' is in the vocabulary, it is encoded as 'O'. (this is a fragment of a smiles, it might not be a valid molecule but the same issue happens with the whole smiles).

from transformers import AutoTokenizer
model_checkpoint = 'DeepChem/ChemBERTa-77M-MTR'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
test_smiles = 'CCC1=[O+]'

this outputs:

[‘C’, ‘C’, ‘C’, ‘1’, ‘=’, ‘O’]

I believe the ChemBERTa tokenizer may have some collisions in the encoding (it’s not 1-1).

@seyonec Would you be able to confirm?

Well if you mean that in this example 'O' and '[O+]' are encoded to the same token index, that’s not the case (going by tokenizer.vocab, I don’t have it open now but the two strings are both keys in this dictionary pointing to different values).

I have replied this question in HF Forum :hugs: .