Data Imbalance in Cheminformatics problems

Raman1121 · June 21, 2019, 7:11am

Hello everyone
I am working on a project where we are exploring the applications of Capsule Networks for virtual high-throughput screening. The datasets I have used till now are CDK2, CHK1, and Urokinase. From each dataset, we obtained RECON and MOE descriptors which we are separately feeding to the neural network.

It is a binary classification problem. However, the number of samples for the two classes - Actives and Inactives differ greatly. For instance, the number of actives in Urokinase dataset is just 72 as compared to 200,000+ samples of inactives.

I have tried oversampling techniques such as SMOTE. SMOTE creates a large number of closely resembling samples such that the number of instances of molecules belonging to both the classes becomes equal. However, this is not very desirable in cheminformatics since these samples don’t have an actual significance and don’t really mean anything. I have also tried undersampling the majority class but the number of samples after undersampling become so less that it becomes really difficult for the neural network to learn. Finally, I have tried a combination of oversampling and undersampling techniques called SMOTE-TOMEK but there was no improvement in results.

I want to ask the community what approach should I adopt next keeping in mind the points mentioned above. I would be really grateful for any kind of help.

Thank You

peastman · June 21, 2019, 4:49pm

What about doing this with weights? When computing the loss function, weight the actives much more highly than the inactives.

Raman1121 · June 21, 2019, 6:25pm

Thank you for replying. I have used the class weights approach where the weight of a class was given as 1/(number of samples in that class). Even this didn’t give me any good results.