Issues with precision score, and model fitting

Dear all,

I am new to deepchem, and I am eager to learning/using it. I am running deepchem 2.3.0. I have had some issues getting deepchem 2.5.0 install due to some incompatibilities while pip installing tensorflow 2.4.0. Anyway, I am trying to build and run binary classification models using different model/featurizer/transformer combinations. It takes smiles strings, and predict the “TARGET” output (annotated as either 1 or 0). Here are some questions/issues that I hope you’ll help me addressing. The code and dataset are pasted below.

  1. I noticed that the prediction with gc_model_1 returns 14 x 2 x 1 array, with two predicted probabilities for each record:
    a) What is the order of the probabilities? How is it determined? Can I assume that the first one is for the negative label (0), and the second is for the positive one (1)? Do they correspond to the ordered list of labels?
    b) How can I instead get the class label instead of probabilities? The predictions array would then be of shape 14 x 1.

  2. The computation of the precision score fails with the following error, while that of the recall score works fine:
    “UserWarning: Error calculating metric precision_score: Classification metrics can’t handle a mix of binary and continuous-multioutput targets”

    a) This is very likely due to the fact that the prediction returns the probabilities instead of classes. But it’s curious that the recall score can be computed. What is the difference in their respective implementations?

  3. When training message passing neural network gc_model_2 (whether with transformers or not), I get the following message: “ValueError: cannot reshape array of size 6776 into shape (484,8)”
    a) I suppose the 6776-size array repreent the features. What does the 484x8 shape represent?
    b) How to solve this issue?

P.S.: I have also noticed that there are too many examples that do not work with older version. It’s not always clear what features were implemented when. I wish this information were available in the documentation.

Your help would be really helpful.

Best,
Yannick

### Code

import deepchem as dc
from deepchem.feat import ConvMolFeaturizer, CircularFingerprint, WeaveFeaturizer
from deepchem.trans import NormalizationTransformer, BalancingTransformer
from deepchem.models import GraphConvModel, MPNNModel
import pandas as pd
import numpy as np

## Data set
# ID,SMILES,TARGET
# MOL1,CC=C(O)CCC1C=CC1,1
# MOL2,COCCC=C(O)C=CC1C=CC1,1
# MOL3,C1=CC=C2C(=C1)C(=O)NS2(=O)=O,0
# MOL4,CC=C(O)C=CC1C=CC1,0
# MOL5,CC(C(O)=O)CCCC#C(O)CCC1C=CC1,1
# MOL6,CC=C(O)CCCCCCC1C=CC1,0
# MOL7,C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2O)CCC4=C3C=CC(=C4)OC(=O)C5=CC=CC=C5,0
# MOL8,C1CN(CC=C1N2C3=CC=CC=C3NC2=O)CCCC(=O)C4=CC=C(C=C4)F,1
# MOL9,C1=CC(=CC=C1C2=CC(=O)C3=C(C=C(C=C3O2)O)O)O,1
# MOL10,CN(CC1=CN=C2C(=N1)C(=NC(=N2)N)N)C3=CC=C(C=C3)C(=O)N[C@@H](CCC(=O)O)C(=O)O,1
# MOL11,C1=CC(=C(C(=C1)F)C(=O)NC(=O)NC2=CC(=C(C=C2Cl)OC(C(C(F)(F)F)F)(F)F)Cl)F,1
# MOL12,C1=CC(=C(C(=C1)F)C(=O)NC(=O)NC2=CC=C(C=C2)Cl)F,1
# MOL13,C1=CC=C(C=C1)C(=O)OOC(=O)C2=CC=CC=C2,0
# MOL14,COC(=O)NC1=NC2=C(N1)C=C(C=C2)C(=O)C3=CC=CC=C3,0
# MOL15,COC1CN(CCC1NC(=O)C2=CC(=C(C=C2OC)N)Cl)CCCOC3=CC=C(C=C3)F,1

if __name__ == '__main__':
    inputfile       = "test-data.csv"
    smiles_field    = "SMILES"
    id_field        = "ID"
    metrics         = [dc.metrics.Metric(dc.metrics.accuracy_score), dc.metrics.Metric(dc.metrics.precision_score)
                        , dc.metrics.Metric(dc.metrics.recall_score)] #, dc.metrics.Metric(dc.metrics.precision_recall_fscore_support), , dc.metrics.Metric(dc.metrics.balanced_accuracy_score)
    dc_feat         = ConvMolFeaturizer()
    dc_loader       = dc.data.CSVLoader(tasks=["TARGET"], smiles_field=smiles_field, id_field=id_field, featurizer=dc_feat)
    dc_transformer  = NormalizationTransformer(transform_w=True)
    dc_dataset      = dc_loader.featurize(inputfile)
    dc_trasformer   = dc_transformer.transformer(dc_dataset)
    # print(dir(dc_dataset.y))
    
    gc_model_1      = GraphConvModel(n_tasks=1)
    gc_model_1.fit(dc_dataset, nb_epoch=10)
    scores          = gc_model_1.evaluate(dc_dataset, metrics, [dc_transformer])
    predictions     = gc_model_1.predict(dc_dataset, transformers=[dc_transformer])
    print("scores: {}".format(scores))
    print("Predictions ({}):\n=========================\n".format(predictions.shape))
    
    for molecule, prediction in zip(dc_dataset.ids, predictions):
        print(molecule, prediction)
        
    weave_feat          = WeaveFeaturizer()
    dc_loader_2         = dc.data.CSVLoader(tasks=["TARGET"], smiles_field=smiles_field, id_field=id_field, featurizer=weave_feat)
    dc_dataset_2        = dc_loader_2.featurize(inputfile)
    # dc_transformer_2    = BalancingTransformer(dataset=dc_dataset_2, transform_w=True)
    # dc_dataset_2        = dc_transformer_2.transform(dc_dataset_2)
    gc_model_2          = MPNNModel(n_tasks=1)
    gc_model_2.fit(dc_dataset_2, nb_epoch=10)
    scores_2            = gc_model_2.evaluate(dc_dataset_2, metrics, transformers=[])
    predictions_2       = gc_model_2.predict(dc_dataset_2, transformers=[])
    print("scores (ECFP 2): {}".format(scores_2))
    print("Predictions (ECFP 2) ({}):\n=========================\n".format(predictions_2.shape))
    
    for molecule, prediction in zip(dc_dataset_2.ids, predictions_2):
        print(molecule, prediction)

Sorry for the delayed response! Will try to answer some of your questions below

  1. a) I believe class 0, class 1. That is, your assumption is that the first is negative and the second is positive is correct. For your second question, the predictions return in the same order as the input.
  2. b) To get the class label, you should manually threshold. A simple threshold function that assigns the class with maximum probability should suffice
  3. a) Both of these are using the sckit-learn precision/recall implementation. Unfortunately we’re beholden to the same metrics conventions as scikit-learn which aren’t entirely consistent.
  4. a) Your `gc_model_2 is using a MPNN. You need to make sure that you’re applying the correct featurize for MPNN model (check https://deepchem.readthedocs.io/en/latest/api_reference/models.html for the model cheatsheet). I think you’re using an incorrect featurizer here

Answer for P.S: As a small development team we can only maintain one version at a time unfortunately! So while we will fix errors for 2.5.0 we can’t do so for 2.3.0. Can you try to update your version to 2.5.0 locally or try using Google colab instead?

accuracy_score is a classification metric, you cannot use it for a regression problem. You get the error because these regression model do not produce binary outcomes, but continuous (float) numbers (as all regression model do); so, when scikit-learn attempts to calculate the accuracy by comparing a binary number (true label) with a float (predicted value), it not unexpectedly gives an error. And this cause is clearly hinted at the error message itself:

The sklearn.metrics.accuracy_score(y_true, y_pred) method defines y_pred as:

y_pred : 1d array-like, or label indicator array / sparse matrix. Predicted labels, as returned by a classifier.

Which means y_pred has to be an array of 1’s or 0’s (predicated labels). They should not be probabilities.