Dear all,
I am new to deepchem, and I am eager to learning/using it. I am running deepchem 2.3.0. I have had some issues getting deepchem 2.5.0 install due to some incompatibilities while pip installing tensorflow 2.4.0. Anyway, I am trying to build and run binary classification models using different model/featurizer/transformer combinations. It takes smiles strings, and predict the “TARGET” output (annotated as either 1 or 0). Here are some questions/issues that I hope you’ll help me addressing. The code and dataset are pasted below.
-
I noticed that the prediction with gc_model_1 returns 14 x 2 x 1 array, with two predicted probabilities for each record:
a) What is the order of the probabilities? How is it determined? Can I assume that the first one is for the negative label (0), and the second is for the positive one (1)? Do they correspond to the ordered list of labels?
b) How can I instead get the class label instead of probabilities? The predictions array would then be of shape 14 x 1. -
The computation of the precision score fails with the following error, while that of the recall score works fine:
“UserWarning: Error calculating metric precision_score: Classification metrics can’t handle a mix of binary and continuous-multioutput targets”a) This is very likely due to the fact that the prediction returns the probabilities instead of classes. But it’s curious that the recall score can be computed. What is the difference in their respective implementations?
-
When training message passing neural network gc_model_2 (whether with transformers or not), I get the following message: “ValueError: cannot reshape array of size 6776 into shape (484,8)”
a) I suppose the 6776-size array repreent the features. What does the 484x8 shape represent?
b) How to solve this issue?
P.S.: I have also noticed that there are too many examples that do not work with older version. It’s not always clear what features were implemented when. I wish this information were available in the documentation.
Your help would be really helpful.
Best,
Yannick
### Code
import deepchem as dc
from deepchem.feat import ConvMolFeaturizer, CircularFingerprint, WeaveFeaturizer
from deepchem.trans import NormalizationTransformer, BalancingTransformer
from deepchem.models import GraphConvModel, MPNNModel
import pandas as pd
import numpy as np
## Data set
# ID,SMILES,TARGET
# MOL1,CC=C(O)CCC1C=CC1,1
# MOL2,COCCC=C(O)C=CC1C=CC1,1
# MOL3,C1=CC=C2C(=C1)C(=O)NS2(=O)=O,0
# MOL4,CC=C(O)C=CC1C=CC1,0
# MOL5,CC(C(O)=O)CCCC#C(O)CCC1C=CC1,1
# MOL6,CC=C(O)CCCCCCC1C=CC1,0
# MOL7,C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2O)CCC4=C3C=CC(=C4)OC(=O)C5=CC=CC=C5,0
# MOL8,C1CN(CC=C1N2C3=CC=CC=C3NC2=O)CCCC(=O)C4=CC=C(C=C4)F,1
# MOL9,C1=CC(=CC=C1C2=CC(=O)C3=C(C=C(C=C3O2)O)O)O,1
# MOL10,CN(CC1=CN=C2C(=N1)C(=NC(=N2)N)N)C3=CC=C(C=C3)C(=O)N[C@@H](CCC(=O)O)C(=O)O,1
# MOL11,C1=CC(=C(C(=C1)F)C(=O)NC(=O)NC2=CC(=C(C=C2Cl)OC(C(C(F)(F)F)F)(F)F)Cl)F,1
# MOL12,C1=CC(=C(C(=C1)F)C(=O)NC(=O)NC2=CC=C(C=C2)Cl)F,1
# MOL13,C1=CC=C(C=C1)C(=O)OOC(=O)C2=CC=CC=C2,0
# MOL14,COC(=O)NC1=NC2=C(N1)C=C(C=C2)C(=O)C3=CC=CC=C3,0
# MOL15,COC1CN(CCC1NC(=O)C2=CC(=C(C=C2OC)N)Cl)CCCOC3=CC=C(C=C3)F,1
if __name__ == '__main__':
inputfile = "test-data.csv"
smiles_field = "SMILES"
id_field = "ID"
metrics = [dc.metrics.Metric(dc.metrics.accuracy_score), dc.metrics.Metric(dc.metrics.precision_score)
, dc.metrics.Metric(dc.metrics.recall_score)] #, dc.metrics.Metric(dc.metrics.precision_recall_fscore_support), , dc.metrics.Metric(dc.metrics.balanced_accuracy_score)
dc_feat = ConvMolFeaturizer()
dc_loader = dc.data.CSVLoader(tasks=["TARGET"], smiles_field=smiles_field, id_field=id_field, featurizer=dc_feat)
dc_transformer = NormalizationTransformer(transform_w=True)
dc_dataset = dc_loader.featurize(inputfile)
dc_trasformer = dc_transformer.transformer(dc_dataset)
# print(dir(dc_dataset.y))
gc_model_1 = GraphConvModel(n_tasks=1)
gc_model_1.fit(dc_dataset, nb_epoch=10)
scores = gc_model_1.evaluate(dc_dataset, metrics, [dc_transformer])
predictions = gc_model_1.predict(dc_dataset, transformers=[dc_transformer])
print("scores: {}".format(scores))
print("Predictions ({}):\n=========================\n".format(predictions.shape))
for molecule, prediction in zip(dc_dataset.ids, predictions):
print(molecule, prediction)
weave_feat = WeaveFeaturizer()
dc_loader_2 = dc.data.CSVLoader(tasks=["TARGET"], smiles_field=smiles_field, id_field=id_field, featurizer=weave_feat)
dc_dataset_2 = dc_loader_2.featurize(inputfile)
# dc_transformer_2 = BalancingTransformer(dataset=dc_dataset_2, transform_w=True)
# dc_dataset_2 = dc_transformer_2.transform(dc_dataset_2)
gc_model_2 = MPNNModel(n_tasks=1)
gc_model_2.fit(dc_dataset_2, nb_epoch=10)
scores_2 = gc_model_2.evaluate(dc_dataset_2, metrics, transformers=[])
predictions_2 = gc_model_2.predict(dc_dataset_2, transformers=[])
print("scores (ECFP 2): {}".format(scores_2))
print("Predictions (ECFP 2) ({}):\n=========================\n".format(predictions_2.shape))
for molecule, prediction in zip(dc_dataset_2.ids, predictions_2):
print(molecule, prediction)