Hi all, and huge thanks to the authors of deepchem!
I can’t understand why the output of several variants of code below is different? Here is the basic example of GCN fit on Delaney(ESOL) dataset:
!pip install deepchem
import deepchem as dc
import tensorflow as tf
import pandas as pd
from deepchem.feat import ConvMolFeaturizer
from sklearn.metrics import mean_squared_error
#loader from dc
tasks, datasets, transformers = dc.molnet.load_delaney(featurizer='GraphConv', split='random')
train_dataset, valid_dataset, test_dataset = datasets
model = dc.models.GraphConvModel(n_tasks=1, batch_size=128, mode='regression', dropout=0.2)
model.fit(train_dataset, nb_epoch=200)
metric = dc.metrics.Metric(dc.metrics.rms_score)
train_score = model.evaluate(train_dataset, [metric], transformers)
valid_score = model.evaluate(valid_dataset, [metric], transformers)
test_score = model.evaluate(test_dataset, [metric], transformers)
print('Training set score:', train_score)
print('Validation set score:', valid_score)
print('Test set score:', test_score)
Results are close to the original paper except for training set (need more epochs?)
Training set score: {‘rms_score’: 0.9388395539670715}
Validation set score: {‘rms_score’: 1.049190558227201}
Test set score: {‘rms_score’: 1.19915950432068}
Now i use the code that should do just the same:
!mkdir data
!wget https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/delaney-processed.csv -O data/delaney.csv
target_df = pd.read_csv('data/delaney.csv')
smiles = target_df['smiles'].values
smiles = smiles.astype(str)
target_values = target_df['measured log solubility in mols per litre'].values
target_values=target_values.astype('float64')
featurizer=ConvMolFeaturizer()
X = featurizer.featurize(smiles)
dataset=dc.data.DiskDataset.from_numpy(X=X, y=target_values, ids=smiles, tasks=['log solubility (mol/L)'])
splitter = dc.splits.RandomSplitter()
train_dataset, valid_dataset, test_dataset = splitter.train_valid_test_split(dataset, frac_train=0.8, frac_valid = 0.1, seed=0)
transformers = [
dc.trans.NormalizationTransformer(transform_y=True, dataset=train_dataset)
]
for dataset in [train_dataset, valid_dataset, test_dataset]:
for transformer in transformers:
dataset = transformer.transform(dataset)
model = dc.models.GraphConvModel(n_tasks=1, batch_size=128, mode='regression', dropout=0.2)
model.fit(train_dataset, nb_epoch=200)
metric = dc.metrics.Metric(dc.metrics.rms_score)
train_score = model.evaluate(train_dataset, [metric], transformers)
valid_score = model.evaluate(valid_dataset, [metric], transformers)
test_score = model.evaluate(test_dataset, [metric], transformers)
print('Training set score:', train_score)
print('Validation set score:', valid_score)
print('Test set score:', test_score)
And the results are different:
Training set score: {‘rms_score’: 1.973609741740122}
Validation set score: {‘rms_score’: 2.1925210282279806}
Test set score: {‘rms_score’: 2.137958534792295}
When i evaluated this metrics by another way, it looks that results are the same:
pred_train=model.predict(train_dataset)
pred_test=model.predict(test_dataset)
pred_val=model.predict(valid_dataset)
model_test_mse = mean_squared_error(test_dataset.y, pred_test, squared=True)
model_train_mse = mean_squared_error(train_dataset.y, pred_train, squared=True)
model_val_mse = mean_squared_error(valid_dataset.y, pred_val, squared=True)
print(‘Training set score:’, model_train_mse ** 0.5)
print(‘Validation set score:’, model_val_mse ** 0.5)
print(‘Test set score:’, model_test_mse ** 0.5)
gives:
Training set score: 0.9395447458414238
Validation set score: 1.0437583246097768
Test set score: 1.0177836506555855
So, the question is why the output of second code is different?
I run it all in Colab, here is link to file:
https://colab.research.google.com/drive/11C3lk9QPZ5pNtmIx1NNnhWCe4CnNdxcR?usp=sharing