QM8: Error in dataset

There seems to be an error in the QM8 dataset where the features: E1-PBE0, E2-PBE0, f1-PBE0, f2-PBE0 seem to be repeated. See the data copied below. This must surely be an error.

|smiles|E1-CC2|E2-CC2|f1-CC2|f2-CC2|E1-PBE0|E2-PBE0|f1-PBE0|f2-PBE0|E1-PBE0|E2-PBE0|f1-PBE0|f2-PBE0|E1-CAM|E2-CAM|f1-CAM|f2-CAM|
|[H]C([H])([H])[H]|0.43295186|0.43295958|0.24972825|0.24973648|0.43021753|0.43023558|0.181436|0.18150153|0.43021753|0.43023558|0.181436|0.18150153|0.40993137|0.40993872|0.1832|0.1832|
|[H]N([H])[H]|0.26521952|0.35008064|0.06701544|0.03004918|0.26838581|0.3491057|0.04076087|0.03164115|0.26838581|0.3491057|0.04076087|0.03164115|0.25385331|0.33448133|0.0575|0.0238|
|[H]O[H]|0.28653735|0.363579|0.03775532|0|0.29137731|0.36209064|0.01950336|0.00000001|0.29137731|0.36209064|0.01950336|0.00000001|0.27851946|0.35007407|0.0333|0|
|[H]C#C[H]|0.35862867|0.35862867|0|0|0.25632133|0.2684693|0|0|0.25632133|0.2684693|0|0|0.24487913|0.25505134|0|0|
|[H]C#N|0.31995762|0.33607406|0|0|0.29513891|0.3116573|0|0|0.29513891|0.3116573|0|0|0.2834255|0.29699335|0|0|

Take a look at https://deepchem.readthedocs.io/en/latest/api_reference/moleculenet.html#qm8-datasets. The first PBE0 is computed with def2SVP and the second with def2TZVP. The basis sets are different although the functional is the same for the DFT calculation.

Yes, that had confused me, I did think when I first posted a comment that the error was that four columns have been copied and that the repetition of the column headings was erroneous.

As you pointed out, it is not.

However, please do look at and compare the data in your version of the qm8 dataset and the original. The columns headings are supposed to be the same but the values are not supposed to be the same and they are in your version of the QM8 dataset.

See this example:

To be clear - if we look at Quantum machines data, for the compound at

index 1 ([H]C([H])([H])[H]), the values for the two E1-PBE0 columns (6 and 10) are 0.43021753 and 0.40985825 , respectively.

Whereas in the csv file referenced in deepchem/molnet/load_function/qm8_datasets.py (https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/qm8.csv, specifically) both values are 0.43021753 . The same holds true for the pairs of columns E2-PBE0,f1-PBE0 and f2-PBE0. Or to put it another way, the code has included the LR-TDPBE0/def2SVP data twice in place of the LR-TDPBE0/def2TZVP data. I believe that this holds for the whole column.

I have raised an issue on the MoleculeNet repo to document https://github.com/deepchem/moleculenet/issues/41 and have made an announcement on Twitter to raise awareness.

We are looking into the error and will make fixes to benchmarking once we are able to confirm the error.