Hi everyone,
We wanted to highlight two important recent bug fixes in DeepChem that affect how QM7/QM8/QM9 datasets are loaded and interpreted.
These issues could lead to silently incorrect labels or chemically inconsistent structures, so we strongly recommend updating to the latest versions if you rely on these datasets.
1) Label Mismatch When Loading SDF + CSV Datasets
Issue: https://github.com/deepchem/deepchem/issues/4485
Fix: https://github.com/deepchem/deepchem/pull/4489
Problem
When using SDFLoader (and derived loaders like load_qm7(), load_qm8(), and load_qm9()), DeepChem loads:
- molecular geometries from an SDF file, and
- task labels from a CSV file.
These files are sharded and should be aligned shard-by-shard. However, a bug caused the loader to always read labels from the first CSV shard, regardless of which SDF shard was being processed.
As a result:
- the first shard loaded correctly,
- every subsequent shard was paired with unrelated labels,
- leading to systematic corruption of “true labels” across most of the dataset.
What Was Fixed
PR #4489 corrected:
- the CSV shard generator call inside load_sdf_files(),
- handling of invalid molecules in SDF files so alignment with CSV shards is preserved,
- and added regression tests.
QM7 Dataset Update
This PR also updates the QM7 dataset URL to point to a revised gdb7_v2 tarball.
The original SDF file contained 4 extra molecules (containing 1 or 2 hydrogen atoms) that were not part of the canonical 7165-molecule QM7 dataset. The new archive:
- removes those extras,
- includes gdb7_v2.sdf.csv with unrounded u0_atom values from the Quantum Machine dataset (https://quantum-machine.org/datasets/).
2) Incorrect Formal Charges in QM9 SDF Files
Issue: https://github.com/deepchem/deepchem/issues/4413
Fix: https://github.com/deepchem/deepchem/pull/4667
Problem
Users noticed that the QM9 SDF files previously distributed by DeepChem contained chemically invalid structures, for example nitrogen atoms carrying +1 formal charges, despite QM9 molecules being strictly neutral.
This was confirmed in recent literature, including the Nature paper:
PropMolFlow: property-guided molecule generation with geometry-complete flow matching
which explicitly mentions that DeepChem’s original QM9 SDF had bond/charge inconsistencies.
What Was Fixed
DeepChem now:
- reprocesses the original QM9 XYZ files,
- converts them to SDF using Open Babel to preserve charge neutrality,
- and uploads the corrected SDF archive to the DeepChem S3 bucket.
Updated dataset:
https://deepchemdata.s3.us-west-1.amazonaws.com/datasets/qm9.tar.gz
Additional Notes
- Molecules like gdb_24 that previously showed incorrect nitrogen charges are now fixed using the script:
deepchem/examples/qm9/qm9_data_preprocessing.py. - Some molecules (e.g., gdb_21968) still show differences depending on RDKit sanitization:
- sanitize=False → no formal charges assigned
- sanitize=True → formal charges on N/O atoms, though the molecule is still neutral overall
Original QM9 XYZ source:
https://doi.org/10.6084/m9.figshare.978904_D12
Note on Prior Research Using These Datasets
QM7/QM8/QM9 from MoleculeNet and DeepChem’s load_qm*() APIs are widely used as standard benchmarks in molecular machine-learning research. Many published models and tutorials rely on the default DeepChem loaders and may not explicitly document custom preprocessing.
Because of this, the earlier SDF/CSV shard-alignment bug and the incorrect QM9 SDF charges could have affected training data used in some previously published work, potentially influencing reported performance or learned representations.
We can’t enumerate all impacted papers, but this context is important when interpreting historical benchmarks and comparisons against newer results obtained with the corrected datasets.
Recommendation
If you trained models on the QM7/QM8/QM9 datasets using SDFLoader with CSV labels or relied on QM9 SDF structures for generative or quantum ML workflows, please update to the latest DeepChem pre-release (pip install --pre deepchem) and re-download the datasets to ensure proper label alignment and chemically consistent structures.
We’re actively preparing the upcoming 2.9 release as well.
Many thanks to everyone who reported and helped debug these issues, and to the DeepChem developers who contributed fixes: @ARY2260, @riya-singh28, @JoseAntonioSiguenza, and @rbharath. These efforts substantially improve the reliability of MolNet benchmarks going forward.
Please let us know if you have any follow-up questions or notice anything else unusual in the datasets.