Google Summer of Code 2022: D-MPNN Model Implementation for DeepChem

Hey everyone!

I am Aryan Amit Barsainyan. I will be working with DeepChem this summer as a GSoC contributor.

About me

I am a 2nd Year UG student at the National Institute of Technology Karnataka, India pursuing Mechanical Engineering (Major) and Information Technology (Minor). I am a Deep Learning enthusiast and enjoy working on tech that uses ML and Deep Learning for solving complex scientific problems.

I started my journey as a part of the DeepChem community in January 2022. I am happy to have found an organization where I could learn a lot about the real-world workings of ML models, especially graph-based neural networks, and I plan on contributing to DeepChem even after GSOC ends by improving the project implementation while also fixing issues and helping the community, wherever and whenever possible!

Contact Details

About the Project

This project seeks to bring a new tool to the DeepChem suite for solving message passing problems based on the recent advancements in GCNs research. This project aims to implement a Directed – Message Passing Neural Network (D-MPNN) model, a graph convolution network (GCN) built upon the existing Message Passing Neural Network (MPNN) model based on the base implementation in Chemprop.

I shall be updating my progress here over the summer on a weekly basis. Stay tuned!

Project description is available here.

2 Likes

Extended Week 1 progress (1 June 2022 - 10 June 2022)

PR 1: https://github.com/deepchem/deepchem/pull/2929

  • Files modified:
    • deepchem/feat/molecule_featurizers/__init__.py
      
    • deepchem/feat/molecule_featurizers/dmpnn_featurizer.py
      
    • deepchem/feat/tests/test_atom_feature_generator_dmpnn.py
      
  • Created:
    • atom_features()
      
    • get_atomic_num_one_hot()
      
    • get_atom_chiral_tag_one_hot()
      
    • get_atom_mass()
      
    • Suitable unit tests for all the added functions

PR 2 (Draft): https://github.com/deepchem/deepchem/pull/2939

  • Files modified:
    • deepchem/feat/graph_features.py
      
    • deepchem/feat/molecule_featurizers/dmpnn_featurizer.py
      
    • deepchem/feat/tests/test_atom_feature_generator_dmpnn.py
      
    • deepchem/feat/tests/test_features_generator_dmpnn.py
      
  • Modified:
    • bond_features() in graph_features.py
      
    • atom_features()
      
  • Created:
    • bond_features() in dmpnn_featurizer.py
      
    • map_reac_to_prod()
      
    • Suitable unit tests for all the added/modified functions

Issue created: https://github.com/deepchem/deepchem/issues/2936

  • To address doctest warning for ndarray()

PR created to rectify the problem: https://github.com/deepchem/deepchem/pull/2937

  • Files modified:
    • deepchem/feat/graph_features.py
      

Content of PR 3 in progress

  • Goal: to implement DMPNNFeaturizer class
  • Open topics to discuss:
    • Normalization of features
    • Phase features generation
    • Usage of BatchGraphData class
  • Files modified so far:
    • deepchem/feat/molecule_featurizers/dmpnn_featurizer.py
      

Highlights of the office-hour calls:

  • Discussion on creating PRs through a feature branch in the remote repo.
  • Discussion on using GraphData class instead of implementing new DmpnnMol and DMPNNEncoding classes.
  • Volunteering to review PRs from fellow contributors.
  • Addressed errors in CI.
  • Why switch from Keras to Pytorch?
  • Discussion to initially make DMPNN model only for non-reaction type datapoints.
  • A future scope: new base class - Reaction featurizer
  • Discussion on atomic mass normalization and lack of support in Deepchem to handle inequality targets.

New learnings this week:

  • Handling multiple branches in a repository.
  • Need for rebasing and procedure.
  • Procedure for error-free type annotation (invariance and covariance).
  • Improved the knowledge of writing proper unit tests.
  • Better practice is to lint the code before formatting.

Week 2 Progress Report

11 June 2022 - 17 June 2022

  • Bond features PR got merged.
  • Re-investigated the DMPNN paper for better understanding of the algorithm.
  • Created the DMPNN Featurizer class and the _featurize function.
  • Wrote unit tests for the featurizer class.
  • Cleared the misunderstanding about CanonicalRankAtoms() function.
  • Modified base class Molecule Featurizer.

Upcoming tasks for weekend:

  • Make changes in the Deepchem Docs and submit the PR for DMPNN featurizer.
  • Create a PR to add global features to the featurizer.
1 Like

Week 3 Progress Report

20 June 2022 - 24 June 2022

  • Earlier this week, submitted PR (6 files changed) to add DMPNN featurizer class to Deepchem with changes to base class MolecularFeaturizer.
  • Created function to generate global features.
  • Working on adding global featurizer from library ‘descriptastorus’ to deepchem.
  • Working on splitting the DMPNN featurizer PR and correcting suggested changes in OH meet.

Upcoming tasks for weekend:

  • Push the PR for only the base class modification and required changes.
  • Run dmpnn featurizer through existing datasets and create a helper function for the featurizer class.
1 Like

Week 4 Progress Report

25 June 2022 - 1 July 2022

  • Got PR merged: (2 files changed) modify molecular featurizer base class and suitable tests #2960

    • (Modifies the base class MolecularFeaturizer to give the users access to the original order of the atoms instead of the canonical order of atoms (default)).
  • Got PR merged: (2 files changed) added _MapperDMPNN class and suitable tests #2962

    • The _MapperDMPNN class will act as a helper class to generate concatenated features and required bond-to-incoming bonds mapping, for the upcoming DMPNNFeaturizer class.
  • Submitted PR: (2 files changed) add global feature generator and suitable unit tests #2971

  • Working on PR for: DMPNN featurizer class (complete) and unit tests (pending)

  • Reported Issue #2969 regarding yapf version mistake in PR template (resolved)

Upcoming tasks for weekend:

  • Push the PR for DMPNN featurizer class.
  • Working on additional global featurizer.
  • Add intuitive explanation about DMPNN algo in docstring of DMPNN featurizer and in forum.
1 Like

Week 5 Progress Report

03 July 2022 - 08 July 2022

  • Created Draft PR #2974: dmpnn featurizer class and unit tests
  • Got PR merged: (2 files changed) add global feature generator and suitable unit tests #2971
  • Tested dmpnn featurizer class with a general dataset like free solv
  • Got PR merged: (2 files changed) fix bug in GraphData class and add suitable unit test #2979 (to solve issue #2978)
  • Created new PR: add count-based morgan fingerprint featurizer and suitable unit tests #2980
  • Found explanation for working of datastruct module in RDKIT and posted in slack.
  • Created new PR: modify RDKitDescriptors class for normalized features #2983
  • Added intuitive explanation about DMPNN algo in docstring of DMPNN featurizer class.

Upcoming tasks for weekend:

  • Work on encoder class for DMPNN
  • Improve unit tests of DMPNN featurizer class
1 Like

Week 6 Progress Report

11 July 2022 - 13 July 2022

  • Got PR merged: (4 files changed) modify RDKitDescriptors class for normalized features #2983
  • Created new PR: add DMPNN featurizer class and suitable unit tests #2995 (generalised)
  • Working on parameterizing the unit tests for DMPNN featurizer class.
  • Created Mapper class in dmpnn.py (extract required features from GraphData object)
  • Discussed on graph batching and pytorch.geometric.
  • Fixed Doctest error caused due to add count-based morgan fingerprint featurizer and suitable unit tests #2980 PR

Upcoming tasks for weekend:

  • Push PR for Mapper class.
  • Work on encoder layer for DMPNN.
  • Work FNN layer for DMPNN.
  • Improve unit tests of DMPNN featurizer class.

Week 7 Progress Report

18 July 2022 - 22 July 2022

  • Got PR merged: (5 files changed) add DMPNN featurizer class and suitable unit tests #2995
  • Created new PR: add mapper class for dmpnn model and suitable unit tests #3001
    • Contains elaborated example for the working of the class.
  • Created new PR: add dmpnn ffn layer and suitable unit test #3004
    • Converted unit tests to pytest format along with @pytest.mark.pytorch
  • Created new PR: add new global feature generators and units tests for DMPNN featurizer #3005
  • Created DMPNNEncoderLayer class in layers.py (message passing for a single mol)

Upcoming tasks for weekend:

  • Write unit tests for DMPNNEncoderLayer class.
  • Push PR for DMPNNEncoderLayer class.
  • Work on DMPNN model class.

Week 8 Progress Report

25 July 2022 - 28 July 2022

  • Got PR merged: (2 files changed) modify PositionwiseFeedForward class and add unit tests #3009
  • Got PR merged: (2 files changed) add mapper class for dmpnn model and suitable unit tests #3001
  • Created new draft PR: Encoder layer and model prototype - dmpnn #3014
    • Contains script for encoder layer and its unit tests
    • Contains script for model class and its unit tests

Upcoming tasks for weekend:

  • Write docstrings for DMPNNEncoderLayer class.
  • Work on unit tests for DMPNN model class.
  • Write docstrings for DMPNN class.
1 Like

Week 9 Progress Report

03 August 2022 - 05 August 2022

  • Created new PR: add dmpnn encoder layer and suitable unit test #3023
  • Tested the encoder layer and model class alongside chemprop implementation.
  • Working on issues to represent docstrings in the documentation.

Upcoming tasks for weekend:

  • Work on unit tests for DMPNN model class.
  • Write docstrings for DMPNN class.
  • Work on torch model wrapper for DMPNN model.
1 Like

Week 10 Progress Report

06 August 2022 - 12 August 2022

Upcoming tasks for weekend:

  • Work on modifications in encoder layer and _mapper class for batching.

Week 11 Progress Report

16 August 2022 - 19 August 2022

  • Created new PR: add torch model wrapper for DMPNN model class #3034 (ready).

  • Analysed pytorch_geometric custom batching based on modified __inc__().

  • Updated prototype for Batching in torch model wrapper. PR #3014.

  • Problems in batching:

    • repeated zero vector!
    • tackle -1 (increment issue)
    • mapping back to individual molecules
    • Irregular sizes of mappings

Upcoming tasks for weekend:

  • Work on modifications in encoder layer and _mapper class for batching.

Week 12 and 13 Progress Report

22 August 2022 - 2 September 2022

Upcoming tasks for weekend:

  • Work on hyperparameter optimisation
  • Work on final report blog
  • Benchmark DMPNN model with datasets

Week 14 Progress Report

3 September 2022 - 9 September 2022

  • PR: implementation of batching for DMPNN model #3040
    • Import issue solved and PR is ready for final review and possibly merge.
  • Tested hyperparameter optimisation scripts from Deepchem (ran into a potential bug)
  • Merged PR: Fix issue #3057 (update _Mapper class for dmpnn) #3058
  • Used Molnet example script to create DMPNN benchmark script.
  • Used compute resources from college to benchmark DMPNN model on Tox21 dataset on 4 variations:
    • DMPNN-no-global, splitter = random
    • DMPNN-no-global, splitter = scaffold
    • DMPNN-global-rdkit-norm, splitter = random
    • DMPNN-global-rdkit-norm, splitter = scaffold
  • GSoC Final submission report is ready for review.

Upcoming tasks for weekend:

  • Start working on tutorial
  • Benchmark other Datasets

Project final report link: