Google Summer of Code 2022: Deepchem - Pytorch Lightning

Hi everyone, I am Princy Chahal currently studying Machine Learning at George Brown College, Toronto, Canada. I have been working with the Deepchem community since February 2022. During this time I have trained models using Deepchem and have also contributed pull requests to
deepchem related to the pytorch lightning usage with deepchem.

I am really excited to work on building out deepchem’s integration with pytorch lightning this summer.

Project: Deepchem - Pytorch Lightning

Pytorch Lightning provides a framework and functionalities for building and training
pytorch models. If integrated with deepchem, pytorch-lightning will reduce the workload of
implementing ML model functionalities for the deepchem library and will also enable useful
functionalities like distributed training, easy model configuration and experimentation.

Below are three specific aspects we would like to showcase through the pytorch-lightning
integration for deepchem:

  • Multi-GPU training for current set of deepchem models compatible with the lightning
    integration. The multi-GPU training would be done by building on top of the lightning
    distributed training functionalities.
  • Leveraging the lightning integration to showcase improvements in training. Two models
    in which improvement can be showcased are the Protein Transformer model and the
    MPNNMOdel (Message Passing Neural Network).
  • Hydra config management to track model configuration and experiments easily with
    the lightning integration.

Contact Details

Email: princychahal68@gmail.com

3 Likes

Week-0 [June 5-11]

Next step:

  • Opening a PR with the DCLightningModule.
  • Writing tests which interface with DCLightningModule for the MultiTaskClassifier model.
1 Like

Week-1 [June 12-18]

  • Opened pull-request for implementing DCLightningModule in deepchem.
    • Implemented class for DCLightningModule, aimed to keep the class generic so that any TorchModel and datasets can use it.
    • Added unit tests for DCLightningModule using the MultiTaskClassifier model.
  • Questions, feedback:
    • Folder where the new implementation should go: “deepchem/models/torch_models/lightning”
    • The test took significantly long for running, hypothesis is that deepchem’s DiskDataset does not work well with pytorch dataloaders. The test works well if we load all the data before hand into memory.
    • Where to add dependency for lightning installation?
    • Should we use deepchem’s logging method or pytorch-lightning’s logging method.
    • Any suggestions for other models to test with.

Next steps:

  • Follow up on review comments, adding docstring.
  • Ensuring that tests pass, right now the tests are crashing when installing lightning.
  • Working on DCLightningDatasetModule.
1 Like

Week-2 [June 19-25]

  • Opened pull request for GCN (Graph Convolution Network) which uses the newly implemented DCLightningModule for training.
    • The GCN model will be used later to scale up the training on multiple GPUs.
  • Deepchem-lightning dataset meeting:
    • Meeting to discuss integration on the dataloaders for deepchem. Figured out a path forward which uses the make_pytorch_dataset function.
    • This functionality will be implemented in PRs in the next week.
  • Addressed review comments on PR:
    • Some tests are still failing on the PR from last week, fixed a number of things yesterday.
  • Next steps:
    • Merge the first PR after ensuring all tests pass.
    • Implementation of the DCLightningDatasetModule
1 Like