Google Summer of Code 2022: Deepchem - Pytorch Lightning

Hi everyone, I am Princy Chahal currently studying Machine Learning at George Brown College, Toronto, Canada. I have been working with the Deepchem community since February 2022. During this time I have trained models using Deepchem and have also contributed pull requests to
deepchem related to the pytorch lightning usage with deepchem.

I am really excited to work on building out deepchem’s integration with pytorch lightning this summer.

Project: Deepchem - Pytorch Lightning

Pytorch Lightning provides a framework and functionalities for building and training
pytorch models. If integrated with deepchem, pytorch-lightning will reduce the workload of
implementing ML model functionalities for the deepchem library and will also enable useful
functionalities like distributed training, easy model configuration and experimentation.

Below are three specific aspects we would like to showcase through the pytorch-lightning
integration for deepchem:

  • Multi-GPU training for current set of deepchem models compatible with the lightning
    integration. The multi-GPU training would be done by building on top of the lightning
    distributed training functionalities.
  • Leveraging the lightning integration to showcase improvements in training. Two models
    in which improvement can be showcased are the Protein Transformer model and the
    MPNNMOdel (Message Passing Neural Network).
  • Hydra config management to track model configuration and experiments easily with
    the lightning integration.

Contact Details

Email: princychahal68@gmail.com

3 Likes

Week-0 [June 5-11]

Next step:

  • Opening a PR with the DCLightningModule.
  • Writing tests which interface with DCLightningModule for the MultiTaskClassifier model.
1 Like

Week-1 [June 12-18]

  • Opened pull-request for implementing DCLightningModule in deepchem.
    • Implemented class for DCLightningModule, aimed to keep the class generic so that any TorchModel and datasets can use it.
    • Added unit tests for DCLightningModule using the MultiTaskClassifier model.
  • Questions, feedback:
    • Folder where the new implementation should go: “deepchem/models/torch_models/lightning”
    • The test took significantly long for running, hypothesis is that deepchem’s DiskDataset does not work well with pytorch dataloaders. The test works well if we load all the data before hand into memory.
    • Where to add dependency for lightning installation?
    • Should we use deepchem’s logging method or pytorch-lightning’s logging method.
    • Any suggestions for other models to test with.

Next steps:

  • Follow up on review comments, adding docstring.
  • Ensuring that tests pass, right now the tests are crashing when installing lightning.
  • Working on DCLightningDatasetModule.
1 Like

Week-2 [June 19-25]

  • Opened pull request for GCN (Graph Convolution Network) which uses the newly implemented DCLightningModule for training.
    • The GCN model will be used later to scale up the training on multiple GPUs.
  • Deepchem-lightning dataset meeting:
    • Meeting to discuss integration on the dataloaders for deepchem. Figured out a path forward which uses the make_pytorch_dataset function.
    • This functionality will be implemented in PRs in the next week.
  • Addressed review comments on PR:
    • Some tests are still failing on the PR from last week, fixed a number of things yesterday.
  • Next steps:
    • Merge the first PR after ensuring all tests pass.
    • Implementation of the DCLightningDatasetModule
1 Like

Week-3 [June 26 - July 3]

  • Worked on ensuring the tests past for PR #2945 and have it merged in Deepchem’s master:
    • Added pytorch-lightning as a dependency for deepchem.
    • Lightning was running into issues with an old version of huggingface Transformers, updated huggingface Transformers version to 4.10.
    • For the transformers version update changes had to be done in the SmilesTokenizer which inherited from Transformers tokenizer, made the changes and ensured that the CI tests pass.
    • PR was merged on Thursday night.
  • Next steps:
    • Addressing comments and getting the GCN PR #2958 merged.
    • Also worked on implementation of DCLightningDatasetModule PR, will open it by this week’s end.
1 Like

Week-4 [July 4-10]

  • Merged the GCN PR #2958:
    • Addressed review comments.
    • A number of CI and doc related issues which were introduced in previous were also resolved in this PR.
    • Currently we have the DCLightningModule checked into deepchem and it is generic enough so that users can use deepchem pytorch models with it directly, this is demonstrated by training a MultiTaskClassfier and GCN model with the setup.
  • DCLightningDatasetModule PR
    • Working on PR with DCLightningDatasetModule, currently we have to define custom dataset modules for each model, this PR will remove that requirement.
    • Tested the PR in a multi-GPU setup on AWS, there were a few issues with distribution of data across GPUs, have resolved most of it.
    • Running behind schedule with this PR, had to spend time on merging in previous PRs.
  • Next steps:
    • Priortise opening PR for the current DCLightningDatasetModule implementation, aiming to have one open today and another one in the next 2 days.
1 Like

Week-5 [July 11-17]

  • DCLightningDatasetModule PRs
    • #2993: PR for the wrapper DCLightningDatasetModule.
    • #2994: PR for updating the unit tests with the new module.
  • Worked on training a larger GCN model with lightning classes.
  • Next steps:
    • Sync up with Stanley and Bharath on training larger models with lightning model implementations.
    • Check in current set of PRs.
    • PR for the scaled up training implementation.

Week-6

  • Using the DCLightning modules was able to launch training on AWS EC2 p2.xlarge and p2.8xlarge nodes on small datasets.
  • I am setting up training of GraphConv models on the large Zinc15 dataset on AWS GPUs.
  • Issues:
    • Trying to use the load_zinc15 function with RDKit featurizer, but running into issues as the Zinc15 dataset is large in size and the featurizer runs endlessly.
    • I could not find an example in deepchem’s documentation of training on Zinc15 dataset with the RDKit featurizer and GraphConv models. Are there any references I could look at?

Week-7

  • Successfully benchmarked training on the Zinc15 dataset using GCNModel on Google Cloud.
    • Benchmarking can also be configured to run on different number of GPUs with the devices parameter.
  • PR for running benchmarking: https://github.com/deepchem/deepchem/pull/3016.
    • Is the “examples” folder correct place to check this in?
  • When training a 6M parameter model on the zinc15 dataset training on 1 GPU gives approximately 11x reduction in training time.
  • Next steps:
    • Using a similar pipeline train the protein transformer model using lightning and measure improvements?
    • Merge the benchmarking code.

Week-8

  • Exploring the huggingface transformer implementation of ChemBERTa with Deepchem.
    • Paulina, Stanley shared the models and codebase.
    • Setting up the models and recognizing how to plugin the DCLightningModule and benchmark the comparison.
  • Starting out with hydra config integration for lightning to better manage the configurations for experimentation, this was the next deliverable of my GSOC project.
  • Finishing up benchmarking of lightning integration against native deepchem models. Current experiments show that there was minimal overhead in using lightning.
  • Next steps:
    • Finishing up the benchmarking numbers and landing the PR.
    • Hydra config integration with DCLightningModule.
    • ChemBERTa experiments.
1 Like

Week-9

  • Benchmarking:
    • Compared the running time for GCNModel on Deepchem single GPU and Pytorch Lightning.
    • Training with a 6M CNN the lightning implementation on single GPU is ~30% slower.
    • The model forward component is taking similar time across both version, the slowdown is likely coming from lightning overhead of data loading.
    • On smaller datasets (get_dataset) for a single GPU the lightning implementation and deepchem implementation are on-par.
  • Multi-GPU benchmarking:
    • Pending on getting approvals from GCloud to be able to launch multi-gpu instances.
  • Lightning-Hydra example usage:
    • Opened a PR which adds an example of how to configure deepchem and lightning.
    • The PR also shows usage of configs for maintaining experiment hyperparameters.
  • Next steps:
    • Understand the data loading benchmarking delays.
    • HuggingFace Deepchem usage.
1 Like

Week-10

  • Benchmarking:
    • Worked on a thorough benchmarking on GPUs with zinc15 dataset.
      • Dataloader parameters num_workers count was made identical.
      • Number of gradient steps were made to be equal.
    • Results:
      • Original Deepchem 10 epochs: 178.4s
      • Pytorch lightning 10 epochs: 194.5s
    • Lightning has a time overhead of ~10%.
    • The slowdown is coming from outside the code I implemented, I measured the fit function across both implementation and the time taken is identical.
  • Multigpu:
    • Trained a few models on 2GPU and 4GPU GCloud instances.
    • Opened a pull request for the changes needed for the implementation: #3035.
    • For processes to be forked the loss returned by _make_pytorch_loss function has to be non-local. This would require changes in other similar functions also.
  • Next steps:
    • Landing all the pull requests.
1 Like

Week-11

  • Multi-GPU:
    • Ran benchmarking with multiple GPUs on GCloud.
    • Results:
      • 1 GPU Original Deepchem 10 epochs: 178s
      • 1 GPU Pytorch lightning 10 epochs: 194s
      • 2 GPUs Pytorch lightning 10 epochs: 104s
      • 4 GPUs Pytorch lightning 10 epochs: 57s
      • 8 GPUs Pytorch lightning 10 epochs: 34s
    • Speed improvement is approximately linear with respect to number of GPUs.
  • Merged
    • #3016: Benchmarking script for running pytorch lightning models.
    • #3042: Updates pytorch-lightning version fixing alert raised by #3041.
  • Next steps:
    • Merge #3030, #3035. These PRs finish up the modifications needed for multi GPU training.
1 Like

Week-12

  • Implemented dataset and dataloader changes required for multi-GPU training.
  • Splitting of dataset w.r.t multiple process is required as each GPU uses a different process.
  • Tested loss variation with larger batch size, on Zinc15 dataset
    • 1GPU, LR: 0.001, Epochs: 100
    • 4GPUs,LR: 0.004, Epochs: 100
    • Both the experiments give almost same training loss value at end. Implementation on 4GPUs runs 3.6x faster compared to 1GPU.
  • Working on a post describing the GSOC work and documenting how to use pytorch-lightning with Deepchem.
  • Next steps:
    • Discussions and follow up on #3035.

Week-13

  • Prepared the GSOC 2022 final work submission post.
  • Only 1 PR is pending merging now, #3035:
    • Followed up on review comments from Peter and Abhishek.
    • Need to benchmark the functionality for multiple num_workers.
    • Will try to finish and merge this PR in the next few days.
1 Like