Summary of 2020 GSoC

nd-02110114 · September 1, 2020, 5:21am

I spent three months in joining DeepChem as a GSoC student. The GSoC program will finish in the end of month, so I want to write up the summary.

If you want to know who am I before reviewing my tasks, please check the following post!

What I did in 2020 GSoC

In this GSoC, I mainly worked for three tasks.

Build the experimental JAXChem project
Support integrations with DGL or PyG
Update the documents and infrastructures

My initial plan (See the details from here) mentioned about only the JAXChem project which is the experimental JAX-based deep learning library for chemical and physical modelings and I was going to implement some GNN or Transformer models using JAX. However, I faced some problems, so I changed the plan. The updated plan was that DeepChem supports integrations with Deep Graph Library (DGL) or PyTorch Geometric (PyG) which are often used when building custom GNNs for molecule or inorganic crystal properties. In addition to this, I was updating the documents and infrastructures for DeepChem throughout the entire period of the GSoC. I will explain the details of these tasks.

Build the experimental JAXChem project

Repository : https://github.com/deepchem/jaxchem
Docs : https://jaxchem.readthedocs.io/en/latest/index.html

From June to the mid of July, I built the experimental JAXChem project. I implemented two pattern GCN models (sparse pattern and pad pattern) as a good starting point using Haiku. The details about this implementation can be confirmed from the following link. I wrote up the details about Pull Requests which I submitted and performance comparisons between JAX model, DeepChem or DGL.

GSoC Brief Report during 1st evaluation period

However, I faced performance issues of the sparse pattern GCN model which is a scalable approach used in DGL and PyG. And, I felt the JAX and Haiku are not mature considering implementing high level API for JAX like KerasModel or TorchModel of DeepChem. This is because JAX sometimes brings breaking changes even when bumping a minor version like 0.1.69 -> 0.1.70 and it is hard to match the same version of JAX between Haiku and Google Colab. Many users, including me, don’t have any GPU environment, so it is really important to set up the Google Colab environment easily. Therefore, I stopped the initial plan and I decided to support integrations with DGL and PyG for DeepChem.

Support integrations with DGL or PyG

From the mid of July to the end of August, I worked for supporting integrations with DGL or PyG. The reasons why I worked for this are that DGL and PyG are de facto standard tools when building new graph convolutional networks for molecule graphs or inorganic crystal graphs and how to handle the graph data in DeepChem is really complex compared with these tools. (See details from issue) The details about my implementations are below.

Implement the new general graph class based on PyG
- PRs : https://github.com/deepchem/deepchem/pull/2012,
  https://github.com/deepchem/deepchem/pull/2045
- This resolve redundant graph classes (like ConvMol or WeaveMol)
- Previously, graph classes are designed for each specific GCN model
Implement the CGCNN model as a sample using DeepChem with DGL
- PRs : https://github.com/deepchem/deepchem/pull/2045,
  https://github.com/deepchem/deepchem/pull/2089
- This is a first support about using DeepChem with DGL
- This is a first support about deep learning models for inorganic crystals
Implement the GAT model as a sample using DeepChem with PyG
- PRs : https://github.com/deepchem/deepchem/pull/2109
- This is a first support about using DeepChem with PyG
- This is an overhaul of featurizer for GCN models

I believe these new features lead to more maintainable codes and attract more users. And, the other member’s implementation like Peter’s TorchModel or Nathan’s materials dataset utilities also boosted my work. I appreciate their works and I’m happy to collaborate with other members.

Update the documents and infrastructures

When starting to contribute to DeepChem, I recognized the problems of DeepChem are the difficult installation and old documents. (See this issue thread) These problems are barriers for many new users, so I wanted to improve this situation. The details about my works are below.

Introduce the automatic docker build system
- PRs: https://github.com/deepchem/deepchem/pull/1917
- Now, we push the new image to DockerHub automatically
Setup the build configuration of pypi packages
- PRs: https://github.com/deepchem/deepchem/pull/1986
- Now, DeepChem could be installed via PyPI
Add type annotations and docstrings
- Utils: https://github.com/deepchem/deepchem/pull/2031
- Docking and HyperOpt: https://github.com/deepchem/deepchem/pull/2027
- Metrics: https://github.com/deepchem/deepchem/pull/2098
- Dataset: https://github.com/deepchem/deepchem/pull/2105
- DataLoader: https://github.com/deepchem/deepchem/pull/2103
Manage the scripts for setting up the DeepChem environment in Google Colab
- PRs: https://github.com/deepchem/deepchem/pull/1870, https://github.com/deepchem/deepchem/pull/2066

Remaining tasks

I have some remaining tasks, so I will contribute to DeepChem continuously.

Cleanup the previous GCN models like MPNN, Weave using the new graph class
Update documents, especially integrations with DGL or PyG

Acknowledgements

A special thanks to Bharath, Peter and Nathan for attentive reviews and advice throughout the summer. It was great fun to collaborate with you guys! I hope I will contribute to DeepChem continuously.