DeepChem Minutes 1/31/2020

bharath · February 5, 2020, 5:11pm

Date: 1/31/2020
Attendees: Bharath, Peter
Summary: Bharath started off the meeting by giving a summary of the discussions he’s been having over the last week. He had a number of talks with experienced chemoinformatics and ML engineers working in the biotech/pharma industry over the last week and asked about their experiences with DeepChem.

There were a few major bits of feedback. The first was emphasizing the need to make DeepChem maximally interoperable with existing machine learning tools and infrastructure. The easier it is to use DeepChem with existing tools like pandas, scikit-learn and others, the more value the library will add to the community. The second major bit of feedback was that while DeepChem built up some significant traction a few years ago, users drifted away since they felt that the library wasn’t well supported. Bharath asked one team what it would take for them to be comfortable using DeepChem in their company and they said they’d have to feel the project wouldn’t vanish or lose support.

We had a good discussion about how to make interoperability better. One idea that emerged in the discussion was that the Dataset package should likely be split out into a small standalone package in the DeepChem 3 architecture. There was a good discussion on the forums about whether it would be feasible to remove Dataset altogether. Unfortunately there are a number of competing standards for dataset storage in the ML community including

PyTorch Datasets
TensorFlow Datasets
Apache Arrow
Apache Parquet
CSV
HDF5

The best solution for interoperability for us might be to keep Dataset as a small standalone package which provides easy interconversions between different input formats. This adds some small conversion overhead, but the new package should be a generally useful tool which might find some use outside DeepChem.

We also discussed the question of long term sustainability for DeepChem. Peter suggested that the best solution for long-term sustainability was if we built up enough active contributors that the loss of any one contributor wouldn’t kill the project. This will just take time and effort as DeepChem becomes a more useful tool for the community. Bharath suggested perhaps applying for grants, but we need to do more groundwork to build up a case for funding.

Peter then gave an update on on-going work converting to TensorFlow 2.X. Most of the tests are now passing! There’s a few more systems left to be converted, such as the RL codebase (which was never converted to eager mode), and some mysterious numerical failures in the graph convolutional code which will take some doing to figure out. Bharath will also start to chip in on the interconversion work over the coming week.