Rethinking the use of NLP methods in the life sciences

A longstanding dream in computational biology has been to use natural language processing (“NLP”) techniques to “read” the vast biological literature and algorithmically generate new ideas for potential therapeutics. There have been a number of past attempts to make this scheme work (for example). I’ve mostly been quite skeptical of these projects for a number of reasons. For one, the engineering lift required to be able to build such a system was considerable. Until the last few years, making a pipeline to handle large numbers of documents would have required considerable expertise. In addition, it wasn’t clear how the various components of a biomedical NLP pipeline could be leveraged in unison to do interesting things. If for example, I wanted to find a potential druggable target for pancreatic cancer, I’d likely have to build a custom plugin on top of the core engine to pull out the understanding needed. This seemed like work I could likely do better with a few Google searches.

That said, the last few years have seen considerable progress forward in the NLP domain that’s making me rethink my usual skepticism. The recent surge seems to have been catalyzed by the invention of the transformer, a powerful new deep learning primitive. The transformer is a “feed-forward” bit of circuitry that nevertheless allows for learning of context for words through the use of a “self-attention” mechanism. The advent of the transformer has made it feasible to build larger, more sophisticated, deep networks for NLP. In particular, the creation of BERT (Bidirectional Encoder Representations from Transformers) seems to have achieved a major step forward by using transformers in conjunction with a novel pre-training mechanism based on filling in masked-out words in the input. The use of these new techniques has allowed deep NLP techniques to make considerable progress on challenging benchmarks in the space such as GLUE and SuperGLUE. More qualitatively, it means that deep language models are capable of some surprising feats such as being able to generate coherent paragraphs de novo given prompts.

Some early research has already started considering how to extend this work to the scientific domain. SciBERT and BioBERT use variants of BERT to tackle various tasks in the scientific and biological NLP domain and achieve considerable performance boosts over baseline techniques with a unified architecture. It’s intriguing to read these papers and consider whether the time might be right to attempt building NLP pipelines to use in life science research more broadly.

I suspect there are still a number of roadblocks though. While datasets like PubMed RCT now exist, there isn’t yet an open source searchable repository of biomedical research papers. Due to the life sciences’ long history of publishing in closed journals like Cell, Nature, and Science, it’s hard to find a good repository of research. Thankfully, this is beginning to change with the rapid growth of biorxiv. Many newer research papers are posted as preprints to Biorxiv, which means it might be possible to gather a sizeable dataset of newer biological research papers together. Unfortunately though, such a resource doesn’t yet exist for open source textbooks, but Wikipedia might suffice. It might then be useful to construct techniques to extract meaningful knowledge graphs from this repository, perhaps using techniques from recent papers. Performing open-ended hypothesis generation seems trickier, but might be feasible by training a suitably powerful language model such as GPT-2 which can answer questions provided queries.

More far-fetched would be to create a multi-modal deep architecture which is capable of using protein embeddings and molecular embeddings to add semantic meaning to word representations of molecules and proteins mentioned in the literature. However, building a suitably sophisticated deep model for multimodal reasoning still seems tricky with current state of the art. In the meanwhile though, there’s still plenty that can be done with existing tools. New dataset contributions to DeepChem for NLP tasks would be very welcome additions. The same goes for implementations of good deep NLP architectures that can be used as library building blocks.


Thanks for sharing your thoughts. In addition to a direct application of NLP techniques to life science literature, I’ve realized an arising interest in applying NLP techniques for structured prediction/generation on objects like string or graph based representations of molecules. For example, Pre-training Graph Neural Networks explores unsupervised/self supervised techniques to learn molecule representations out of large scale unlabeled data. I think ultimately this can be something like BERT for learning based molecule fingerprints.

Do you also pay attention to these work and what do you think of them?


Should we think about adding a transformer implementation to DeepChem?

1 Like

@peastman Yes, definitely think this would be a high value addition to DeepChem. In particular, it might enable borrowing some of these NLP-tricks to the molecular domain more easily.

@mufeili Glad to see your first post on the forums! Yes, I definitely agree that there’s a big overlap between NLP methods and molecular methods. I think this is worth drilling down on. It ought to be possible to adapt some of the advances in NLP to molecular machine learning.


I was planning on working on the Molecular-Transformer as I find time during the GSOC project.

A few other papers I do want to replicate:

  1. Learning Multimodal Graph-to-Graph Translation for Molecular Optimization
  2. A graph-convolutional neural network model for the prediction of chemical reactivity
  3. End-to-End Differentiable Learning of Protein Structure
1 Like

I think it depends on the structure of the data you’re dealing with.

I’ve had some success applying ULMFiT, an NLP transfer learning technique, to genomic sequence classification problems repo.

Self supervised language model type pretraining for sequence data seems to be a very powerful technique for learning initial vector representation of your data.