A longstanding dream in computational biology has been to use natural language processing (“NLP”) techniques to “read” the vast biological literature and algorithmically generate new ideas for potential therapeutics. There have been a number of past attempts to make this scheme work (for example). I’ve mostly been quite skeptical of these projects for a number of reasons. For one, the engineering lift required to be able to build such a system was considerable. Until the last few years, making a pipeline to handle large numbers of documents would have required considerable expertise. In addition, it wasn’t clear how the various components of a biomedical NLP pipeline could be leveraged in unison to do interesting things. If for example, I wanted to find a potential druggable target for pancreatic cancer, I’d likely have to build a custom plugin on top of the core engine to pull out the understanding needed. This seemed like work I could likely do better with a few Google searches.
That said, the last few years have seen considerable progress forward in the NLP domain that’s making me rethink my usual skepticism. The recent surge seems to have been catalyzed by the invention of the transformer, a powerful new deep learning primitive. The transformer is a “feed-forward” bit of circuitry that nevertheless allows for learning of context for words through the use of a “self-attention” mechanism. The advent of the transformer has made it feasible to build larger, more sophisticated, deep networks for NLP. In particular, the creation of BERT (Bidirectional Encoder Representations from Transformers) seems to have achieved a major step forward by using transformers in conjunction with a novel pre-training mechanism based on filling in masked-out words in the input. The use of these new techniques has allowed deep NLP techniques to make considerable progress on challenging benchmarks in the space such as GLUE and SuperGLUE. More qualitatively, it means that deep language models are capable of some surprising feats such as being able to generate coherent paragraphs de novo given prompts.
Some early research has already started considering how to extend this work to the scientific domain. SciBERT and BioBERT use variants of BERT to tackle various tasks in the scientific and biological NLP domain and achieve considerable performance boosts over baseline techniques with a unified architecture. It’s intriguing to read these papers and consider whether the time might be right to attempt building NLP pipelines to use in life science research more broadly.
I suspect there are still a number of roadblocks though. While datasets like PubMed RCT now exist, there isn’t yet an open source searchable repository of biomedical research papers. Due to the life sciences’ long history of publishing in closed journals like Cell, Nature, and Science, it’s hard to find a good repository of research. Thankfully, this is beginning to change with the rapid growth of biorxiv. Many newer research papers are posted as preprints to Biorxiv, which means it might be possible to gather a sizeable dataset of newer biological research papers together. Unfortunately though, such a resource doesn’t yet exist for open source textbooks, but Wikipedia might suffice. It might then be useful to construct techniques to extract meaningful knowledge graphs from this repository, perhaps using techniques from recent papers. Performing open-ended hypothesis generation seems trickier, but might be feasible by training a suitably powerful language model such as GPT-2 which can answer questions provided queries.
More far-fetched would be to create a multi-modal deep architecture which is capable of using protein embeddings and molecular embeddings to add semantic meaning to word representations of molecules and proteins mentioned in the literature. However, building a suitably sophisticated deep model for multimodal reasoning still seems tricky with current state of the art. In the meanwhile though, there’s still plenty that can be done with existing tools. New dataset contributions to DeepChem for NLP tasks would be very welcome additions. The same goes for implementations of good deep NLP architectures that can be used as library building blocks.