Available != Usable. How public data lakes can accelerate drug discovery

Submitted May 21, 2020

Making a drug takes time (a decade), we now feel that more than ever given our current crisis. In these times one is forced to think of a scenario where we get better drugs at a lesser cost and lesser time. Managing research data is a mess right now, only supported by ELNs, they can lead to false discoveries, more time will be spent on cleaning and finding the data rather than asking a critical research question. To demonstrate this we created a datalake of 40000 datasets with million of samples from publicly available data. The GEO database started by NCBI GEO is a public repository for the free distribution of next-generation sequencing and other forms of high-throughput functional genomics data submitted by researchers all across the world. The GEO platform provides a search functionality that is based on keywords provided by a researcher. The results returned by GEO are diverse and extensive in size, and nearly impossible for a researcher to manually go through. Moreover, the results are based on keywords present in the experiment design or the title of a study. This does not convey the full complexity of a study. Moreover the data is not cleaned or curated and is not readily available to analyse. There is a need to centralize the publically available data and learn from it to design next experiments without doing an experiment.

Key Takeways:

Biomedical data is distributed, not curated and very messy to navigate.
There is a lot of value in making the data searchable and curated. It can aid in bringing down the cost and time spent in finding new drugs.
There is no need to do your first experiment based on anecdotal information, use data!
Using elastic search to make a search engine for million samples and trillion data points

Outline

The GEO database started by NCBI GEO is a public repository for the free distribution of next-generation sequencing and other forms of high-throughput functional genomics data submitted by researchers all across the world. There are around 60,000 Microarray and High Throughput Sequencing studies available on GEO, however, there is no effective way to find datasets of interest. The platform provides a search functionality that is based on keywords provided by a researcher. The results returned by GEO are diverse and extensive in size, and nearly impossible for a researcher to manually go through. Moreover, the results are based on keywords present in the experiment design or the title of a study. This does not convey the full complexity of a study. Researchers may have a gene signature of interest which was obtained through an experiment or which is heavily cited in the literature. Here we created a system that will take this gene signature and find studies in the complete GEO database in which a similar set of genes is co-expressed. In this way, the search for relevant datasets takes into account the data present in the data rather than relying on the external information provided by GEO. The user can further refine the recommendations by giving keywords that are looked for in the publications linked to these datasets. By combining the actual gene expressionvalues from a dataset and the textual information present in the publication linked to that dataset, we can create a powerful tool which further can generate relevant suggestions as to be shown by the results obtained using two signatures representative of two different biological conditions which were used to validate this tool.

Polly provides a query and search engine that allows the user to find the right datasets for their analysis from its Data Lakes, and helps them run analysis on top of those datasets. The developed methodology facilitates the systematic curation and processing of publicly available gene expression datasets from GEO. Here we present a specific engine, AskGEO that runs a signature based and keyword based searches that helps the user identify studies related to a biological phenomenon from the entire GEO repository.

Sequence of ideas:

Standardizing the processing of 40,000 datasets
Building a gene co-expression database using a Kubernetes Architecture
Creating a biomedical keyword database for the GEO datasets using state of the art NLP
Using gene signatures to recommend datasets using AWS lambda
Validating the recommendations for two gene signatures and a random gene signature
Using elastic search to make all of this queryable for further processing

Requirements

Laptop

Speaker bio

Jayakrishnan (JK) is a Technical Lead at Elucidata and leads a team of young engineers involved in building Polly™. JK is passionate about building scalable technology systems that simplify people’s day to day lives. With over a decade of expertise working in various early stage tech startups, he has built software systems for Intellectual Property (IP) analytics, digitizing business processes in the insurance domain. He holds a Bachelor’s degree in Computer Science & Engineering from NIT Bhopal.

The Fifth Elephant 2020 edition