Your Genome on the Cloud: Big Data Challenges in Personalized Medicine
Submitted by Ramesh Hariharan on Tuesday, 12 June 2012
Section: Data Analytics Technical level: Beginner Session type: Lecture
We are at the threshold of a major revolution in health care: thanks to two decades of explosive research in tools and techniques that interrogate living cells at the molecular level, doctors will soon have an invaluable tool added to their arsenal to help diagnose and cure disease, i.e., the genome of the patient. Several success stories have already emerged, for instance, a little boy who needed several futile operations before sequencing his genome indicated a defect in the immune system, which was then solved with a bone-marrow transplant.
The genome and its associated paraphernalia is quite large and that naturally calls for Big Data techniques to manage and deliver genomic information to clinicians, consumers, and researchers. To just give you a feel, sequencing machines generate upwards of 150GB of compressed data for a single individual and analysing this data is equivalent to sifting through 30 finely shredded copies of a 200,000 page telephone directory!
The next few years will see the translation of all the above from research lab to hospital and impact all our lives eventually. The goal of this session will be to introduce attendees to this area and share the excitement that the next few years hold in store.
The session will have two parts.
The first part will describe the evolution of genomic measurement over the last two decades, survey the current state, describe how ever reducing costs and increased understanding are leading to significant impact on disease diagnosis and cure, discuss how the world will look in the next 5 years when large numbers of people have their genomes sequenced.
This will lead to the second part where we will describe the Big Data techniques and challenges in handling large volumes of genomic data: what computations need to be run, what queries need to be handled, how data needs to flow from site of generation to site of consumption etc. These will include clever ways for text indexing, fast string matching algorithms, use of special hardware paradigms (SIMD/GPUs), Hadoop based pipelines to large volume processing, visualization methods etc.
I am a Computer Scientist/Entrepreneur, a founder of Strand Life Sciences and an Adjunct Faculty member at the Indian Institute of Science. My work spans algorithmic problems from a research perspective, building handheld devices (the Simputer) and speech synthesis systems (Dhvani), and various software platforms for biological data processing (GeneSpring, Avadis NGS) as part of Strand Life Sciences.