|to Students, Professionals, and Researchers interested in Data Mining and Knowledge Discovery in Biology|
This 733-page book examines the concepts, problems, progress, and trends in developing and applying data mining techniques in genome biology, a rapidly growing field of study. By studying the concepts and case studies presented in the book, readers can gain significant insight and develop practical solutions in future biological data mining projects.
Modern biology has become an information science. Since the invention of DNA sequencing method by Sanger in the late seventies, public repositories of genomic sequences have been growing exponentially, doubling in size every sixteen months—a rate often compared to the growth of semiconductor transistor densities in CPUs known as Moore’s Law. In the nineties, the public-private race to sequence the human genome further intensified the fervor to generate high-throughput biomolecular data from highly parallel and miniaturized instruments. Today, sequencing data from thousands of genomes, including plants, mammals, and microbial genomes are accumulating at an unprecedented rate. The advent of second-generation DNA sequencing instruments, high-density cDNA microarrays, tandem mass spectrometers, and high-power NMRs, have fueled the growth of molecular biology into a wide spectrum of disciplines such as personalized genomics, functional genomics, proteomics, metabolomics, and structural genomics. Few experiments in molecular biology and genetics performed today can afford to ignore the vast amount of biological information accessible publicly. Suddenly, molecular biology and genetics have become data rich.
Biological data mining is a data-guzzling turbo engine for post-genomic biology, driving the competitive race towards unprecedented biological discovery opportunities in the 21st century. Classical bioinformatics emerged from the study of macromolecules in molecular biology, biochemistry and biophysics. Analysis, comparison, and classification of DNA and protein sequences were the dominant theme of bioinformatics in the early nineties. Machine learning mainly focused on predicting genes and proteins functions from their sequences and structures. The understanding of cellular functions and processes underlying complex diseases were out of reach. Bioinformatics scientists were a rare breed, and their contribution to molecular biology and genetics was considered marginal, because the computational tools available then for biomolecular data analysis were far more primitive than the array of experimental techniques and assays that were available to life scientists. Today, we are now witnessing the reversal of these past trends. Diverse sets of data types that cover a broad spectrum of genotypes and phenotypes, particularly those related to human health and diseases, have become available. Many interdisciplinary researchers, including applied computer scientists, applied mathematicians, biostatisticians, biomedical researchers, clinical scientists, and biopharmaceutical professionals, have discovered in biology a gold mine of knowledge leading to many exciting possibilities: unraveling of tree of life, harnessing the power of microbial organisms for renewable energy, finding new ways to diagnose disease early, and developing new therapeutic compounds that save lives. Much of the experimental high-throughput biology data are generated and analyzed “in haste”, therefore leaving plenty of opportunities for knowledge discovery even after the original data is released. Most of the bets on the race to separate the wheat from the chaff have been placed on biological data mining techniques. After all, when easy, straightforward, first-pass data analysis hasn’t yielded novel biological insights, data mining techniques must be able to help—or, many presumed so.
In reality, biological data mining is still much of an ‘art’, successfully practiced by a few bioinformatics research groups that occupy themselves in solving real-world biological problems. Unlikely data mining in business, where the major concerns are often related to the bottom line—profit, the goals of biological data mining can be as diverse as the spectrum of biological questions that exist. In the business domain, association rules discovered between sales items are immediately actionable; in biology, any unorthodox hypothesis produced by computational models has to be first red-flagged and is lucky to be validated experimentally. In the internet business domain, classification, clustering, and visualization of blogs, network traffic patterns, and news feeds add significant values to regular internet users who are unaware of high-level patterns that may exist in the data set; in molecular biology and genetics, any clustering or classification of the data presented to biologists may promptly elicit questions like “great, but how and why did it happen?” or “how can you explain these results in the context of the biology I know?” The majority of general-purpose data mining techniques do not take into considerations prior knowledge domain of the biological problem, leading them to often underperform hypothesis-driven biological investigative techniques. The high level of variability of measurements inherent in many types of biological experiments or samples, the general unavailability of experimental replicates, the large number of hidden variables in the data, and the high correlation of biomolecular expression measurements also constitute significant challenges in the application of classical data mining methods in biology. Many biological data mining projects are attempted and then abandoned, even by experienced data mining scientists. In the extreme cases, large-scale biological data mining efforts are jokingly labeled as fishing expeditions and dispelled, in national grant proposal review panels.
This book represents a culmination of our past research efforts in biological data mining. Through this book, we wanted to showcase a small, but noteworthy sample of successful projects involving data mining and molecular biology. Each chapter of the book is authored by a distinguished team of bioinformatics scientists whom we invited to offer the readers the widest possible range of application domains. To ensure high quality standards, each contributed chapter went through standard peer-reviews and a round of revisions. Contributed chapters have been grouped into four major sections. The first section, entitled Sequence, Structure, and Function, collects contributions on data mining techniques designed to analyze biological sequences and structures with the objective of discovering novel functional knowledge. The second section on Genomics, Transcriptomics, and Proteomics, contains studies addressing emerging large-scale data mining challenges in analyzing high-throughput ‘omics’ data. The chapters in the third section, entitled Functional and Molecular Interaction Networks, address emerging system-scale molecular properties and their relevance to cellular functions. The fourth section is about Literature, Ontology, and Knowledge Integrations, and it collects chapters related to knowledge representation, information retrieval, and data integration for structured and unstructured biological data. The contributed works in the fifth and last section, entitled Genome Medicine Applications, address emerging biological data mining applications in medicine.
We believe this book can serve as a valuable guide to the field for graduate students, researchers, and practitioners. We hope that the wide range of topics covered will allow readers to appreciate the extent of the impact of data mining in molecular biology and genetics. For us, research in data mining and its applications to biology and genetics is fascinating and rewarding. It may even help save human lives one day. This field offers great opportunities and rewards if one is prepared to learn molecular biology and genetics, design user-friendly software tools under the proper biological assumptions, and validate all discovered hypothesis rigorously using appropriate models.
In closing, we would like to thank all the authors that contributed a chapter in the book. We are also indebted to Randi Cohen, our outstanding publishing editor. Randi efficiently managed timelines and deadlines, gracefully handled the communication with the authors and the reviewers, and took care of every little detail associated with this project. This book could not have been possible without her.
Our thanks also go to our families for their support throughout the book project.
Jake Y. Chen and Stefano Lonardi
You can purchase the book at Amazon.com here.
"Biological Database Modeling" (2007) Ed. by Jake Y Chen and Amandeep Sidhu, Artech House.