With the launch
and completion of several genome projects, biomedical researchers have
accumulated a large amount of “Omics” data beyond a large contentional body of biomedical textual informtion. The wealth of
such data has given rise to new discovery opportunities in biology (data-driven
knowledge discovery) and data mining challenges in computer science
(multi-dimensional yet sparse/noise data). Drawing from my R&D experience
in the past decade (1995-2005), I identified three significant areas that are
poised to significant growth for the next decade (2005-2015):
·
Semantics-level
bio-computing systems: the development of
semantic-level data integration, data/object modeling, semantic knowledge
interoperation, complex querying, and data mining technqiues and software
systems for biological studies.
·
Networks
and Systems biology: the study of
high-level complex relationships of proteins, DNAs, RNAs, metabolites,
compounds, and environmental perturbations with integrated functional genomics,
proteomics, and pharmacogenomics data/information.
·
Discovery
informatics: the application of computational
techniques with data-driven approaches to bridge the gap of conventional
"hypothesis-driven" approaches for knowledge discovery problems.
What connect
these three long-term research interests of our lab are the complex computer
systems/tools necessary to solve complex biology problems at a new global
perspective, all with significant drug discovery or molecular diagnostic
implications. In the short- to mid- term, we organize our research activities
towards the three key research areas into three themes, and has started to
produce some promising results.
The first
theme, declarative bio-computing, is focused on the development of
semantically interoperable and scientific expert friendly biological data management methods and software tools. In particular,
Our current research covers biological data integration using semantic webs,
biological data mining using extended query languages, and biological
explorative data analysis using complex query models. Examples of Our completed
research work in these directions include: development of regular expression
cartridges and sequencing matching capabilities such as ODM_BLAST in Oracle
database software [1], the implementation of Similar_Join operator to perform
general sequence matching in databases [2], addressing of complex biological
queries through query modeling [3], and integrating biological data using
resource description framework schema [4]. We plan to continue this research
focus, primarily to extend “biological database operator” concepts into the
relational database engine, and to develop user-friendly data
integration/querying strategies for large high-throughput biological data sets.
The second
theme, computational network and pathways, is focused on the emerging study of
biological networks from large high-throughput Omics data sets. In this area,
We have developed computational techniques to collect large-scale biology data
sets [5], assess quality of high-throughput protein interaction data sets [6],
integrate them with extensive public biological databases [7], interpret
statistically significant patterns of newly characterized proteins, and build
expanded models of biological pathways for subsequent validations [8]. The core
algorithm for disease-specific network analysis, SPINNER, received increasing
notice from peer bioinformatics groups, which subsequently resulted in the
official filing of US and Worldwide patent by Indiana University. The
technology is being negotiated with several companies for licensing and
commercialization plan.
The third
theme, systems biology applications, is focused on the disease-specific
study of complex disease mechanisms, drug target discovery, and molecular biomarkers. We have worked on several disease areas, including identification
of essential proteins involved in Alzheimer’s disease [8], identification of
proteins involved in the complex formation of Fancc protein in Fanconi Anemia
[9], and determination of drug resistance in Ovarian Cancer treatment from
proteomics profiles [10]. In these applications, various techniques, including
data management, explorative data analysis, algorithm development, and biology
knowledge curation, are all needed to come up with a solution. This research
area not only bears the highest potential impacts in future translational
medicine, but also provides a rich source of research collaboration with
biologists in the Indiana University School of Medicine nearby.
Our research has been generously
supported by IUPUI NIH Roadmap fund, RSFG fund, Purdue Summer Research fund,
National Cancer Institute, and Indiana Lung Cancer Working Group since 2004.
1.
Stephens, S.M., et al., Oracle Database 10g: a platform for BLAST search and
Regular Expression pattern matching in life sciences. Nucleic Acids Res,
2005. 33 Database Issue: p. D675-9.
2.
Chen, J.Y. and J.V. Carlis, Similar_Join: extending DBMS with a bio-specific
operator. Proceedings of the 2003 ACM symposium on Applied computing, 2003:
p. 109-114.
3.
Chen, J.Y., J.V. Carlis, and N. Gao, A complex biological database querying
method. Proceedings of the 2005 ACM symposium on Applied computing, 2005:
p. 110-114.
4.
Dhanapalan, L. and J.Y. Chen, A Case Study of Integrating Protein
Interaction Data using Semantic Web Technology. International Journal of
Bioinformatics Research and Applications, 2007. 3(3): p. (accepted).
5.
Chen, J.Y., et al. Initial Large-scale Exploration of Protein-protein
Interactions in Human Brain. in IEEE Computer Society Computational
Systems Bioinformatics '03. 2003. Stanford, California: IEEE Computer
Society Press.
6.
Shen, C., L. Li, and J.Y. Chen, A statistical framework to discover true
associations from multiprotein complex pull-down proteomics data sets.
Proteins, 2006. 64(2): p. 436-43.
7.
Chen, J.Y., S. Mamidipalli, and B. George. HAPPI: A Database of Human
Annotated Protein-Protein Interactions. 2007 [cited 2006];
Available from: http://bio.informatics.iupui.edu/HAPPI/.
8.
Chen, J.Y., C. Shen, and A. Sivachenko, Mining Alzheimer Disease Relevant
Proteins from Integrated Protein Interactome Data, in Pacific Symposium
on Biocomputing '06. 2006: Maui, HI. p. 367-378.
9.
Chen, J.Y., et al., An Integrated Computational Proteomics Method to Extract
Protein Targets for Fanconi Anemia Studies, in 21st Annual ACM Symposium
on Applied Computing. 2006: Dijon, France. p. 173-179.
10.
Chen, J.Y., et al. A Systems Biology Case Study of Ovarian Cancer Drug
Resistance. in Computational Systems Bioinformatics '06. 2006.
Stanford, CA.