With the launch and completion of several genome projects, biomedical researchers have accumulated a large amount of “Omics” data beyond a large contentional body of biomedical textual informtion. The wealth of such data has given rise to new discovery opportunities in biology (data-driven knowledge discovery) and data mining challenges in computer science (multi-dimensional yet sparse/noise data). Drawing from my R&D experience in the past decade (1995-2005), I identified three significant areas that are poised to significant growth for the next decade (2005-2015):

·         Semantics-level bio-computing systems: the development of semantic-level data integration, data/object modeling, semantic knowledge interoperation, complex querying, and data mining technqiues and software systems for biological studies.

·         Networks and Systems biology: the study of high-level complex relationships of proteins, DNAs, RNAs, metabolites, compounds, and environmental perturbations with integrated functional genomics, proteomics, and pharmacogenomics data/information.

·         Discovery informatics: the application of computational techniques with data-driven approaches to bridge the gap of conventional "hypothesis-driven" approaches for knowledge discovery problems.

What connect these three long-term research interests of our lab are the complex computer systems/tools necessary to solve complex biology problems at a new global perspective, all with significant drug discovery or molecular diagnostic implications. In the short- to mid- term, we organize our research activities towards the three key research areas into three themes, and has started to produce some promising results.

The first theme, declarative bio-computing, is focused on the development of semantically interoperable and scientific expert friendly biological data management methods and software tools. In particular, Our current research covers biological data integration using semantic webs, biological data mining using extended query languages, and biological explorative data analysis using complex query models. Examples of Our completed research work in these directions include: development of regular expression cartridges and sequencing matching capabilities such as ODM_BLAST in Oracle database software [1], the implementation of Similar_Join operator to perform general sequence matching in databases [2], addressing of complex biological queries through query modeling [3], and integrating biological data using resource description framework schema [4]. We plan to continue this research focus, primarily to extend “biological database operator” concepts into the relational database engine, and to develop user-friendly data integration/querying strategies for large high-throughput biological data sets.

The second theme, computational network and pathways, is focused on the emerging study of biological networks from large high-throughput Omics data sets. In this area, We have developed computational techniques to collect large-scale biology data sets [5], assess quality of high-throughput protein interaction data sets [6], integrate them with extensive public biological databases [7], interpret statistically significant patterns of newly characterized proteins, and build expanded models of biological pathways for subsequent validations [8]. The core algorithm for disease-specific network analysis, SPINNER, received increasing notice from peer bioinformatics groups, which subsequently resulted in the official filing of US and Worldwide patent by Indiana University. The technology is being negotiated with several companies for licensing and commercialization plan.

The third theme, systems biology applications, is focused on the disease-specific study of complex disease mechanisms, drug target discovery, and molecular biomarkers. We have worked on several disease areas, including identification of essential proteins involved in Alzheimer’s disease [8], identification of proteins involved in the complex formation of Fancc protein in Fanconi Anemia [9], and determination of drug resistance in Ovarian Cancer treatment from proteomics profiles [10]. In these applications, various techniques, including data management, explorative data analysis, algorithm development, and biology knowledge curation, are all needed to come up with a solution. This research area not only bears the highest potential impacts in future translational medicine, but also provides a rich source of research collaboration with biologists in the Indiana University School of Medicine nearby.

Our research has been generously supported by IUPUI NIH Roadmap fund, RSFG fund, Purdue Summer Research fund, National Cancer Institute, and Indiana Lung Cancer Working Group since 2004.

1.             Stephens, S.M., et al., Oracle Database 10g: a platform for BLAST search and Regular Expression pattern matching in life sciences. Nucleic Acids Res, 2005. 33 Database Issue: p. D675-9.

2.             Chen, J.Y. and J.V. Carlis, Similar_Join: extending DBMS with a bio-specific operator. Proceedings of the 2003 ACM symposium on Applied computing, 2003: p. 109-114.

3.             Chen, J.Y., J.V. Carlis, and N. Gao, A complex biological database querying method. Proceedings of the 2005 ACM symposium on Applied computing, 2005: p. 110-114.

4.             Dhanapalan, L. and J.Y. Chen, A Case Study of Integrating Protein Interaction Data using Semantic Web Technology. International Journal of Bioinformatics Research and Applications, 2007. 3(3): p. (accepted).

5.             Chen, J.Y., et al. Initial Large-scale Exploration of Protein-protein Interactions in Human Brain. in IEEE Computer Society Computational Systems Bioinformatics '03. 2003. Stanford, California: IEEE Computer Society Press.

6.             Shen, C., L. Li, and J.Y. Chen, A statistical framework to discover true associations from multiprotein complex pull-down proteomics data sets. Proteins, 2006. 64(2): p. 436-43.

7.             Chen, J.Y., S. Mamidipalli, and B. George. HAPPI: A Database of Human Annotated Protein-Protein Interactions.  2007  [cited 2006]; Available from: http://bio.informatics.iupui.edu/HAPPI/.

8.             Chen, J.Y., C. Shen, and A. Sivachenko, Mining Alzheimer Disease Relevant Proteins from Integrated Protein Interactome Data, in Pacific Symposium on Biocomputing '06. 2006: Maui, HI. p. 367-378.

9.             Chen, J.Y., et al., An Integrated Computational Proteomics Method to Extract Protein Targets for Fanconi Anemia Studies, in 21st Annual ACM Symposium on Applied Computing. 2006: Dijon, France. p. 173-179.

10.          Chen, J.Y., et al. A Systems Biology Case Study of Ovarian Cancer Drug Resistance. in Computational Systems Bioinformatics '06. 2006. Stanford, CA.