Pattern Recognition in Bioinformatics

Alioune Ngom; Enrico Formenti; Jin-Kao Hao; Zhao Xing-Ming; Twan van Laarhoven

doi:10.1007/978-3-642-39159-0

Résumé

In the post-genomic era, a holistic understanding of biological systems and pro- cesses, in all their complexity, is critical in comprehending nature’s choreogra- phy of life. As a result, bioinformatics involving its two main disciplines, namely, the life sciences and the computational sciences, is fast becoming a very promis- ing multidisciplinary research field. With the ever-increasing application of large- scale high-throughput technologies, such as gene or protein microarrays and mass spectrometry methods, the enormous body of information is growing rapidly. Bioinformaticians are posed with a large number of difficult problems to solve, arising not only due to the complexities in acquiring the molecular information but also due to the size and nature of the generated data sets and/or the limi- tations of the algorithms required for analyzing these data. The recent advance- ments in computational and information-theoretic techniques are enabling us to conduct various in silico testing and screening of many lab-based experiments be- fore these are actually performed in vitro or in vivo. These in silico investigations are providing new insights for interpreting and establishing new direction for a deeper understanding. Among the various advanced computational methods cur- rently being applied to such studies, the pattern recognition techniques are mostly found to be at the core of the whole discovery process for apprehending the under- lying biological knowledge. Thus, we can safely surmise that the ongoing bioin- formatics revolution may, in future, inevitably play a major role in many aspects of medical practice and/or the discipline of life sciences. The aim of this conference on Pattern Recognition in Bioinformatics (PRIB) is to provide an opportunity to academics, researchers, scientists, and industry professionals to present their latest research in pattern recognition and compu- tational intelligence-based techniques applied to problems in bioinformatics and computational biology. It also provides them with an excellent forum to interact with each other and share experiences. The conference is organized jointly by the Nice Sophia Antipolis University, France, and IAPR (International Association for Pattern Recognition) Bioinformatics Technical Committee (TC-20). This volume presents the proceedings of the 8th IAPR International Confer- ence on Pattern Recognition in Bioinformatics (PRIB 2013), held in Nice, June 17–19, 2013. It includes 25 technical contributions that were selected by the In- ternational Program Committee from 43 submissions. Each of these rigorously reviewed papers was presented orally at PRIB 2013. The proceedings consists of five parts: Part I Bio-Molecular Networks and Pathway Analysis Part II Learning, Classification, and Clustering Part III Data Mining and Knowledge Discovery Part IV Protein: Structure, Function, and Interaction Part V Motifs, Sites, and Sequences Analysis Part I of the proceedings contains six chapters on “Bio-Molecular Networks and Pathway Analysis.” Rahman et al. propose a fast agglomerative cluster- ing method for protein complex discovery. A new criterion is introduced that combines an edge clustering coefficient and an edge clustering value, allowing us to decide when a node can be added to the current cluster. Maduranga et al. use the well-known random forest method to predict GRNs. The problem of in- ferring GRNs from (limited) time-series data is recast as a number of regression problems, and the random forest approach is used here to fit a model to this. Winterbach et al. evaluate how well topological signatures in protein interaction networks predict protein function. They compare several complex signatures and their own simple signature. They find that network topology is only a weak predictor of function and the simple signature performs on par with the more sophisticated ones. De Ridder et al. propose an approach for identifying putative cancer pathways. This approach relies on expression profiling tumors that are induced by retroviral insertional mutagenesis. This provides the opportunity to search for associations between tumor-initiating events (the viral insertion sites) and the consequent transcription changes, thus revealing putative regulatory in- teractions. An important advantage is that the selective pressure exerted by the tumor growth is exploited to yield a relatively small number of loci that are likely to be causal for tumor formation. Ochs et al. apply outlier statistics, gene set analysis, and top scoring pair methods to identify deregulated pathways in can- cer. Analysis of the results on pediatric acute myeloid leukemia data indicate the effectiveness of the proposed methodology. Pizzuti et al. present some variants of RNSC (restricted neighborhood search clustering) for prediction of protein com- plexes that are based on new score functions and evolutionary computation. It is shown via computational experiments that the proposed methods have better prediction accuracies (in F-measure) than the basic RNSC algorithm. Part II of the proceedings contains three chapters on “Learning, Classifica- tion, and Clustering.” Marchiori addresses a limitation of the RELIEF feature weighting algorithm that maximizes the sample margin over the entire training set, or the sum of the possibly competing feature weights. Her work proposes, instead, a conditional weighting algorithm (CCFW) and classifier (CCWNN) to improve feature weighting and classification. Mundra et al. propose a sample se- lection criterion using a modified logistic regression loss function and a backward elimination based gene ranking algorithm. On the basis of the classifier margin for sample points, points on or within the margin are more important than those outside, the sample selection criterion based on T-score is proposed. Li et al. describe a generalization of sparse matrix factorization (SMF) algorithms and showcase a few very concisely described applications in bioinformatics. The main merit of the work is the fact that a unified representation for SMF algorithms is proposed, as well as an optimization algorithm to solve this problem. Part III of the proceedings contains six chapters on “Data Mining and Knowl- edge Discovery.” Hsu et al. consider prediction of RNA secondary structure in the “triple helix” setting for which they argue existing methods are inade- quate. Their approach uses a Simple Tree Adjoining Grammar (STAG) coupled with maximum likelihood estimation (MLE), implemented via an efficient dy- namic programming formulation. Higgs et al. present an algorithm for generating near-native protein models. It combines a fragment feature-based resampling algorithm with a local optimization method that performed best, for protein structure prediction (PSP), among a set of five optimization techniques. Com- putational experiments show that the use of local optimization is beneficial in terms of both RMSD and TM score. Spirov et al. discuss a method for trans- formation of variables, in order to normalize Drosophila oocyte images acquired via confocal microscopy. The paper describes an interesting problem, namely, the experimental determination of intrinsic Drosophila embryo coordinates, and proposes an approach using evolutionary computation by genetic algorithms. Rezaeian et al. propose a novel and flexible hierarchical framework to select dis- criminative genes and predict breast tumor subtypes simultaneously. Dai et al. tackle an important problem in drug-target interaction research and present an interesting application of machine learning methods to the analysis of drugs. Gritsenko et al. make an adaptation of their previously developed protocol for building and evaluating predictors, in order to introduce a framework that en- ables forward engineering in biology. An experimental test is performed in the biological field of codon optimization and the results obtained are comparable with those produced by the reference tool JCat. Part IV of the proceedings contains six chapters on “Protein: Structure, Func- tion, and Interaction.” Xiong et al. propose an active learning-based approach for protein function prediction. The novelty of the proposal is the use of a pre- processing phase that uses spectral clustering before selecting candidates for labeling with graph centrality metrics. Experimental results show that cluster- ing reveals a valid pre-processing step for the active learning method. Gehrmann et al. address the problem of integrating multiple sources of evidence to predict protein functions. The paper proposes to use a conditional random field (CRF) to represent protein functions as random variables to be predicted and different sources of evidence as conditioning variables. Inference and learning algorithms based on MCMC are described and the proposed method is applied to a yeast dataset. Dehzangi et al. describe a new approach to protein fold recognition, a problem that has been widely studied over the past decade. The main contribu- tion is the proposal of a new set of global protein features based on evolutionary consensus sequences and predicted secondary structure, and local features based on distributions and auto covariances of these features over segments. An RBF SVM using these features is applied to two benchmark datasets in an extensive comparison with a number of existing methods and is demonstrated to work well. Dehzangi et al. present a novel approach to using features extracted from the position specific scoring matrix (PSSM) to predict the structural class of a protein. The authors propose two new sets of features: a global one based on the consensus sequence of a PSSM and a local one that takes the auto-covariance in sequence segments into account. The features extracted are used to train an RBF SVM and are shown to lead to good results (better than other state-of-the-art algorithms) on two benchmarks. Chiu et al. discuss a new method for detecting associated sites in aligned sequence ensembles. The main idea is derived from the concept of granular computing, where information is extracted at different levels of granularity or resolution. The experimentation was focused on p53 and it has been demonstrated that the extracted association patterns are useful in discov- ering sites with some structural and functional properties of a protein molecule. Tung presents a new method for predicting the potential hepatocarcinogenicity of non-genotoxic chemicals. The proposed method based on chemical–protein interactions and interpretable decision tree is compared with other data-mining approaches and shows very good performances in both accuracy and simplicity of the found model. Part V of the proceedings contains four chapters on “Motifs, Sites, and Se- quences Analysis.” Pathak et al. present an algorithm that exploits structural information for reducing false positives in motifs prediction. They tested the validity of the algorithm using the minimotifs stored in the MnM database. Lacroix et al. present a workflow for the prediction of the effects of residue sub- stitution on protein stability. The workflow integrates eight algorithms that use delta-delta-G as a measure of stability. The workflow is designed to populate the online resource SPROUTS. A use case of the workflow is presented using the PDB entry 1enh. Malhotra et al. present an algorithm for inferring haplotypes of virus populations from k-mer counts obtained from next-generation sequencing (NGS) data. The algorithm takes as input read counts for a set of k-mers and produces as output a predicted number of haplotypes, their relative frequen- cies and, for reads covering SNPs, can assign reads to a haplotype. The novel feature of the algortihm is that it does not rely on having a reference genome. The authors report that it performs well on synthetic data compared with the existing algorithm ShoRAH, which relies on a reference genome. Comin et al. discuss and improve the Entropic Profile method introduced in the literature for detecting conservation in genome sequences. The authors propose a linear-time linear-space algorithm that captures the importance of given regions with re- spect to the whole genome, suitable for large genomes and for the discovery of motifs with unbounded length. Many have contributed directly or indirectly toward the organization and success of the PRIB 2013 conference. We would like to thank all the individ- uals and institutions, especially the authors for submitting the papers and the sponsors for generously providing financial support for the conference. We are very grateful to IAPR for the sponsorship. Our gratitude goes to the Nice Sophia Antipolis University, Nice, France, and IAPR (International Association for Pat- tern Recognition) Bioinformatics Technical Committee (TC-20) for supporting the conference in many ways. We would like to express our gratitude to all PRIB 2013 International Pro- gram Committee members for their objective and thorough reviews of the sub- mitted papers. We fully appreciate the PRIB 2013 Organizing Committee for their time, efforts, and excellent work. We would also like to thank the Nice Sophia Antipolis University for hosting the symposium and providing technical support. We sincerely thank the EDSTIC doctoral school for providing grants toa number of students attending the conference. We also thank “Region PACA” and the University of Salerno (Italy) for partially funding the invited speakers. Last, but not least, we wish to convey our sincere thanks to Springer for providing excellent professional support in preparing this volume.

Pattern Recognition in Bioinformatics

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager