Extreme Learning Machine for Protein Subcellular Localization from Primary Sequence
Abstract: Predicting protein subcellular localization from primary sequence is crucial to genome annotation, protein function prediction, drug discovery and etc. Extreme learning machine is an attractive learning method in recent years. This paper explores the potential of extreme learning machine for protein subcellular localization prediction. For this, a new feature selection strategy is established first. By utilizing the feature selection strategy, each primary sequence can be expressed as a 25-dimensional numerical vector. Furthermore, some numerical comparisons of Support Vector Ma-chine with new features, Extreme Learning Machine with new features and another existing Support Vector Machine method with Pseudo amino acid composition features are given on 852 mycobcterial proteins data. The data arises from Swiss-Prot 48 database and belongs to four different classes. Results of five cross-validation for 852 protein sequences show that ELM with new features achieves the best accuracy. It achieves 97.2% accuracy, SVM with new features ob-tains 96.4% accuracy and SVM with Pseudo amino acid composition features displays 95.2% accuracy.
文章引用: 石峰 , 陈洪 , 熊慧娟 (2013) 基于一级序列预测蛋白质亚细胞定位的超级学习机方法。 数据挖掘， 3， 6-11. doi: 10.12677/HJDM.2013.31002
 T. Blum, S. Briesemeister and O. Kohlbacher. MultiLoc2: Inte-grating phylogeny and Gene Ontology terms improves subcellular protein localization prediction. BMC Bioinformatics, 2009, 10: 274.
 K. C. Chou, H.-B. Shen. Review: Recent progresses in protein subcellular localization prediction. Analytical Biochemistry, 2007, 370: 1-16.
 R. Casadio, P. L. Martelli and A. Pierleoni. The prediction of protein subcellular localization from sequence: A shortcut to functional genome annotation. Briefings in Functional Genomic Proteomic, 2008, 7(1): 63-73.
 K. C. Chou, H. B. Shen. A new method for predicting the subcellular localization of eukaryotic proteins with both single and multiple sites: Euk-mPloc 2.0. Plos ONE, 2010, 5(4): e9931.
 A. Garg, M. Bhasin and G. P. Raghava. Support vector machine- based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search. Journal of Biological Chemistry, 2005, 280: 14427-14432.
 M. Rashid, S. Saha and G. P. S. Raghava. Support vector machine-based method for predicting subcellular localization of mycobacterial proteins using evolutionary information and motifs. BMC Bioinformatics, 2007, 8(1): 337.
 K.-C. Chou, Z.-C. Wu and X. Xiao. iLoc-Euk: A multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. Plos ONE, 2011, 6(3): e18258.
 C. C. Chang, C. J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Internet Systems and Technology, 2011, 2: 1-27.
 H. Nakashima, K. Nishikawa. Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. Journal of Molecular Biology, 1994, 238(1): 54-61.
 G.-B. Huang, D.-H. Wang and Y. Lan. Extreme learning machines: A survey. International Journal of Machine Learning and Cybernetics, 2011, 2(2): 107-122.
 G.-B. Huang, Q.-Y. Zhu and C.-K. Siew. Extreme learning machine: Theory and applications. Neu-rocomputing, 2006, 70: 489- 501.
 G.-B. Huang, H.-M. Zhou, X.-J. Ding and R. Zhang. Extreme learning machine for regression and multiclass classification. IEEE Transactions on Systems, Man & Cybernetics-Part B: Cybernetics, 2012, 42(2): 513-529.
 H. Lin, H. Ding, F.-B. Guo, Y.-A. Zhang and J. Huang. Predicting subcellular localization of mycobaterial proteins by using Chow’s pseudo amino acid composition. Protein & Peptide Letters, 2008, 15(7): 739-744.
 R. Nair, B. Rost. Sequence conserved for subcellular localization. Protein Science, 2002, 11(12): 2836-2847.
 Z. Lei, Y. Dai. Assessing protein similarity with gene ontology and its use in subnuclear localization prediction. BMC Bioinformatics, 2006, 7: 491.
 S. Mei, W. Fei and S. Zhou. Gene ontology based transfer learning for protein subcellular localization. BMC Bioinformatics, 2011, 12: 44.
 S. F. Altschul, T. L. Madden, A. A. Schaffer, et al. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Research, 1997, 25(17): 3389-3402.