An Efficient Algorithm for Mining DNA Sequences Based on the Association Matrix
Abstract: The DNA analysis is the core of bioinformatics research, and as an important technology to support bioinformatics, the data mining has been widely applied to the analysis of DNA sequences. Compared to the transaction sequences in traditional business areas, DNA sequences have the characteristics that are item-less but length-longer, so the classic sequence mining algorithms are not perfectly suitable for the DNA sequence pattern mining. Based on the analysis of DNA sequence mining demands, we propose an efficient data structure, called Association Matrix. Such a structure can compress a long DNA sequence into a matrix form which can be effectively analyzed. Therefore, by making use of the space compactness of this structure, we can deal with DNA sequences with a super-long length in a limited memory. Based on the Association Matrix, we design an efficient mining algorithm to find the key segments from DNA Sequence. Experiments show that the proposed algorithm performs well in DNA sequence mining.
文章引用: 毛国君 , 杨静欣 (2015) 一种基于关联矩阵的高效DNA序列挖掘算法。 计算机科学与应用， 5， 271-277. doi: 10.12677/CSA.2015.58035
 朱扬勇, 熊赟 (2007) DNA序列数据挖掘技术. 软件学报, 11, 2767-2781.
Hirosawa, M., Totoki, Y., Hoshida, M. and Ishikawa, M. (1995) Comprehensive study on iterative algorithms of multiple sequence alignment. Computer Applications in the Biosciences, 11, 13-18.
Bell, D.A. and Guan, J.W. (2003) Data mining for motifs in DNA sequences. Rough Sets, Fuzzy Sets. Data Mining and Granular Computing: Lecture Notes in Computer Science, 2639, 507-514.
Papapetrou, P. and Benson, G. and Kollios, G. (2012) Mining poly-regions in DNA. International Journal of Data Mining and Bioinformatics, 6, 406.
Agrawal, R. and Srikant, R. (1995) Mining sequential patterns. Proceeding of the 6th International Conference on Data Engineering, Taipei, 6-10 March 1995, 3-14.
Srikant, R. and Agrawal, R. (1996) Mining sequential patterns: Generalizations and performance improvements. Proceedings of the 5th International conference on Extending Database Technology: Advances in Database Technology, Avognon, 25-29 March 1996, 3-17.
 Han, J. and Pei, J. (2000) Free-span: Frequent pattern-projected se-quential pattern mining. Proceedings of 6th International Conference of Knowledge Discovery and Data Mining, Boston, 20-23 August 2000, 355-359.
 Pei, J., Han, J., Asl, B., Chen, Q., Dayal, U. and Hsu, M. (2001) PrefixSpan: Mining sequential patterns efficiently by prefix-projected patterns growth. In Proceeding of the 17th International Data Engineering, Heidelberg, 2-6 April 2001, 215-224.
Pei, J., Han, J., Asl, B., Wang, J., Pinto, H., Chen, Q., Dayal, U. and Hsu, M. (2004) Mining sequential patterns by pattern-growth: The Preﬁx-Span approach. IEEE Transactions on Knowledge and Data Engineering, 16, 1424-1440.
Mohammed, J. (2001) SPADE: An efficient algorithm for mining frequent sequences. Machine Learning, 42, 31-60.
 Liao, Z.X., Peng, W.C. and Hu, X.Y. (2006) Mining mul-ti-domain sequential patterns. Workshop on Software Engineering, Databases, and Knowledge Discovery, Taipei, 12-15 December 2006, 334-339.
Liu, Y.C., Chen, L.C., Liu, C.W. and Tseng, V.S. (2014) Effective peak alignment for mass spectrometry data analysis using two-phase clustering approach. International Journal of Data Mining and Bioinformatics, 9, 52-66.
 毛国君, 段丽娟, 王实, 石云 (2007) 数据挖掘原理与算法(第二版). 清华大学出版社, 北京, 217-218.
Arjas, E., Mannila, H., Salmenkivi, M., Suramo, R. and Toivonen, H. (1996) BASS: Bayesian analyzer of event sequences. Proceedings of the 12th COMPSTAT, Barcelona, 28-31 July 1996, 199-204.
Mannila, H., Toivonen, H. and Verkamo, I. (1997) Dis-covery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1, 259-289.
Mannilaa, H. and Salmenkivi, M. (2001) Finding simple intensity descriptions from event sequence data. Proceedings of the 7th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, 26 August 2001, 341-346.
Keogh, E., Chu, S., Hart, D. and Pazzani, M. (2001) An online al-gorithm for segmenting time series. Proceedings of IEEE International Conference on Data Mining, San Jose, 29 No-vember-2 December 2001, 289-296.
 Lee, G.L., Chen, Y.C. and Hung, K.C. (2012) Path tree: Mining sequential patterns efficiently in multiple data streams environment. Proceedings of the International Computer Sym-posium ICS, Hualien, 12-14 November 2012, 261-268.
Wu, Y.X., Wang, L.L., Ren, J.D., Ding, W. and Wu, X.D. (2014) Mining sequential patterns with periodic wildcard gaps. Applied Intelligence, 41, 99-116.
Wang, K., Xu, Y.B. and Yu, J.X. (2004) Scalable sequential pattern mining for biological sequences. Proceedings of the 13th ACM International Conference on Information and Knowledge Management, Shanghai, 3-7 November 2014, 178-187.