一种基于关联矩阵的高效DNA序列挖掘算法
An Efficient Algorithm for Mining DNA Sequences Based on the Association Matrix

作者: 毛国君 , 杨静欣 :中央财经大学信息学院,北京 ;

关键词: DNA序列数据挖掘关联矩阵关键序列挖掘DNA Sequence Data Mining Association Matrix Key Sequence Mining

摘要: DNA分析是生物信息学研究中基础而核心的工作,而数据挖掘作为支撑生物信息学的重要技术,已经被广泛应用到DNA序列的分析中。与传统的商业领域的事务序列相比,DNA序列具有项目符号少但序列长度长的特点,因此经典的序列挖掘算法很难适应DNA序列的模式挖掘需要。本文在分析DNA序列的挖掘需求基础上,提出了一种称为关联矩阵的数据结构。关联矩阵能够将序列数据压缩成可分析的矩阵形式,所以它的空间紧凑性能够使得超长的DNA序列能够在有限的内存中加以处理。基于关联矩阵结构,设计了高效的DNA序列的关键序列挖掘算法。实验说明了本文算法在DNA序列分析中的高效性。

Abstract: The DNA analysis is the core of bioinformatics research, and as an important technology to support bioinformatics, the data mining has been widely applied to the analysis of DNA sequences. Compared to the transaction sequences in traditional business areas, DNA sequences have the characteristics that are item-less but length-longer, so the classic sequence mining algorithms are not perfectly suitable for the DNA sequence pattern mining. Based on the analysis of DNA sequence mining demands, we propose an efficient data structure, called Association Matrix. Such a structure can compress a long DNA sequence into a matrix form which can be effectively analyzed. Therefore, by making use of the space compactness of this structure, we can deal with DNA sequences with a super-long length in a limited memory. Based on the Association Matrix, we design an efficient mining algorithm to find the key segments from DNA Sequence. Experiments show that the proposed algorithm performs well in DNA sequence mining.

Abstract:

文章引用: 毛国君 , 杨静欣 (2015) 一种基于关联矩阵的高效DNA序列挖掘算法。 计算机科学与应用, 5, 271-277. doi: 10.12677/CSA.2015.58035

参考文献

[1] 朱扬勇, 熊赟 (2007) DNA序列数据挖掘技术. 软件学报, 11, 2767-2781.

[2] Hirosawa, M., Totoki, Y., Hoshida, M. and Ishikawa, M. (1995) Comprehensive study on iterative algorithms of multiple sequence alignment. Computer Applications in the Biosciences, 11, 13-18.
http://dx.doi.org/10.1093/bioinformatics/11.1.13

[3] Bell, D.A. and Guan, J.W. (2003) Data mining for motifs in DNA sequences. Rough Sets, Fuzzy Sets. Data Mining and Granular Computing: Lecture Notes in Computer Science, 2639, 507-514.
http://dx.doi.org/10.1007/3-540-39205-X_85

[4] Papapetrou, P. and Benson, G. and Kollios, G. (2012) Mining poly-regions in DNA. International Journal of Data Mining and Bioinformatics, 6, 406.
http://dx.doi.org/10.1504/IJDMB.2012.049278

[5] Agrawal, R. and Srikant, R. (1995) Mining sequential patterns. Proceeding of the 6th International Conference on Data Engineering, Taipei, 6-10 March 1995, 3-14.
http://dx.doi.org/10.1109/icde.1995.380415

[6] Srikant, R. and Agrawal, R. (1996) Mining sequential patterns: Generalizations and performance improvements. Proceedings of the 5th International conference on Extending Database Technology: Advances in Database Technology, Avognon, 25-29 March 1996, 3-17.
http://dx.doi.org/10.1007/bfb0014140

[7] Han, J. and Pei, J. (2000) Free-span: Frequent pattern-projected se-quential pattern mining. Proceedings of 6th International Conference of Knowledge Discovery and Data Mining, Boston, 20-23 August 2000, 355-359.

[8] Pei, J., Han, J., Asl, B., Chen, Q., Dayal, U. and Hsu, M. (2001) PrefixSpan: Mining sequential patterns efficiently by prefix-projected patterns growth. In Proceeding of the 17th International Data Engineering, Heidelberg, 2-6 April 2001, 215-224.

[9] Pei, J., Han, J., Asl, B., Wang, J., Pinto, H., Chen, Q., Dayal, U. and Hsu, M. (2004) Mining sequential patterns by pattern-growth: The Prefix-Span approach. IEEE Transactions on Knowledge and Data Engineering, 16, 1424-1440.
http://dx.doi.org/10.1109/TKDE.2004.77

[10] Mohammed, J. (2001) SPADE: An efficient algorithm for mining frequent sequences. Machine Learning, 42, 31-60.
http://dx.doi.org/10.1023/A:1007652502315

[11] Liao, Z.X., Peng, W.C. and Hu, X.Y. (2006) Mining mul-ti-domain sequential patterns. Workshop on Software Engineering, Databases, and Knowledge Discovery, Taipei, 12-15 December 2006, 334-339.

[12] Liu, Y.C., Chen, L.C., Liu, C.W. and Tseng, V.S. (2014) Effective peak alignment for mass spectrometry data analysis using two-phase clustering approach. International Journal of Data Mining and Bioinformatics, 9, 52-66.
http://dx.doi.org/10.1504/IJDMB.2014.057780

[13] 毛国君, 段丽娟, 王实, 石云 (2007) 数据挖掘原理与算法(第二版). 清华大学出版社, 北京, 217-218.

[14] Arjas, E., Mannila, H., Salmenkivi, M., Suramo, R. and Toivonen, H. (1996) BASS: Bayesian analyzer of event sequences. Proceedings of the 12th COMPSTAT, Barcelona, 28-31 July 1996, 199-204.
http://dx.doi.org/10.1007/978-3-642-46992-3_20

[15] Mannila, H., Toivonen, H. and Verkamo, I. (1997) Dis-covery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1, 259-289.
http://dx.doi.org/10.1023/A:1009748302351

[16] Mannilaa, H. and Salmenkivi, M. (2001) Finding simple intensity descriptions from event sequence data. Proceedings of the 7th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, 26 August 2001, 341-346.
http://dx.doi.org/10.1145/502512.502562

[17] Keogh, E., Chu, S., Hart, D. and Pazzani, M. (2001) An online al-gorithm for segmenting time series. Proceedings of IEEE International Conference on Data Mining, San Jose, 29 No-vember-2 December 2001, 289-296.
http://dx.doi.org/10.1109/icdm.2001.989531

[18] Lee, G.L., Chen, Y.C. and Hung, K.C. (2012) Path tree: Mining sequential patterns efficiently in multiple data streams environment. Proceedings of the International Computer Sym-posium ICS, Hualien, 12-14 November 2012, 261-268.

[19] Wu, Y.X., Wang, L.L., Ren, J.D., Ding, W. and Wu, X.D. (2014) Mining sequential patterns with periodic wildcard gaps. Applied Intelligence, 41, 99-116.
http://dx.doi.org/10.1007/s10489-013-0499-4

[20] Wang, K., Xu, Y.B. and Yu, J.X. (2004) Scalable sequential pattern mining for biological sequences. Proceedings of the 13th ACM International Conference on Information and Knowledge Management, Shanghai, 3-7 November 2014, 178-187.
http://dx.doi.org/10.1145/1031171.1031209

分享
Top