PDF全文下载:
201801017.pdf
文章编号: 16726987(2018)01010608; DOI: 10.16351/j.16726987.2018.01.017
于彬, 李珊, 陈成, 陈瑞欣, 田保光
(青岛科技大学 数理学院,山东 青岛 266061)
摘要: 长非编码RNA (LncRNA)在表观遗传调控、转录后调控和人类疾病中发挥着重要作用,利用机器学习方法从海量的RNA数据中识别出LncRNA十分必要。本研究提出一种基于集成学习的LncRNA大数据基因预测新方法。首先提取序列碱基出现频率的86个特征作为原始特征集合,其次,基于GA-SVM选取出最优特征,以SVM五折交叉验证的准确率作为适应度,最后构建AdaBoost算法与SVM相结合的基因预测模型(AdaBoost-SVM)。实验结果表明:AdaBoost-SVM模型对测试集LncRNA的预测准确率为89-26%,优于RF、SVM和DWT-SVM3种预测模型的结果。
关键词: 长非编码RNA; 基因预测; 集成学习; AdaBoost算法; 支持向量机
中图分类号: Q 811.4文献标志码: A
引用格式:于彬, 李珊, 陈成, 等. 基于集成学习的人类LncRNA大数据基因预测\[J\]. 青岛科技大学学报(自然科学版), 2018, 39(1): 106113.
YU Bin, LI Shan, CHEN Cheng, et al. Prediction of human LncRNA big data genes based on ensemble learning\[J\]. Journal of Qingdao University of Science and Technology(Natural Science Edition), 2018, 39(1): 106113.
Prediction of Human LncRNA Big Data Genes Based on Ensemble Learning
YU Bin, LI Shan, CHEN Cheng, CHEN Ruixin, TIAN Baoguang
(College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao 266061, China)
Abstract: Long noncoding RNA (LncRNA) plays an important role in epigenetic regulation, posttranscriptional regulation and human diseases. It is of great necessity to identify LncRNA from vast amounts of RNA data by using machine learning. This paper presents a new method for predicting LncRNA big data genes based on ensemble learning. Firstly, such 86 features as frequency of occurrence of base sequence are extracted as initial characteristic sets. Secondly, the optimal features based on GASVM are selected, and 5fold crossvalidation accuracy of SVM is employed as fitness. Lastly, gene prediction model (AdaBoostSVM) combined by AdaBoost algorithm and SVM is constructed. The experimental results show that the prediction accuracy of test set LncRNA based on AdaBoostSVM model is 8926%, which is better than that of the RF, SVM and DWTSVM models.
Key words: long noncoding RNA; gene prediction; ensemble learning; AdaBoost algorithm; support vector machine
收稿日期: 20170502
基金项目: 国家自然科学基金项目(51572136); 山东省自然科学基金项目(ZR2014FL021);山东省高等学校科技计划项目(J17KA159).
作者简介: 于彬(1977—),男,副教授.