李挥剑
(交通运输部管理干部学院 信息技术应用研究所,北京 101601)
摘要: 多种频繁项集挖掘(FIM)方法组合用来对大数据进行挖掘会暴露很多问题。针对暴露的问题,在MapReduce平台上对两种频繁项集挖掘算法进行了研究。采用两种新的大数据集挖掘方法:DistEclat和BigFIM,前者侧重于速度,利用基于kFIs的简易负荷平衡方案来解决问题。而后者通过先验变体对kFIs进行挖掘后将找出的频繁项集分配给映射程序,通过优化后在真正大的数据集上运行。最后通过实验证明该方法时间复杂度较低,数据量越大优势将越明显,扩展效果越好。
关键词: 分布式数据挖掘; 频繁项集挖掘; MapReduce; Hadoop; Eclat算法
中图分类号: TP 301.6 文献标志码: A
Research on Frequent Itemsets Mining in Large Data Environment
LI Huijian
(Institute of Information Technology Application, Ministry of Transport Management
Cadre Institute,Beijing 101601, China)
Abstract: A variety of mining frequent itemsets (FIM) combination method used for mining on large data will expose many problems. According to the exposed problems to two kinds of frequent itemsets mining algorithm were researched in the platform of MapReduce, This paper adopts two kinds of big new data set mining method: DistEclat and BigFIM. The former focuses on speed, using simple load balancing scheme based on kFIs to solve the problem. The latter by mining the kFIs through a priori variants will find frequent item sets assigned to mapping procedures, through optimized operation in a real large data sets. The experiments prove that the time complexity of the method is low. The advantage will be more obvious and the effect of expansion is better,when data quantity is bigger.
Key words: distributed data mining; FIM; MapReduce; Hadoop; Eclat Algorithm
收稿日期: 20140412
基金项目: 交通运输部应用基础研究(主干学科)项目(2012319226320).
作者简介: 李挥剑(1976—),男,高级工程师.