全文下载:2013040423
王举范,陈卓
(青岛科技大学 信息科学技术学院,山东 青岛,266042)
摘要: 属性离散化能够降低问题的复杂度,得到更加简短、精确且易于理解的规则。针对现有离散化方法在选择断点时没有考虑属性间和属性内断点的互斥性且不能保证保持决策表的不可分辨关系,本研究提出一种新的基于信息熵的粗糙集连续属性多变量离散化算法(PAD)。它以信息熵作为选择断点的衡量标准,以不可分辨关系为停止标准并引入5条断点预选确选策略。实验结果表明,引入断点预选、确选策略的PAD算法与Rostta软件中的5个离散化算法相比,具有较高的预测精度和较少的断点数目。
关键词: 粗糙集; 不可分辨关系; 离散化; 信息熵
中图分类号: P 208文献标志码: A
Multiple Variable Discretization Algorithm of Continuous Attributes in Rough Set Theory Based on Information Entropy
WANG Ju-fan, CHEN Zhuo
(College of Information Science and Technology, Qingdao University of Science and Technology, Qingdao 266042, China)
Abstract: Attribute discretization can reduce the problem complexity, and obtain more brief, accurate and comprehensible rules. The existing discretization methods in selecting breakpoint don't take into consideration of the mutual exclusion of the ones among and within the attributes, therefore cannot maintain the indiscernibility relation of decision table. In this paper a new multiple variable discretization algorithm is proposed for continuous attributes in rough set theory based on information entropy (PAD). The new algorithm employs information entropy as a measure to choose breakpoint, takes indiscernibility relation as the stopping criterion and introduces five strategies for breakpoint pre-selection and final selection. Experimental results show that PAD algorithm can get higher precision accuracy and less breakpoint number compared with five discretization algorithms employed in Rostta software.
Key words: rough sets; indiscernibility; discretization; information entropy
收稿日期:2012-09-26
基金项目:国家自然科学基金项目(61273180).
作者简介:王举范(1986—),男,硕士研究生.