详细信息
一种基于节点密度分割和标签传播的Web页面挖掘方法 ( EI收录)
A Method Based on Node Density Segmentation and Label Propagation for Mining Web Page
文献类型:期刊文献
中文题名:一种基于节点密度分割和标签传播的Web页面挖掘方法
英文题名:A Method Based on Node Density Segmentation and Label Propagation for Mining Web Page
作者:张乃洲[1];曹薇[1];李石君[2]
第一作者:张乃洲
通讯作者:Zhang, Nai-Zhou
机构:[1]河南财经政法大学计算机与信息工程学院;[2]武汉大学计算机学院
第一机构:河南财经政法大学计算机与信息工程学院
年份:2015
卷号:38
期号:2
起止页码:349-364
中文期刊名:计算机学报
外文期刊名:Chinese Journal of Computers
收录:CSTPCD;;EI(收录号:20151200650596);Scopus(收录号:2-s2.0-84924804462);北大核心:【北大核心2014】;CSCD:【CSCD2015_2016】;
基金:国家自然科学基金(61272109;61202285);国家星火计划项目(2012GA750007);河南省科技厅基础与前沿技术研究项目(122300410378);河南省教育厅科学技术研究重点项目(13A520032)资助~~
语种:中文
中文关键词:页面分割;节点密度;标签传播;DOM树;块分类;社会计算;社交网络
外文关键词:Web page segmentation; node density; label propagation; DOM tree; block classification;social computing; social networks
摘要:获取Web页面中的重要内容如文本和链接,在许多Web挖掘研究领域有着重要的应用价值.目前针对该问题主要采用Web页面分割和区块识别的方法.但现有的方法将Web页面中重要文本和链接的识别视为两个相互独立的问题,这种做法忽略了Web页面中文本和链接的内在语义关系,同时降低了页面处理的效率.文中提出了一种Web页面重要内容挖掘的统一框架,该框架主要由3个部分组成:第一,先将Web页面转换为DOM树表示,然后采用节点密度熵为度量将DOM树分割为不同的页面块;第二,采用基于K最近邻标签传播的半监督方法自动扩展页面块训练集;第三,在扩展的页面块训练集上对SVM分类器进行训练,并用来对页面块进行分类.采用该框架可以将Web页面块区分为多种类型,并且该框架独立于Web页面的类型和布局.我们在真实的Web环境下进行了广泛的实验,实验结果表明了该方法的有效性.
For many research fields in Web mining, how to get the important content in a Web page, such as texts and links, has important applications. At present, the main method for solving this problem is to adopt Web page segmentation and informative sections recognition. However, existing approaches use decoupled strategies that attempt to do text content and link content identification in two separate phases. This ignores the inner semantic relationships between texts and links in a Web page, and also results in low efficiency of the processing of Web page. In this paper, we propose a uniform framework for mining important content in a Web page. This framework consists of three components. First, a Web page is transformed into a DOM tree, and then it is segmented into several Web page blocks with a metric based on node density entropy. Second, a semi-supervised approach based on K-Nearest Neighbor label propagation is proposed to automatically extend the training set for classification. Third, a SVM-based classifier is trained over the extended training set, and eventually it is leveraged to classify Web page blocks. The framework can distinguish Web page blocks into a variety of types, and it is independent of the type and layout of Web pages. We conduct the extensive experiment over real Web environment, and the experimental results show that the proposed methods are effective.
参考文献:
正在载入数据...