文章导读
总览 评价 赵明明 1, , 陶华 2, , 伏虎 2, , 李昕 3,* ( 1、 北京邮电大学网络与交换国家重点实验室; 2、 河南省电力公司朝阳供电公司; 3、 北京邮电大学网络与交换国家重点实验室,北京100876; ) 摘要: 网络成为人们获取信息的重要途径。而网
赵明明1,, 陶华2,, 伏虎2,, 李昕3,*
(
1、北京邮电大学网络与交换国家重点实验室; 2、河南省电力公司朝阳供电公司; 3、北京邮电大学网络与交换国家重点实验室,北京100876; )
摘要:
网络成为人们获取信息的重要途径。而网页上的内容除了主题内容外,还有如广告、版权信息、欢迎信息等与主题无关的内容,如何将网页中的正文内容提取出来已经成为机器学习和数据挖掘界的一个研究热点。本文将对网页正文提取方法的研究现状做一个简要介绍,并对未来的研究工作进行展望。
关键词:
网页正文提取;DOM树;VIPS算法
ZHAO Mingming1,, Tao Hua2,, Fu Hu2,, LI Xin3,*
(
1、State Key Laboratory of Networking and Switching, Beijing University of Posts and Telecommunications; 2、HeNan Electronic Power Company, Xinyang Power Supply Company; 3、State Key Laboratory of Networking and Switching, Beijing University of Posts and Telecommunications, Beijing 100876; )
Abstract:
Network has become an important way for people to obtain information. The web pages contents include subject matter, in addition, there also including advertising, copyright information, welcome message and other topics unrelated with the contents, how to extract the contents of Web pages out of the body has become a research focus for machine learning and data mining sector. This articleal will make a brief introduction of the gorithm research extracting the body of the page, and make prospects for future research work.
Tag:
点此返回栏目查看更多>>>参考论文