DSpace
 

Tai Nguyen So - Vietnam National University, Ha Noi - VNU >
ĐẠI HỌC QUỐC GIA HÀ NỘI - VIETNAM NATIONAL UNIVERSITY, HANOI >
BÀI BÁO ĐĂNG TRÊN SCOPUS >
2009-2010 VNU-DOI-Publications >

Search

Please use this identifier to cite or link to this item: http://tainguyenso.vnu.edu.vn/jspui/handle/123456789/7317

Title: A fast template-based approach to automatically identify primary text content of a web page
Authors: Nguyen, D.Q.
Nguyen, D.Q.
Pham, S.B.
Bui, T.D.
Keywords: Web mining
Template detection
Data mining
Issue Date: 2009
Publisher: KSE 2009 - The 1st International Conference on Knowledge and Systems Engineering
Citation: Page : 232-236
Abstract: Search engines have become an indispensable tool for browsing information on the Internet. The user, however, is often annoyed by redundant results from irrelevant web pages. One reason is because search engines also look at non-informative blocks of web pages such as advertisement, navigation links, etc. In this paper, we propose a fast algorithm called FastContentExtractor to automatically detect main content blocks in a web page by improving the ContentExtractor algorithm. By automatically identifying and storing templates representing the structure of content blocks in a website, content blocks of a new web page from the website can be extracted quickly. The hierarchical order of the output blocks is also maintained which guarantees that the extracted content blocks are in the same order as the original ones. ?? 2009 IEEE.
URI: http://tainguyenso.vnu.edu.vn/jspui/handle/123456789/7317
ISBN: 9.78E+12
Appears in Collections:2009-2010 VNU-DOI-Publications

Files in This Item:

File Description SizeFormat
242.pdf47.81 kBAdobe PDFView/Open

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

 

Valid XHTML 1.0! DSpace Software Copyright © 2002-2010  Duraspace - Feedback