A fast template-based approach to automatically identify primary text content of a web page

DSpace/Manakin Repository

A fast template-based approach to automatically identify primary text content of a web page

Show simple item record


dc.contributor.author Nguyen, D.Q.
dc.contributor.author Nguyen, D.Q.
dc.contributor.author Pham, S.B.
dc.contributor.author Bui, T.D.
dc.date.accessioned 2011-05-09T08:56:34Z
dc.date.available 2011-05-09T08:56:34Z
dc.date.issued 2009
dc.identifier.citation Page : 232-236 vi
dc.identifier.isbn 9.78E+12
dc.identifier.uri http://tainguyenso.vnu.edu.vn/jspui/handle/123456789/7317
dc.description.abstract Search engines have become an indispensable tool for browsing information on the Internet. The user, however, is often annoyed by redundant results from irrelevant web pages. One reason is because search engines also look at non-informative blocks of web pages such as advertisement, navigation links, etc. In this paper, we propose a fast algorithm called FastContentExtractor to automatically detect main content blocks in a web page by improving the ContentExtractor algorithm. By automatically identifying and storing templates representing the structure of content blocks in a website, content blocks of a new web page from the website can be extracted quickly. The hierarchical order of the output blocks is also maintained which guarantees that the extracted content blocks are in the same order as the original ones. ?? 2009 IEEE. vi
dc.language.iso en vi
dc.publisher KSE 2009 - The 1st International Conference on Knowledge and Systems Engineering vi
dc.subject Web mining vi
dc.subject Template detection vi
dc.subject Data mining vi
dc.title A fast template-based approach to automatically identify primary text content of a web page vi
dc.type Article vi

Files in this item

Files Size Format View
242.pdf 47.80Kb PDF View/Open

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse

My Account