A fast template-based approach to automatically identify primary text content of a web page

A fast template-based approach to automatically identify primary text content of a web page

DSpace/Manakin Repository

A fast template-based approach to automatically identify primary text content of a web page

dc.contributor.author	Nguyen, D.Q.
dc.contributor.author	Nguyen, D.Q.
dc.contributor.author	Pham, S.B.
dc.contributor.author	Bui, T.D.
dc.date.accessioned	2011-05-09T08:56:34Z
dc.date.available	2011-05-09T08:56:34Z
dc.date.issued	2009
dc.identifier.citation	Page : 232-236	vi
dc.identifier.isbn	9.78E+12
dc.identifier.uri	http://tainguyenso.vnu.edu.vn/jspui/handle/123456789/7317
dc.description.abstract	Search engines have become an indispensable tool for browsing information on the Internet. The user, however, is often annoyed by redundant results from irrelevant web pages. One reason is because search engines also look at non-informative blocks of web pages such as advertisement, navigation links, etc. In this paper, we propose a fast algorithm called FastContentExtractor to automatically detect main content blocks in a web page by improving the ContentExtractor algorithm. By automatically identifying and storing templates representing the structure of content blocks in a website, content blocks of a new web page from the website can be extracted quickly. The hierarchical order of the output blocks is also maintained which guarantees that the extracted content blocks are in the same order as the original ones. ?? 2009 IEEE.	vi
dc.language.iso	en	vi
dc.publisher	KSE 2009 - The 1st International Conference on Knowledge and Systems Engineering	vi
dc.subject	Web mining	vi
dc.subject	Template detection	vi
dc.subject	Data mining	vi
dc.title	A fast template-based approach to automatically identify primary text content of a web page	vi
dc.type	Article	vi

Files in this item

Files	Size	Format	View
242.pdf	47.80Kb	PDF	View/Open

A fast template-based approach to automatically identify primary text content of a web page

DSpace/Manakin Repository

A fast template-based approach to automatically identify primary text content of a web page

Files in this item

This item appears in the following Collection(s)

Search DSpace

Browse

All of DSpace

This Collection

My Account