Abstract:
|
Search engines have become an indispensable tool for browsing information on the Internet. The
user, however, is often annoyed by redundant results from irrelevant web pages. One reason is because
search engines also look at non-informative blocks of web pages such as advertisement, navigation links,
etc. In this paper, we propose a fast algorithm called FastContentExtractor to automatically detect main
content blocks in a web page by improving the ContentExtractor algorithm. By automatically identifying
and storing templates representing the structure of content blocks in a website, content blocks of a new web
page from the website can be extracted quickly. The hierarchical order of the output blocks is also
maintained which guarantees that the extracted content blocks are in the same order as the original ones. ??
2009 IEEE. |