DSpace
 

Tai Nguyen So - Vietnam National University, Ha Noi - VNU >
TRƯỜNG ĐẠI HỌC CÔNG NGHỆ >
Bài báo khoa học >
Bài báo khoa học >

Search

Please use this identifier to cite or link to this item: http://tainguyenso.vnu.edu.vn/jspui/handle/123456789/3406

Title: Extracting parallel texts from the web
Authors: Le, Cuong Anh
Issue Date: 2010
Abstract: Abstract: Parallel corpus is the valuable resource for some important applications of natural language processing such as statistical machine translation, dictionary construction, cross-language information retrieval. The Web is a huge resource of knowledge, which partly contains bilingual information in various kinds of web pages. It currently attracts many studies on building parallel corpora based on the Internet resource. However, obtaining a parallel corpus with high accuracy is still a challenge. This paper focuses on extracting parallel texts from bilingual web-sites of the English and Vietnamese language pair. We first propose a new way of designing content-based features, and then combining them with structural features under a framework of machine learning. In the experiment we obtain 88.2% of precision for the extracted parallel texts. © 2010 IEEE.
URI: http://hdl.handle.net/123456789/3406
Appears in Collections:Bài báo khoa học

Files in This Item:

File Description SizeFormat
2.99.pdf47.29 kBAdobe PDFView/Open

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

 

Valid XHTML 1.0! DSpace Software Copyright © 2002-2010  Duraspace - Feedback