A hybrid approach to word segmentation of Vietnamese texts

Tìm kiếm bằng Google

	Home

Browse
	Communities & Collections
	Issue Date
	Author
	Title
	Subject

Sign on to:
	Receive email updates
	My DSpace authorized users
	Edit Profile

Tai Nguyen So - Vietnam National University, Ha Noi - VNU >
ĐẠI HỌC QUỐC GIA HÀ NỘI - VIETNAM NATIONAL UNIVERSITY, HANOI >
BÀI BÁO ĐĂNG TRÊN SCOPUS >
2006-2008 VNU-DOI-Publications >

Search

Enter some text in the box below to search DSpace.

Please use this identifier to cite or link to this item: http://tainguyenso.vnu.edu.vn/jspui/handle/123456789/6636

Title:	A hybrid approach to word segmentation of Vietnamese texts
Authors:	Phuong, L.H. Huyen, N.T.M. Roussanaly, A. Vinh, H.T.
Keywords:	Applications Computational linguistics Finite automata Graph theory Linguistics Natural language processing systems Robots Semantics Statistical methods Translation (languages) Bigram language models Hybrid approaches Linear graphs Regular expressions
Issue Date:	2008
Publisher:	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Citation:	Volume: 5196 LNCS, Page : 240-249
Abstract:	We present in this article a hybrid approach to automatically tokenize Vietnamese text. The approach combines both finite-state automata technique, regular expression parsing and the maximalmatching strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The Vietnamese lexicon in use is compactly represented by a minimal finite-state automaton. A text to be tokenized is first parsed into lexical phrases and other patterns using pre-defined regular expressions. The automaton is then deployed to build linear graphs corresponding to the phrases to be segmented. The application of a maximal- matching strategy on a graph results in all candidate segmentations of a phrase. It is the responsibility of an ambiguity resolver, which uses a smoothed bigram language model, to choose the most probable segmentation of the phrase. The hybrid approach is implemented to create vnTokenizer, a highly accurate tokenizer for Vietnamese texts. ?? 2008 Springer-Verlag Berlin Heidelberg.
URI:	http://tainguyenso.vnu.edu.vn/jspui/handle/123456789/6636
ISBN:	3540882812; 9783540882817
ISSN:	3029743
Appears in Collections:	2006-2008 VNU-DOI-Publications

Files in This Item:

File	Description	Size	Format
428.pdf		44.5 kB	Adobe PDF	View/Open