A hybrid approach to word segmentation of Vietnamese texts

DSpace/Manakin Repository

A hybrid approach to word segmentation of Vietnamese texts

Show full item record


Title: A hybrid approach to word segmentation of Vietnamese texts
Author: Phuong, L.H.; Huyen, N.T.M.; Roussanaly, A.; Vinh, H.T.
Abstract: We present in this article a hybrid approach to automatically tokenize Vietnamese text. The approach combines both finite-state automata technique, regular expression parsing and the maximalmatching strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The Vietnamese lexicon in use is compactly represented by a minimal finite-state automaton. A text to be tokenized is first parsed into lexical phrases and other patterns using pre-defined regular expressions. The automaton is then deployed to build linear graphs corresponding to the phrases to be segmented. The application of a maximal- matching strategy on a graph results in all candidate segmentations of a phrase. It is the responsibility of an ambiguity resolver, which uses a smoothed bigram language model, to choose the most probable segmentation of the phrase. The hybrid approach is implemented to create vnTokenizer, a highly accurate tokenizer for Vietnamese texts. ?? 2008 Springer-Verlag Berlin Heidelberg.
URI: http://tainguyenso.vnu.edu.vn/jspui/handle/123456789/6636
Date: 2008

Files in this item

Files Size Format View
428.pdf 44.49Kb PDF View/Open

This item appears in the following Collection(s)

Show full item record

Search DSpace


Advanced Search

Browse

My Account