A hybrid approach to word segmentation of Vietnamese texts

DSpace/Manakin Repository

A hybrid approach to word segmentation of Vietnamese texts

Show simple item record


dc.contributor.author Phuong, L.H.
dc.contributor.author Huyen, N.T.M.
dc.contributor.author Roussanaly, A.
dc.contributor.author Vinh, H.T.
dc.date.accessioned 2011-05-05T06:57:09Z
dc.date.available 2011-05-05T06:57:09Z
dc.date.issued 2008
dc.identifier.citation Volume: 5196 LNCS, Page : 240-249 vi
dc.identifier.isbn 3540882812; 9783540882817
dc.identifier.issn 3029743
dc.identifier.uri http://tainguyenso.vnu.edu.vn/jspui/handle/123456789/6636
dc.description.abstract We present in this article a hybrid approach to automatically tokenize Vietnamese text. The approach combines both finite-state automata technique, regular expression parsing and the maximalmatching strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The Vietnamese lexicon in use is compactly represented by a minimal finite-state automaton. A text to be tokenized is first parsed into lexical phrases and other patterns using pre-defined regular expressions. The automaton is then deployed to build linear graphs corresponding to the phrases to be segmented. The application of a maximal- matching strategy on a graph results in all candidate segmentations of a phrase. It is the responsibility of an ambiguity resolver, which uses a smoothed bigram language model, to choose the most probable segmentation of the phrase. The hybrid approach is implemented to create vnTokenizer, a highly accurate tokenizer for Vietnamese texts. ?? 2008 Springer-Verlag Berlin Heidelberg. vi
dc.language.iso en vi
dc.publisher Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) vi
dc.subject Applications vi
dc.subject Computational linguistics vi
dc.subject Finite automata vi
dc.subject Graph theory vi
dc.subject Linguistics vi
dc.subject Natural language processing systems vi
dc.subject Robots vi
dc.subject Semantics vi
dc.subject Statistical methods vi
dc.subject Translation (languages) vi
dc.subject Bigram language models vi
dc.subject Hybrid approaches vi
dc.subject Linear graphs vi
dc.subject Regular expressions vi
dc.title A hybrid approach to word segmentation of Vietnamese texts vi
dc.type Article vi

Files in this item

Files Size Format View
428.pdf 44.49Kb PDF View/Open

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse

My Account