A hybrid approach to word segmentation of Vietnamese texts

A hybrid approach to word segmentation of Vietnamese texts

DSpace/Manakin Repository

A hybrid approach to word segmentation of Vietnamese texts

dc.contributor.author	Phuong, L.H.
dc.contributor.author	Huyen, N.T.M.
dc.contributor.author	Roussanaly, A.
dc.contributor.author	Vinh, H.T.
dc.date.accessioned	2011-05-05T06:57:09Z
dc.date.available	2011-05-05T06:57:09Z
dc.date.issued	2008
dc.identifier.citation	Volume: 5196 LNCS, Page : 240-249	vi
dc.identifier.isbn	3540882812; 9783540882817
dc.identifier.issn	3029743
dc.identifier.uri	http://tainguyenso.vnu.edu.vn/jspui/handle/123456789/6636
dc.description.abstract	We present in this article a hybrid approach to automatically tokenize Vietnamese text. The approach combines both finite-state automata technique, regular expression parsing and the maximalmatching strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The Vietnamese lexicon in use is compactly represented by a minimal finite-state automaton. A text to be tokenized is first parsed into lexical phrases and other patterns using pre-defined regular expressions. The automaton is then deployed to build linear graphs corresponding to the phrases to be segmented. The application of a maximal- matching strategy on a graph results in all candidate segmentations of a phrase. It is the responsibility of an ambiguity resolver, which uses a smoothed bigram language model, to choose the most probable segmentation of the phrase. The hybrid approach is implemented to create vnTokenizer, a highly accurate tokenizer for Vietnamese texts. ?? 2008 Springer-Verlag Berlin Heidelberg.	vi
dc.language.iso	en	vi
dc.publisher	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)	vi
dc.subject	Applications	vi
dc.subject	Computational linguistics	vi
dc.subject	Finite automata	vi
dc.subject	Graph theory	vi
dc.subject	Linguistics	vi
dc.subject	Natural language processing systems	vi
dc.subject	Robots	vi
dc.subject	Semantics	vi
dc.subject	Statistical methods	vi
dc.subject	Translation (languages)	vi
dc.subject	Bigram language models	vi
dc.subject	Hybrid approaches	vi
dc.subject	Linear graphs	vi
dc.subject	Regular expressions	vi
dc.title	A hybrid approach to word segmentation of Vietnamese texts	vi
dc.type	Article	vi

Files in this item

Files	Size	Format	View
428.pdf	44.49Kb	PDF	View/Open

A hybrid approach to word segmentation of Vietnamese texts

DSpace/Manakin Repository

A hybrid approach to word segmentation of Vietnamese texts

Files in this item

This item appears in the following Collection(s)

Search DSpace

Browse

All of DSpace

This Collection

My Account