dc.contributor.author |
Phuong, L.H. |
|
dc.contributor.author |
Huyen, N.T.M. |
|
dc.contributor.author |
Roussanaly, A. |
|
dc.contributor.author |
Vinh, H.T. |
|
dc.date.accessioned |
2011-05-05T06:57:09Z |
|
dc.date.available |
2011-05-05T06:57:09Z |
|
dc.date.issued |
2008 |
|
dc.identifier.citation |
Volume: 5196 LNCS, Page : 240-249 |
vi |
dc.identifier.isbn |
3540882812; 9783540882817 |
|
dc.identifier.issn |
3029743 |
|
dc.identifier.uri |
http://tainguyenso.vnu.edu.vn/jspui/handle/123456789/6636 |
|
dc.description.abstract |
We present in this article a hybrid approach to automatically tokenize Vietnamese text. The
approach combines both finite-state automata technique, regular expression parsing and the maximalmatching
strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The
Vietnamese lexicon in use is compactly represented by a minimal finite-state automaton. A text to be
tokenized is first parsed into lexical phrases and other patterns using pre-defined regular expressions. The
automaton is then deployed to build linear graphs corresponding to the phrases to be segmented. The
application of a maximal- matching strategy on a graph results in all candidate segmentations of a phrase. It
is the responsibility of an ambiguity resolver, which uses a smoothed bigram language model, to choose the
most probable segmentation of the phrase. The hybrid approach is implemented to create vnTokenizer, a
highly accurate tokenizer for Vietnamese texts. ?? 2008 Springer-Verlag Berlin Heidelberg. |
vi |
dc.language.iso |
en |
vi |
dc.publisher |
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
vi |
dc.subject |
Applications |
vi |
dc.subject |
Computational linguistics |
vi |
dc.subject |
Finite automata |
vi |
dc.subject |
Graph theory |
vi |
dc.subject |
Linguistics |
vi |
dc.subject |
Natural language processing systems |
vi |
dc.subject |
Robots |
vi |
dc.subject |
Semantics |
vi |
dc.subject |
Statistical methods |
vi |
dc.subject |
Translation (languages) |
vi |
dc.subject |
Bigram language models |
vi |
dc.subject |
Hybrid approaches |
vi |
dc.subject |
Linear graphs |
vi |
dc.subject |
Regular expressions |
vi |
dc.title |
A hybrid approach to word segmentation of Vietnamese texts |
vi |
dc.type |
Article |
vi |