Applications Computational linguistics Finite automata Graph theory Linguistics Natural language processing systems Robots Semantics Statistical methods Translation (languages) Bigram language models Hybrid approaches Linear graphs Regular expressions
Issue Date:
2008
Publisher:
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Citation:
Volume: 5196 LNCS, Page : 240-249
Abstract:
We present in this article a hybrid approach to automatically tokenize Vietnamese text. The
approach combines both finite-state automata technique, regular expression parsing and the maximalmatching
strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The
Vietnamese lexicon in use is compactly represented by a minimal finite-state automaton. A text to be
tokenized is first parsed into lexical phrases and other patterns using pre-defined regular expressions. The
automaton is then deployed to build linear graphs corresponding to the phrases to be segmented. The
application of a maximal- matching strategy on a graph results in all candidate segmentations of a phrase. It
is the responsibility of an ambiguity resolver, which uses a smoothed bigram language model, to choose the
most probable segmentation of the phrase. The hybrid approach is implemented to create vnTokenizer, a
highly accurate tokenizer for Vietnamese texts. ?? 2008 Springer-Verlag Berlin Heidelberg.