Chinese Text Segmenter>

Unlike languages such as English, German and Spanish, Chinese does not have word delimiters except for punctuation marks. A Chinese word often consists of more than one Chinese character. This makes some applications that use a word as the basic unit, like searching a digital library, word-based compression, speech recognition and so on, very difficult to be implemented. Word segmentation is an important task required for the applications mentioned above. It is a tool to predict word boundaries by properly positioning spaces between them.

We use a PPM character-based language model, trained on a text file of 1,000,000 words, to do the segmentation. This method has been found to be very effective and is known to be one of the best methods to achieve good compression.

This text segmenter takes GB encoded input text. If using Windows 95/98/NT you may download this plugin to allow you to input/view GB encoded text.