Provides a language independent way to break UNICODE
text into meaningful semantic units (e.g. words).
start()
Starts up the semantic unit scanner with an optional
character set, which acts as a hint to optimize the heuristics
used to determine the language(s) of the processed text.
characterSet | the character set the text was originally encoded in (can be NULL) |
next()
Get the begin / end offset of the next unit in the current text
text | the text to be scanned |
length | the number of characters in the text to be processed |
pos | the current position |
isLastBuffer, | the buffer is the last one |
begin | the begin offset of the next unit |
begin | the end offset of the next unit |
has more unit in the current text |