public class RussianLetterTokenizer extends CharTokenizer
Tokenizer
that extends LetterTokenizer
by additionally looking up letters in a given "russian charset".
The problem with
LetterTokenizer
is that it uses Character.isLetter(char)
method,
which doesn't know how to detect letters in encodings like CP1252 and KOI8
(well-known problems with 0xD7 and 0xF7 chars)
AttributeSource.AttributeFactory, AttributeSource.State
Constructor and Description |
---|
RussianLetterTokenizer(AttributeSource.AttributeFactory factory,
Reader in) |
RussianLetterTokenizer(AttributeSource source,
Reader in) |
RussianLetterTokenizer(Reader in) |
RussianLetterTokenizer(Reader in,
char[] charset)
Deprecated.
Use
RussianLetterTokenizer(Reader) instead. |
Modifier and Type | Method and Description |
---|---|
protected boolean |
isTokenChar(char c)
Collects only characters which satisfy
Character.isLetter(char) . |
end, incrementToken, next, next, normalize, reset
close, correctOffset
getOnlyUseNewAPI, reset, setOnlyUseNewAPI
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, restoreState, toString
public RussianLetterTokenizer(Reader in, char[] charset)
RussianLetterTokenizer(Reader)
instead.public RussianLetterTokenizer(Reader in)
public RussianLetterTokenizer(AttributeSource source, Reader in)
public RussianLetterTokenizer(AttributeSource.AttributeFactory factory, Reader in)
protected boolean isTokenChar(char c)
Character.isLetter(char)
.isTokenChar
in class CharTokenizer
Copyright © 2000-2012 Apache Software Foundation. All Rights Reserved.