public class AnalyzingSuggester extends Lookup implements Accountable
This can result in powerful suggester functionality. For
example, if you use an analyzer removing stop words,
then the partial text "ghost chr..." could see the
suggestion "The Ghost of Christmas Past". Note that
position increments MUST NOT be preserved for this example
to work, so you should call the constructor with
preservePositionIncrements
parameter set to
false
If SynonymFilter is used to map wifi and wireless network to hotspot then the partial text "wirele..." could suggest "wifi router". Token normalization like stemmers, accent removal, etc., would allow suggestions to ignore such variations.
When two matching suggestions have the same weight, they are tie-broken by the analyzed form. If their analyzed form is the same then the order is undefined.
There are some limitations:
StopFilter
, and the user will
type "fast apple", but so far all they've typed is
"fast a", again because the analyzer doesn't convey whether
it's seen a token separator after the "a",
StopFilter
will remove that "a" causing
far more matches than you'd expect.
Modifier and Type | Class and Description |
---|---|
private static class |
AnalyzingSuggester.AnalyzingComparator |
Lookup.LookupPriorityQueue, Lookup.LookupResult
Modifier and Type | Field and Description |
---|---|
private long |
count
Number of entries the lookup was built with
|
private static int |
END_BYTE
Marks end of the analyzed input and start of dedup
byte.
|
static int |
EXACT_FIRST
Include this flag in the options parameter to
AnalyzingSuggester(Directory,String,Analyzer,Analyzer,int,int,int,boolean) to always
return the exact match first, regardless of score. |
private boolean |
exactFirst
True if exact match suggestions should always be returned first.
|
private FST<PairOutputs.Pair<java.lang.Long,BytesRef>> |
fst
FST<Weight,Surface>:
input is the analyzed form, with a null byte between terms
weights are encoded as costs: (Integer.MAX_VALUE-weight)
surface is the original, unanalyzed form.
|
private boolean |
hasPayloads |
private Analyzer |
indexAnalyzer
Analyzer that will be used for analyzing suggestions at
index time.
|
private int |
maxAnalyzedPathsForOneInput
Highest number of analyzed paths we saw for any single
input surface form.
|
private int |
maxGraphExpansions
Maximum graph paths to index for a single analyzed
surface form.
|
private int |
maxSurfaceFormsPerAnalyzedForm
Maximum number of dup surface forms (different surface
forms for the same analyzed form).
|
private static int |
PAYLOAD_SEP |
static int |
PRESERVE_SEP
Include this flag in the options parameter to
AnalyzingSuggester(Directory,String,Analyzer,Analyzer,int,int,int,boolean) to preserve
token separators when matching. |
private boolean |
preservePositionIncrements
Whether position holes should appear in the automaton.
|
private boolean |
preserveSep
True if separator between tokens should be preserved.
|
private Analyzer |
queryAnalyzer
Analyzer that will be used for analyzing suggestions at
query time.
|
private static int |
SEP_LABEL
Represents the separation between tokens, if
PRESERVE_SEP was specified
|
private Directory |
tempDir |
private java.lang.String |
tempFileNamePrefix |
(package private) static java.util.Comparator<PairOutputs.Pair<java.lang.Long,BytesRef>> |
weightComparator |
CHARSEQUENCE_COMPARATOR
Constructor and Description |
---|
AnalyzingSuggester(Directory tempDir,
java.lang.String tempFileNamePrefix,
Analyzer analyzer)
|
AnalyzingSuggester(Directory tempDir,
java.lang.String tempFileNamePrefix,
Analyzer indexAnalyzer,
Analyzer queryAnalyzer)
|
AnalyzingSuggester(Directory tempDir,
java.lang.String tempFileNamePrefix,
Analyzer indexAnalyzer,
Analyzer queryAnalyzer,
int options,
int maxSurfaceFormsPerAnalyzedForm,
int maxGraphExpansions,
boolean preservePositionIncrements)
Creates a new suggester.
|
Modifier and Type | Method and Description |
---|---|
void |
build(InputIterator iterator)
Builds up a new internal
Lookup representation based on the given InputIterator . |
protected Automaton |
convertAutomaton(Automaton a)
Used by subclass to change the lookup automaton, if
necessary.
|
private static int |
decodeWeight(long encoded)
cost -> weight
|
private static int |
encodeWeight(long value)
weight -> cost
|
java.lang.Object |
get(java.lang.CharSequence key)
Returns the weight associated with an input string,
or null if it does not exist.
|
java.util.Collection<Accountable> |
getChildResources()
Returns nested resources of this class.
|
long |
getCount()
Get the number of entries the lookup was built with
|
protected java.util.List<FSTUtil.Path<PairOutputs.Pair<java.lang.Long,BytesRef>>> |
getFullPrefixPaths(java.util.List<FSTUtil.Path<PairOutputs.Pair<java.lang.Long,BytesRef>>> prefixPaths,
Automaton lookupAutomaton,
FST<PairOutputs.Pair<java.lang.Long,BytesRef>> fst)
Returns all prefix paths to initialize the search.
|
private Lookup.LookupResult |
getLookupResult(java.lang.Long output1,
BytesRef output2,
CharsRefBuilder spare) |
(package private) TokenStreamToAutomaton |
getTokenStreamToAutomaton() |
boolean |
load(DataInput input)
Discard current lookup data and load it from a previously saved copy.
|
java.util.List<Lookup.LookupResult> |
lookup(java.lang.CharSequence key,
java.util.Set<BytesRef> contexts,
boolean onlyMorePopular,
int num)
Look up a key and return possible completion for this key.
|
long |
ramBytesUsed()
Returns byte size of the underlying FST.
|
private Automaton |
replaceSep(Automaton a) |
private boolean |
sameSurfaceForm(BytesRef key,
BytesRef output2) |
boolean |
store(DataOutput output)
Persist the constructed lookup data to a directory.
|
(package private) Automaton |
toAutomaton(BytesRef surfaceForm,
TokenStreamToAutomaton ts2a) |
(package private) Automaton |
toLookupAutomaton(java.lang.CharSequence key) |
private FST<PairOutputs.Pair<java.lang.Long,BytesRef>> fst
private final Analyzer indexAnalyzer
private final Analyzer queryAnalyzer
private final boolean exactFirst
private final boolean preserveSep
public static final int EXACT_FIRST
AnalyzingSuggester(Directory,String,Analyzer,Analyzer,int,int,int,boolean)
to always
return the exact match first, regardless of score. This
has no performance impact but could result in
low-quality suggestions.public static final int PRESERVE_SEP
AnalyzingSuggester(Directory,String,Analyzer,Analyzer,int,int,int,boolean)
to preserve
token separators when matching.private static final int SEP_LABEL
private static final int END_BYTE
private final int maxSurfaceFormsPerAnalyzedForm
private final int maxGraphExpansions
private final Directory tempDir
private final java.lang.String tempFileNamePrefix
private int maxAnalyzedPathsForOneInput
private boolean hasPayloads
private static final int PAYLOAD_SEP
private boolean preservePositionIncrements
private long count
static final java.util.Comparator<PairOutputs.Pair<java.lang.Long,BytesRef>> weightComparator
public AnalyzingSuggester(Directory tempDir, java.lang.String tempFileNamePrefix, Analyzer analyzer)
public AnalyzingSuggester(Directory tempDir, java.lang.String tempFileNamePrefix, Analyzer indexAnalyzer, Analyzer queryAnalyzer)
public AnalyzingSuggester(Directory tempDir, java.lang.String tempFileNamePrefix, Analyzer indexAnalyzer, Analyzer queryAnalyzer, int options, int maxSurfaceFormsPerAnalyzedForm, int maxGraphExpansions, boolean preservePositionIncrements)
indexAnalyzer
- Analyzer that will be used for
analyzing suggestions while building the index.queryAnalyzer
- Analyzer that will be used for
analyzing query text during lookupoptions
- see EXACT_FIRST
, PRESERVE_SEP
maxSurfaceFormsPerAnalyzedForm
- Maximum number of
surface forms to keep for a single analyzed form.
When there are too many surface forms we discard the
lowest weighted ones.maxGraphExpansions
- Maximum number of graph paths
to expand from the analyzed form. Set this to -1 for
no limit.preservePositionIncrements
- Whether position holes
should appear in the automatapublic long ramBytesUsed()
ramBytesUsed
in interface Accountable
public java.util.Collection<Accountable> getChildResources()
Accountable
getChildResources
in interface Accountable
Accountables
protected Automaton convertAutomaton(Automaton a)
TokenStreamToAutomaton getTokenStreamToAutomaton()
public void build(InputIterator iterator) throws java.io.IOException
Lookup
Lookup
representation based on the given InputIterator
.
The implementation might re-sort the data internally.public boolean store(DataOutput output) throws java.io.IOException
Lookup
store
in class Lookup
output
- DataOutput
to write the data to.java.io.IOException
- when fatal IO error occurs.public boolean load(DataInput input) throws java.io.IOException
Lookup
private Lookup.LookupResult getLookupResult(java.lang.Long output1, BytesRef output2, CharsRefBuilder spare)
public java.util.List<Lookup.LookupResult> lookup(java.lang.CharSequence key, java.util.Set<BytesRef> contexts, boolean onlyMorePopular, int num)
Lookup
lookup
in class Lookup
key
- lookup key. Depending on the implementation this may be
a prefix, misspelling, or even infix.contexts
- contexts to filter the lookup by, or null if all contexts are allowed; if the suggestion contains any of the contexts, it's a matchonlyMorePopular
- return only more popular resultsnum
- maximum number of results to returnpublic long getCount()
Lookup
protected java.util.List<FSTUtil.Path<PairOutputs.Pair<java.lang.Long,BytesRef>>> getFullPrefixPaths(java.util.List<FSTUtil.Path<PairOutputs.Pair<java.lang.Long,BytesRef>>> prefixPaths, Automaton lookupAutomaton, FST<PairOutputs.Pair<java.lang.Long,BytesRef>> fst) throws java.io.IOException
java.io.IOException
final Automaton toAutomaton(BytesRef surfaceForm, TokenStreamToAutomaton ts2a) throws java.io.IOException
java.io.IOException
final Automaton toLookupAutomaton(java.lang.CharSequence key) throws java.io.IOException
java.io.IOException
public java.lang.Object get(java.lang.CharSequence key)
private static int decodeWeight(long encoded)
private static int encodeWeight(long value)