Full Text Tokenizer

Introduction

The Zorba XQuery processor implements the XQuery and XPath Full Text 1.0 specification that, among other things, tokenizes a string into a sequence of tokens. See Tokenization.

The initial implementation of the toknenizer uses the one provided by the ICU library. However, you can provide your own tokenizer instead.

The Tokenizer Class

The Tokenizer class is:

class Tokenizer {
public:
  typedef /* implementation-defined */ ptr;
  typedef /* implementation-defined */ size_type;

  struct Numbers {
    typedef Tokenizer::size_type value_type;

    value_type token;   // Token number.
    value_type sent;    // Sentence number.
    value_type para;    // Paragraph number.

    Numbers();
  };

  class Callback {
  public:
    typedef Tokenizer::size_type size_type;;

    virtual ~Callback();

    virtual void operator()( char const *utf8_s, size_type utf8_len,
                             size_type token_no, size_type sent_no, size_type para_no,
                             void *payload = 0 ) = 0;
  };

  enum ElementTraceOptions {
    trace_none  = 0x0,  // Trace no elements.
    trace_begin = 0x1,  // Trace the beginning of elements.
    trace_end   = 0x2   // Trace the ending of elements.
  };

  virtual void destroy() const = 0;
  virtual void element( Item const &qname, int trace_options );
  Numbers& numbers();
  Numbers const& numbers() const;
  int trace_options() const;

  virtual void tokenize( char const *utf8_s, size_type utf8_len, locale::iso639_1::type lang,
                         bool wildcards, Callback &callback, void *payload = 0 ) = 0;

protected:
  Tokenizer( Numbers&, int trace_options = trace_none );
  virtual ~Tokenizer();
};

For details about the ptr type, the destroy() function, and why the destructor is protected, see the Memory Management document.

The Numbers struct is created by Zorba and passed to your constructor. It simply keeps track of the current token, sentence, and paragraph numbers.

To implement the Tokenizer, you need to implement the tokenize() function where:

utf8_s A pointer to the UTF-8 byte sequence comprising the string to be tokenized.
utf8_len The number of bytes in the string to be tokenized.
lang The language of the string.
wildcards If true, allows XQuery wildcard syntax characters to be part of tokens.
callback The Callback to call once per token.
payload Optional implementation-defined data.

A complete implementation of tokenize() is non-trivial and therefore an example is beyond the scope of this API documentation. However, the things a tokenizer should take into consideration include:

Paragraphs

By default, Zorba increments the current paragraph number once for each XML element encountered. However, this doens't work well for mixed content. For example, in the XHTML:

<p>The <em>best</em> thing ever!</p>

all the tokens are both in the same sentence and paragraph, but Zorba will consider that 3 paragraphs by default.

Your tokenizer can take control over when the paragraph number is incremented by passing the bitwise-or of the ElementTraceOptions values to the constructor and overriding the element() function. The element() function is passed the QName of the current XML element and (depending on the initial value passed to the constructor) one of trace_begin or trace_end. Note that this function is called only if the trace options value passed to the constructor was non-zero.

For example, the element() function for tokenizing XHTML would be along the lines of:

void MyTokenizer::element( Item const &qname, int trace_options ) {
  if ( trace_options & trace_end )
    return;
  String const name( qname.getLocalName() );
  if ( /* qname is an XHTML block-level element */ )
    ++numbers().para;
}

The TokenizerProviderClass

In addition to a Tokenizer, you must also implement a TokenizerProvider that, given a language, provides a Tokenizer for that language:

class TokenizerProvider {
public:
  virtual ~TokenizerProvider();
  virtual Tokenizer::ptr getTokenizer( locale::iso639_1::type lang, Tokenizer::Numbers &numbers ) const = 0;
};

A simple TokenizerProvider for our tokenizer can be implemented as:

class MyTokenizerProvider : public TokenizerProvider {
public:
  Tokenizer::ptr getTokenizer( locale::iso639_1::type lang ) const;
};

Tokenizer::ptr MyTokenizerProvider::getTokenizer( locale::iso639_1::type lang const {
  return Tokenizer::ptr( new MyTokenizer );
}

Using Your Tokenizer

To enable your tokenizer to be used, you need to register it with the XmlDataManager:

void *const store = StoreManager::getStore();
Zorba *const zorba = Zorba::getInstance( store );

MyTokenizerProvider provider;
zorba->getXmlDataManager()->registerTokenizerProvider( &provider );
blog comments powered by Disqus