public final class

StandardTokenizer

extends Tokenizer
java.lang.Object
   ↳ org.apache.lucene.util.AttributeSource
     ↳ org.apache.lucene.analysis.TokenStream
       ↳ org.apache.lucene.analysis.Tokenizer
         ↳ org.apache.lucene.analysis.standard.StandardTokenizer

Class Overview

A grammar-based tokenizer constructed with JFlex

This should be a good tokenizer for most European-language documents:

  • Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
  • Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
  • Recognizes email addresses and internet hostnames as one token.

Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.

You must specify the required Version compatibility when creating StandardAnalyzer:

  • As of 2.4, Tokens incorrectly identified as acronyms are corrected (see LUCENE-1608

Summary

Constants
int ACRONYM
int ACRONYM_DEP This constant is deprecated. this solves a bug where HOSTs that end with '.' are identified as ACRONYMs.
int ALPHANUM
int APOSTROPHE
int CJ
int COMPANY
int EMAIL
int HOST
int NUM
Fields
public static final String[] TOKEN_TYPES String token types that correspond to token type int constants
[Expand]
Inherited Fields
From class org.apache.lucene.analysis.Tokenizer
Public Constructors
StandardTokenizer(Version matchVersion, Reader input)
Creates a new instance of the StandardTokenizer.
StandardTokenizer(Version matchVersion, AttributeSource source, Reader input)
Creates a new StandardTokenizer with a given AttributeSource.
StandardTokenizer(Version matchVersion, AttributeSource.AttributeFactory factory, Reader input)
Creates a new StandardTokenizer with a given AttributeSource.AttributeFactory
Public Methods
final void end()
This method is called by the consumer after the last token has been consumed, after incrementToken() returned false (using the new TokenStream API).
int getMaxTokenLength()
final boolean incrementToken()
Consumers (i.e., IndexWriter) use this method to advance the stream to the next token.
boolean isReplaceInvalidAcronym()
This method is deprecated. Remove in 3.X and make true the only valid value
void reset(Reader reader)
Expert: Reset the tokenizer to a new reader.
void setMaxTokenLength(int length)
Set the max allowed token length.
void setReplaceInvalidAcronym(boolean replaceInvalidAcronym)
This method is deprecated. Remove in 3.X and make true the only valid value See https://issues.apache.org/jira/browse/LUCENE-1068
[Expand]
Inherited Methods
From class org.apache.lucene.analysis.Tokenizer
From class org.apache.lucene.analysis.TokenStream
From class org.apache.lucene.util.AttributeSource
From class java.lang.Object
From interface java.io.Closeable

Constants

public static final int ACRONYM

Constant Value: 2 (0x00000002)

public static final int ACRONYM_DEP

This constant is deprecated.
this solves a bug where HOSTs that end with '.' are identified as ACRONYMs.

Constant Value: 8 (0x00000008)

public static final int ALPHANUM

Constant Value: 0 (0x00000000)

public static final int APOSTROPHE

Constant Value: 1 (0x00000001)

public static final int CJ

Constant Value: 7 (0x00000007)

public static final int COMPANY

Constant Value: 3 (0x00000003)

public static final int EMAIL

Constant Value: 4 (0x00000004)

public static final int HOST

Constant Value: 5 (0x00000005)

public static final int NUM

Constant Value: 6 (0x00000006)

Fields

public static final String[] TOKEN_TYPES

String token types that correspond to token type int constants

Public Constructors

public StandardTokenizer (Version matchVersion, Reader input)

Creates a new instance of the StandardTokenizer. Attaches the input to the newly created JFlex scanner.

Parameters
input The input reader See http://issues.apache.org/jira/browse/LUCENE-1068

public StandardTokenizer (Version matchVersion, AttributeSource source, Reader input)

Creates a new StandardTokenizer with a given AttributeSource.

public StandardTokenizer (Version matchVersion, AttributeSource.AttributeFactory factory, Reader input)

Creates a new StandardTokenizer with a given AttributeSource.AttributeFactory

Public Methods

public final void end ()

This method is called by the consumer after the last token has been consumed, after incrementToken() returned false (using the new TokenStream API). Streams implementing the old API should upgrade to use this feature.

This method can be used to perform any end-of-stream operations, such as setting the final offset of a stream. The final offset of a stream might differ from the offset of the last token eg in case one or more whitespaces followed after the last token, but a WhitespaceTokenizer was used.

public int getMaxTokenLength ()

public final boolean incrementToken ()

Consumers (i.e., IndexWriter) use this method to advance the stream to the next token. Implementing classes must implement this method and update the appropriate AttributeImpls with the attributes of the next token.

The producer must make no assumptions about the attributes after the method has been returned: the caller may arbitrarily change it. If the producer needs to preserve the state for subsequent calls, it can use captureState() to create a copy of the current attribute state.

This method is called for every token of a document, so an efficient implementation is crucial for good performance. To avoid calls to addAttribute(Class) and getAttribute(Class), references to all AttributeImpls that this stream uses should be retrieved during instantiation.

To ensure that filters and consumers know which attributes are available, the attributes must be added during instantiation. Filters and consumers are not required to check for availability of attributes in incrementToken().

Returns
  • false for end of stream; true otherwise
Throws
IOException

public boolean isReplaceInvalidAcronym ()

This method is deprecated.
Remove in 3.X and make true the only valid value

Prior to https://issues.apache.org/jira/browse/LUCENE-1068, StandardTokenizer mischaracterized as acronyms tokens like www.abc.com when they should have been labeled as hosts instead.

Returns
  • true if StandardTokenizer now returns these tokens as Hosts, otherwise false

public void reset (Reader reader)

Expert: Reset the tokenizer to a new reader. Typically, an analyzer (in its reusableTokenStream method) will use this to re-use a previously created tokenizer.

Throws
IOException

public void setMaxTokenLength (int length)

Set the max allowed token length. Any token longer than this is skipped.

public void setReplaceInvalidAcronym (boolean replaceInvalidAcronym)

This method is deprecated.
Remove in 3.X and make true the only valid value See https://issues.apache.org/jira/browse/LUCENE-1068

Parameters
replaceInvalidAcronym Set to true to replace mischaracterized acronyms as HOST.