public final class

StandardTokenizer

extends Tokenizer

java.lang.Object
↳	org.apache.lucene.util.AttributeSource
	↳	org.apache.lucene.analysis.TokenStream
		↳	org.apache.lucene.analysis.Tokenizer
			↳	org.apache.lucene.analysis.standard.StandardTokenizer

Class Overview

A grammar-based tokenizer constructed with JFlex

This should be a good tokenizer for most European-language documents:

Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
Recognizes email addresses and internet hostnames as one token.

Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.

You must specify the required Version compatibility when creating StandardAnalyzer:

As of 2.4, Tokens incorrectly identified as acronyms are corrected (see LUCENE-1608

Summary

Constants
int	ACRONYM
int	ACRONYM_DEP	This constant is deprecated. this solves a bug where HOSTs that end with '.' are identified as ACRONYMs.
int	ALPHANUM
int	APOSTROPHE
int	CJ
int	COMPANY
int	EMAIL
int	HOST
int	NUM

Fields
public static final String[]	TOKEN_TYPES	String token types that correspond to token type int constants

[Expand]

Inherited Fields

From class org.apache.lucene.analysis.Tokenizer

Public Constructors
	StandardTokenizer(Version matchVersion, Reader input) Creates a new instance of the `StandardTokenizer`.
	StandardTokenizer(Version matchVersion, AttributeSource source, Reader input) Creates a new StandardTokenizer with a given `AttributeSource`.
	StandardTokenizer(Version matchVersion, AttributeSource.AttributeFactory factory, Reader input) Creates a new StandardTokenizer with a given `AttributeSource.AttributeFactory`

Public Methods
final void	end() This method is called by the consumer after the last token has been consumed, after `incrementToken()` returned `false` (using the new `TokenStream` API).
int	getMaxTokenLength()
final boolean	incrementToken() Consumers (i.e., `IndexWriter`) use this method to advance the stream to the next token.
boolean	isReplaceInvalidAcronym() This method is deprecated. Remove in 3.X and make true the only valid value
void	reset(Reader reader) Expert: Reset the tokenizer to a new reader.
void	setMaxTokenLength(int length) Set the max allowed token length.
void	setReplaceInvalidAcronym(boolean replaceInvalidAcronym) This method is deprecated. Remove in 3.X and make true the only valid value See https://issues.apache.org/jira/browse/LUCENE-1068

[Expand]

Inherited Methods

From class org.apache.lucene.analysis.Tokenizer

From class org.apache.lucene.analysis.TokenStream

From class org.apache.lucene.util.AttributeSource

From class java.lang.Object

From interface java.io.Closeable

Constants

public static final int ACRONYM

Constant Value: 2 (0x00000002)

public static final int ACRONYM_DEP

This constant is deprecated.
this solves a bug where HOSTs that end with '.' are identified as ACRONYMs.

Constant Value: 8 (0x00000008)

public static final int ALPHANUM

Constant Value: 0 (0x00000000)

public static final int APOSTROPHE

Constant Value: 1 (0x00000001)

public static final int CJ

Constant Value: 7 (0x00000007)

public static final int COMPANY

Constant Value: 3 (0x00000003)

public static final int EMAIL

Constant Value: 4 (0x00000004)

public static final int HOST

Constant Value: 5 (0x00000005)

public static final int NUM

Constant Value: 6 (0x00000006)

Fields

public static final String[] TOKEN_TYPES

String token types that correspond to token type int constants

Public Constructors

public StandardTokenizer (Version matchVersion, Reader input)

Creates a new instance of the StandardTokenizer. Attaches the input to the newly created JFlex scanner.

Parameters

input	The input reader See http://issues.apache.org/jira/browse/LUCENE-1068

public StandardTokenizer (Version matchVersion, AttributeSource source, Reader input)

Creates a new StandardTokenizer with a given AttributeSource.

public StandardTokenizer (Version matchVersion, AttributeSource.AttributeFactory factory, Reader input)

Creates a new StandardTokenizer with a given AttributeSource.AttributeFactory

Public Methods

public final void end ()

This method is called by the consumer after the last token has been consumed, after incrementToken() returned false (using the new TokenStream API). Streams implementing the old API should upgrade to use this feature.

This method can be used to perform any end-of-stream operations, such as setting the final offset of a stream. The final offset of a stream might differ from the offset of the last token eg in case one or more whitespaces followed after the last token, but a WhitespaceTokenizer was used.

public int getMaxTokenLength ()

public final boolean incrementToken ()

Consumers (i.e., IndexWriter) use this method to advance the stream to the next token. Implementing classes must implement this method and update the appropriate AttributeImpls with the attributes of the next token.

The producer must make no assumptions about the attributes after the method has been returned: the caller may arbitrarily change it. If the producer needs to preserve the state for subsequent calls, it can use captureState() to create a copy of the current attribute state.

This method is called for every token of a document, so an efficient implementation is crucial for good performance. To avoid calls to addAttribute(Class) and getAttribute(Class), references to all AttributeImpls that this stream uses should be retrieved during instantiation.

To ensure that filters and consumers know which attributes are available, the attributes must be added during instantiation. Filters and consumers are not required to check for availability of attributes in incrementToken().

Returns

false for end of stream; true otherwise

Throws

IOException

public boolean isReplaceInvalidAcronym ()

This method is deprecated.
Remove in 3.X and make true the only valid value

Prior to https://issues.apache.org/jira/browse/LUCENE-1068, StandardTokenizer mischaracterized as acronyms tokens like www.abc.com when they should have been labeled as hosts instead.

Returns

true if StandardTokenizer now returns these tokens as Hosts, otherwise false

public void reset (Reader reader)

Expert: Reset the tokenizer to a new reader. Typically, an analyzer (in its reusableTokenStream method) will use this to re-use a previously created tokenizer.

Throws

IOException

public void setMaxTokenLength (int length)

Set the max allowed token length. Any token longer than this is skipped.

public void setReplaceInvalidAcronym (boolean replaceInvalidAcronym)

This method is deprecated.
Remove in 3.X and make true the only valid value See https://issues.apache.org/jira/browse/LUCENE-1068

Parameters

replaceInvalidAcronym	Set to true to replace mischaracterized acronyms as HOST.

Classes

StandardTokenizer

Class Overview

Summary

Constants

public static final int ACRONYM

public static final int ACRONYM_DEP

public static final int ALPHANUM

public static final int APOSTROPHE

public static final int CJ

public static final int COMPANY

public static final int EMAIL

public static final int HOST

public static final int NUM

Fields

public static final String[] TOKEN_TYPES

Public Constructors

public StandardTokenizer (Version matchVersion, Reader input)

Parameters

public StandardTokenizer (Version matchVersion, AttributeSource source, Reader input)

public StandardTokenizer (Version matchVersion, AttributeSource.AttributeFactory factory, Reader input)

Public Methods

public final void end ()

public int getMaxTokenLength ()

See Also

public final boolean incrementToken ()

Returns

Throws

public boolean isReplaceInvalidAcronym ()

Returns

public void reset (Reader reader)

Throws

public void setMaxTokenLength (int length)

public void setReplaceInvalidAcronym (boolean replaceInvalidAcronym)

Parameters