public final class

UCharacter

extends Object
java.lang.Object
   ↳ sun.text.normalizer.UCharacter

Class Overview

The UCharacter class provides extensions to the java.lang.Character class. These extensions provide support for Unicode 3.2 properties and together with the UTF16 class, provide support for supplementary characters (those with code points above U+FFFF).

Code points are represented in these API using ints. While it would be more convenient in Java to have a separate primitive datatype for them, ints suffice in the meantime.

To use this class please add the jar file name icu4j.jar to the class path, since it contains data files which supply the information used by this file.
E.g. In Windows
set CLASSPATH=%CLASSPATH%;$JAR_FILE_PATH/ucharacter.jar.
Otherwise, another method would be to copy the files uprops.dat and unames.icu from the icu4j source subdirectory $ICU4J_SRC/src/com.ibm.icu.impl.data to your class directory $ICU4J_CLASS/com.ibm.icu.impl.data.

Aside from the additions for UTF-16 support, and the updated Unicode 3.1 properties, the main differences between UCharacter and Character are:

  • UCharacter is not designed to be a char wrapper and does not have APIs to which involves management of that single char.
    These include:
    • char charValue(),
    • int compareTo(java.lang.Character, java.lang.Character), etc.
  • UCharacter does not include Character APIs that are deprecated, not does it include the Java-specific character information, such as boolean isJavaIdentifierPart(char ch).
  • Character maps characters 'A' - 'Z' and 'a' - 'z' to the numeric values '10' - '35'. UCharacter also does this in digit and getNumericValue, to adhere to the java semantics of these methods. New methods unicodeDigit, and getUnicodeNumericValue do not treat the above code points as having numeric values. This is a semantic change from ICU4J 1.3.1.

Further detail differences can be determined from the program com.ibm.icu.dev.test.lang.UCharacterCompare

This class is not subclassable

See Also
  • com.ibm.icu.lang.UCharacterEnums

Summary

Nested Classes
interface UCharacter.ECharacterCategory This interface is deprecated. This is a draft API and might change in a future release of ICU.  
interface UCharacter.HangulSyllableType Hangul Syllable Type constants. 
interface UCharacter.NumericType Numeric Type constants. 
Constants
int MAX_VALUE The highest Unicode code point value (scalar value) according to the Unicode Standard.
int MIN_VALUE The lowest Unicode code point value.
double NO_NUMERIC_VALUE Special value that is returned by getUnicodeNumericValue(int) when no numeric value is defined for a code point.
int SUPPLEMENTARY_MIN_VALUE The minimum value for Supplementary code points
Public Methods
static int digit(int ch, int radix)
Retrieves the numeric value of a decimal digit code point.
static String foldCase(String str, boolean defaultmapping)
The given string is mapped to its case folding equivalent according to UnicodeData.txt and CaseFolding.txt; if any character has no case folding equivalent, the character itself is returned.
static VersionInfo getAge(int ch)

Get the "age" of the code point.

static int getCodePoint(char lead, char trail)
Returns a code point corresponding to the two UTF16 characters.
static int getDirection(int ch)
Returns the Bidirection property of a code point.
static int getIntPropertyValue(int ch, int type)

Gets the property value for an Unicode property type of a code point.

static int getType(int ch)
Returns a value indicating a code point's Unicode category.
static double getUnicodeNumericValue(int ch)

Get the numeric value for a Unicode code point as defined in the Unicode Character Database.

[Expand]
Inherited Methods
From class java.lang.Object

Constants

public static final int MAX_VALUE

The highest Unicode code point value (scalar value) according to the Unicode Standard. This is a 21-bit value (21 bits, rounded up).
Up-to-date Unicode implementation of java.lang.Character.MIN_VALUE

Constant Value: 1114111 (0x0010ffff)

public static final int MIN_VALUE

The lowest Unicode code point value.

Constant Value: 0 (0x00000000)

public static final double NO_NUMERIC_VALUE

Special value that is returned by getUnicodeNumericValue(int) when no numeric value is defined for a code point.

Constant Value: -1.23456789E8

public static final int SUPPLEMENTARY_MIN_VALUE

The minimum value for Supplementary code points

Constant Value: 65536 (0x00010000)

Public Methods

public static int digit (int ch, int radix)

Retrieves the numeric value of a decimal digit code point.
This method observes the semantics of java.lang.Character.digit(). Note that this will return positive values for code points for which isDigit returns false, just like java.lang.Character.
Semantic Change: In release 1.3.1 and prior, this did not treat the European letters as having a digit value, and also treated numeric letters and other numbers as digits. This has been changed to conform to the java semantics.
A code point is a valid digit if and only if:

  • ch is a decimal digit or one of the european letters, and
  • the value of ch is less than the specified radix.

Parameters
ch the code point to query
radix the radix
Returns
  • the numeric value represented by the code point in the specified radix, or -1 if the code point is not a decimal digit or if its value is too large for the radix

public static String foldCase (String str, boolean defaultmapping)

The given string is mapped to its case folding equivalent according to UnicodeData.txt and CaseFolding.txt; if any character has no case folding equivalent, the character itself is returned. "Full", multiple-code point case folding mappings are returned here. For "simple" single-code point mappings use the API foldCase(int ch, boolean defaultmapping).

Parameters
str the String to be converted
defaultmapping Indicates if all mappings defined in CaseFolding.txt is to be used, otherwise the mappings for dotted I and dotless i marked with 'I' in CaseFolding.txt will be skipped.
Returns
  • the case folding equivalent of the character, if any; otherwise the character itself.
See Also
  • #foldCase(int, boolean)

public static VersionInfo getAge (int ch)

Get the "age" of the code point.

The "age" is the Unicode version when the code point was first designated (as a non-character or for Private Use) or assigned a character.

This can be useful to avoid emitting code points to receiving processes that do not accept newer characters.

The data is from the UCD file DerivedAge.txt.

Parameters
ch The code point.
Returns
  • the Unicode version number

public static int getCodePoint (char lead, char trail)

Returns a code point corresponding to the two UTF16 characters.

Parameters
lead the lead char
trail the trail char
Returns
  • code point if surrogate characters are valid.
Throws
IllegalArgumentException thrown when argument characters do not form a valid codepoint

public static int getDirection (int ch)

Returns the Bidirection property of a code point. For example, 0x0041 (letter A) has the LEFT_TO_RIGHT directional property.
Result returned belongs to the interface UCharacterDirection

Parameters
ch the code point to be determined its direction
Returns
  • direction constant from UCharacterDirection.

public static int getIntPropertyValue (int ch, int type)

Gets the property value for an Unicode property type of a code point. Also returns binary and mask property values.

Unicode, especially in version 3.2, defines many more properties than the original set in UnicodeData.txt.

The properties APIs are intended to reflect Unicode properties as defined in the Unicode Character Database (UCD) and Unicode Technical Reports (UTR). For details about the properties see http://www.unicode.org/.

For names of Unicode properties see the UCD file PropertyAliases.txt.

 Sample usage:
 int ea = UCharacter.getIntPropertyValue(c, UProperty.EAST_ASIAN_WIDTH);
 int ideo = UCharacter.getIntPropertyValue(c, UProperty.IDEOGRAPHIC);
 boolean b = (ideo == 1) ? true : false;
 

Parameters
ch code point to test.
type UProperty selector constant, identifies which binary property to check. Must be UProperty.BINARY_START <= type < UProperty.BINARY_LIMIT or UProperty.INT_START <= type < UProperty.INT_LIMIT or UProperty.MASK_START <= type < UProperty.MASK_LIMIT.
Returns
  • numeric value that is directly the property value or, for enumerated properties, corresponds to the numeric value of the enumerated constant of the respective property value enumeration type (cast to enum type if necessary). Returns 0 or 1 (for false / true) for binary Unicode properties. Returns a bit-mask for mask properties. Returns 0 if 'type' is out of bounds or if the Unicode version does not have data for the property at all, or not for this code point.
See Also
  • UProperty
  • #hasBinaryProperty
  • #getIntPropertyMinValue
  • #getIntPropertyMaxValue
  • #getUnicodeVersion

public static int getType (int ch)

Returns a value indicating a code point's Unicode category. Up-to-date Unicode implementation of java.lang.Character.getType() except for the above mentioned code points that had their category changed.
Return results are constants from the interface UCharacterCategory
NOTE: the UCharacterCategory values are not compatible with those returned by java.lang.Character.getType. UCharacterCategory values match the ones used in ICU4C, while java.lang.Character type values, though similar, skip the value 17.

Parameters
ch code point whose type is to be determined
Returns
  • category which is a value of UCharacterCategory

public static double getUnicodeNumericValue (int ch)

Get the numeric value for a Unicode code point as defined in the Unicode Character Database.

A "double" return type is necessary because some numeric values are fractions, negative, or too large for int.

For characters without any numeric values in the Unicode Character Database, this function will return NO_NUMERIC_VALUE.

API Change: In release 2.2 and prior, this API has a return type int and returns -1 when the argument ch does not have a corresponding numeric value. This has been changed to synch with ICU4C

This corresponds to the ICU4C function u_getNumericValue.

Parameters
ch Code point to get the numeric value for.
Returns
  • numeric value of ch, or NO_NUMERIC_VALUE if none is defined.