public final class

UTF16

extends Object
java.lang.Object
   ↳ sun.text.normalizer.UTF16

Class Overview

Standalone utility class providing UTF16 character conversions and indexing conversions.

Code that uses strings alone rarely need modification. By design, UTF-16 does not allow overlap, so searching for strings is a safe operation. Similarly, concatenation is always safe. Substringing is safe if the start and end are both on UTF-32 boundaries. In normal code, the values for start and end are on those boundaries, since they arose from operations like searching. If not, the nearest UTF-32 boundaries can be determined using bounds().

Examples:

The following examples illustrate use of some of these methods.

 // iteration forwards: Original
 for (int i = 0; i < s.length(); ++i) {
     char ch = s.charAt(i);
     doSomethingWith(ch);
 }

 // iteration forwards: Changes for UTF-32
 int ch;
 for (int i = 0; i < s.length(); i+=UTF16.getCharCount(ch)) {
     ch = UTF16.charAt(s,i);
     doSomethingWith(ch);
 }

 // iteration backwards: Original
 for (int i = s.length() -1; i >= 0; --i) {
     char ch = s.charAt(i);
     doSomethingWith(ch);
 }

 // iteration backwards: Changes for UTF-32
 int ch;
 for (int i = s.length() -1; i > 0; i-=UTF16.getCharCount(ch)) {
     ch = UTF16.charAt(s,i);
     doSomethingWith(ch);
 }
 
Notes:
  • Naming: For clarity, High and Low surrogates are called Lead and Trail in the API, which gives a better sense of their ordering in a string. offset16 and offset32 are used to distinguish offsets to UTF-16 boundaries vs offsets to UTF-32 boundaries. int char32 is used to contain UTF-32 characters, as opposed to char16, which is a UTF-16 code unit.
  • Roundtripping Offsets: You can always roundtrip from a UTF-32 offset to a UTF-16 offset and back. Because of the difference in structure, you can roundtrip from a UTF-16 offset to a UTF-32 offset and back if and only if bounds(string, offset16) != TRAIL.
  • Exceptions: The error checking will throw an exception if indices are out of bounds. Other than than that, all methods will behave reasonably, even if unmatched surrogates or out-of-bounds UTF-32 values are present. UCharacter.isLegal() can be used to check for validity if desired.
  • Unmatched Surrogates: If the string contains unmatched surrogates, then these are counted as one UTF-32 value. This matches their iteration behavior, which is vital. It also matches common display practice as missing glyphs (see the Unicode Standard Section 5.4, 5.5).
  • Optimization: The method implementations may need optimization if the compiler doesn't fold static final methods. Since surrogate pairs will form an exceeding small percentage of all the text in the world, the singleton case should always be optimized for.

Summary

Constants
int CODEPOINT_MAX_VALUE The highest Unicode code point value (scalar value) according to the Unicode Standard.
int CODEPOINT_MIN_VALUE The lowest Unicode code point value.
int LEAD_SURROGATE_MAX_VALUE Lead surrogate maximum value
int LEAD_SURROGATE_MIN_VALUE Lead surrogate minimum value
int SUPPLEMENTARY_MIN_VALUE The minimum value for Supplementary code points
int SURROGATE_MIN_VALUE Surrogate minimum value
int TRAIL_SURROGATE_MAX_VALUE Trail surrogate maximum value
int TRAIL_SURROGATE_MIN_VALUE Trail surrogate minimum value
Public Constructors
UTF16()
Public Methods
static StringBuffer append(StringBuffer target, int char32)
Append a single UTF-32 value to the end of a StringBuffer.
static int charAt(String source, int offset16)
Extract a single UTF-32 value from a string.
static int charAt(char[] source, int start, int limit, int offset16)
Extract a single UTF-32 value from a substring.
static int getCharCount(int char32)
Determines how many chars this char32 requires.
static char getLeadSurrogate(int char32)
Returns the lead surrogate.
static char getTrailSurrogate(int char32)
Returns the trail surrogate.
static boolean isLeadSurrogate(char char16)
Determines whether the character is a lead surrogate.
static boolean isSurrogate(char char16)
Determines whether the code value is a surrogate.
static boolean isTrailSurrogate(char char16)
Determines whether the character is a trail surrogate.
static int moveCodePointOffset(char[] source, int start, int limit, int offset16, int shift32)
Shifts offset16 by the argument number of codepoints within a subarray.
static String valueOf(int char32)
Convenience method corresponding to String.valueOf(char).
[Expand]
Inherited Methods
From class java.lang.Object

Constants

public static final int CODEPOINT_MAX_VALUE

The highest Unicode code point value (scalar value) according to the Unicode Standard.

Constant Value: 1114111 (0x0010ffff)

public static final int CODEPOINT_MIN_VALUE

The lowest Unicode code point value.

Constant Value: 0 (0x00000000)

public static final int LEAD_SURROGATE_MAX_VALUE

Lead surrogate maximum value

Constant Value: 56319 (0x0000dbff)

public static final int LEAD_SURROGATE_MIN_VALUE

Lead surrogate minimum value

Constant Value: 55296 (0x0000d800)

public static final int SUPPLEMENTARY_MIN_VALUE

The minimum value for Supplementary code points

Constant Value: 65536 (0x00010000)

public static final int SURROGATE_MIN_VALUE

Surrogate minimum value

Constant Value: 55296 (0x0000d800)

public static final int TRAIL_SURROGATE_MAX_VALUE

Trail surrogate maximum value

Constant Value: 57343 (0x0000dfff)

public static final int TRAIL_SURROGATE_MIN_VALUE

Trail surrogate minimum value

Constant Value: 56320 (0x0000dc00)

Public Constructors

public UTF16 ()

Public Methods

public static StringBuffer append (StringBuffer target, int char32)

Append a single UTF-32 value to the end of a StringBuffer. If a validity check is required, use isLegal() on char32 before calling.

Parameters
target the buffer to append to
char32 value to append.
Returns
  • the updated StringBuffer
Throws
IllegalArgumentException thrown when char32 does not lie within the range of the Unicode codepoints

public static int charAt (String source, int offset16)

Extract a single UTF-32 value from a string. Used when iterating forwards or backwards (with UTF16.getCharCount(), as well as random access. If a validity check is required, use UCharacter.isLegal() on the return value. If the char retrieved is part of a surrogate pair, its supplementary character will be returned. If a complete supplementary character is not found the incomplete character will be returned

Parameters
source array of UTF-16 chars
offset16 UTF-16 offset to the start of the character.
Returns
  • UTF-32 value for the UTF-32 value that contains the char at offset16. The boundaries of that codepoint are the same as in bounds32().
Throws
IndexOutOfBoundsException thrown if offset16 is out of bounds.

public static int charAt (char[] source, int start, int limit, int offset16)

Extract a single UTF-32 value from a substring. Used when iterating forwards or backwards (with UTF16.getCharCount(), as well as random access. If a validity check is required, use UCharacter.isLegal() on the return value. If the char retrieved is part of a surrogate pair, its supplementary character will be returned. If a complete supplementary character is not found the incomplete character will be returned

Parameters
source array of UTF-16 chars
start offset to substring in the source array for analyzing
limit offset to substring in the source array for analyzing
offset16 UTF-16 offset relative to start
Returns
  • UTF-32 value for the UTF-32 value that contains the char at offset16. The boundaries of that codepoint are the same as in bounds32().
Throws
IndexOutOfBoundsException thrown if offset16 is not within the range of start and limit.

public static int getCharCount (int char32)

Determines how many chars this char32 requires. If a validity check is required, use isLegal() on char32 before calling.

Parameters
char32 the input codepoint.
Returns
  • 2 if is in supplementary space, otherwise 1.

public static char getLeadSurrogate (int char32)

Returns the lead surrogate. If a validity check is required, use isLegal() on char32 before calling.

Parameters
char32 the input character.
Returns
  • lead surrogate if the getCharCount(ch) is 2;
    and 0 otherwise (note: 0 is not a valid lead surrogate).

public static char getTrailSurrogate (int char32)

Returns the trail surrogate. If a validity check is required, use isLegal() on char32 before calling.

Parameters
char32 the input character.
Returns
  • the trail surrogate if the getCharCount(ch) is 2;
    otherwise the character itself

public static boolean isLeadSurrogate (char char16)

Determines whether the character is a lead surrogate.

Parameters
char16 the input character.
Returns
  • true iff the input character is a lead surrogate

public static boolean isSurrogate (char char16)

Determines whether the code value is a surrogate.

Parameters
char16 the input character.
Returns
  • true iff the input character is a surrogate.

public static boolean isTrailSurrogate (char char16)

Determines whether the character is a trail surrogate.

Parameters
char16 the input character.
Returns
  • true iff the input character is a trail surrogate.

public static int moveCodePointOffset (char[] source, int start, int limit, int offset16, int shift32)

Shifts offset16 by the argument number of codepoints within a subarray.

Parameters
source char array
start position of the subarray to be performed on
limit position of the subarray to be performed on
offset16 UTF16 position to shift relative to start
shift32 number of codepoints to shift
Returns
  • new shifted offset16 relative to start
Throws
IndexOutOfBoundsException if the new offset16 is out of bounds with respect to the subarray or the subarray bounds are out of range.

public static String valueOf (int char32)

Convenience method corresponding to String.valueOf(char). Returns a one or two char string containing the UTF-32 value in UTF16 format. If a validity check is required, use isLegal() on char32 before calling.

Parameters
char32 the input character.
Returns
  • string value of char32 in UTF16 format
Throws
IllegalArgumentException thrown if char32 is a invalid codepoint.