java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html
The Java platform will track the Unicode specification as it evolves. The precise version of Unicode used by a given release is specified in the documentation of the class Character. The first 128 characters of the Unicode character encoding are the ASCII characters. A Unicode escape of the form \uxxxx, where xxxx is a hexadecimal value, represents the Unicode character whose encoding is xxxx. This translation step allows any program to be expressed using only ASCII characters. The longest possible translation is used at each step, even if the result does not ultimately make a correct program while another lexical translation would. This translation step results in a sequence of Unicode input characters: UnicodeInputCharacter: UnicodeEscape RawInputCharacter UnicodeEscape: \ UnicodeMarker HexDigit HexDigit HexDigit HexDigit UnicodeMarker: u UnicodeMarker u RawInputCharacter: any Unicode character HexDigit: one of 0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F The \, u, and hexadecimal digits here are all ASCII characters. In addition to the processing implied by the grammar, for each raw input character that is a backslash \, input processing must consider how many other \ characters contiguously precede it, separating it from a non-\ character or the start of the input stream. If this number is even, then the \ is eligible to begin a Unicode escape; If an eligible \ is not followed by u, then it is treated as a RawInputCharacter and remains part of the escaped Unicode stream. If an eligible \ is followed by u, or more than one u, and the last u is not followed by four hexadecimal digits, then a compile-time error occurs. The character produced by a Unicode escape does not participate in further Unicode escapes. For example, the raw input \u005cu005a results in the six characters \ u 0 0 5 a, because 005c is the Unicode value for \. It does not result in the character Z, which is Unicode character 005a, because the \ that resulted from the \u005c is not interpreted as the start of a further Unicode escape. The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra u-for example, \uxxxx becomes \uuxxxx-while simultaneously converting non-ASCII characters in the source text to a \uxxxx escape containing a single u. This transformed version is equally acceptable to a compiler for the Java programming language ("Java compiler") and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple u's are present to a sequence of Unicode characters with one fewer u, while simultaneously converting each escape sequence with a single u to the corresponding single Unicode character. Implementations should use the \uxxxx notation as an output format to display Unicode characters when a suitable font is not available. This definition of lines determines the line numbers produced by a Java compiler or other system component. LineTerminator: the ASCII LF character, also known as "newline" the ASCII CR character, also known as "return" the ASCII CR character followed by the ASCII LF character InputCharacter: UnicodeInputCharacter but not CR or LF Lines are terminated by the ASCII characters CR, or LF, or CR LF. The two characters CR immediately followed by LF are counted as one line terminator, not two. The result is a sequence of line terminators and input characters, which are the terminal symbols for the third step in the tokenization process. As a special concession for compatibility with certain operating systems, the ASCII SUB character (\u001a, or control-Z) is ignored if it is the last character in the escaped input stream. Consider two tokens x and y in the resulting input stream. If x precedes y, then we say that x is to the left of y and that y is to the right of x. For example, in this simple piece of code: class Empty { } we say that the } token is to the right of the { token, even though it appears, in this two-dimensional representation on paper, downward and to the left of the { token. This convention about the use of the words left and right allows us to speak, for example, of the right-hand operand of a binary operator or of the left-hand side of an assignment. These comments are formally specified by the following productions: Comment: TraditionalComment EndOfLineComment TraditionalComment: / * NotStar CommentTail EndOfLineComment: / / CharactersInLine opt LineTerminator CommentTail: * CommentTailStar NotStar CommentTail CommentTailStar: / * CommentTailStar NotStarNotSlash CommentTail NotStar: InputCharacter but not * LineTerminator NotStarNotSlash: InputCharacter but not * or / LineTerminator CharactersInLine: InputCharacter CharactersInLine InputCharacter These productions imply all of the following properties: * Comments do not nest. As a result, the text: /* this comment /* // /** ends here: */ is a single complete comment. Identifier: IdentifierChars but not a Keyword or BooleanLiteral or NullLiteral IdentifierChars: JavaLetter IdentifierChars JavaLetterOrDigit JavaLetter: any Unicode character that is a Java letter (see below) JavaLetterOrDigit: any Unicode character that is a Java letter-or-digit (see below) Letters and digits may be drawn from the entire Unicode character set, which supports most writing scripts in use in the world today, including the large sets for Chinese, Japanese, and Korean. This allows programmers to use identifiers in their programs that are written in their native languages. The Java letters include uppercase and lowercase ASCII Latin letters A-Z (\u0041-\u005a), and a-z (\u0061-\u007a), and, for historical reasons, the ASCII underscore (_, or \u005f) and dollar sign ($, or \u0024). The $ character should be used only in mechanically generated source code or, rarely, to access preexisting names on legacy systems. The "Java digits" include the ASCII digits 0-9 (\u0030-\u0039). Two identifiers are the same only if they are identical, that is, have the same Unicode character for each letter or digit. Identifiers that have the same external appearance may yet be different. For example, the identifiers consisting of the single letters LATIN CAPITAL LETTER A (A, \u0041), LATIN SMALL LETTER A (a, \u0061), GREEK CAPITAL LETTER ALPHA (A, \u0391), and CYRILLIC SMALL LETTER A (a, \u0430) are all different. Unicode composite characters are different from the decomposed characters. For example, a LATIN CAPITAL LETTER A ACUTE (, \u00c1) could be considered to be the same as a LATIN CAPITAL LETTER A (A, \u0041) immediately followed by a NON-SPACING ACUTE (, \u0301) when sorting, but these are different in identifiers. See The Unicode Standard, Volume 1, pages 412ff for details about decomposition, and see pages 626-627 of that work for details about sorting. This may allow a Java compiler to produce better error messages if these C++ keywords incorrectly appear in programs. An integer literal may be expressed in decimal (base 10), hexadecimal (base 16), or octal (base 8): IntegerLiteral: DecimalIntegerLiteral HexIntegerLiteral OctalIntegerLiteral DecimalIntegerLiteral: DecimalNumeral IntegerTypeSuffix opt HexIntegerLiteral: HexNumeral IntegerTypeSuffix opt OctalIntegerLiteral: OctalNumeral IntegerTypeSuffix opt IntegerTypeSuffix: one of l L An integer literal is of type long if it is suffixed with an ASCII letter L or l (ell); The suffix L is preferred, because the letter l (ell) is often hard to distinguish from the digit 1 (one). A decimal numeral is either the single ASCII character 0, representing the integer zero, or consists of an ASCII digit from 1 to 9, optionally followed by one or more ASCII digits from 0 to 9, representing a positive integer: DecimalNumeral: 0 NonZeroDigit Digits opt Digits: Digit Digits Digit Digit: 0 NonZeroDigit NonZeroDigit: one of 1 2 3 4 5 6 7 8 9 A hexadecimal numeral consists of the leading ASCII characters 0x or 0X followed by one or more ASCI...
|