Documentation

The Java™ Tutorials
Hide TOC
Unicode SupportUnicode支持
Trail: Essential Java Classes
Lesson: Regular Expressions

Unicode SupportUnicode支持

As of the JDK 7 release, Regular Expression pattern matching has expanded functionality to support Unicode 6.0.从JDK 7版本开始,正则表达式模式匹配已经扩展了支持Unicode 6.0的功能。

Matching a Specific Code Point匹配特定代码点

You can match a specific Unicode code point using an escape sequence of the form \uFFFF, where FFFF is the hexadecimal value of the code point you want to match. 可以使用格式为\uFFFF的转义序列匹配特定的Unicode代码点,其中FFFF是要匹配的代码点的十六进制值。For example, \u6771 matches the Han character for east.例如,\u6771匹配东方的汉字。

Alternatively, you can specify a code point using Perl-style hex notation, \x{...}. 或者,您可以使用Perl风格的十六进制表示法\x{...}指定代码点。For example:例如:

String hexPattern = "\x{" + Integer.toHexString(codePoint) + "}";

Unicode Character PropertiesUnicode字符属性

Each Unicode character, in addition to its value, has certain attributes, or properties. 每个Unicode字符除了其值之外,还具有某些属性。You can match a single character belonging to a particular category with the expression \p{prop}. 可以将属于特定类别的单个字符与表达式\p{prop}匹配。You can match a single character not belonging to a particular category with the expression \P{prop}.可以将不属于特定类别的单个字符与表达式\P{prop}匹配。

The three supported property types are scripts, blocks, and a "general" category.支持的三种属性类型是脚本、块和“常规”类别。

Scripts脚本

To determine if a code point belongs to a specific script, you can either use the script keyword, or the sc short form, for example, \p{script=Hiragana}. 要确定某个代码点是否属于特定脚本,可以使用script关键字或sc缩写形式,例如,\p{script=Hiragana}Alternatively, you can prefix the script name with the string Is, such as \p{IsHiragana}.或者,您可以使用字符串Is作为脚本名称的前缀,例如\p{IsHiragana}

Valid script names supported by Pattern are those accepted by UnicodeScript.forName.Pattern支持的有效脚本名是UnicodeScript.forName接受的脚本名。

Blocks

A block can be specified using the block keyword, or the blk short form, for example, \p{block=Mongolian}. 可以使用block关键字或blk缩写形式指定块,例如,\p{block=Mongolian}Alternatively, you can prefix the block name with the string In, such as \p{InMongolian}.或者,您可以在块名称的前面加上字符串In,例如\p{InMongolian}

Valid block names supported by Pattern are those accepted by UnicodeBlock.forName.Pattern支持的有效块名是UnicodeBlock.forName接受的块名。

General Category一般类别

Categories can be specified with optional prefix Is. 可以使用可选前缀Is指定类别。For example, IsL matches the category of Unicode letters. 例如,IsL匹配Unicode字母的类别。Categories can also be specified by using the general_category keyword, or the short form gc. 还可以使用general_category关键字或缩写gc指定类别。For example, an uppercase letter can be matched using general_category=Lu or gc=Lu.例如,可以使用general_category=Lugc=Lu匹配大写字母。

Supported categories are those of The Unicode Standard in the version specified by the Character class.支持的类别是Character类指定版本中的Unicode标准类别。


Previous page: Methods of the PatternSyntaxException Class
Next page: Additional Resources