Unicode {ISOcodes} | R Documentation |
Basic Unicode data, including the Universal Character Set (UCS) code points as defined by the ISO/IEC 10646 International Standard.
data("Unicode")
A data frame with the following variables:
Code
:Name
:General_Category
:Canonical_Combining_Class
:Bidi_Class
:Decomposition
:Numeric_Value_Decimal_Digit
:Numeric_Value_Digit
:Numeric_Value
:Bidi_Mirrored
:"Y"
and
"N"
indicating whether the character has been identified as
a “mirrored” character in bidirectional text or not.Unicode_1_Name
:ISO_Comment
:Simple_Uppercase_Mapping
:Simple_Lowercase_Mapping
:Simple_Titlecase_Mapping
:
Variable General_Category
has the following property values
(levels).
Lu | Letter, Uppercase |
Ll | Letter, Lowercase |
Lt | Letter, Titlecase |
Lm | Letter, Modifier |
Lo | Letter, Other |
Mn | Mark, Nonspacing |
Mc | Mark, Spacing Combining |
Me | Mark, Enclosing |
Nd | Number, Decimal Digit |
Nl | Number, Letter |
No | Number, Other |
Pc | Punctuation, Connector |
Pd | Punctuation, Dash |
Ps | Punctuation, Open |
Pe | Punctuation, Close |
Pi | Punctuation, Initial quote (may behave like Ps or Pe depending on usage) |
Pf | Punctuation, Final quote (may behave like Ps or Pe depending on usage) |
Po | Punctuation, Other |
Sm | Symbol, Math |
Sc | Symbol, Currency |
Sk | Symbol, Modifier |
So | Symbol, Other |
Zs | Separator, Space |
Zl | Separator, Line |
Zp | Separator, Paragraph |
Cc | Other, Control |
Cf | Other, Format |
Cs | Other, Surrogate |
Co | Other, Private Use |
Cn | Other, Not Assigned (no characters in the file have this property) |
Variable Canonical_Combining_Class
has the following property
values (levels).
0 | Spacing, split, enclosing, reordrant, and Tibetan subjoined |
1 | Overlays and interior |
7 | Nuktas |
8 | Hiragana/Katakana voicing marks |
9 | Viramas |
10 | Start of fixed position classes |
199 | End of fixed position classes |
200 | Below left attached |
202 | Below attached |
204 | Below right attached |
208 | Left attached (reordrant around single base character) |
210 | Right attached |
212 | Above left attached |
214 | Above attached |
216 | Above right attached |
218 | Below left |
220 | Below |
222 | Below right |
224 | Left (reordrant around single base character) |
226 | Right |
228 | Above left |
230 | Above |
232 | Above right |
233 | Double below |
234 | Double above |
240 | Below (iota subscript) |
Variable Bidi_Class
has the following property values (levels).
L | Left-to-Right |
LRE | Left-to-Right Embedding |
LRO | Left-to-Right Override |
R | Right-to-Left |
AL | Right-to-Left Arabic |
RLE | Right-to-Left Embedding |
RLO | Right-to-Left Override |
Pop Directional Format | |
EN | European Number |
ES | European Number Separator |
ET | European Number Terminator |
AN | Arabic Number |
CS | Common Number Separator |
NSM | Non-Spacing Mark |
BN | Boundary Neutral |
B | Paragraph Separator |
S | Segment Separator |
WS | Whitespace |
ON | Other Neutrals |
The decomposition types in variable Decomposition
are as
follows.
<font> | A font variant (e.g., a blackletter form). |
<noBreak> | A no-break version of a space or hyphen. |
<initial> | An initial presentation form (Arabic). |
<medial> | A medial presentation form (Arabic). |
<final> | A final presentation form (Arabic). |
<isolated> | An isolated presentation form (Arabic). |
<circle> | An encircled form. |
<super> | A superscript form. |
<sub> | A subscript form. |
<vertical> | A vertical layout presentation form. |
<wide> | A wide (or zenkaku) compatibility character. |
<narrow> | A narrow (or hankaku) compatibility character. |
<small> | A small variant form (CNS compatibility). |
<square> | A CJK squared font variant. |
<fraction> | A vulgar fraction form. |
<compat> | Otherwise unspecified compatibility character. |
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
http://en.wikipedia.org/wiki/Unicode, http://en.wikipedia.org/wiki/ISO_10646; http://www.unicode.org/Public/UNIDATA/UCD.html for details on the Unicode data sets.