perlunicode - Unicode support in Perl |
perlunicode - Unicode support in Perl (EXPERIMENTAL, subject to change)
WARNING: As of the 5.6.1 release, the implementation of Unicode support in Perl is incomplete, and continues to be highly experimental.
The following areas need further work. They are being rapidly addressed in the 5.7.x development branch.
use utf8
still needed to enable a few featuresutf8
pragma implements the tables used for Unicode support. These
tables are automatically loaded on demand, so the utf8
pragma need not
normally be used.
However, as a compatibility measure, this pragma must be explicitly used to enable recognition of UTF-8 encoded literals and identifiers in the source text.
Beginning with version 5.6, Perl uses logically wide characters to represent strings internally. This internal representation of strings uses the UTF-8 encoding.
In future, Perl-level operations can be expected to work with characters rather than bytes, in general.
However, as strictly an interim compatibility measure, Perl v5.6 aims to provide a safe migration path from byte semantics to character semantics for programs. For operations where Perl can unambiguously decide that the input data is characters, Perl now switches to character semantics. For operations where this determination cannot be made without additional information from the user, Perl decides in favor of compatibility, and chooses to use byte semantics.
This behavior preserves compatibility with earlier versions of Perl, which allowed byte semantics in Perl operations, but only as long as none of the program's inputs are marked as being as source of Unicode character data. Such data may come from filehandles, from calls to external programs, from information provided by the system (such as %ENV), or from literals and constants in the source text.
If the -C
command line switch is used, (or the ${^WIDE_SYSTEM_CALLS}
global flag is set to 1
), all system calls will use the
corresponding wide character APIs. This is currently only implemented
on Windows.
Regardless of the above, the bytes
pragma can always be used to force
byte semantics in a particular lexical scope. See the bytes manpage.
The utf8
pragma is primarily a compatibility device that enables
recognition of UTF-8 in literals encountered by the parser. It may also
be used for enabling some of the more experimental Unicode support features.
Note that this pragma is only required until a future version of Perl
in which character semantics will become the default. This pragma may
then become a no-op. See the utf8 manpage.
Unless mentioned otherwise, Perl operators will use character semantics
when they are dealing with Unicode data, and byte semantics otherwise.
Thus, character semantics for these operations apply transparently; if
the input data came from a Unicode source (for example, by adding a
character encoding discipline to the filehandle whence it came, or a
literal UTF-8 string constant in the program), character semantics
apply; otherwise, byte semantics are in effect. To force byte semantics
on Unicode data, the bytes
pragma should be used.
Under character semantics, many operations that formerly operated on
bytes change to operating on characters. For ASCII data this makes
no difference, because UTF-8 stores ASCII in single bytes, but for
any character greater than chr(127)
, the character may be stored in
a sequence of two or more bytes, all of which have the high bit set.
But by and large, the user need not worry about this, because Perl
hides it from the user. A character in Perl is logically just a number
ranging from 0 to 2**32 or so. Larger characters encode to longer
sequences of bytes internally, but again, this is just an internal
detail which is hidden at the Perl level.
Character semantics have the following effects:
Presuming you use a Unicode editor to edit your program, such characters
will typically occur directly within the literal strings as UTF-8
characters, but you can also specify a particular character with an
extension of the \x
notation. UTF-8 characters are specified by
putting the hexadecimal code within curlies after the \x
. For instance,
a Unicode smiley face is \x{263A}
.
\C
pattern
is provided to force a match a single byte (``char
'' in C, hence
\C
).)
\w
can be used to match an ideograph,
for instance.
\p{}
(matches property) and \P{}
(doesn't
match property) constructs. For instance, \p{Lu}
matches any
character with the Unicode uppercase property, while \p{M}
matches
any mark character. Single letter properties may omit the brackets, so
that can be written \pM
also. Many predefined character classes are
available, such as \p{IsMirrored}
and \p{InTibetan}
.
\X
match matches any extended Unicode sequence
(a ``combining character sequence'' in Standardese), where the first
character is a base character and subsequent characters are mark
characters that apply to the base character. It is equivalent to
(?:\PM\pM*)
.
tr///
operator translates characters instead of bytes. Note
that the tr///CU
functionality has been removed, as the interface
was a mistake. For similar functionality see pack('U0', ...) and
pack('C0', ...).
uc()
translates to
uppercase, while ucfirst
translates to titlecase (for languages
that make the distinction). Naturally the corresponding backslash
sequences have the same semantics.
chop()
,
substr()
, pos()
, index()
, rindex()
, sprintf()
,
write()
, and length()
. Operators that specifically don't switch
include vec()
, pack()
, and unpack()
. Operators that really
don't care include chomp()
, as well as any other operator that
treats a string as a bucket of bits, such as sort()
, and the
operators dealing with filenames.
pack()
/unpack()
letters ``c
'' and ``C
'' do not change,
since they're often used for byte-oriented formats. (Again, think
``char
'' in the C language.) However, there is a new ``U
'' specifier
that will convert between UTF-8 characters and integers. (It works
outside of the utf8 pragma too.)
chr()
and ord()
functions work on characters. This is like
pack("U")
and unpack("U")
, not like pack("C")
and
unpack("C")
. In fact, the latter are how you now emulate
byte-oriented chr()
and ord()
under utf8.
& | ^ ~
can operate on character data.
However, for backward compatibility reasons (bit string operations
when the characters all are less than 256 in ordinal value) one cannot
mix ~
(the bit complement) and characters both less than 256 and
equal or greater than 256. Most importantly, the DeMorgan's laws
(~($x|$y) eq ~$x&~$y
, ~($x&$y) eq ~$x|~$y
) won't hold.
Another way to look at this is that the complement cannot return
both the 8-bit (byte) wide bit complement, and the full character
wide bit complement.
scalar reverse()
reverses by character rather than by byte.
[XXX: This feature is not yet implemented.]
As of yet, there is no method for automatically coercing input and output to some encoding other than UTF-8. This is planned in the near future, however.
Whether an arbitrary piece of data will be treated as ``characters'' or ``bytes'' by internal operations cannot be divined at the current time.
Use of locales with utf8 may lead to odd results. Currently there is some attempt to apply 8-bit locale info to characters in the range 0..255, but this is demonstrably incorrect for locales that use characters above that range (when mapped into Unicode). It will also tend to run slower. Avoidance of locales is strongly encouraged.
the bytes manpage, the utf8 manpage, ${^WIDE_SYSTEM_CALLS} in the perlvar manpage
perlunicode - Unicode support in Perl |