Unicode::String - String of Unicode characters |
Unicode::String - String of Unicode characters (UCS2/UTF16)
use Unicode::String qw(utf8 latin1 utf16); $u = utf8("The Unicode Standard is a fixed-width, uniform "); $u .= utf8("encoding scheme for written characters and text");
# convert to various external formats print $u->ucs4; # 4 byte characters print $u->utf16; # 2 byte characters + surrogates print $u->utf8; # 1-4 byte characters print $u->utf7; # 7-bit clean format print $u->latin1; # lossy print $u->hex; # a hexadecimal string
# all these can be used to set string value or as constructor $u->latin1("Å være eller å ikke være"); $u = utf16("\0Å\0 \0v\0æ\0r\0e");
# string operations $u2 = $u->copy; $u->append($u2); $u->repeat(2); $u->chop;
$u->length; $u->index($other); $u->index($other, $pos);
$u->substr($offset); $u->substr($offset, $length); $u->substr($offset, $length, $substitute);
# overloading $u .= "more"; $u = $u x 100; print "$u\n";
# string <--> array of numbers @array = $u->unpack; $u->pack(@array);
# misc $u->ord; $u = uchr($num);
A Unicode::String object represents a sequence of Unicode characters. The Unicode Standard is a fixed-width, uniform encoding scheme for written characters and text. This encoding treats alphabetic characters, ideographic characters, and symbols identically, which means that they can be used in any mixture and with equal facility. Unicode is modeled on the ASCII character set, but uses a 16-bit encoding to support full multilingual text.
Internally a Unicode::String object is a string of 2 byte values in network byte order (big-endian). The class provide various methods to convert from and to various external formats, and all string manipulations are made on strings in this the internal 16-bit format.
The functions utf16(), utf8(), utf7(), ucs2(), ucs4(), latin1(),
uchr()
can be imported from the Unicode::String module and will
work as constructors initializing strings of the corresponding
encoding. The ucs2()
and utf16()
are really aliases for the same
function.
The Unicode::String objects overload various operators, so they will normally work like plain 8-bit strings in Perl. This includes conversions to strings, numbers and booleans as well as assignment, concatenation and repetition.
The following methods are available:
stringify_as()
returns the current encoding ctor function. The
encoding argument ($enc) is a string with one of the following values:
``ucs4'', ``ucs2'', ``utf16'', ``utf8'', ``utf7'', ``latin1'', ``hex''. The default
is ``utf8''.
stringify_as()
encoding and used to initialize the newly created
object.
Normally you create Unicode::String objects by importing some of the encoding methods below as functions into your namespace and calling them with an appropriate encoded argument.
The ucs4()
method always return the old value of $us and if given an
argument decodes the UCS-4 string and set this as the new value of $us.
The characters in $newval must be in the range 0x0 .. 0x10FFFF.
Characters outside this range is ignored.
ucs2()
and utf16()
are really just different names for the same
method. The UCS-2 encoding use 16 bits per character. The UTF-16
encoding is identical to UCS-2, but includes the use of surrogate
pairs. Surrogates make it possible to encode characters in the range
0x010000 .. 0x10FFFF with the use of two consecutive 16-bit chars.
Encoded as a Perl string we use 2-bytes in network byte order for each
character (or surrogate code).
The ucs2()
method always return the old value of $us and if given an
argument set this as the new value of $us.
The utf8()
method always return the old value of $us encoded using
UTF-8 and if given an argument decodes the UTF-8 string and set this as
the new value of $us.
The utf7()
method always return the old value of $us encoded using
UTF-7 and if given an argument decodes the UTF-7 string and set this as
the new value of $us.
If the (global) variable $Unicode::String::UTF7_OPTIONAL_DIRECT_CHARS is TRUE, then a wider range of characters are encoded as themselves. It is even TRUE by default. The characters affected by this are:
! " # $ % & * ; < = > @ [ ] ^ _ ` { | }
latin1()
method always return
the old value of $us and if given an argument set this as the new
value of $us. Characters outside the 0x0 .. 0xFF range are ignored
when returning a Latin-1 string. If you want more control over the
mapping from Unicode to Latin-1, use the Unicode::Map8 class. This
is also the way to deal with other 8-bit character sets.
method()
return a plain ASCII string where each Unicode character
is represented by the ``U+XXXX'' string and separated by a single space
character. This format can also be used to set the value of $us (in
which case the ``U+'' is optional).
stringify_as()
method is
``utf8''.
$us x $count
$us . $other_string
$us .= $other_string
Unicode reserve the character U+FEFF character as a byte order mark. This works because the swapped character, U+FFFE, is reserved to not be valid. For strings that have the byte order mark as the first character, we can guaranty to get the byte order right with the following code:
$ustr->byteswap if $ustr->ord == 0xFFFE;
ord()
method deals with surrogate pairs, which gives us a result-range of
0x0 .. 0x10FFFF. If the $us string is empty, undef is returned.
The following utility functions are provided. They will be exported on request.
the Unicode::CharName manpage, the Unicode::Map8 manpage, http://www.unicode.org/
Copyright 1997-2000 Gisle Aas.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Unicode::String - String of Unicode characters |