NAME

utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code

SYNOPSIS

 use utf8;
 use utf8 'Greek', 'Arabic';  # allow mixed-scripts in identifiers
 no utf8;

 # Convert the internal representation of a Perl scalar to/from UTF-8.

 $num_octets = utf8::upgrade($string);
 $success    = utf8::downgrade($string[, $fail_ok]);

 # Change each character of a Perl scalar to/from a series of
 # characters that represent the UTF-8 bytes of each original character.

 utf8::encode($string);  # "\x{100}"  becomes "\xc4\x80"
 utf8::decode($string);  # "\xc4\x80" becomes "\x{100}"

 # Convert a code point from the platform native character set to
 # Unicode, and vice-versa.
 $unicode = utf8::native_to_unicode(ord('A')); # returns 65 on both
                                               # ASCII and EBCDIC
                                               # platforms
 $native = utf8::unicode_to_native(65);        # returns 65 on ASCII
                                               # platforms; 193 on
                                               # EBCDIC

 $flag = utf8::is_utf8($string); # since Perl 5.8.1
 $flag = utf8::valid($string);

DESCRIPTION

The use utf8 pragma tells the Perl parser to allow UTF-8 and certain mixed scripts other than Latin, Common and Inherited in the program text in the current lexical scope for identifiers (package and symbol names, function and variable names) and literals. It doesn't declare strings in the source to be UTF-8 encoded or unicode, see "The 'unicode_strings' feature" in feature instead.

The no utf8 pragma tells Perl to switch back to treating the source text as literal bytes in the current lexical scope. (On EBCDIC platforms, technically it is allowing UTF-EBCDIC, and not UTF-8, but this distinction is academic, so in this document the term UTF-8 is used to mean both).

Do not use this pragma for anything else than telling Perl that your script is written in UTF-8. The utility functions described below are directly usable without use utf8;.

Because it is not possible to reliably tell UTF-8 from native 8 bit encodings, you need either a Byte Order Mark at the beginning of your source code, or use utf8;, to instruct perl.

When UTF-8 becomes the standard source format, this pragma wwithout any argument will become effectively a no-op.

See also the effects of the -C switch and its cousin, the PERL_UNICODE environment variable, in perlrun.

Enabling the utf8 pragma has the following effect:

Note that if you have non-ASCII, non-UTF-8 bytes in your script (for example embedded Latin-1 in your string literals), use utf8 will be unhappy. If you want to have such bytes under use utf8, you can disable this pragma until the end the block (or file, if at top level) by no utf8;.

Valid scripts

use utf8 takes any valid UCD script names as arguments. This declares those scripts for all identifiers as valid, all others besides 'Latin', 'Common' and 'Inherited' are invalid. This is currently only globally, not lexically scoped. Being forced to declare valid scripts disallows unicode confusables from different language families, which might looks the same but are not. This does not affect strings, only names, literals and numbers.

The unicode standard 9.0 defines 137 scripts, i.e. written language families.

    perl -alne'/; (\w+) #/ && print $1' lib/unicore/Scripts.txt | \
        sort -u

Adlam Ahom Anatolian_Hieroglyphs Arabic Armenian Avestan Balinese Bamum Bassa_Vah Batak Bengali Bhaiksuki Bopomofo Brahmi Braille Buginese Buhid Canadian_Aboriginal Carian Caucasian_Albanian Chakma Cham Cherokee Common Coptic Cuneiform Cypriot Cyrillic Deseret Devanagari Duployan Egyptian_Hieroglyphs Elbasan Ethiopic Georgian Glagolitic Gothic Grantha Greek Gujarati Gurmukhi Han Hangul Hanunoo Hatran Hebrew Hiragana Imperial_Aramaic Inherited Inscriptional_Pahlavi Inscriptional_Parthian Javanese Kaithi Kannada Katakana Kayah_Li Kharoshthi Khmer Khojki Khudawadi Lao Latin Lepcha Limbu Linear_A Linear_B Lisu Lycian Lydian Mahajani Malayalam Mandaic Manichaean Marchen Meetei_Mayek Mende_Kikakui Meroitic_Cursive Meroitic_Hieroglyphs Miao Modi Mongolian Mro Multani Myanmar Nabataean New_Tai_Lue Newa Nko Ogham Ol_Chiki Old_Hungarian Old_Italic Old_North_Arabian Old_Permic Old_Persian Old_South_Arabian Old_Turkic Oriya Osage Osmanya Pahawh_Hmong Palmyrene Pau_Cin_Hau Phags_Pa Phoenician Psalter_Pahlavi Rejang Runic Samaritan Saurashtra Sharada Shavian Siddham SignWriting Sinhala Sora_Sompeng Sundanese Syloti_Nagri Syriac Tagalog Tagbanwa Tai_Le Tai_Tham Tai_Viet Takri Tamil Tangut Telugu Thaana Thai Tibetan Tifinagh Tirhuta Ugaritic Vai Warang_Citi Yi

Note that this matches the UCD and is a bit different to the old-style casing of "charscript()" in Unicode::UCD in previous versions of Unicode::UCD.

We add some aliases for languages using multiple scripts:

   :Japanese => Katakana Hiragana Han
   :Korean   => Hangul Han
   :Hanb     => Han Bopomofo

These three aliases need not to be declared. They are allowed scripts in the Highly Restriction Level for identifiers.

Certain scripts don't need to be declared:

We follow the Moderately Restrictive Level for identifiers. I.e. All characters in each identifier must be from a single script, or from any of the following combinations:

Latin + Han + Hiragana + Katakana; or equivalently: Latn + Jpan

Latin + Han + Bopomofo; or equivalently: Latn + Hanb

Latin + Han + Hangul; or equivalently: Latn + Kore

Allow Latin with other Recommended or Aspirational scripts except Cyrillic and Greek.

So these scripts need always to be declared:

Cyrillic Greek Ahom Anatolian_Hieroglyphs Avestan Balinese Bamum Bassa_Vah Batak Brahmi Braille Buginese Buhid Carian Caucasian_Albanian Chakma Cham Cherokee Common Coptic Cuneiform Cypriot Deseret Duployan Egyptian_Hieroglyphs Elbasan Glagolitic Gothic Grantha Hanunoo Hatran Imperial_Aramaic Inherited Inscriptional_Pahlavi Inscriptional_Parthian Javanese Kaithi Kayah_Li Kharoshthi Khojki Khudawadi Lepcha Limbu Linear_A Linear_B Lisu Lycian Lydian Mahajani Mandaic Manichaean Meetei_Mayek Mende_Kikakui Meroitic_Cursive Meroitic_Hieroglyphs Modi Mro Multani Nabataean New_Tai_Lue Nko Ogham Ol_Chiki Old_Hungarian Old_Italic Old_North_Arabian Old_Permic Old_Persian Old_South_Arabian Old_Turkic Osmanya Pahawh_Hmong Palmyrene Pau_Cin_Hau Phags_Pa Phoenician Psalter_Pahlavi Pau_Cin_Hau Phags_Pa Phoenician Psalter_Pahlavi Rejang Runic Samaritan Saurashtra Sharada Shavian Siddham SignWriting Sora_Sompeng Sundanese Syloti_Nagri Syriac Tagalog Tagbanwa Tai_Le Tai_Tham Tai_Viet Takri Tirhuta Ugaritic Vai Warang_Citi

Utility functions

The following functions are defined in the utf8:: package by the Perl core. You do not need to say use utf8 to use these and in fact you should not say that unless you really want to have UTF-8 source code.

utf8::encode is like utf8::upgrade, but the UTF8 flag is cleared. See perlunicode, and the C API functions sv_utf8_upgrade, "sv_utf8_downgrade" in perlapi, "sv_utf8_encode" in perlapi, and "sv_utf8_decode" in perlapi, which are wrapped by the Perl functions utf8::upgrade, utf8::downgrade, utf8::encode and utf8::decode. Also, the functions utf8::is_utf8, utf8::valid, utf8::encode, utf8::decode, utf8::upgrade, and utf8::downgrade are actually internal, and thus always available, without a require utf8 statement.

BUGS

Some filesystems may not support UTF-8 file names, or they may be supported incompatibly with Perl. Therefore UTF-8 names that are visible to the filesystem, such as module names may not work.

perl5 upstream allows mixed script confusables as described in http://www.unicode.org/reports/tr39/ since 5.16 and is therefore considered insecure.

perl5 upstream does not normalize its unicode identifiers as described in http://www.unicode.org/reports/tr15/ since 5.16 and is therefore considered insecure. See http://www.unicode.org/reports/tr36/ for the security risks.

SEE ALSO

perlunitut, perluniintro, perlrun, bytes, perlunicode.

http://www.unicode.org/reports/tr36/#Mixed_Script_Spoofing, http://unicode.org/reports/tr39/#Mixed_Script_Confusables.