Unicode::Japanese - phpMan

Command: man perldoc info search(apropos)  


Unicode::Japanese(3pm)         User Contributed Perl Documentation         Unicode::Japanese(3pm)

NAME
       Unicode::Japanese - Convert encoding of japanese text

SYNOPSIS
        use Unicode::Japanese;
        use Unicode::Japanese qw(unijp);

        # convert utf8 -> sjis

        print Unicode::Japanese->new($str)->sjis;
        print unijp($str)->sjis; # same as above.

        # convert sjis -> utf8

        print Unicode::Japanese->new($str,'sjis')->get;

        # convert sjis (imode_EMOJI) -> utf8

        print Unicode::Japanese->new($str,'sjis-imode')->get;

        # convert zenkaku (utf8) -> hankaku (utf8)

        print Unicode::Japanese->new($str)->z2h->get;

DESCRIPTION
       The Unicode::Japanese module converts encoding of japanese text from one encoding to
       another.

   FEATURES
       o An instance of Unicode::Japanese internally holds a string in UTF-8.

       o This module is implemented in two ways: XS and pure perl. If efficiency is important for
         you, you should build and install the XS module. If you don't want to, or if you can't
         build the XS module, you may use the pure perl module instead. In that case, only you
         have to do is to copy Japanese.pm into somewhere in @INC.

       o This module can convert characters from zenkaku (full-width) form to hankaku (half-
         width) form, and vice versa. Conversion between hiragana (one of two sets of japanese
         phonetical alphabet) and katakana (another set of japanese phonetical alphabet) is also
         supported.

       o This module has mapping tables for emoji (graphic characters) defined by various
         japanese mobile phones; DoCoMo i-mode, ASTEL dot-i and J-PHONE J-Sky. Those letters are
         mapped on Unicode Private Use Area so unicode strings it outputs are still valid even if
         they contain emoji, and you can safely pass them to other software that can handle
         Unicode.

       o This module can map some emoji from one set to another. Different mobile phones define
         different sets of emoji, so mapping each other is not always possible. But since some
         emoji exist in two or more sets with similar appearance, this module considers those
         emoji to be the same.

       o This module uses the mapping table for MS-CP932 instead of the standard Shift_JIS. The
         Shift_JIS encoding used by MS-Windows (MS-SJIS/MS-CP932) slightly differs from the
         standard.

       o When the module converts strings from Unicode to Shift_JIS, EUC-JP or ISO-2022-JP,
         unicode letters which can't be represented in those encodings will be encoded in
         "&#dddd;" form (decimal character reference). Note, however, that letters in Unicode
         Private Use Area will be replaced with '?' mark ('QUESTION MARK'; U+003F) instead of
         being encoded. In addition, encoding to character sets for mobile phones makes every
         unrepresentable letters being '?' mark.

       o On perl-5.8.0 or later, this module handles the UTF-8 flag: the method utf8() returns
         UTF-8 byte string, and the method getu() returns UTF-8 character string.

         Currently the method get() returns UTF-8 byte string but this behavior may be changed in
         the future.

         Methods like sjis(), jis(), utf8(), and such like return byte string. new(), set(),
         getcode() methods just ignore the UTF-8 flag of strings they take.

REQUIREMENT
       o   perl 5.10.x, 5.8.x, etc. (5.004 and later)

       o   (optional) C Compiler.  This module supports both XS and Pure Perl.  If you have no C
           Compilers, Unicode::Japanese will be installed as Pure Perl module.

       o   (optional) Test.pm and Test::More for testing.

       No other modules are required at run time.

METHODS
       $s = Unicode::Japanese->new($str [, $icode [, $encode]])
           Create a new instance of Unicode::Japanese.

           Any given parameters will be internally passed to the method "set"().

       $s = unijp($str [, $icode [, $encode]])
           Same as Unicode::Jananese->new(...).

       $s->set($str [, $icode [, $encode]])
           $str: string
           $icode: optional character encoding (default: 'utf8')
           $encode: optional binary encoding (default: no binary encodings are assumed)

           Store a string into the instance.

           Possible character encodings are:

            auto
            utf8 ucs2 ucs4
            utf16-be utf16-le utf16
            utf32-be utf32-le utf32
            sjis cp932 euc euc-jp jis
            sjis-imode sjis-imode1 sjis-imode2
            utf8-imode utf8-imode1 utf8-imode2
            sjis-doti sjis-doti1
            sjis-jsky sjis-jsky1 sjis-jsky2
            jis-jsky  jis-jsky1  jis-jsky2
            utf8-jsky utf8-jsky1 utf8-jsky2
            sjis-au sjis-au1 sjis-au2
            jis-au  jis-au1  jis-au2
            sjis-icon-au sjis-icon-au1 sjis-icon-au2
            euc-icon-au  euc-icon-au1  euc-icon-au2
            jis-icon-au  jis-icon-au1  jis-icon-au2
            utf8-icon-au utf8-icon-au1 utf8-icon-au2
            ascii binary

           (see also "SUPPORTED ENCODINGS".)

           If you want the Unicode::Japanese detect the character encoding of string, you must
           explicitly specify 'auto' as the second argument. In that case, the given string will
           be passed to the method getcode() to guess the encoding.

           For binary encodings, only 'base64' is currently supported. If you specify 'base64' as
           the third argument, the given string will be decoded using Base64 decoder.

           Specify 'binary' as the second argument if you want your string to be stored without
           modification.

           When you specify 'sjis-imode' or 'sjis-doti' as the character encoding, any occurences
           of '&#dddd;' (decimal character reference) in the string will be interpreted and
           decoded as code point of emoji, just like emoji implanted into the string in binary
           form.

           Since encoded forms of strings in various encodings are not clearly distinctive to
           each other, it is not always certainly possible to detect what encoding is used for a
           given string.

           When a given string is possibly interpreted as both Shift_JIS and UTF-8 string, this
           module considers such a string to be encoded in Shift_JIS. And if the encoding is not
           distinguishable between 'sjis-au' and 'sjis-doti', this module considers it 'sjis-au'.

       $str = $s->get
           $str: string (UTF-8)

           Get the internal string in UTF-8.

           This method currently returns a byte string (whose UTF-8 flag is turned off), but this
           behavior may be changed in the future.

           If you absolutely want a byte string, you should use the method utf8() instead. And if
           you want a character string (whose UTF-8 flag is turned on), you have to use the
           method getu().

       $str = $s->getu
           $str: string (UTF-8)

           Get the internal string in UTF-8.

           On perl-5.8.0 or later, this method returns a character string with its UTF-8 flag
           turned on.

       $code = $s->getcode($str)
           $str: string
           $code: name of character encoding

           Detect the character encoding of given string.

           Note that this method, exceptionaly, doesn't deal with the internal string of an
           instance.

           To guess the encoding, the following algorithm is used:

           (For pure perl implementation)

           1.  If the string has an UTF-32 BOM, its encoding is 'utf32'.

           2.  If it has an UTF-16 BOM, its encoding is 'utf16'.

           3.  If it is valid for UTF-32BE, its encoding is 'utf32-be'.

           4.  If it is valid for UTF-32LE, its encoding is 'utf32-le'.

           5.  If it contains no ESC characters or bytes whose eighth bit is on, its encoding is
               'ascii'. Every ASCII control characters (0x00-0x1F and 0x7F) except ESC (0x1B) are
               considered to be in the range of 'ascii'.

           6.  If it contains escape sequences of ISO-2022-JP, its encoding is 'jis'.

           7.  If it contains any emoji defined for J-PHONE, its encoding is 'sjis-jsky'.

           8.  If it is valid for EUC-JP, its encoding is 'euc'.

           9.  If it is valid for Shift_JIS, its encoding is 'sjis'.

           10. If it contains any emoji defined for au, and everything else is valid for
               Shift_JIS, its encoding is 'sjis-au'.

           11. If it contains any emoji defined for i-mode, and everything else is valid for
               Shift_JIS, its encoding is 'sjis-imode'.

           12. If it contains any emoji defined for dot-i, and everything else is valid for
               Shift_JIS, its encoding is 'sjis-doti'.

           13. If it is valid for UTF-8, its encoding is 'utf8'.

           14. If no conditions above are fulfilled, its encoding is 'unknown'.

           (For XS implementation)

           1.  If the string has an UTF-32 BOM, its encoding is 'utf32'.

           2.  If it has an UTF-16 BOM, its encoding is 'utf16'.

           3.  Find all possible encodings that might have been applied to the string from the
               following:

               ascii / euc / sjis / jis / utf8 / utf32-be / utf32-le / sjis-jsky / sjis-imode /
               sjis-au / sjis-doti

           4.  If any encodings have been found possible, this module picks out one encoding
               having the highest priority among them. The priority order is as follows:

               utf32-be / utf32-le / ascii / jis / euc / sjis / sjis-jsky / sjis-imode / sjis-au
               / sjis-doti / utf8

           5.  If no conditions above are fulfilled, its encoding is 'unknown'.

           Pay attention to the following pitfalls in the above algorithm:

           o UTF-8 strings might be accidentally considered to be encoded in Shift_JIS.

           o UCS-2 strings (sequence of raw UCS-2 letters in big-endian; each letters has always
             2 bytes) can't be detected because they look like nothing but sequences of random
             bytes whose length is an even number.

           o UTF-16 strings must have BOM to be detected.

           o Emoji are only be recognized if they are implanted into the string in binary form.
             If they are described in '&#dddd;' form, they aren't considered to be emoji.

           Since the XS and pure perl implementations use different algorithms to guess encoding,
           they may guess differently for the same string. Especially, the pure perl
           implementation finds Shift_JIS strings containing ESC character (0x1B) to be actually
           encoded in Shift_JIS but XS implementation doesn't. This is because such strings can
           hardly be distinguished from 'sjis-jsky'. In addition, EUC-JP strings containing ESC
           character are also rejected for the same reason.

       $code = $s->getcodelist($str)
           $str: string
           $code: name of character encodings

           Detect the character encoding of given string.

           Unlike the method getcode(), getcodelist() returns a list of possible encodings.

       $str = $s->conv($ocode, $encode)
           $ocode: character encoding (possible encodings are:)
              utf8 ucs2 ucs4 utf16
              sjis cp932 euc euc-jp jis
              sjis-imode sjis-imode1 sjis-imode2
              utf8-imode utf8-imode1 utf8-imode2
              sjis-doti sjis-doti1
              sjis-jsky sjis-jsky1 sjis-jsky2
              jis-jsky  jis-jsky1  jis-jsky2
              utf8-jsky utf8-jsky1 utf8-jsky2
              sjis-au sjis-au1 sjis-au2
              jis-au  jis-au1  jis-au2
              sjis-icon-au sjis-icon-au1 sjis-icon-au2
              euc-icon-au  euc-icon-au1  euc-icon-au2
              jis-icon-au  jis-icon-au1  jis-icon-au2
              utf8-icon-au utf8-icon-au1 utf8-icon-au2
              binary

             (see also "SUPPORTED ENCODINGS".)

             Some encodings for mobile phones have a trailing digit like 'sjis-au2'. Those digits
             represent the version number of encodings. Such encodings have a variant with no
             trailing digits, like 'sjis-au', which is the same as the latest version among its
             variants.

           $encode: optional binary encoding
           $str: string

           Get the internal string of instance with encoding it using a given character encoding
           method.

           If you want the resulting string to be encoded in Base64, specify 'base64' as the
           second argument.

           On perl-5.8.0 or later, the UTF-8 flag of resulting string is turned off even if you
           specify 'utf8' to the first argument.

       $s->tag2bin
           Interpret decimal character references (&#dddd;) in the instance, and replaces them
           with single characters they represent.

       $s->z2h
           Replace zenkaku (full-width) letters in the instance with hankaku (half-width)
           letters.

       $s->h2z
           Replace hankaku (half-width) letters in the instance with zenkaku (full-width)
           letters.

       $s->hira2kata
           Replace any hiragana in the instance with katakana.

       $s->kata2hira
           Replace any katakana in the instance with hiragana.

       $str = $s->jis
           $str: byte string in ISO-2022-JP

           Get the internal string of instance with encoding it in ISO-2022-JP.

       $str = $s->euc
           $str: byte string in EUC-JP

           Get the internal string of instance with encoding it in EUC-JP.

       $str = $s->utf8
           $str: byte string in UTF-8

           Get the internal UTF-8 string of instance.

           On perl-5.8.0 or later, the UTF-8 flag of resulting string is turned off.

       $str = $s->ucs2
           $str: byte string in UCS-2

           Get the internal string of instance as a sequence of raw UCS-2 letters in big-endian.
           Note that this is different from UTF-16BE as raw UCS-2 sequence has no concept of
           surrogate pair.

       $str = $s->ucs4
           $str: byte string in UCS-4

           Get the internal string of instance as a sequence of raw UCS-4 letters in big-endian.
           This is practically the same as UTF-32BE.

       $str = $s->utf16
           $str: byte string in UTF-16

           Get the insternal string of instance with encoding it in UTF-16 in big-endian with no
           BOM prepended.

       $str = $s->sjis
           $str: byte string in Shift_JIS

           Get the internal string of instance with encoding it in Shift_JIS (MS-SJIS /
           MS-CP932).

       $str = $s->sjis_imode
           $str: byte string in 'sjis-imode'

           Get the internal string of instance with encoding it in 'sjis-imode'.

       $str = $s->sjis_imode1
           $str: byte string in 'sjis-imode1'

           Get the internal string of instance with encoding it in 'sjis-imode1'.

       $str = $s->sjis_imode2
           $str: byte string in 'sjis-imode2'

           Get the internal string of instance with encoding it in 'sjis-imode2'.

       $str = $s->sjis_doti
           $str: byte string in 'sjis-doti'

           Get the internal string of instance with encoding it in 'sjis-doti'.

       $str = $s->sjis_jsky
           $str: byte string in 'sjis-jsky'

           Get the internal string of instance with encoding it in 'sjis-jsky'.

       $str = $s->sjis_jsky1
           $str: byte string in 'sjis-jsky1'

           Get the internal string of instance with encoding it in 'sjis-jsky1'.

       $str = $s->sjis_jsky
           $str: byte string in 'sjis-jsky'

           Get the internal string of instance with encoding it in 'sjis-jsky'.

       $str = $s->sjis_icon_au
           $str: byte string in 'sjis-icon-au'

           Get the internal string of instance with encoding it in 'sjis-icon-au'.

       $str_arrayref = $s->strcut($len)
           $len: maximum length of each chunks (in number of full-width characters)
           $str_arrayref: reference to array of strings

           Split the internal string of instance into chunks of a given length.

           On perl-5.8.0 or later, UTF-8 flags of each chunks are turned on.

       $len = $s->strlen
           $len: character width of the internal string

           Calculate the character width of the internal string. Half-width characters have width
           of one unit, and full-width characters have width of two units.

       $s->join_csv(@values);
           @values: array of strings

           Build a line of CSV from the arguments, and store it into the instance. The resulting
           line has a trailing line break ("\n").

       @values = $s->split_csv;
           @values: array of strings

           Parse a line of CSV in the instance and return each columns. The line will be
           chomp()ed before getting parsed.

           If the internal string was decoded from 'binary' encoding (see methods new() and
           set()), the UTF-8 flags of the resulting array of strings are turned off. Otherwise
           the flags are turned on.

SUPPORTED ENCODINGS
        +---------------+----+-----+-------+
        |encoding       | in | out | guess |
        +---------------+----+-----+-------+
        |auto           : OK : --  | ----- |
        +---------------+----+-----+-------+
        |utf8           : OK : OK  | OK    |
        |ucs2           : OK : OK  | ----- |
        |ucs4           : OK : OK  | ----- |
        |utf16-be       : OK : --  | ----- |
        |utf16-le       : OK : --  | ----- |
        |utf16          : OK : OK  | OK(#) |
        |utf32-be       : OK : --  | OK    |
        |utf32-le       : OK : --  | OK    |
        |utf32          : OK : --  | OK(#) |
        +---------------+----+-----+-------+
        |sjis           : OK : OK  | OK    |
        |cp932          : OK : OK  | ----- |
        |euc            : OK : OK  | OK    |
        |euc-jp         : OK : OK  | ----- |
        |jis            : OK : OK  | OK    |
        +---------------+----+-----+-------+
        |sjis-imode     : OK : OK  | OK    |
        |sjis-imode1    : OK : OK  | ----- |
        |sjis-imode2    : OK : OK  | ----- |
        |utf8-imode     : OK : OK  | ----- |
        |utf8-imode1    : OK : OK  | ----- |
        |utf8-imode2    : OK : OK  | ----- |
        +---------------+----+-----+-------+
        |sjis-doti      : OK : OK  | OK    |
        |sjis-doti1     : OK : OK  | ----- |
        +---------------+----+-----+-------+
        |sjis-jsky      : OK : OK  | OK    |
        |sjis-jsky1     : OK : OK  | ----- |
        |sjis-jsky2     : OK : OK  | ----- |
        |jis-jsky       : OK : OK  | ----- |
        |jis-jsky1      : OK : OK  | ----- |
        |jis-jsky2      : OK : OK  | ----- |
        |utf8-jsky      : OK : OK  | ----- |
        |utf8-jsky1     : OK : OK  | ----- |
        |utf8-jsky2     : OK : OK  | ----- |
        +---------------+----+-----+-------+
        |sjis-au        : OK : OK  | OK    |
        |sjis-au1       : OK : OK  | ----- |
        |sjis-au2       : OK : OK  | ----- |
        |jis-au         : OK : OK  | ----- |
        |jis-au1        : OK : OK  | ----- |
        |jis-au2        : OK : OK  | ----- |
        |sjis-icon-au   : OK : OK  | ----- |
        |sjis-icon-au1  : OK : OK  | ----- |
        |sjis-icon-au2  : OK : OK  | ----- |
        |euc-icon-au    : OK : OK  | ----- |
        |euc-icon-au1   : OK : OK  | ----- |
        |euc-icon-au2   : OK : OK  | ----- |
        |jis-icon-au    : OK : OK  | ----- |
        |jis-icon-au1   : OK : OK  | ----- |
        |jis-icon-au2   : OK : OK  | ----- |
        |utf8-icon-au   : OK : OK  | ----- |
        |utf8-icon-au1  : OK : OK  | ----- |
        |utf8-icon-au2  : OK : OK  | ----- |
        +---------------+----+-----+-------+
        |ascii          : OK : --  | OK    |
        |binary         : OK : OK  | ----- |
        +---------------+----+-----+-------+
        (#): guessed when it has bom.

   GUESSING ORDER
        1.  utf32 (#)
        2.  utf16 (#)
        3.  utf32-be
        4.  utf32-le
        5.  ascii
        6.  jis
        7.  sjis-jsky (pp)
        8.  euc
        9.  sjis
        10. sjis-jsky (xs)
        11. sjis-au
        12. sjis-imode
        13. sjis-doti
        14. utf8
        15. unknown

DESCRIPTION OF UNICODE MAPPING
       Transcoding between Unicode encodings and other ones is performed as below:

       Shift_JIS
         This module uses the mapping table of MS-CP932.

         <ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT>

         When the module tries to convert Unicode string to Shift_JIS, it represents most letters
         which isn't available in Shift_JIS as decimal character reference ('&#dddd;'). There is
         one exception to this: every graphic characters for mobile phones are replaced with '?'
         mark.

         For variants of Shift_JIS defined for mobile phones, every unrepresentable characters
         are replaced with '?' mark unlike the plain Shift_JIS.

       EUC-JP/ISO-2022-JP
         This module doesn't directly convert Unicode string from/to EUC-JP or ISO-2022-JP: it
         once converts from/to Shift_JIS and then do the rest translation. So characters which
         aren't available in the Shift_JIS can not be properly translated.

       DoCoMo i-mode
         This module maps emoji in the range of F800 - F9FF to U+0FF800 - U+0FF9FF.

       ASTEL dot-i
         This module maps emoji in the range of F000 - F4FF to U+0FF000 - U+0FF4FF.

       J-PHONE J-SKY
         The encoding method defined by J-SKY is as follows: first an escape sequence "\e\$"
         comes to indicate the beginning of emoji, then the first byte of an emoji comes next,
         then the second bytes of at least one emoji comes next, then "\x0f" comes last to
         indicate the end of emoji. If a string contains a series of emoji whose first bytes are
         identical, such sequence can be compressed by cascading second bytes of them to the
         single first byte.

         This module considers a pair of those first and second bytes to be one letter, and map
         them from 4500 - 47FF to U+0FFB00 - U+0FFDFF.

         When the module encodes J-SKY emoji, it performs the compression automatically.

       AU
         This module maps AU emoji to U+0FF500 - U+0FF6FF.

PurePerl mode
          use Unicode::Japanese qw(PurePerl);

       If you want to explicitly take the pure perl implementation, pass 'PurePerl' to the
       argument of the "use" statement.

BUGS
       Please report bugs and requests to "bug-unicode-japanese at rt.cpan.org" or
       <http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Unicode-Japanese>. If you report them to
       the web interface, any progress to your report will be automatically sent back to you.

       o This module doesn't directly convert Unicode string from/to EUC-JP or ISO-2022-JP: it
         once converts from/to Shift_JIS and then do the rest translation. So characters which
         aren't available in the Shift_JIS can not be properly translated.

       o The XS implementation of getcode() fails to detect the encoding when the given string
         contains \e while its encoding is EUC-JP or Shift_JIS.

       o Japanese.pm is composed of textual perl script and binary character conversion table. If
         you transfer it on FTP using ASCII mode, the file will collapse.

SUPPORT
       You can find documentation for this module with the perldoc command.

           perldoc Unicode::Japanese

       You can find more information at:

       o   AnnoCPAN: Annotated CPAN documentation

           <http://annocpan.org/dist/Unicode-Japanese>

       o   CPAN Ratings

           <http://cpanratings.perl.org/d/Unicode-Japanese>

       o   RT: CPAN's request tracker

           <http://rt.cpan.org/NoAuth/Bugs.html?Dist=Unicode-Japanese>

       o   Search CPAN

           <http://search.cpan.org/dist/Unicode-Japanese>

CREDITS
       Thanks very much to:

       NAKAYAMA Nao

       SUGIURA Tatsuki & Debian JP Project

COPYRIGHT & LICENSE
       Copyright 2001-2008 SANO Taku (SAWATARI Mikage) and YAMASHINA Hio, all rights reserved.

       This program is free software; you can redistribute it and/or modify it under the same
       terms as Perl itself.

perl v5.34.0                                2022-02-06                     Unicode::Japanese(3pm)

Generated by $Id: phpMan.php,v 4.55 2007/09/05 04:42:51 chedong Exp $ Author: Che Dong
On Apache
Under GNU General Public License
2024-12-12 19:49 @18.118.0.170 CrawledBy Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
Valid XHTML 1.0!Valid CSS!