PHP has a large collection of multibyte functions in the standard library for handling multibyte strings such as Japanese. Two useful multibyte functions that PHP provides are for detecting the encoding of a multibyte string, and converting from one multibyte encoding to another.
To check if $string is in UTF-8 encoding, we call mb_check_encoding() like this:
if (mb_check_encoding($string, "UTF-8")) { // do_something(); }
To convert $string, which is currently Shift-JIS, to UTF-8, we call mb_convert_encoding() like this:
$convertedString = mb_convert_encoding($string, "UTF-8", "Shift-JIS);
A convenient feature of mb_convert_encoding() is that you can generalize the function by adding a list of character encodings to convert from. This can come in very handy if you want to convert all Japanese multibyte string encodings to UTF-8, or something else. There are actually 18 Japanese-specific multibyte encodings (that I know of), not including all the Unicode variants like UTF-8, UTF-16, etc. A lot of them come from the Japanese mobile phone carriers.
Let’s put all of this together and check if a string is UTF-8, and if it’s not, meaning it is one of the other 18 Japanese encoding types, let’s convert it to UTF-8.
if (!mb_check_encoding($string, "UTF-8")) { $string = mb_convert_encoding($string, "UTF-8", "Shift-JIS, EUC-JP, JIS, SJIS, JIS-ms, eucJP-win, SJIS-win, ISO-2022-JP, ISO-2022-JP-MS, SJIS-mac, SJIS-Mobile#DOCOMO, SJIS-Mobile#KDDI, SJIS-Mobile#SOFTBANK, UTF-8-Mobile#DOCOMO, UTF-8-Mobile#KDDI-A, UTF-8-Mobile#KDDI-B, UTF-8-Mobile#SOFTBANK, ISO-2022-JP-MOBILE#KDDI"); }
I’m having some trouble with this.
When I run the following code, it does not return the actual character encoding. How did you get this to work properly?
oops, it didn’t print the code. Here it is:
——–
setlocale(LC_ALL, “ja_JP.utf8”);
echo “Enter website URL: “;
$string = trim(fgets( STDIN ));
if (!($handle = file_get_contents($string)) ) die( “Cannot read or access the specified URL.” );
echo “Encoding used: “.mb_detect_encoding($string, “Shift-JIS, EUC-JP, JIS, SJIS, JIS-ms, eucJP-win, SJIS-win, ISO-2022-JP, ISO-2022-JP-MS, SJIS-mac, SJIS-Mobile#DOCOMO, SJIS-Mobile#KDDI, SJIS-Mobile#SOFTBANK, UTF-8-Mobile#DOCOMO, UTF-8-Mobile#KDDI-A, UTF-8-Mobile#KDDI-B, UTF-8-Mobile#SOFTBANK, ISO-2022-JP-MOBILE#KDDI”).”\n”;
——–
Got it!;)
I’ve extended your code a bit to take concatenations of kanji from Asahi news’ articles. Here’s the article and code: http://asia-gazette.com/news/japan/109