Three small tips about dealing with character sets

The many oddities when dealing with character sets in HTTP, PHP and MySQL are quite complex. Here are three small tips about working with character sets in such settings. I assume you use the PHP multibyte (the mb_* functions) extension and that you are using UTF-8 as your output character set.

Overloading functions

It may seem convenient to use the mbstring.func_overload configuration setting to overload the typical string functions such as str_replace() and strlen(). However, remember that you cannot set this in your PHP script by means of ini_set(). This is documented behavior, but you will not receive a warning when you try to do this and it may seem as if everything is working okay because you usually test with ASCII characters, which are single-byte characters in UTF-8.

If you often use constructs such as if (strlen($string) > 0) in order to test if a string is non-empty, remember that this is slower with the string functions overloaded by the multibyte extension, especially when the string is non-empty. Just use if (!empty($string)).

The difference between mb_convert_encoding() and iconv()

PHP offers two extensions that deal with converting a text string from one character set to the other: mb_convert_encoding() and iconv(). I strongly prefer iconv(), because of the additional possibilities this function offers over its counterpart from the multibyte extension.

The function iconv() enables you to ignore or transliterate characters that are present in the source string, but cannot be represented in the target string. So, if you use transliteration when you are converting string such as "€ 22.95" from UTF-8 to ISO-8859-1, you will get something like "EUR 22.95". In order to use this, concatenate the //TRANSLIT flag after the target encoding. For example: $string = iconv('UTF-8', 'ISO-8895-1//TRANSLIT', $string);. Remember that if you do not specify either the //IGNORE or the //TRANSLIT flag and iconv() encounters a character it does not understand or is invalid in the encoding, iconv() will return an empty string.

HTML, XML and UTF-8

In HTML and XML, entities are required for characters that are not supported in the character set the XML document is in. Thus, in order to represent an ë you have to use ë in ISO-8859-1 documents but in UTF-8 you can use the ë character directly. No entities required. This is mandatory for XML documents, which will not validate if you use named entities. Thus, when using escaping functions always pass the character set like this: echo htmlentities($string, ENT_QUOTES, mb_internal_encoding());. This will make sure your XML document validates.

Creating images with text

One extra tip: the function imagettftext() expects the $text parameter to be encoded in UTF-8. Convert your text to UTF-8 (if it isn’t already) to use non ASCII characters. You can use the iconv() function as mentioned above.

Feel free to post any questions in the comments.


About this entry