Converting Web Applications to UTF-8
An Overview of UTF-8 in PHP, Smarty, Oracle, and Apache, with data exports to PDF, RTF, email, and text
Here at the Penn Med School we recently switched our web and database applications from Western/ISO encoding to Unicode/UTF-8. We did this so we can provide better support for international character sets (Greek, Japanese, etc.). As sometimes happens with projects that involve computers, it grew into a big, hairy beast that was way beyond anything we initially anticipated. I was partly responsible for managing the transition, and since I found no comprehensive guide to help us through it, I thought I’d write one now that we’re done. We’re using two-thirds of the open source PHP-Apache-MySQL trinity, with Oracle instead of MySQL. Even if you have a different mix of applications, the concepts I’ll describe are probably applicable to your situation, even if the semantics are different.
First, if you need some orientation in understanding character sets, start with The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). It’s actually quite readable, even if you’re not a techie.
Second, you need to read the Oracle document An Overview on Globalizing Oracle PHP Applications. It’s an excellent starting point, but unfortunately it doesn’t always explain the reasons behind its recommendations, which means you’ll get stuck if things don’t happen to work after you follow their instructions. I’ll try to fill those gaps here.
Persuading Apache and Oracle to talk to each other in UTF-8
PHP web applications are run under the Apache web server, which itself is running in a user account (assuming you’re in a Unix environment). So the first step is to set the environment of that account correctly, so it will know how to “speak” UTF-8 to Oracle. You do this by setting the NLS_LANG environment variable in the Apache configuration. The Oracle Overview document says to set it to .AL32UTF8, but doesn’t explain why. So when this didn’t do the trick for me, I had to do some more research. I found the Oracle Character Set descriptions, and found that .AL32UTF8 corresponds to Unicode 3.1. After talking with our DBA I learned that our Oracle database is set to Unicode 3.0, which meant I needed to set
NLS_LANG=.UTF8 (we ultimately switched to .AL32UTF8, since it is Oracle’s recommended standard). The key point here is that NLS_LANG must exactly match the character set you’re using in Oracle.
Serving your web pages to users in UTF-8
There are a few different aspects to this:
- If you want all the documents on your server to default to UTF-8, then set the AddDefaultCharset directive in the Apache configuration to UTF-8. You should do either #2 or #3 below in addition to this (see the Apache documentation for the reason).
- If you want all your PHP documents served in UTF-8, but not necessarily other document types, set
default_charset=UTF-8in your php.ini file. It’s OK if the PHP charset is different from the Apache charset: the PHP charset will apply to PHP files, and the Apache charset will apply to all other types (this goes for #3 below as well).
- If you only want certain PHP documents in UTF-8, specify UTF-8 in the Content-type header of those documents. It’s important to point out here that, if you haven’t done #1 or #2 above, then you must set this header with the PHP header() function. If you try to set it with an HTML Meta tag, the charset defined in Apache will override your Meta tag.
UTF-8 in form submissions
In Windows 95 and 98, Microsoft used the Windows ANSI character set. If you ever copy-and-pasted text from Microsoft Word into a web form under Windows 9x, chances are any upper ASCII characters, such as ©, turned into something like ä in the web form. This is because the web page was probably Western ISO8859-1 encoded, and that character set organizes the upper ASCII range differently from Windows ANSI. So the web page thought it was receiving a different character than what you intended. Windows NT, 2000, and XP use Unicode, so you won’t have this problem under the newer versions of Windows. Macs and most other modern OSs use either Western ISO 8859-1 or Unicode. The first 256 characters of Western ISO 8859-1 are the same in Unicode. So your Unicode encoded web form should correctly interpret upper ASCII text provided by anyone not using Windows 9x (or a completely foreign, non-Unicode character set).
Additional PHP and Oracle configurations
You will want to enable multi-byte character support in PHP. Compile PHP with the
-enable-mbstring option, and set
mbstring.internal_encoding=UTF-8 in your php.ini file. Also, you should definitely look over the PHP documentation for multi-byte string functions. Note that if you haven’t upgraded to PHP 5 yet, the html_entity_decode() function will fail hard if you pass it a UTF-8 string. This was the only UTF-8 incompatibility we found in PHP 4.3.
You may want to implement PHP’s function overloading. An example will illustrate why this is important: in UTF-8, a string that is 4 characters long could occupy anywhere from 4 to 12 bytes depending on the multi-byte characters in it. The mb_strlen() function will correctly tell you the number of characters in such a string, but the regular strlen() function won’t (it’ll tell you the number of bytes). Enabling function overloading will cause PHP to automatically assume it’s handling multi-byte strings, so, in this example, it will execute mb_strlen() when you call strlen(). If you’re making a wholesale conversion to UTF-8, and you don’t want to tweak all your existing code, implementing function overloading makes sense. But there is one exception: you may not want to do function overloading on mail() – I’ll get to that in a minute.
Related to this, in Oracle 9, you can set NLS_LENGTH_SEMANTICS to use either character length or byte length semantics for the tables you create. That is, you can use it to indicate whether, for example, a varchar(10) column is 10 characters, or 10 bytes.
If you’re using Smarty with PHP, you’ll need to override the escape() function. It calls the PHP htmlentities() and htmlspecialchars() functions, but it doesn’t provide them with the necessary charset argument so they’ll work with UTF-8. Make a copy of the escape() modifier and tweak it to pass along a charset argument to PHP, and then use it to override the original.
Exporting to other formats
As you’ll see below, it may not always be wise to do data exports in UTF-8. Sometimes you need to change the character set before performing the export. Take a look at PHP’s utf8_decode() and iconv functions to learn about converting UTF-8 to single-byte encoding. Note that utf8_decode(), while easy to use, is limited to the Latin character set (see the user contributed notes on the PHP utf8_decode() page for tips on dealing with other character sets).
- PDF: we use PDFlib on our web server to create PDF documents on the fly. For it to work with UTF-8 data, you need to use it with a UTF-8 compatible font. The standard Arial font supports Greek and Cyrillic in UTF-8, which is generally sufficient (don’t confuse standard Arial with Microsoft’s Arial Unicode MS font – while it can print just about any UTF-8 character, it’s 32MB, so you probably don’t want to load it on your web server!). Also, Gentium is a very nice UTF-8 compatible serif font that supports Greek and Cyrillic.
- RTF: we are moving away from RTF, but we still have some applications that generate RTF files. RTF does not provide good UTF-8 support. Our solution is to do a utf8_decode() on our data before generating RTF files (we can get away with this since none of the data going into our RTF files contain non-Latin characters – hopefully we’ll get rid of RTF before non-Latin characters start showing up).
- Text: we also do data exports to text files, mainly in .csv format for use in spreadsheets. Surprisingly, Microsoft Excel does not support importing UTF-8 encoded text files. Again, our solution is to perform a utf8_decode() before generating these text files.
- Email: I recommend not doing function overloading on PHP mail(). The reason has to do with line breaks. In Unix, a line break is represented by a line feed (LF) character. On Macs, it’s represented by a carriage return (CR) character. And on Windows, by a CR+LF. For email to work between platforms, an email standard was agreed upon in the early days of the Internet, which is CR+LF. So, for example, on Unix, sendmail will add a CR as needed to each LF it finds in the body of an email message. But when an email is UTF-8, mailers don’t try to wade through the multi-byte encoding, and they don’t “fix” the line breaks. We found that the line breaks in UTF-8 emails (generated on Unix) were interpreted as desired in Mac and Unix mail readers, and by Microsoft Outlook on Windows, but not by Eudora 6.2 (and previous versions) on Windows. In Eudora, the messages displayed with no line breaks. You can’t say it’s a Eudora bug, since the line breaks weren’t meeting the standard. At this time, the emails we generate only contain basic Latin characters, so sticking with the standard mail() function meets our needs for now.