Converting Web Applications to UTF-8

UPDATE: I expanded this to a full length article, which was published in the May 2005 issue of php|architect. They had it available for several years as a free download, but it’s no longer available there, so you can download it from me as a PDF. My apologies for not responding to earlier comments – I had a newborn baby at the time.

An Overview of UTF-8 in PHP, Smarty, Oracle, and Apache, with data exports to PDF, RTF, email, and text

Here at the Penn Med School we recently switched our web and database applications from Western/ISO encoding to Unicode/UTF-8. We did this so we can provide better support for international character sets (Greek, Japanese, etc.). As sometimes happens with projects that involve computers, it grew into a big, hairy beast that was way beyond anything we initially anticipated. I was partly responsible for managing the transition, and since I found no comprehensive guide to help us through it, I thought I’d write one now that we’re done. We’re using two-thirds of the open source PHP-Apache-MySQL trinity, with Oracle instead of MySQL. Even if you have a different mix of applications, the concepts I’ll describe are probably applicable to your situation, even if the semantics are different.

Getting Started

First, if you need some orientation in understanding character sets, start with The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). It’s actually quite readable, even if you’re not a techie.

Second, you need to read the Oracle document An Overview on Globalizing Oracle PHP Applications. It’s an excellent starting point, but unfortunately it doesn’t always explain the reasons behind its recommendations, which means you’ll get stuck if things don’t happen to work after you follow their instructions. I’ll try to fill those gaps here.

Persuading Apache and Oracle to talk to each other in UTF-8

PHP web applications are run under the Apache web server, which itself is running in a user account (assuming you’re in a Unix environment). So the first step is to set the environment of that account correctly, so it will know how to “speak” UTF-8 to Oracle. You do this by setting the NLS_LANG environment variable in the Apache configuration. The Oracle Overview document says to set it to .AL32UTF8, but doesn’t explain why. So when this didn’t do the trick for me, I had to do some more research. I found the Oracle Character Set descriptions, and found that .AL32UTF8 corresponds to Unicode 3.1. After talking with our DBA I learned that our Oracle database is set to Unicode 3.0, which meant I needed to set NLS_LANG=.UTF8 (we ultimately switched to .AL32UTF8, since it is Oracle’s recommended standard). The key point here is that NLS_LANG must exactly match the character set you’re using in Oracle.

Serving your web pages to users in UTF-8

There are a few different aspects to this:

  1. If you want all the documents on your server to default to UTF-8, then set the AddDefaultCharset directive in the Apache configuration to UTF-8. You should do either #2 or #3 below in addition to this (see the Apache documentation for the reason).
  2. If you want all your PHP documents served in UTF-8, but not necessarily other document types, set default_charset=UTF-8 in your php.ini file. It’s OK if the PHP charset is different from the Apache charset: the PHP charset will apply to PHP files, and the Apache charset will apply to all other types (this goes for #3 below as well).
  3. If you only want certain PHP documents in UTF-8, specify UTF-8 in the Content-type header of those documents. It’s important to point out here that, if you haven’t done #1 or #2 above, then you must set this header with the PHP header() function. If you try to set it with an HTML Meta tag, the charset defined in Apache will override your Meta tag.

UTF-8 in form submissions

In Windows 95 and 98, Microsoft used the Windows ANSI character set. If you ever copy-and-pasted text from Microsoft Word into a web form under Windows 9x, chances are any upper ASCII characters, such as ©, turned into something like ä in the web form. This is because the web page was probably Western ISO8859-1 encoded, and that character set organizes the upper ASCII range differently from Windows ANSI. So the web page thought it was receiving a different character than what you intended. Windows NT, 2000, and XP use Unicode, so you won’t have this problem under the newer versions of Windows. Macs and most other modern OSs use either Western ISO 8859-1 or Unicode. The first 256 characters of Western ISO 8859-1 are the same in Unicode. So your Unicode encoded web form should correctly interpret upper ASCII text provided by anyone not using Windows 9x (or a completely foreign, non-Unicode character set).

Additional PHP and Oracle configurations

You will want to enable multi-byte character support in PHP. Compile PHP with the -enable-mbstring option, and set mbstring.internal_encoding=UTF-8 in your php.ini file. Also, you should definitely look over the PHP documentation for multi-byte string functions. Note that if you haven’t upgraded to PHP 5 yet, the html_entity_decode() function will fail hard if you pass it a UTF-8 string. This was the only UTF-8 incompatibility we found in PHP 4.3.

You may want to implement PHP’s function overloading. An example will illustrate why this is important: in UTF-8, a string that is 4 characters long could occupy anywhere from 4 to 12 bytes depending on the multi-byte characters in it. The mb_strlen() function will correctly tell you the number of characters in such a string, but the regular strlen() function won’t (it’ll tell you the number of bytes). Enabling function overloading will cause PHP to automatically assume it’s handling multi-byte strings, so, in this example, it will execute mb_strlen() when you call strlen(). If you’re making a wholesale conversion to UTF-8, and you don’t want to tweak all your existing code, implementing function overloading makes sense. But there is one exception: you may not want to do function overloading on mail() – I’ll get to that in a minute.

Related to this, in Oracle 9, you can set NLS_LENGTH_SEMANTICS to use either character length or byte length semantics for the tables you create. That is, you can use it to indicate whether, for example, a varchar(10) column is 10 characters, or 10 bytes.

Smarty

If you’re using Smarty with PHP, you’ll need to override the escape() function. It calls the PHP htmlentities() and htmlspecialchars() functions, but it doesn’t provide them with the necessary charset argument so they’ll work with UTF-8. Make a copy of the escape() modifier and tweak it to pass along a charset argument to PHP, and then use it to override the original.

Exporting to other formats

As you’ll see below, it may not always be wise to do data exports in UTF-8. Sometimes you need to change the character set before performing the export. Take a look at PHP’s utf8_decode() and iconv functions to learn about converting UTF-8 to single-byte encoding. Note that utf8_decode(), while easy to use, is limited to the Latin character set (see the user contributed notes on the PHP utf8_decode() page for tips on dealing with other character sets).

  • PDF: we use PDFlib on our web server to create PDF documents on the fly. For it to work with UTF-8 data, you need to use it with a UTF-8 compatible font. The standard Arial font supports Greek and Cyrillic in UTF-8, which is generally sufficient (don’t confuse standard Arial with Microsoft’s Arial Unicode MS font – while it can print just about any UTF-8 character, it’s 32MB, so you probably don’t want to load it on your web server!). Also, Gentium is a very nice UTF-8 compatible serif font that supports Greek and Cyrillic.
  • RTF: we are moving away from RTF, but we still have some applications that generate RTF files. RTF does not provide good UTF-8 support. Our solution is to do a utf8_decode() on our data before generating RTF files (we can get away with this since none of the data going into our RTF files contain non-Latin characters – hopefully we’ll get rid of RTF before non-Latin characters start showing up).
  • Text: we also do data exports to text files, mainly in .csv format for use in spreadsheets. Surprisingly, Microsoft Excel does not support importing UTF-8 encoded text files. Again, our solution is to perform a utf8_decode() before generating these text files.
  • Email: I recommend not doing function overloading on PHP mail(). The reason has to do with line breaks. In Unix, a line break is represented by a line feed (LF) character. On Macs, it’s represented by a carriage return (CR) character. And on Windows, by a CR+LF. For email to work between platforms, an email standard was agreed upon in the early days of the Internet, which is CR+LF. So, for example, on Unix, sendmail will add a CR as needed to each LF it finds in the body of an email message. But when an email is UTF-8, mailers don’t try to wade through the multi-byte encoding, and they don’t “fix” the line breaks. We found that the line breaks in UTF-8 emails (generated on Unix) were interpreted as desired in Mac and Unix mail readers, and by Microsoft Outlook on Windows, but not by Eudora 6.2 (and previous versions) on Windows. In Eudora, the messages displayed with no line breaks. You can’t say it’s a Eudora bug, since the line breaks weren’t meeting the standard. At this time, the emails we generate only contain basic Latin characters, so sticking with the standard mail() function meets our needs for now.

6 Comments

  1. Reply
    Bran August 9, 2005

    I have a rather specific question. I’m running php5 and mysql 4.1, php was compiled with MB enabled. My MySQL tables and fields are all marked as general_utf8_ci. When I use phpmyadmin and insert rows with japanese hiragana characters, they are preserved and view properly inside phpmyadmin. Yet when I try to pull out these characters using php5’s mysql functions, they display inside my webpage as ???? (question marks).

    On the other hand, if I use POST to query and story a string in Japanese into the database, phpmyadmin will not show the proper characters, but they display properly when pulled into a webpage using mysql queries in php (some characters are still messed (this is due to variable length of UTF-8 char’s, I presume)?

    The first lines in my php files are
    header(“Content-Type: text/plain; charset=utf-8”);

    I also have the meta defined in addition

  2. Reply
    Anand September 29, 2005

    Wonderful blog. Though i can relate to most of the setup you mentioned, we do have a bit complex environment and i wonder if i can get pointers.

    We have J2EE environment where the application talk to SAP backend. When user cut and paste information from word document on some accented chars on the web page, we store it to SAP straight (using JCO). When we retrieve the same data from SAP for display, web page shows these special chars converted to ??s ## or something like that.

    I wonder if you or others have any checkpoints for us. We did set utf-8 encoding on the webpage and at appserver level.

  3. Reply
    Roger Lancefield December 1, 2005

    Very informative and useful. Thanks for taking the time to share your experiences.

  4. Reply
    Teo HuiMing May 29, 2006

    Hi, currently I have a website at a commercial web host. The host runs PHP as CGI and mbstring.func_overload is set to 0 by default.

    I notice func_overload can only be set either in .htaccess (only if PHP run as Apache module) and php.ini/httpd.conf (which I can’t access).

    Is there any alternative to activate the overload mbstring functions in this case?

  5. Reply
    Robert Dupuy March 7, 2007

    I found your writing style to be confusing, because you are talking about the oracle client configuration, and it seems like you are talking about an apache setting.

    You are not talking about any apache setting at all, when you talk abou the ‘apache configuration.’ You sort of hinted at it, by stating that apache runs under a user account…

    the idea is, wherever you are running your oracle client, that oracle client has settings that need to match the server.

    You just saying, change the settings…really is vague and would have people looking in httpd.conf…in my opinion this isn’t written well.

  6. Reply
    Mike March 19, 2007

    Robert,

    Sorry you found it confusing. It sounds like you’re confused about the NLS_LANG setting (if it’s something else that was confusing, please let me know). The goal is to set utf-8 as the character encoding in Oracle, Apache, and PHP. You need to do it for all three. For Apache, you set it as an environment variable in the account apache runs under – you can do that directly in the account, or export it from the apache configuration like I said (I intentionally didn’t specify httpd.conf, as I’ve seen people use different configuration files for different purposes, so it doesn’t necessarily have to go specifically in httpd.conf).

    I recommend reading the PHP Architect article over the blog entry – it goes into more detail –
    http://www.phparch.com/issuedata/articles/article_179.pdf.

    Good luck,

    Mike T

Leave a Reply