Loading...
 

Tikiwiki, PHP, UTF-8 character encoding and MySQL

What is UTF-8 and why should one use it ?


In short, UTF-8 is a character encoding that uses 1 to 3 bytes for each character.
It is one of the existing character encodings of the UCS (Universal Character Set), that contains nearly a hundred thousand abstract characters (including ASCII characters).

UTF-8 greatly simplifies the task of internationalization by replacing multiple alternative character encoding (such as ISO8859-15 Latin-9, which encodes those English, French, German, Spanish and Portuguese characters not available in ASCII).

To learn more about UTF-8, you should read this :

Is Tikiwiki fully UTF-8 ?


In most cases, it seems to be... but the answer is : No 😉

Nowadays, a typical MySQL installation will create, by default, Latin1 and not UTF8 databases. And since Tikiwiki doesn't specify the database, table, or field encodings in its installation scripts, the data sources are typically created in Latin1. So... data is stored in Latin1 tables and not in UTF8 ones.

Good Point! - To create an installationwith a utf-8 database (as the default) see below.

Why everything seems to work and what's wrong ?


Because it's not completely impossible to have UTF8 data in Latin1 tables !

The most frequent mistake is to think that specifying encoding at the web server level (through HTTP headers) or in the HTML code is enough. This is not so easy to understand, especially because this could work (or let you think it works) in many situations.

In fact you should also use an UTF8 database with UTF8 content. Many people are missing this part of the problem or don't even know that their databases are not correctly configured to handle UTF8 data.

What kind of problems may I encounter if I use Latin1 tables for UTF8 data ?


As said before, some UTF8 characters use 3 bytes. Since your database server will not be aware of the real character encoding, you may have this kind of problems :

  • Truncated data.

Example : If you try to store an 8 character string that contains 5 of those 3byte-UTF8 character, your data will require 18 bytes of space and not 8 bytes. So, if you try to store this string in a database field defined to be a string with a maximum lenght of 8 characters, your string will be truncated to 8 bytes.

  • Wrong results from some database functions.

Example : If you use functions that will count the number of characters in the string or that will return a substring, you may respectively have a total of characters greater than the what it should be, and a substring smaller than the one expected.

Ok, so... how does all this stuff work ?


There are three major components you should consider when trying to understand how encoding works for an application like tikiwiki :

  • The web browser
  • PHP in association with a web server
  • The database server

List Slides
Tikiwiki, PHP, UTF-8 character encoding and MySQL What is UTF-8 and why should one use it ? In short, UTF-8 is a character encoding that uses 1 to 3 bytes for each character. It is one of the existing character encodings of the UCS (Universal Character Set), that contains nearly a hundred thousand abstract characters (including ASCII characters). UTF-8 greatly simplifies the task of internationalization by replacing multiple alternative character encoding (such as ISO8859-15 Latin-9, which encodes those English, French, German, Spanish and Portuguese characters not available in ASCII). To learn more about UTF-8, you should read this : The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) , Wikipedia UTF-8 article , Wikipedia Universal Character Set article Is Tikiwiki fully UTF-8 ? In most cases, it seems to be... but the answer is : No 😉 Nowadays, a typical MySQL installation will create, by default, Latin1 and not UTF8 databases. And since Tikiwiki doesn't specify the database, table, or field encodings in its installation scripts, the data sources are typically created in Latin1. So... data is stored in Latin1 tables and not in UTF8 ones. Good Point! - To create an installation with a utf-8 database (as the default) see below. Why everything seems to work and what's wrong ? Because it's not completely impossible to have UTF8 data in Latin1 tables ! The most frequent mistake is to think that specifying encoding at the web server level (through HTTP headers) or in the HTML code is enough. This is not so easy to understand, especially because this could work (or let you think it works) in many situations. In fact you should also use an UTF8 database with UTF8 content. Many people are missing this part of the problem or don't even know that their databases are not correctly configured to handle UTF8 data. What kind of problems may I encounter if I use Latin1 tables for UTF8 data ? As said before, some UTF8 characters use 3 bytes. Since your database server will not be aware of the real character encoding, you may have this kind of problems : Truncated data. Example : If you try to store an 8 character string that contains 5 of those 3byte-UTF8 character, your data will require 18 bytes of space and not 8 bytes. So, if you try to store this string in a database field defined to be a string with a maximum lenght of 8 characters, your string will be truncated to 8 bytes. Wrong results from some database functions. Example : If you use functions that will count the number of characters in the string or that will return a substring, you may respectively have a total of characters greater than the what it should be, and a substring smaller than the one expected. Ok, so... how does all this stuff work ? There are three major components you should consider when trying to understand how encoding works for an application like tikiwiki : The web browser PHP in association with a web server The database server