Tikiwiki, PHP, UTF-8 character encoding and MySQL | |
What is UTF-8 and why should one use it ?
UTF-8 greatly simplifies the task of internationalization by replacing multiple alternative character encoding (such as ISO8859-15 Latin-9, which encodes those English, French, German, Spanish and Portuguese characters not available in ASCII). To learn more about UTF-8, you should read this :
Is Tikiwiki fully UTF-8 ?
Nowadays, a typical MySQL installation will create, by default, Latin1 and not UTF8 databases. And since Tikiwiki doesn't specify the database, table, or field encodings in its installation scripts, the data sources are typically created in Latin1. So... data is stored in Latin1 tables and not in UTF8 ones. Good Point! - To create an installationwith a utf-8 database (as the default) see below.
Why everything seems to work and what's wrong ?
The most frequent mistake is to think that specifying encoding at the web server level (through HTTP headers) or in the HTML code is enough. This is not so easy to understand, especially because this could work (or let you think it works) in many situations. In fact you should also use an UTF8 database with UTF8 content. Many people are missing this part of the problem or don't even know that their databases are not correctly configured to handle UTF8 data. What kind of problems may I encounter if I use Latin1 tables for UTF8 data ?
Example : If you try to store an 8 character string that contains 5 of those 3byte-UTF8 character, your data will require 18 bytes of space and not 8 bytes. So, if you try to store this string in a database field defined to be a string with a maximum lenght of 8 characters, your string will be truncated to 8 bytes.
Example : If you use functions that will count the number of characters in the string or that will return a substring, you may respectively have a total of characters greater than the what it should be, and a substring smaller than the one expected. Ok, so... how does all this stuff work ?
|
The web browser | |
This bad feature exists only because web browsers are designed to work with most content, including those created by people that forgot to specify the encoding or even don't really know what it is. |
How to set UTF-8 as your default encoding | |
There is not only one method because :
Now that the browser knows the encoding of the web server output and input, let's talk about the two other components... |
PHP in association with a web server | |
(There is also a possibility to set default_charset = "utf8" in the php.ini configuration file. Comments in this file explains that : "As of 4.0b4, PHP always outputs a character encoding by default in the Content-type: header. To disable sending of the charset, simply set it to be empty." ) Configuration example : default_charset = UTF-8
mbstring.language = Neutral
If you don't have access to the server's php.ini configuration file, you can use this syntax in .htaccess files : php_value mbstring.language "Neutral"
php_value mbstring.internal_encoding "UTF-8" ...
|
The database server (MySQL in this article) | |
MySQL is configured to communicate with "clients". PHP (through it's mysql extensions) is one of them.
There is two ways to specify this encoding :
[client]
default-character-set = utf8
SET NAMES 'UTF8';
In fact, for the solution (1), PHP mysql extension doesn't read the /etc/mysql/my.cnf file and don't use the specified encoding. In order to handle this, you will need to use the mysqli (MySQL Improved) extension for PHP. By chance, this one is already useable with Tikiwiki, because :
To switch to mysqli extension, you need to :
(For latin1 to UTF8 data migration, you can find articles via google 😉 )
CREATE))(( DATABASE `tikiwiki` CHARACTER SET utf8; (There is no need to specify collation, the default one will be based on the character set) Just for information, if you want to change your MySQL default character set (that will be used when creating a database without specifying character set), add this to you my.cnf configuration file : [mysqld]
default-character-set = utf8
|
Interaction between the three components | |
What about MySQL collations ?
Related page at tikiwiki.org
alias
|