Loading...
 
Skip to main content

History: Character Encoding

Source of version: 4 (current)

Copy to clipboard
            Please see: ((Understanding Encoding)) as the info below is quite old.


---


Related: ((tw:utf-8))

Character Encoding settings allow various language character sets to be displayed properly in a browser.  

For more on internationalization see: ((i18n))

All the data in Tiki are supposed encoding with UTF-8 which can handle all characters defined by the Unicode standard while still being relatively compact. Problems can arise if you are looking at this site and you see some strange character when you choose a non english language.

One concern is that your browser has to support UTF-8 and further more figure out that the Tikiwiki pages are encoded in this encoding. If your browser interprets the UTF-8 encoded output as something else, for example Latin-1 which is the default in Western Europe, then the characters 'æ', 'ø', and 'å' in the Danish locale looks like 'à¦', 'à¸', and 'à¥'.

^Redflo [http://tikiwiki.cvs.sourceforge.net/tikiwiki/tiki/tiki-setup.php?r1=1.173.2.42&r2=1.173.2.43|solved this in August 2004.]^ All releases since then are unaffected...


! Background

The Unicode standard defines the Universal Character Set (UCS) which gives numbers to all the characters in all the alphabeths of the world. The UCS is a superset of Latin-1 (ISO-8859-1) which again is a superset of ASCII. ASCII defines the first 128 characters and Latin-1 defines another 128 characters and thereby exhusts all bits in a 8-bit byte.

The UCS defines many more characters, so 1 byte per character is not enough. Unicode uses 31 bit, so the logical size of each character would be 4 bytes (32 bit). The problem with those wide characters is that they're only needed if your use of the ~2 billion characters are evenly distributed --- most people use no more than 256 of those characters in their documents, so there's a lot wasted space.

The UTF-8 encoding is a way of transforming 4 byte wide characters into 1-6 byte wide characters. It's backwards compatible with ASCII meaning that texts encoded in ASCII automatically is in UTF-8 as well. Other encodings (including Latin-1) use two or more bytes to represent each character. That's why 'æ', 'ø', and 'å' turns into two-letter combinations when an UTF-8 encoded text is viewed as Latin-1.

All the above is dealt with in much more detail in the [http://www.cl.cam.ac.uk/~mgk25/unicode.html|UTF-8 and Unicode FAQ for Unix/Linux] which is usefull for a lot more than just Unix/Linux.

! If Tiki on your server doesn't look fine:

I've added a line
 header('Content-Type: text/html; charset=utf-8');
in tiki-setup.php. This should fix the "browser character encoding decision" problems.
redflo.

Some servers (like [http://httpd.apache.org/|Apache] with the default Debian config) adds a -+charset=iso-8859-1+- to the ~np~Content-Type~/np~ header.  The browser (ex: Mozilla) first looks for a charset value in the ~np~Content-Type~/np~ header and then for META tags, so the header overrides the META tag inserted by TikiWiki.

To solve this with Apache then either check that there is no option -+AddDefaultCharset iso-8859-1+- in the -+httpd.conf+- file. Some distributions set -+AddDefaultCharset on+- in the -+httpd.conf+- for some security issues (see [http://www.apache.org/info/css-security/|Apache Css Security]), so keep in mind you can allways overset default settings in virtualhost directive in the -+httpd.conf+- file.

An other solution is to move the -+doc/htaccess+- in your Tikiwiki installation to -+.htaccess+- and uncomment -+AddDefaultCharset utf-8+- which will add the correct header for UTF-8 output.

! Test on this page itself
(pl) Czy polski ogónki funkcjonujÄ… tutaj, pisany przez mozilli? Np, tu trochÄ™ treść. (ja) 日本語. (en) Good, this looks OK :).

! Editing the translations

If you change the -+language.php+- file in Tikiwiki, keep in mind to save the file in UTF-8 encoding. If you're using [http://www.gnu.org/software/emacs/emacs.html|Emacs] then it's easy to change the encoding of a file. Simply open the file, and then type '-+C-x RET f+-' which runs the command -+set-buffer-file-coding-system+-. Now choose -+utf-8+- from the list.

[http://tikiwiki.org/tiki-index.php?page=CharacterEncodingTrouble|original page]