L'uso di questo sito
autorizza anche l'uso dei cookie
necessari al suo funzionamento.
(Altre informazioni)

Thursday, November 26, 2015

Mojibake and the horrors of database encoding.

I gave several sound and wise advices  (If I say so myself) in the past when writing about character encodings, whle stopping on the threshold of DBMS encodings with some smartass comment like "God helps you if you must".

So it was just appropriate that I should go into that same task and that it bit me in the arse in revenge. (Background: my company hosts several legacy LAMP sites which had originally been set up with very little thought given to character encodings and all that).

And so it was that - having read a few StackOverflow posts on the subject - I started handling a multilingual, latin1 encoded, MySql database with what turned out to be a very hazy understanding of the subject, took a several misguided turns and  ended with a character soup. In truth, the character soup was probably there to begin with, and I just made it worse, although the changes I made in the (panicked) attempts to fix it just concurred to make it quite hard to sort out.

Of course, all this happened before I ran across two (1 and 2) wise posts on the subject by Joni Salonen. He knows his stuff, and, had I known it in advance, I would have toiled less. So most of what follows was learned the hard way (trial and error). I am presenting it here compounded with what I learned from Joni's posts.

Mojibake (a japanese word, and not by chance: asians I guess have a lot of encoding fun) was for me entirely out of the equation - I know better now. For europeans, this is a common form of mojibake (courtesy of Richard Tobin's useful tool:



Character:                        è
Character name:     LATIN SMALL LETTER E WITH GRAVE
Hex code point:     00E8
Decimal code point:    232
Hex UTF-8 bytes:    C3 A8
Octal UTF-8 bytes:    303 250
UTF-8 bytes as Latin-1 characters bytes: Ã ¨


Mojibake (in one of its simplest forms) is when you start with a UTF8 encoded character top row which is (at one point of its life) mistaken as two latin1 chars (bottom row). This is usually compounded by something (someone) taking the two not-quite-latin1 characters and encoding them back to UTF8 (for european alphabets this normally means you now have four bytes).

Oh, and let's not forget about Windows-1252, helpfully using bytes 128-159 for characters the are UTF8-encoded in three byte sequences (most sighted as single/double quotes, often a result of copy-paste from word). Eg:



Character                  „
Character name         DOUBLE LOW-9 QUOTATION MARK
Hex code point         201E
Decimal code point        8222
Hex UTF-8 bytes         E2 80 9E
Octal UTF-8 bytes        342 200 236
UTF-8 bytes as Latin-1 characters bytes      â [80] [9E]


This is just the beginning of the story, though. When trying to understand just what is going on with your DBMS (mysql, in this case) you should bear in mind that each link of the chain has its own idea of what encoding is being used and may (or may not) decide to make it "easier" for you by converting, from the encoding it thinks are your strings in, to the one it is using. It may get it wrong: if it does not, you will - if you believe to the characters you see - and will end up with egg on your face (mojibake, scrambled).

The links you should be aware of are:
  • the server (holds the truth, but won't tell)
  • the client (it lies: it has a default encoding that won't tell)
  • the terminal (may be lying: check your $LANG)
  • the operating system (probably an innnocent bystander)
  • your editor (It lies. Maybe. See later.)

The server. Mysql has what it calls 'default encodings': they exist for the database, for the table, for the single column of the database. If you tell it nothing, mysql defaults to latin1 (there are good reasons for this: holding a multilingual catalog, however is not one). The clients, meanwhile, have their own idea of the encodings that the incoming/outgoing characters are in. The most typical SNAFUs: UTF8 characters stored as couples of latin1 chars in a latin1 table. It will display as character soup, and if now you ALTER TABLE to change the encoding (mysql doing automatic conversions for you), you will end up with the double encoded mojibake I mentioned above.

You can have even more fun if you have the (not) brilliant idea (advocated by several posts on the 'net) to do the DB conversion by dumping the databas to a file (mysqldump), edit it to change the default latin1 encodings to utf8, and then run a conversion tool the like of iconv/uconv to modify the (supposedly) latin1 characters to utf8, and then feed the result to mysql. Don't. What you are not realising here is that - when dumping - the characters are being output in the encoding that the client is using - which these days is most often utf8, if the client is the command line mysql tool and your terminal is using a .utf8 locale. Running iconv on the dump, then, accomplishes the "fine" result of doubly encoding all the characters which are already utf8.

What to do then? There is a good a case, and a host of bad cases.

Regardless of the situation you are in, you should try your best to know what you are exactly looking at - when in doubt reach for an hex editor/viewer (bvi, emacs' hexl, hexdump, xxd). This seems silly but consider that, when you see the sequence 'è', your instinct will be telling you that you are looking at a UTF8 char displayed on a latin1 device. But throw mojibake in, and you may well have two UTF8 characters as displayed by a UTF8 device. Looking at the actual bytes will give you certainty.

Also locking every chain of your environment in utf8 will help, and will also prevent future mixups. This is how I do it. For mysql, I put this in my.cnf:



[mysqld]
; this inhibits start on v5.5 (a bug)
; default-character-set=utf8
character-set-server=utf8
collation-server = utf8_unicode_ci
init-connect='SET NAMES utf8'
skip-character-set-client-handshake
... 
[client]
default-character-set=utf8


This alone should be enough. However you always want to be explicit so be sure that (wher you are in charge of this things) the db connection strings specify the encoding eg. (php,PDO):



"mysql:host=localhost;dbname=test;charset=utf8"


In your terminal, be sure that locale-related variables are utf8 aware (e.g: LANG=en_US.UTF-8). Your editor should be modern, UTF8 aware, and prefer the UTF8 encoding when in doubt (e.g., for emacs, run versione greater than 21.3, and ‘M-x prefer-coding-system utf-8' or have (prefer-coding-system 'utf-8) in your .emacs)

With this, you will be sure that every link of your chain is thinking UTF8 - for CLI mysql, you can confirm it with:



mysql> SHOW VARIABLES LIKE 'character_set%'; SHOW VARIABLES LIKE 'collation%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | utf8                       |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

+----------------------+-----------------+
| Variable_name        | Value           |
+----------------------+-----------------+
| collation_connection | utf8_unicode_ci |
| collation_database   | utf8_unicode_ci |
| collation_server     | utf8_unicode_ci |
+----------------------+-----------------+
3 rows in set (0.00 sec)


I also put



AddDefaultCharset UTF-8
 ...
AddCharset utf-8 .html .css .php .txt .js 


In Apache's config for the involved web sites.

And note, here I am talking about small DBs, therefore giving little, if any, consideration to performance, conversion times, and somesuch. If you have multiGB DBs with full text indexes that take days to rebuild or somesuch, you should research the topic for your specific requirements.

In the good case, you have a coherent latin1 encoding environment. If so, just issuing



ALTER TABLE tbl_name CONVERT TO CHARACTER SET charset_name;


should do the deed. Watch out that



ALTER TABLE tbl CHARACTER SET utf8


(notice the missing "convert to"?) will change the default encoding, but will do no conversion, leave the encoding of the existing columns as it is and confuse the hell out of you.

And, if you already are in this unfortunate situation, this may sort of help you (assuming your column is CHAR(50)):



ALTER TABLE t MODIFY col1 CHAR(50) CHARACTER SET utf8;


This snippet - http://stackoverflow.com/questions/8906813/how-to-change-the-default-charset-of-a-mysql-table - will cobble the queries for your entire table set:



SELECT concat('alter table ', table_name, ' 
     CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;') 
     FROM information_schema.tables WHERE table_schema='your_database_name' 
           AND table_collation != 'utf8_general_ci' GROUP BY table_name;


As an alternative, you may mysqldump your (smallish) database, edit the resulting file, change all latin1 occurrences to utf8 and reload into a fresh database (defaulting to utf8, of course).

Bad cases are plentiful, but from my experience, have mostly to do with double-encodings (see above) or mixed encodings. Because I do not think my personal struggles with them are of great interest, I will refer you again to the excellent posts of Joni on the subject. Print them. Study them. And may mojibake never cme close to your DBs.

Appendix:

The following bash script reports on the encodings in effect on a particular (or all) database, table.



#!/bin/bash
ver=0.1
author="Alessandro Forghieri "
name=`basename $0`
sa_usage () {  echo "usage: $0 [-l] [-t] [-c] [db [table [column]" ; }
usage () {
 echo "$name $ver $author"
 echo
 echo $(sa_usage)
 echo
 echo "        -l   like dbname "
 echo "        -t   like tablename"
 echo "        -c   like columnname"
 echo "report on database encodings"
 echo
}

 # getopts optstring name [args]
while getopts dvltc opt ; do
 case "$opt" in
  d) set -v ;;
         v) opt_v=1 ;;
         l) opt_l=1 ;;
         t) opt_t=1 ;;
         c) opt_c=1 ;;
  ?) usage; exit ;;
 esac
done

shift `expr $OPTIND - 1`

#http://stackoverflow.com/questions/1049728/how-do-i-see-what-character-set-a-mysql-database-table-column-is

echo "======================"

echo "SHOW VARIABLES LIKE 'character_set%'; SHOW VARIABLES LIKE 'collation%';" | mysql

echo "======================"


CSET_DB="SCHEMA_NAME AS name,DEFAULT_CHARACTER_SET_NAME AS encoding,DEFAULT_COLLATION_NAME AS collation"


if [[ x$1 == x ]]; then
    CLAUSE_DB=""
else
    if [[ x$opt_l == x ]]; then
 CLAUSE_DB="WHERE schema_name = '$1'"
    else
 CLAUSE_DB="WHERE schema_name like '$1'"
    fi
fi
echo "SELECT $CSET_DB  FROM information_schema.schemata $CLAUSE_DB" | mysql

[[ x$1 == x ]] && exit 0;
[[ x$opt_l  != x ]] && exit 0;

echo "======================"

if [[ x$opt_l == x ]]; then
    CLAUSE_DB=" AND  T.table_schema='$1' "
else
    CLAUSE_DB=" WHERE T.table_schema like '$1' "
fi

if [[ x$2 != x ]] ; then
    if [[ x$opt_t == x ]]; then
 CLAUSE_TABLE=" AND T.table_name = '$2' "
    else
 CLAUSE_TABLE=" AND T.table_name like '$2' "
    fi
fi

CSET2="T.TABLE_SCHEMA, T.TABLE_NAME, CCSA.character_set_name"

echo "SELECT $CSET2 FROM information_schema.TABLES T,information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA WHERE CCSA.collation_name = T.table_collation  $CLAUSE_DB $CLAUSE_TABLE" | mysql

echo "======================"

if [[ x$opt_l == x ]]; then
    CLAUSE_DB=" WHERE table_schema='$1' "
else
    CLAUSE_DB=" WHERE table_schema like '$1' "
fi


if [[ x$2 != x ]] ; then
    if [[ x$opt_t == x ]]; then
 CLAUSE_TABLE=" AND table_name = '$2' "
    else
 CLAUSE_TABLE=" AND table_name like '$2' "
    fi
fi

if [[ x$3 != x ]] ; then
    if [[ x$opt_c == x ]]; then
 CLAUSE_COLUMN=" AND column_name = '$3' "
    else
 CLAUSE_COLUMN=" AND column_name like '$3' "
    fi
fi

CSET3="TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME , CHARACTER_SET_NAME"

echo "SELECT $CSET3 FROM information_schema.COLUMNS $CLAUSE_DB $CLAUSE_TABLE $CLAUSE_COLUMN" | mysql