Trouble moving to UTF-8 with Bugzilla 3.0 (recode.pl + checksetup.pl)
Hi,
I am experiencing serious trouble moving an existing Bugzilla non-
UTF-8 installation to a new database with UTF-8 with Bugzilla 3.0.
I have found some hack to get it to work, but my understanding of the
inner workings of this is rather limited, so I'd appreciate some
feedback :-)
Old (source) system:
* Bugzilla 2.22.2, on iso-8859-1 (German), coming there all the way up
from 2.18
* mysql 4.0.15 with default character set "latin1"
* Apache 2.x
* Old Perl
* Old Linux box
New (target) system
* Bugzilla 3.0 on UTF-8
* mysql 5.0.37 with default character set "utf-8"
* Apache 2.x
* Current? Perl
[mysql, Apache, Perl all come from XAMPP, [url]http://www.apachefriends.org/][/url]
* Windows 2003 Server box
Intent: Get data from old to new :-)
Steps taken:
Source: mysqldump -all-databases > mydumpfile.dump
Target: mysql < mydumpfile.dump
So far, so good, and the first observation: The data made it into the
mysql database just fine, i.e. all the funky German characters are
visible just fine from within phpMyAdmin.
Running contrib/recode.pl on the database works, but does not seem to
do anything (i.e. it does not seem to alter to the worse or better -
the data is simple visible the same in phpMyAdmin as it was before
running contrib/recode.pl)
Running checkconfig.pl now does the expected things, emits the
appropriate stern warning regarding data getting deleted in case of
bad encoding and successfully runs through all
Converting table storage format to UTF-8. This may take a while.
steps. Fine. But now the data inside the database most definitely is
corrupt - i.e. all text is cut off at the "funky German character"
boundary (or, more formally, at any character that is outside of ASCII
and would require UTF-8 escaping). Example: Where the old text would
be "Müßiggänger", the (scrambled) text, both in phpMyAdmin and
Bugzilla would just be "M".
Having lost quite a bit of hair over this, I decided to try switching
one peculiar umlaut-prone column to utf-8 encoding straight from
within phpMyAdmin. This did not loose any data inside that column
(great!). Running checksetup.pl on top had the effect that all encoded
text was broken (ok, no surprise), except for all the text in the one
column that I had converted manually. Humm.
Since I was not in the mood to manually convert every single text
field from latin1 encoding to utf-8 encoding, I decided to take a peek
at the upgrade approach taken by checksetup.pl and noticted that it
eventually performs two steps (in Bugzilla/DB/Mysql.pm):
a) change column from source to "binary"
b) change column from the then "binary" to "utf-8"
[I sort of understand the intent of this - contrib/recode.pl was
expected to have performed the data "content" conversion, but then
contrib/recode.pl had not done anything for me]
That looked odd, to be honest, and not applicable to my scenario, so
trying to skip the "binary" conversion, I ended up with (in Bugzilla/
DB/Mysql.pm)
************************
my $type = $self->_bz_schema-[color=blue]
>convert_type($col_info->{TYPE});[/color]
$binary =~ s/(\Q$type\E)/$1 CHARACTER SET binary/;
$utf8 =~ s/(\Q$type\E)/$1 CHARACTER SET utf8/;
#
# $self->do("ALTER TABLE $table CHANGE COLUMN $name
$name
# $binary");
#
$self->do("ALTER TABLE $table CHANGE COLUMN $name
$name
$utf8");
************************
Run. Check with phpMyAdmin. Check with Bugzilla 3.0. Celebrate - a
complete success.
But ... how come?
In essence, checksetup.pl behaved the same, regardless of whether
contrib/recode.pl had been run or not. Unmodified, it would always
corrupt my data.
I did not see contrib/recode.pl doing anything useful, but the ALTER
TABLE to utf8 worked fine for me, out of the box, no running of
contrib/recode.pl involved at all.
Is this a peculiarity of the XAMPP distribution on Windows? The rather
massive switch of things (not one thing at a time, but change OS,
database version, Perl version, and Bugzilla version, all at the same
time)?
Can someone please help me understand what is going on here?
Thanks!
Re: Trouble moving to UTF-8 with Bugzilla 3.0 (recode.pl +checksetup.pl)
On 15 May 2007 08:43:25 -0700 Stefan Hoffmeister
<stefan.hoffmeister@gmail.com> wrote:[color=blue]
> Running contrib/recode.pl on the database works, but does not seem to
> do anything (i.e. it does not seem to alter to the worse or better -
> the data is simple visible the same in phpMyAdmin as it was before
> running contrib/recode.pl)[/color]
Okay. The steps you need to do are:
1) Run checksetup.pl until it tells you to run recode.pl.
2) Run contrib/recode.pl --charset=cp1252
3) Run checksetup.pl again.
[color=blue]
> But ... how come?[/color]
Because the data in your database was in the actual encoding
the database claimed it was in. That is, your data was in Latin1, and
the DB was encoded in Latin1. This isn't the case for a vast number of
Bugzilla installations.
-Max
--
[url]http://www.everythingsolved.com/[/url]
Competent, Friendly Bugzilla Services. And Everything Else, too.
Re: Trouble moving to UTF-8 with Bugzilla 3.0 (recode.pl + checksetup.pl)
On 15 Mai, 21:46, Max Kanat-Alexander <mka...@bugzilla.org> wrote:[color=blue]
> On 15 May 2007 08:43:25 -0700 Stefan Hoffmeister[/color]
Max, many thanks for your reply!
Your steps did work (and this included the expected temporary garbling
of text inside the mysql database), but they open another can of worms
with questions :-)
[color=blue]
> 2) Run contrib/recode.pl --charset=cp1252[/color]
Why cp1252? This is a Windows code page?
I had used "--charset=iso-8859-1" before, as that is latin1 (which is
the mysql box on Linux). Also the columns had shown up as latin1 in
phpMyAdmin after the reimport into the Windows-based mysql database.
I am currently adding tests with a variety of "international"
characters (German umlauts, characters from the set where ISO 8859-1
and Windows 1252 do not match - [url]http://en.wikipedia.org/wiki/Windows-1252[/url]
to the rescue!) - if I run into additional difficulties, I'll be sure
to speak up ;-).
Re: Trouble moving to UTF-8 with Bugzilla 3.0 (recode.pl + checksetup.pl)
Stefan Hoffmeister wrote:[color=blue][color=green]
>> 2) Run contrib/recode.pl --charset=cp1252[/color]
>
> Why cp1252? This is a Windows code page?[/color]
Yes, but it fully includes iso-8859-1.
So it's never a problem to convert iso-8859-1 data using the cp1252
codepage, but it will be a problem to convert using iso-8859-1 if any
cp1252 character has slipped in.
[color=blue]
> I had used "--charset=iso-8859-1" before, as that is latin1 (which is
> the mysql box on Linux)[/color]
What counts is the OS of the user that entered the data in the bugs.
Those under linux who used iso-8859-1 won't be a problem, but there will
almost always be many who used windows, and those might have inserted
some cp1252 characters, which the old bugzilla will not have filtered
out before inserting in the base.
Re: Trouble moving to UTF-8 with Bugzilla 3.0 (recode.pl + checksetup.pl)
On 16 Mai, 13:40, Jean-Marc Desperrier <jmd...@alussinan.org> wrote:
[color=blue]
> What counts is the OS of the user that entered the data in the bugs.[/color]
Ah! That's it, then.
I am confident that the "international" users only made use of Windows
clients (both IE and Firefox). Will make sure, though, that the
migration setup works, regardless of which client created the text.
What I still am surprised about, though, is that I could hack things
to work just fine by
* mysql bugs < the.dump
* edit Bugzilla's migration script to not use "locale encoding ->
binary -> utf8" but to "locale encoding -> utf8"
* ./checksetup.pl
I do not recall, though, whether the Euro symbol (primary suspect,
because its binary encoding is different in 8859-15 and cp1252)
converted fine with the hack. This test I added later, and this test
definitely passes when following Max's steps :-)
Re: Trouble moving to UTF-8 with Bugzilla 3.0 (recode.pl + checksetup.pl)
Thanks a lot - helped me alot moving Bugzilla 2.20 to 3.6.3 from Redhat to Centos machine - Saved my day and hours :)
Re: Trouble moving to UTF-8 with Bugzilla 3.0 (recode.pl + checksetup.pl)
Thanks for sharing the informative post.