Trouble moving to UTF-8 with Bugzilla 3.0 (recode.pl + checksetup.pl) - Mozilla

This is a discussion on Trouble moving to UTF-8 with Bugzilla 3.0 (recode.pl + checksetup.pl) - Mozilla ; Hi, I am experiencing serious trouble moving an existing Bugzilla non- UTF-8 installation to a new database with UTF-8 with Bugzilla 3.0. I have found some hack to get it to work, but my understanding of the inner workings of ...

+ Reply to Thread
Results 1 to 7 of 7

Thread: Trouble moving to UTF-8 with Bugzilla 3.0 (recode.pl + checksetup.pl)

  1. Trouble moving to UTF-8 with Bugzilla 3.0 (recode.pl + checksetup.pl)


    Hi,

    I am experiencing serious trouble moving an existing Bugzilla non-
    UTF-8 installation to a new database with UTF-8 with Bugzilla 3.0.

    I have found some hack to get it to work, but my understanding of the
    inner workings of this is rather limited, so I'd appreciate some
    feedback :-)

    Old (source) system:

    * Bugzilla 2.22.2, on iso-8859-1 (German), coming there all the way up
    from 2.18
    * mysql 4.0.15 with default character set "latin1"
    * Apache 2.x
    * Old Perl
    * Old Linux box

    New (target) system

    * Bugzilla 3.0 on UTF-8
    * mysql 5.0.37 with default character set "utf-8"
    * Apache 2.x
    * Current? Perl
    [mysql, Apache, Perl all come from XAMPP, http://www.apachefriends.org/]
    * Windows 2003 Server box

    Intent: Get data from old to new :-)

    Steps taken:

    Source: mysqldump -all-databases > mydumpfile.dump
    Target: mysql < mydumpfile.dump

    So far, so good, and the first observation: The data made it into the
    mysql database just fine, i.e. all the funky German characters are
    visible just fine from within phpMyAdmin.

    Running contrib/recode.pl on the database works, but does not seem to
    do anything (i.e. it does not seem to alter to the worse or better -
    the data is simple visible the same in phpMyAdmin as it was before
    running contrib/recode.pl)

    Running checkconfig.pl now does the expected things, emits the
    appropriate stern warning regarding data getting deleted in case of
    bad encoding and successfully runs through all

    Converting table storage format to UTF-8. This may take a while.

    steps. Fine. But now the data inside the database most definitely is
    corrupt - i.e. all text is cut off at the "funky German character"
    boundary (or, more formally, at any character that is outside of ASCII
    and would require UTF-8 escaping). Example: Where the old text would
    be "Müßiggänger", the (scrambled) text, both in phpMyAdmin and
    Bugzilla would just be "M".

    Having lost quite a bit of hair over this, I decided to try switching
    one peculiar umlaut-prone column to utf-8 encoding straight from
    within phpMyAdmin. This did not loose any data inside that column
    (great!). Running checksetup.pl on top had the effect that all encoded
    text was broken (ok, no surprise), except for all the text in the one
    column that I had converted manually. Humm.

    Since I was not in the mood to manually convert every single text
    field from latin1 encoding to utf-8 encoding, I decided to take a peek
    at the upgrade approach taken by checksetup.pl and noticted that it
    eventually performs two steps (in Bugzilla/DB/Mysql.pm):

    a) change column from source to "binary"
    b) change column from the then "binary" to "utf-8"

    [I sort of understand the intent of this - contrib/recode.pl was
    expected to have performed the data "content" conversion, but then
    contrib/recode.pl had not done anything for me]

    That looked odd, to be honest, and not applicable to my scenario, so
    trying to skip the "binary" conversion, I ended up with (in Bugzilla/
    DB/Mysql.pm)

    ************************
    my $type = $self->_bz_schema-
    >convert_type($col_info->{TYPE});

    $binary =~ s/(\Q$type\E)/$1 CHARACTER SET binary/;
    $utf8 =~ s/(\Q$type\E)/$1 CHARACTER SET utf8/;

    #
    # $self->do("ALTER TABLE $table CHANGE COLUMN $name
    $name
    # $binary");
    #

    $self->do("ALTER TABLE $table CHANGE COLUMN $name
    $name
    $utf8");
    ************************

    Run. Check with phpMyAdmin. Check with Bugzilla 3.0. Celebrate - a
    complete success.

    But ... how come?

    In essence, checksetup.pl behaved the same, regardless of whether
    contrib/recode.pl had been run or not. Unmodified, it would always
    corrupt my data.

    I did not see contrib/recode.pl doing anything useful, but the ALTER
    TABLE to utf8 worked fine for me, out of the box, no running of
    contrib/recode.pl involved at all.

    Is this a peculiarity of the XAMPP distribution on Windows? The rather
    massive switch of things (not one thing at a time, but change OS,
    database version, Perl version, and Bugzilla version, all at the same
    time)?

    Can someone please help me understand what is going on here?

    Thanks!


  2. Re: Trouble moving to UTF-8 with Bugzilla 3.0 (recode.pl +checksetup.pl)

    On 15 May 2007 08:43:25 -0700 Stefan Hoffmeister
    wrote:
    > Running contrib/recode.pl on the database works, but does not seem to
    > do anything (i.e. it does not seem to alter to the worse or better -
    > the data is simple visible the same in phpMyAdmin as it was before
    > running contrib/recode.pl)


    Okay. The steps you need to do are:

    1) Run checksetup.pl until it tells you to run recode.pl.
    2) Run contrib/recode.pl --charset=cp1252
    3) Run checksetup.pl again.

    > But ... how come?


    Because the data in your database was in the actual encoding
    the database claimed it was in. That is, your data was in Latin1, and
    the DB was encoded in Latin1. This isn't the case for a vast number of
    Bugzilla installations.

    -Max
    --
    http://www.everythingsolved.com/
    Competent, Friendly Bugzilla Services. And Everything Else, too.

  3. Re: Trouble moving to UTF-8 with Bugzilla 3.0 (recode.pl + checksetup.pl)

    On 15 Mai, 21:46, Max Kanat-Alexander wrote:
    > On 15 May 2007 08:43:25 -0700 Stefan Hoffmeister


    Max, many thanks for your reply!

    Your steps did work (and this included the expected temporary garbling
    of text inside the mysql database), but they open another can of worms
    with questions :-)

    > 2) Run contrib/recode.pl --charset=cp1252


    Why cp1252? This is a Windows code page?

    I had used "--charset=iso-8859-1" before, as that is latin1 (which is
    the mysql box on Linux). Also the columns had shown up as latin1 in
    phpMyAdmin after the reimport into the Windows-based mysql database.

    I am currently adding tests with a variety of "international"
    characters (German umlauts, characters from the set where ISO 8859-1
    and Windows 1252 do not match - http://en.wikipedia.org/wiki/Windows-1252
    to the rescue!) - if I run into additional difficulties, I'll be sure
    to speak up ;-).



  4. Re: Trouble moving to UTF-8 with Bugzilla 3.0 (recode.pl + checksetup.pl)

    Stefan Hoffmeister wrote:
    >> 2) Run contrib/recode.pl --charset=cp1252

    >
    > Why cp1252? This is a Windows code page?


    Yes, but it fully includes iso-8859-1.
    So it's never a problem to convert iso-8859-1 data using the cp1252
    codepage, but it will be a problem to convert using iso-8859-1 if any
    cp1252 character has slipped in.

    > I had used "--charset=iso-8859-1" before, as that is latin1 (which is
    > the mysql box on Linux)


    What counts is the OS of the user that entered the data in the bugs.

    Those under linux who used iso-8859-1 won't be a problem, but there will
    almost always be many who used windows, and those might have inserted
    some cp1252 characters, which the old bugzilla will not have filtered
    out before inserting in the base.

  5. Re: Trouble moving to UTF-8 with Bugzilla 3.0 (recode.pl + checksetup.pl)

    On 16 Mai, 13:40, Jean-Marc Desperrier wrote:

    > What counts is the OS of the user that entered the data in the bugs.


    Ah! That's it, then.

    I am confident that the "international" users only made use of Windows
    clients (both IE and Firefox). Will make sure, though, that the
    migration setup works, regardless of which client created the text.

    What I still am surprised about, though, is that I could hack things
    to work just fine by

    * mysql bugs < the.dump
    * edit Bugzilla's migration script to not use "locale encoding ->
    binary -> utf8" but to "locale encoding -> utf8"
    * ./checksetup.pl

    I do not recall, though, whether the Euro symbol (primary suspect,
    because its binary encoding is different in 8859-15 and cp1252)
    converted fine with the hack. This test I added later, and this test
    definitely passes when following Max's steps :-)


  6. Re: Trouble moving to UTF-8 with Bugzilla 3.0 (recode.pl + checksetup.pl)

    Thanks a lot - helped me alot moving Bugzilla 2.20 to 3.6.3 from Redhat to Centos machine - Saved my day and hours

  7. Re: Trouble moving to UTF-8 with Bugzilla 3.0 (recode.pl + checksetup.pl)

    Thanks for sharing the informative post.

+ Reply to Thread