Thread: Recommended approach for upgrading DBs with nonmatching encodings

Recommended approach for upgrading DBs with nonmatching encodings

From

Martin Pitt

Date:

30 March 2008, 15:40:51

Hello all,

as already known, 8.3 now enforces a match between DB encoding and
server locale [1]. I agree that this is a Good Thing=E2=84=A2, but it causes
automatic upgrades from previous versions to 8.3 to fail with
something like

pg_restore: [archiver (db)] could not execute query: ERROR:  encoding LATIN=
1 does not match server's locale it_IT.UTF-8
DETAIL:  The server's LC_CTYPE setting requires encoding UTF8.
    Command was:=20
CREATE DATABASE lixpergroupware WITH TEMPLATE =3D template0 ENCODING =3D 'L=
ATIN1';

There were some proposals [3][4] to have the server run under locale C
or POSIX, but IMHO this is rather an aggravation than a viable
solution (locales are a good thing), so I rather don't do that.

My gut feeling is that the right approach would be to create all
target (8.3) databases with a correct encoding that matches the server
locale and have the character data from the pg_dump converted on the
fly (with iconv, or with pg_dump itself), so that people get from a
broken 8.1/8.2 setup to a good 8.3 setup painlessly.

I'm seeking some input on (1) whether above approach is the correct
one, and (2) suggestions how to implement it properly.

My current pg_upgradecluster uses pg_dumpall to copy the schema, and a
per-db pg_dump to copy the DB table contents. Will calling pg_dump
with  --encoding=3D<encoding of 8.3 server's locale> always DTRT and is
it meant to solve this problem? The common case is --encoding=3DUTF-8,
but of course in theory someone might also have it the other way
around, so the upgrade should fail if it encounters an UTF-8 character
which cannot be encoded into an. e. g.  LATIN1 character.

Thank you all for any suggestion,

Martin

[1] http://www.nabble.com/BUG--3924:-Create-Database-with-another-encoding-=
as-the-encoding-from-postgres-td15246051.html
[2] http://bugs.debian.org/472930
[3] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=3D472930#10
[4] https://bugs.launchpad.net/207779

--=20
Martin Pitt                        | http://www.piware.de
Ubuntu Developer (www.ubuntu.com)  | Debian Developer  (www.debian.org)

Re: Recommended approach for upgrading DBs with nonmatching encodings

From

Martin Pitt

Date:

30 March 2008, 17:39:37

Hi again,
Martin Pitt [2008-03-30 20:40 +0200]:
> My current pg_upgradecluster uses pg_dumpall to copy the schema, and a
> per-db pg_dump to copy the DB table contents. Will calling pg_dump
> with  --encoding=3D<encoding of 8.3 server's locale> always DTRT and is
> it meant to solve this problem?=20

One problem with this is that while pg_dump -E UTF8 works (with SQL
output), -E does not seem to have any effect when using -Fc. However,
-Fc is a must for a reasonably robust automatic upgrade script.

Thanks in advance,

Martin
--=20
Martin Pitt                        | http://www.piware.de
Ubuntu Developer (www.ubuntu.com)  | Debian Developer  (www.debian.org)

Re: Recommended approach for upgrading DBs with nonmatching encodings

From

Tom Lane

Date:

30 March 2008, 17:44:01

Martin Pitt <martin@piware.de> writes:
> One problem with this is that while pg_dump -E UTF8 works (with SQL
> output), -E does not seem to have any effect when using -Fc.

Huh?  Please provide a test case.

            regards, tom lane

Re: Recommended approach for upgrading DBs with nonmatching encodings

From

Martin Pitt

Date:

30 March 2008, 17:58:47

Tom Lane [2008-03-30 16:43 -0400]:
> Martin Pitt <martin@piware.de> writes:
> > One problem with this is that while pg_dump -E UTF8 works (with SQL
> > output), -E does not seem to have any effect when using -Fc.
>
> Huh?  Please provide a test case.

Ah, I got it. This fails:

  pg_dump -Fc -E UTF8 -p 5432  latin1test | pg_restore -p 5433 -d template1 -C

(5432 is 8.1, 5433 is 8.3, both with locale ru_RU.UTF-8;
createdb -E latin1 latin1test)

But if I create the DB beforehand (with correct encoding) and then
dump/restore without using -C, it works fine:

  createdb -p 5433 latin1test
  pg_dump -Fc -p 5432  latin1test | pg_restore -p 5433 -d latin1test

In that case I do not even need to specify -E. Seems that
pg_dump/pg_restore are clever enough to detect encodings and necessary
conversions.

So this seems to be the cleanest approach to me, and it's free of
hacks. pg_restore restores the correct owner of the DB, so calling
createdb as the DB superuser does not harm.

Thanks,

Martin
--
Martin Pitt                        | http://www.piware.de
Ubuntu Developer (www.ubuntu.com)  | Debian Developer  (www.debian.org)

Re: Recommended approach for upgrading DBs with nonmatching encodings

From

Tom Lane

Date:

30 March 2008, 18:15:22

Martin Pitt <martin@piware.de> writes:
> Tom Lane [2008-03-30 16:43 -0400]:
>> Huh?  Please provide a test case.

> Ah, I got it. This fails:

>   pg_dump -Fc -E UTF8 -p 5432  latin1test | pg_restore -p 5433 -d template1 -C

> (5432 is 8.1, 5433 is 8.3, both with locale ru_RU.UTF-8;
> createdb -E latin1 latin1test)

Yeah.  This will try to create the new latin1test with all the same
properties it had before, including encoding, and 8.3 intentionally
rejects that.  Your 8.1 setup is pretty broken too (it will misbehave
in various ways because of the encoding mismatch), but 8.1 fails to
realize that.

> In that case I do not even need to specify -E. Seems that
> pg_dump/pg_restore are clever enough to detect encodings and necessary
> conversions.

Yeah, there's usually little value in -E unless you're planning to
do something else with the dump than just feed it to pg_restore.
(If you wanted to export to some other DBMS, for example, it could
be useful.)  In particular -E has entirely zip bearing on what
database encoding will be assigned during restore.

            regards, tom lane

Re: Recommended approach for upgrading DBs with nonmatching encodings

From

Martin Pitt

Date:

31 March 2008, 03:22:52

Hi Tom,

Tom Lane [2008-03-30 17:15 -0400]:
> >   pg_dump -Fc -E UTF8 -p 5432  latin1test | pg_restore -p 5433 -d template1 -C
>
> Yeah.  This will try to create the new latin1test with all the same
> properties it had before, including encoding

I see, so it's intended to behave like this. man pg_dump is a little
unclear on this.

>, and 8.3 intentionally rejects that.

Rightly so.

> Your 8.1 setup is pretty broken too (it will misbehave in various
> ways because of the encoding mismatch), but 8.1 fails to realize
> that.

Right, I know. It is the reproducer for the upgrade problems many
people have (see quoted bug reports in initial mail), and what I now
use in the postgresql-common test suite.

> Yeah, there's usually little value in -E unless you're planning to
> do something else with the dump than just feed it to pg_restore.
> (If you wanted to export to some other DBMS, for example, it could
> be useful.)  In particular -E has entirely zip bearing on what
> database encoding will be assigned during restore.

Thanks for the clarification. So it seems createdb+pg_restore is the
way to go, and pg_restore -C does not DTRT for my purpose (correcting
DB encodings on upgrades).

BTW, many people seem to run the server under C and use different
encodings in their DBs (latin, UTF8). Shouldn't that cause similar
problems with collation, data type checking (ischar(), etc.)? What do
you recommend should the upgrade script do if it encounters an 8.[12]
server running under C?

Thanks,

Martin

--
Martin Pitt                        | http://www.piware.de
Ubuntu Developer (www.ubuntu.com)  | Debian Developer  (www.debian.org)

Re: Recommended approach for upgrading DBs with nonmatching encodings

From

Tom Lane

Date:

31 March 2008, 03:38:55

Martin Pitt <martin@piware.de> writes:
> BTW, many people seem to run the server under C and use different
> encodings in their DBs (latin, UTF8). Shouldn't that cause similar
> problems with collation, data type checking (ischar(), etc.)? What do
> you recommend should the upgrade script do if it encounters an 8.[12]
> server running under C?

Yeah, C locale will work with any encoding, for small values of "work"
--- one problem is that you get byte-by-byte sorting which doesn't act
all that sensibly in a lot of encodings, and another is that
upper()/lower() only work for the base ASCII letters.

I gather that this behavior is just fine with a lot of the Far Eastern
contingent, though you'd have to ask them for details.  In any case
you should not second-guess such a configuration during upgrade.
People do use it.

            regards, tom lane