Thread: Recommended approach for upgrading DBs with nonmatching encodings
Hello all, as already known, 8.3 now enforces a match between DB encoding and server locale [1]. I agree that this is a Good Thing=E2=84=A2, but it causes automatic upgrades from previous versions to 8.3 to fail with something like pg_restore: [archiver (db)] could not execute query: ERROR: encoding LATIN= 1 does not match server's locale it_IT.UTF-8 DETAIL: The server's LC_CTYPE setting requires encoding UTF8. Command was:=20 CREATE DATABASE lixpergroupware WITH TEMPLATE =3D template0 ENCODING =3D 'L= ATIN1'; There were some proposals [3][4] to have the server run under locale C or POSIX, but IMHO this is rather an aggravation than a viable solution (locales are a good thing), so I rather don't do that. My gut feeling is that the right approach would be to create all target (8.3) databases with a correct encoding that matches the server locale and have the character data from the pg_dump converted on the fly (with iconv, or with pg_dump itself), so that people get from a broken 8.1/8.2 setup to a good 8.3 setup painlessly. I'm seeking some input on (1) whether above approach is the correct one, and (2) suggestions how to implement it properly. My current pg_upgradecluster uses pg_dumpall to copy the schema, and a per-db pg_dump to copy the DB table contents. Will calling pg_dump with --encoding=3D<encoding of 8.3 server's locale> always DTRT and is it meant to solve this problem? The common case is --encoding=3DUTF-8, but of course in theory someone might also have it the other way around, so the upgrade should fail if it encounters an UTF-8 character which cannot be encoded into an. e. g. LATIN1 character. Thank you all for any suggestion, Martin [1] http://www.nabble.com/BUG--3924:-Create-Database-with-another-encoding-= as-the-encoding-from-postgres-td15246051.html [2] http://bugs.debian.org/472930 [3] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=3D472930#10 [4] https://bugs.launchpad.net/207779 --=20 Martin Pitt | http://www.piware.de Ubuntu Developer (www.ubuntu.com) | Debian Developer (www.debian.org)
Hi again, Martin Pitt [2008-03-30 20:40 +0200]: > My current pg_upgradecluster uses pg_dumpall to copy the schema, and a > per-db pg_dump to copy the DB table contents. Will calling pg_dump > with --encoding=3D<encoding of 8.3 server's locale> always DTRT and is > it meant to solve this problem?=20 One problem with this is that while pg_dump -E UTF8 works (with SQL output), -E does not seem to have any effect when using -Fc. However, -Fc is a must for a reasonably robust automatic upgrade script. Thanks in advance, Martin --=20 Martin Pitt | http://www.piware.de Ubuntu Developer (www.ubuntu.com) | Debian Developer (www.debian.org)
Martin Pitt <martin@piware.de> writes: > One problem with this is that while pg_dump -E UTF8 works (with SQL > output), -E does not seem to have any effect when using -Fc. Huh? Please provide a test case. regards, tom lane
Tom Lane [2008-03-30 16:43 -0400]: > Martin Pitt <martin@piware.de> writes: > > One problem with this is that while pg_dump -E UTF8 works (with SQL > > output), -E does not seem to have any effect when using -Fc. > > Huh? Please provide a test case. Ah, I got it. This fails: pg_dump -Fc -E UTF8 -p 5432 latin1test | pg_restore -p 5433 -d template1 -C (5432 is 8.1, 5433 is 8.3, both with locale ru_RU.UTF-8; createdb -E latin1 latin1test) But if I create the DB beforehand (with correct encoding) and then dump/restore without using -C, it works fine: createdb -p 5433 latin1test pg_dump -Fc -p 5432 latin1test | pg_restore -p 5433 -d latin1test In that case I do not even need to specify -E. Seems that pg_dump/pg_restore are clever enough to detect encodings and necessary conversions. So this seems to be the cleanest approach to me, and it's free of hacks. pg_restore restores the correct owner of the DB, so calling createdb as the DB superuser does not harm. Thanks, Martin -- Martin Pitt | http://www.piware.de Ubuntu Developer (www.ubuntu.com) | Debian Developer (www.debian.org)
Martin Pitt <martin@piware.de> writes: > Tom Lane [2008-03-30 16:43 -0400]: >> Huh? Please provide a test case. > Ah, I got it. This fails: > pg_dump -Fc -E UTF8 -p 5432 latin1test | pg_restore -p 5433 -d template1 -C > (5432 is 8.1, 5433 is 8.3, both with locale ru_RU.UTF-8; > createdb -E latin1 latin1test) Yeah. This will try to create the new latin1test with all the same properties it had before, including encoding, and 8.3 intentionally rejects that. Your 8.1 setup is pretty broken too (it will misbehave in various ways because of the encoding mismatch), but 8.1 fails to realize that. > In that case I do not even need to specify -E. Seems that > pg_dump/pg_restore are clever enough to detect encodings and necessary > conversions. Yeah, there's usually little value in -E unless you're planning to do something else with the dump than just feed it to pg_restore. (If you wanted to export to some other DBMS, for example, it could be useful.) In particular -E has entirely zip bearing on what database encoding will be assigned during restore. regards, tom lane
Hi Tom, Tom Lane [2008-03-30 17:15 -0400]: > > pg_dump -Fc -E UTF8 -p 5432 latin1test | pg_restore -p 5433 -d template1 -C > > Yeah. This will try to create the new latin1test with all the same > properties it had before, including encoding I see, so it's intended to behave like this. man pg_dump is a little unclear on this. >, and 8.3 intentionally rejects that. Rightly so. > Your 8.1 setup is pretty broken too (it will misbehave in various > ways because of the encoding mismatch), but 8.1 fails to realize > that. Right, I know. It is the reproducer for the upgrade problems many people have (see quoted bug reports in initial mail), and what I now use in the postgresql-common test suite. > Yeah, there's usually little value in -E unless you're planning to > do something else with the dump than just feed it to pg_restore. > (If you wanted to export to some other DBMS, for example, it could > be useful.) In particular -E has entirely zip bearing on what > database encoding will be assigned during restore. Thanks for the clarification. So it seems createdb+pg_restore is the way to go, and pg_restore -C does not DTRT for my purpose (correcting DB encodings on upgrades). BTW, many people seem to run the server under C and use different encodings in their DBs (latin, UTF8). Shouldn't that cause similar problems with collation, data type checking (ischar(), etc.)? What do you recommend should the upgrade script do if it encounters an 8.[12] server running under C? Thanks, Martin -- Martin Pitt | http://www.piware.de Ubuntu Developer (www.ubuntu.com) | Debian Developer (www.debian.org)
Martin Pitt <martin@piware.de> writes: > BTW, many people seem to run the server under C and use different > encodings in their DBs (latin, UTF8). Shouldn't that cause similar > problems with collation, data type checking (ischar(), etc.)? What do > you recommend should the upgrade script do if it encounters an 8.[12] > server running under C? Yeah, C locale will work with any encoding, for small values of "work" --- one problem is that you get byte-by-byte sorting which doesn't act all that sensibly in a lot of encodings, and another is that upper()/lower() only work for the base ASCII letters. I gather that this behavior is just fine with a lot of the Far Eastern contingent, though you'd have to ask them for details. In any case you should not second-guess such a configuration during upgrade. People do use it. regards, tom lane