Thread: Why do we let CREATE DATABASE reassign encoding?
If I have locale set to C, I can do this: regression=# create database u8 encoding 'utf8'; CREATE DATABASE regression=# create database l1 encoding 'latin1' template u8; CREATE DATABASE Had I had any actual utf8 data in u8, l1 would now contain encoding-corrupt information. Given that we've tried to clamp down on encoding violations in recent releases, I wonder why this case is still allowed. (In non-C locales, this will typically fail because the two different encodings can't both match the locale. But I don't believe it's our policy to enforce encoding validity only for non-C locales.) We should presumably let the encoding be changed when cloning from template0, and probably it's reasonable to trust the user if either source or destination DB encoding is SQL_ASCII. In other cases I'm thinking it should fail. regards, tom lane
Tom Lane wrote: > If I have locale set to C, I can do this: > > regression=# create database u8 encoding 'utf8'; > CREATE DATABASE > regression=# create database l1 encoding 'latin1' template u8; > CREATE DATABASE > > Had I had any actual utf8 data in u8, l1 would now contain > encoding-corrupt information. Given that we've tried to > clamp down on encoding violations in recent releases, I wonder > why this case is still allowed. Wow, I'm surprised we allow that. Never occurred to me to try. > We should presumably let the encoding be changed when cloning > from template0, and probably it's reasonable to trust the user > if either source or destination DB encoding is SQL_ASCII. > In other cases I'm thinking it should fail. Agreed, that's exactly what we did with per-database collation. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
In response to Tom Lane <tgl@sss.pgh.pa.us>: > If I have locale set to C, I can do this: > > regression=# create database u8 encoding 'utf8'; > CREATE DATABASE > regression=# create database l1 encoding 'latin1' template u8; > CREATE DATABASE > > Had I had any actual utf8 data in u8, l1 would now contain > encoding-corrupt information. Given that we've tried to > clamp down on encoding violations in recent releases, I wonder > why this case is still allowed. > > (In non-C locales, this will typically fail because the two > different encodings can't both match the locale. But I don't > believe it's our policy to enforce encoding validity only for > non-C locales.) > > We should presumably let the encoding be changed when cloning > from template0, and probably it's reasonable to trust the user > if either source or destination DB encoding is SQL_ASCII. > In other cases I'm thinking it should fail. On a pedantic level, doesn't this remove the ability to have databases on a single cluster that are different encodings? I mean, if template1 is utf8, and I can't change that using CREATE DATABASE, then I'm stuck with utf8 for all databases on that cluster ... unless I'm missing something. Granted, there's the potential for special cases with databases used only for templates, but as I see it, this should be allowed, it should just fail if any data in the template can't be converted to the desired encoding. I mean, I can always alter template1 by inserting non-utf8 data, and then try to use it to create a utf8 encoded database ... -- Bill Moran http://www.potentialtech.com http://people.collaborativefusion.com/~wmoran/
Tom Lane wrote: > If I have locale set to C, I can do this: > > regression=# create database u8 encoding 'utf8'; > CREATE DATABASE > regression=# create database l1 encoding 'latin1' template u8; > CREATE DATABASE > > Had I had any actual utf8 data in u8, l1 would now contain > encoding-corrupt information. Given that we've tried to > clamp down on encoding violations in recent releases, I wonder > why this case is still allowed. > > (In non-C locales, this will typically fail because the two > different encodings can't both match the locale. But I don't > believe it's our policy to enforce encoding validity only for > non-C locales.) > > We should presumably let the encoding be changed when cloning > from template0, and probably it's reasonable to trust the user > if either source or destination DB encoding is SQL_ASCII. > In other cases I'm thinking it should fail. > > > Really? You want to forbid selecting an encoding when the source is template1, which is the default, and template1 is not SQL_ASCII? So the following sequence woiuld be illegal: initdb -E latin1 createdb -E utf8 I think we have a bit more thinking to do on this - I don't have a reasonable solution immediately in my head. cheers andrew
Bill Moran <wmoran@potentialtech.com> writes: > In response to Tom Lane <tgl@sss.pgh.pa.us>: >> We should presumably let the encoding be changed when cloning >> from template0, and probably it's reasonable to trust the user >> if either source or destination DB encoding is SQL_ASCII. >> In other cases I'm thinking it should fail. > On a pedantic level, doesn't this remove the ability to have > databases on a single cluster that are different encodings? I mean, > if template1 is utf8, and I can't change that using CREATE > DATABASE, then I'm stuck with utf8 for all databases on that > cluster ... unless I'm missing something. You're supposed to clone from template0, not template1, when creating DBs that are different in either encoding or locale from the installation default. We already enforce this except for having missed the special case of C locale. (There might be some corner cases involving UTF8 on Windows, too; not sure about that.) The reason is that template0 is expected to contain only ASCII data, but template1 might not. regards, tom lane
Bill Moran wrote: > In response to Tom Lane <tgl@sss.pgh.pa.us>: >> We should presumably let the encoding be changed when cloning >> from template0, and probably it's reasonable to trust the user >> if either source or destination DB encoding is SQL_ASCII. >> In other cases I'm thinking it should fail. > > On a pedantic level, doesn't this remove the ability to have > databases on a single cluster that are different encodings? I mean, > if template1 is utf8, and I can't change that using CREATE > DATABASE, then I'm stuck with utf8 for all databases on that > cluster ... unless I'm missing something. You could still use template0 as template for a database with any encoding, like: CREATE DATABASE .. TEMPLATE = template0; We can special case template0 because we know its contents are pure 7-bit ascii which is compatible with any server encoding. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > So the following sequence woiuld be illegal: > initdb -E latin1 > createdb -E utf8 Yes, that's rather the point. Note that it already *is* illegal unless you happen to have selected C locale; AFAICS that is an oversight and not intentional. For instance, going in the other direction in en_US locale, I get $ createdb -E latin1 l1 createdb: database creation failed: ERROR: encoding LATIN1 does not match locale en_US.utf8 DETAIL: The chosen LC_CTYPE setting requires encoding UTF8. You can get around this by cloning template0 instead of template1 (we assume template0 contains nothing that's encoding-specific). Possibly the docs will need to be improved to emphasize that. regards, tom lane
Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > >> So the following sequence woiuld be illegal: >> > > >> initdb -E latin1 >> createdb -E utf8 >> > > Yes, that's rather the point. Note that it already *is* illegal > unless you happen to have selected C locale; AFAICS that is an > oversight and not intentional. > Ok, then I agree that we need a big warning on that in the docs. cheers andrew
In response to Tom Lane <tgl@sss.pgh.pa.us>: > Bill Moran <wmoran@potentialtech.com> writes: > > In response to Tom Lane <tgl@sss.pgh.pa.us>: > >> We should presumably let the encoding be changed when cloning > >> from template0, and probably it's reasonable to trust the user > >> if either source or destination DB encoding is SQL_ASCII. > >> In other cases I'm thinking it should fail. > > > On a pedantic level, doesn't this remove the ability to have > > databases on a single cluster that are different encodings? I mean, > > if template1 is utf8, and I can't change that using CREATE > > DATABASE, then I'm stuck with utf8 for all databases on that > > cluster ... unless I'm missing something. > > You're supposed to clone from template0, not template1, when creating > DBs that are different in either encoding or locale from the > installation default. We already enforce this except for having missed > the special case of C locale. Ah ... was not aware of that. It hasn't come up in my usage. > (There might be some corner cases > involving UTF8 on Windows, too; not sure about that.) The reason is > that template0 is expected to contain only ASCII data, but template1 > might not. Makes sense, with that explanation. -- Bill Moran http://www.potentialtech.com http://people.collaborativefusion.com/~wmoran/
So it would still be possible to byass this check by cloning a database into SQL_ASCII and then cloning it into the desired encoding? Doesn't sound like it really accomplishes much. I do seem to recall some discussion about this way back. I don't recall the conclusion but I remember some talk about detecting an empty template database and the ned to reindex. -- Greg On 23 Apr 2009, at 19:46, Bill Moran <wmoran@potentialtech.com> wrote: > In response to Tom Lane <tgl@sss.pgh.pa.us>: > >> If I have locale set to C, I can do this: >> >> regression=# create database u8 encoding 'utf8'; >> CREATE DATABASE >> regression=# create database l1 encoding 'latin1' template u8; >> CREATE DATABASE >> >> Had I had any actual utf8 data in u8, l1 would now contain >> encoding-corrupt information. Given that we've tried to >> clamp down on encoding violations in recent releases, I wonder >> why this case is still allowed. >> >> (In non-C locales, this will typically fail because the two >> different encodings can't both match the locale. But I don't >> believe it's our policy to enforce encoding validity only for >> non-C locales.) >> >> We should presumably let the encoding be changed when cloning >> from template0, and probably it's reasonable to trust the user >> if either source or destination DB encoding is SQL_ASCII. >> In other cases I'm thinking it should fail. > > On a pedantic level, doesn't this remove the ability to have > databases on a single cluster that are different encodings? I mean, > if template1 is utf8, and I can't change that using CREATE > DATABASE, then I'm stuck with utf8 for all databases on that > cluster ... unless I'm missing something. > > Granted, there's the potential for special cases with databases used > only for templates, but as I see it, this should be allowed, it should > just fail if any data in the template can't be converted to the > desired encoding. I mean, I can always alter template1 by inserting > non-utf8 data, and then try to use it to create a utf8 encoded > database ... > > -- > Bill Moran > http://www.potentialtech.com > http://people.collaborativefusion.com/~wmoran/ > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
Greg Stark <greg.stark@enterprisedb.com> writes: > So it would still be possible to byass this check by cloning a > database into SQL_ASCII and then cloning it into the desired encoding? > Doesn't sound like it really accomplishes much. Well, it accomplishes preventing stupid encoding violations. The point came to mind when I saw a post a little bit ago on -general in which the poster seemed to imagine that CREATE DATABASE would convert encodings for him. Since that is not happening in the foreseeable future, I think we need to prevent the system from acting like it would work. If we wanted to be entirely anal about this, we could allow SQL_ASCII destination with a different source encoding, but not the reverse. However, we currently consider that you're on your own to ensure sanity when using SQL_ASCII as far as locale goes, so I'm not sure why the policy would be different for encoding. regards, tom lane
Tom Lane wrote: > If we wanted to be entirely anal about this, we could allow SQL_ASCII > destination with a different source encoding, but not the reverse. > However, we currently consider that you're on your own to ensure sanity > when using SQL_ASCII as far as locale goes, so I'm not sure why the > policy would be different for encoding. > > > The trouble is that people won't know the provenance of their database. I think we should try to guarantee as far as possible that if a database has encoding X then all the data in it is valid in that encoding. cheers andrew
On Apr 23, 2009, at 12:00 PM, Tom Lane wrote: > You can get around this by cloning template0 instead of template1 > (we assume template0 contains nothing that's encoding-specific). > Possibly the docs will need to be improved to emphasize that. I was just about to suggest that. With this change, template0 is suddenly going to be a lot more important for people to know about and make use of. Best, David
On Thursday 23 April 2009 22:00:25 Tom Lane wrote: > Andrew Dunstan <andrew@dunslane.net> writes: > > So the following sequence woiuld be illegal: > > > > initdb -E latin1 > > createdb -E utf8 > > Yes, that's rather the point. Note that it already is illegal > unless you happen to have selected C locale; AFAIR, the only reason that we haven't disallowed this sort of stuff years and years ago is that people use it; the Japanese in particular. I don't see what is different now.
Peter Eisentraut <peter_e@gmx.net> writes: > AFAIR, the only reason that we haven't disallowed this sort of stuff > years and years ago is that people use it; the Japanese in particular. > I don't see what is different now. What's different now is that 8.4 has already established the principle that you have to clone template0 if you want to change the locale of a database. I think this is a good time to establish the same principle for encodings. (Or in other words, if we don't fix it now, when will be a better time?) regards, tom lane
I wrote: > Peter Eisentraut <peter_e@gmx.net> writes: >> AFAIR, the only reason that we haven't disallowed this sort of stuff >> years and years ago is that people use it; the Japanese in particular. >> I don't see what is different now. > What's different now is that 8.4 has already established the principle > that you have to clone template0 if you want to change the locale of a > database. I think this is a good time to establish the same principle > for encodings. (Or in other words, if we don't fix it now, when will > be a better time?) Attached is a proposed patch (without documentation changes as yet) for this. Since the code is already enforcing exact locale match when cloning a non-template0 database, I just made it act the same for encoding, without any strange exceptions for SQL_ASCII. I found that mbregress.sh was already broken by the existing restrictions, if you try to use it in a database whose default locale isn't C. The patch adds switches to fix that. The patch also incidentally fixes a few ereport's that were missing errcode values. Last chance for objections ... regards, tom lane Index: src/backend/commands/dbcommands.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/commands/dbcommands.c,v retrieving revision 1.223 diff -c -r1.223 dbcommands.c *** src/backend/commands/dbcommands.c 5 May 2009 23:39:55 -0000 1.223 --- src/backend/commands/dbcommands.c 6 May 2009 00:30:59 -0000 *************** *** 361,367 **** #endif (encoding == PG_SQL_ASCII && superuser()))) ereport(ERROR, ! (errmsg("encoding %s does not match locale %s", pg_encoding_to_char(encoding), dbctype), errdetail("The chosen LC_CTYPE setting requires encoding %s.", --- 361,368 ---- #endif (encoding == PG_SQL_ASCII && superuser()))) ereport(ERROR, ! (errcode(ERRCODE_INVALID_PARAMETER_VALUE), ! errmsg("encoding %s does not match locale %s", pg_encoding_to_char(encoding), dbctype), errdetail("The chosen LC_CTYPE setting requires encoding %s.", *************** *** 374,402 **** #endif (encoding == PG_SQL_ASCII && superuser()))) ereport(ERROR, ! (errmsg("encoding %s does not match locale %s", pg_encoding_to_char(encoding), dbcollate), errdetail("The chosen LC_COLLATE setting requires encoding %s.", pg_encoding_to_char(collate_encoding)))); /* ! * Check that the new locale is compatible with the source database. * ! * We know that template0 doesn't contain any indexes that depend on ! * collation or ctype, so template0 can be used as template for ! * any locale. */ if (strcmp(dbtemplate, "template0") != 0) { if (strcmp(dbcollate, src_collate) != 0) ereport(ERROR, ! (errmsg("new collation is incompatible with the collation of the template database (%s)", src_collate), errhint("Use the same collation as in the template database, or use template0 as template."))); if (strcmp(dbctype, src_ctype) != 0) ereport(ERROR, ! (errmsg("new LC_CTYPE is incompatible with LC_CTYPE of the template database (%s)", src_ctype), errhint("Use the same LC_CTYPE as in the template database, or use template0 as template."))); } --- 375,419 ---- #endif (encoding == PG_SQL_ASCII && superuser()))) ereport(ERROR, ! (errcode(ERRCODE_INVALID_PARAMETER_VALUE), ! errmsg("encoding %s does not match locale %s", pg_encoding_to_char(encoding), dbcollate), errdetail("The chosen LC_COLLATE setting requires encoding %s.", pg_encoding_to_char(collate_encoding)))); /* ! * Check that the new encoding and locale settings match the source ! * database. We insist on this because we simply copy the source data --- ! * any non-ASCII data would be wrongly encoded, and any indexes sorted ! * according to the source locale would be wrong. * ! * However, we assume that template0 doesn't contain any non-ASCII data ! * nor any indexes that depend on collation or ctype, so template0 can be ! * used as template for creating a database with any encoding or locale. */ if (strcmp(dbtemplate, "template0") != 0) { + if (encoding != src_encoding) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("new encoding (%s) is incompatible with the encoding of the template database (%s)", + pg_encoding_to_char(encoding), + pg_encoding_to_char(src_encoding)), + errhint("Use the same encoding as in the template database, or use template0 as template."))); + if (strcmp(dbcollate, src_collate) != 0) ereport(ERROR, ! (errcode(ERRCODE_INVALID_PARAMETER_VALUE), ! errmsg("new collation (%s) is incompatible with the collation of the template database (%s)", ! dbcollate, src_collate), errhint("Use the same collation as in the template database, or use template0 as template."))); if (strcmp(dbctype, src_ctype) != 0) ereport(ERROR, ! (errcode(ERRCODE_INVALID_PARAMETER_VALUE), ! errmsg("new LC_CTYPE (%s) is incompatible with the LC_CTYPE of the template database (%s)", ! dbctype, src_ctype), errhint("Use the same LC_CTYPE as in the template database, or use template0 as template."))); } *************** *** 1099,1105 **** continue; ereport(ERROR, ! (errmsg("some relations of database \"%s\" are already in tablespace \"%s\"", dbname, tblspcname), errhint("You must move them back to the database's default tablespace before using this command."))); } --- 1116,1123 ---- continue; ereport(ERROR, ! (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), ! errmsg("some relations of database \"%s\" are already in tablespace \"%s\"", dbname, tblspcname), errhint("You must move them back to the database's default tablespace before using this command."))); } Index: src/test/mb/mbregress.sh =================================================================== RCS file: /cvsroot/pgsql/src/test/mb/mbregress.sh,v retrieving revision 1.9 diff -c -r1.9 mbregress.sh *** src/test/mb/mbregress.sh 24 Jun 2005 15:11:59 -0000 1.9 --- src/test/mb/mbregress.sh 6 May 2009 00:30:59 -0000 *************** *** 15,21 **** fi dropdb utf8 ! createdb -E UTF8 utf8 PSQL="psql -n -e -q" tests="euc_jp sjis euc_kr euc_cn euc_tw big5 utf8 mule_internal" --- 15,21 ---- fi dropdb utf8 ! createdb -T template0 -l C -E UTF8 utf8 PSQL="psql -n -e -q" tests="euc_jp sjis euc_kr euc_cn euc_tw big5 utf8 mule_internal" *************** *** 36,42 **** unset PGCLIENTENCODING else dropdb $i >/dev/null 2>&1 ! createdb -E `echo $i | tr 'abcdefghijklmnopqrstuvwxyz' 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'` $i >/dev/null $PSQL $i < sql/${i}.sql > results/${i}.out 2>&1 fi --- 36,42 ---- unset PGCLIENTENCODING else dropdb $i >/dev/null 2>&1 ! createdb -T template0 -l C -E `echo $i | tr 'abcdefghijklmnopqrstuvwxyz' 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'` $i >/dev/null $PSQL $i < sql/${i}.sql > results/${i}.out 2>&1 fi