Thread: Why do we let CREATE DATABASE reassign encoding?

Why do we let CREATE DATABASE reassign encoding?

From

Tom Lane

Date:

23 April 2009, 14:46:47

If I have locale set to C, I can do this:

regression=# create database u8 encoding 'utf8';
CREATE DATABASE
regression=# create database l1 encoding 'latin1' template u8;
CREATE DATABASE

Had I had any actual utf8 data in u8, l1 would now contain
encoding-corrupt information.  Given that we've tried to
clamp down on encoding violations in recent releases, I wonder
why this case is still allowed.

(In non-C locales, this will typically fail because the two
different encodings can't both match the locale.  But I don't
believe it's our policy to enforce encoding validity only for
non-C locales.)

We should presumably let the encoding be changed when cloning
from template0, and probably it's reasonable to trust the user
if either source or destination DB encoding is SQL_ASCII.
In other cases I'm thinking it should fail.
        regards, tom lane

Re: Why do we let CREATE DATABASE reassign encoding?

From

Heikki Linnakangas

Date:

23 April 2009, 15:09:42

Tom Lane wrote:
> If I have locale set to C, I can do this:
> 
> regression=# create database u8 encoding 'utf8';
> CREATE DATABASE
> regression=# create database l1 encoding 'latin1' template u8;
> CREATE DATABASE
> 
> Had I had any actual utf8 data in u8, l1 would now contain
> encoding-corrupt information.  Given that we've tried to
> clamp down on encoding violations in recent releases, I wonder
> why this case is still allowed.

Wow, I'm surprised we allow that. Never occurred to me to try.

> We should presumably let the encoding be changed when cloning
> from template0, and probably it's reasonable to trust the user
> if either source or destination DB encoding is SQL_ASCII.
> In other cases I'm thinking it should fail.

Agreed, that's exactly what we did with per-database collation.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Why do we let CREATE DATABASE reassign encoding?

From

Bill Moran

Date:

23 April 2009, 15:47:08

In response to Tom Lane <tgl@sss.pgh.pa.us>:

> If I have locale set to C, I can do this:
> 
> regression=# create database u8 encoding 'utf8';
> CREATE DATABASE
> regression=# create database l1 encoding 'latin1' template u8;
> CREATE DATABASE
> 
> Had I had any actual utf8 data in u8, l1 would now contain
> encoding-corrupt information.  Given that we've tried to
> clamp down on encoding violations in recent releases, I wonder
> why this case is still allowed.
> 
> (In non-C locales, this will typically fail because the two
> different encodings can't both match the locale.  But I don't
> believe it's our policy to enforce encoding validity only for
> non-C locales.)
> 
> We should presumably let the encoding be changed when cloning
> from template0, and probably it's reasonable to trust the user
> if either source or destination DB encoding is SQL_ASCII.
> In other cases I'm thinking it should fail.

On a pedantic level, doesn't this remove the ability to have
databases on a single cluster that are different encodings?  I mean,
if template1 is utf8, and I can't change that using CREATE
DATABASE, then I'm stuck with utf8 for all databases on that
cluster ... unless I'm missing something.

Granted, there's the potential for special cases with databases used
only for templates, but as I see it, this should be allowed, it should
just fail if any data in the template can't be converted to the
desired encoding.  I mean, I can always alter template1 by inserting
non-utf8 data, and then try to use it to create a utf8 encoded
database ...

-- 
Bill Moran
http://www.potentialtech.com
http://people.collaborativefusion.com/~wmoran/

Re: Why do we let CREATE DATABASE reassign encoding?

From

Andrew Dunstan

Date:

23 April 2009, 15:50:57


Tom Lane wrote:
> If I have locale set to C, I can do this:
>
> regression=# create database u8 encoding 'utf8';
> CREATE DATABASE
> regression=# create database l1 encoding 'latin1' template u8;
> CREATE DATABASE
>
> Had I had any actual utf8 data in u8, l1 would now contain
> encoding-corrupt information.  Given that we've tried to
> clamp down on encoding violations in recent releases, I wonder
> why this case is still allowed.
>
> (In non-C locales, this will typically fail because the two
> different encodings can't both match the locale.  But I don't
> believe it's our policy to enforce encoding validity only for
> non-C locales.)
>
> We should presumably let the encoding be changed when cloning
> from template0, and probably it's reasonable to trust the user
> if either source or destination DB encoding is SQL_ASCII.
> In other cases I'm thinking it should fail.
>
>             
>   


Really? You want to forbid selecting an encoding when the source is 
template1, which is the default, and template1 is not SQL_ASCII? So the 
following sequence woiuld be illegal:

initdb -E latin1
createdb -E utf8

I think we have a bit more thinking to do on this - I don't have a 
reasonable solution immediately in my head.


cheers

andrew

Re: Why do we let CREATE DATABASE reassign encoding?

From

Tom Lane

Date:

23 April 2009, 15:53:03

Bill Moran <wmoran@potentialtech.com> writes:
> In response to Tom Lane <tgl@sss.pgh.pa.us>:
>> We should presumably let the encoding be changed when cloning
>> from template0, and probably it's reasonable to trust the user
>> if either source or destination DB encoding is SQL_ASCII.
>> In other cases I'm thinking it should fail.

> On a pedantic level, doesn't this remove the ability to have
> databases on a single cluster that are different encodings?  I mean,
> if template1 is utf8, and I can't change that using CREATE
> DATABASE, then I'm stuck with utf8 for all databases on that
> cluster ... unless I'm missing something.

You're supposed to clone from template0, not template1, when creating
DBs that are different in either encoding or locale from the
installation default.  We already enforce this except for having missed
the special case of C locale.  (There might be some corner cases
involving UTF8 on Windows, too; not sure about that.)  The reason is
that template0 is expected to contain only ASCII data, but template1
might not.
        regards, tom lane

Re: Why do we let CREATE DATABASE reassign encoding?

From

Heikki Linnakangas

Date:

23 April 2009, 15:54:26

Bill Moran wrote:
> In response to Tom Lane <tgl@sss.pgh.pa.us>:
>> We should presumably let the encoding be changed when cloning
>> from template0, and probably it's reasonable to trust the user
>> if either source or destination DB encoding is SQL_ASCII.
>> In other cases I'm thinking it should fail.
> 
> On a pedantic level, doesn't this remove the ability to have
> databases on a single cluster that are different encodings?  I mean,
> if template1 is utf8, and I can't change that using CREATE
> DATABASE, then I'm stuck with utf8 for all databases on that
> cluster ... unless I'm missing something.

You could still use template0 as template for a database with any 
encoding, like:

CREATE DATABASE .. TEMPLATE = template0;

We can special case template0 because we know its contents are pure 
7-bit ascii which is compatible with any server encoding.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Why do we let CREATE DATABASE reassign encoding?

From

Tom Lane

Date:

23 April 2009, 16:01:45

Andrew Dunstan <andrew@dunslane.net> writes:
> So the following sequence woiuld be illegal:

> initdb -E latin1
> createdb -E utf8

Yes, that's rather the point.  Note that it already *is* illegal
unless you happen to have selected C locale; AFAICS that is an
oversight and not intentional.  For instance, going in the other
direction in en_US locale, I get

$ createdb -E latin1 l1
createdb: database creation failed: ERROR:  encoding LATIN1 does not match locale en_US.utf8
DETAIL:  The chosen LC_CTYPE setting requires encoding UTF8.

You can get around this by cloning template0 instead of template1
(we assume template0 contains nothing that's encoding-specific).
Possibly the docs will need to be improved to emphasize that.
        regards, tom lane

Re: Why do we let CREATE DATABASE reassign encoding?

From

Andrew Dunstan

Date:

23 April 2009, 16:06:28


Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> So the following sequence woiuld be illegal:
>>     
>
>   
>> initdb -E latin1
>> createdb -E utf8
>>     
>
> Yes, that's rather the point.  Note that it already *is* illegal
> unless you happen to have selected C locale; AFAICS that is an
> oversight and not intentional.  
>   

Ok, then I agree that we need a big warning on that in the docs.

cheers

andrew

Re: Why do we let CREATE DATABASE reassign encoding?

From

Bill Moran

Date:

23 April 2009, 16:07:29

In response to Tom Lane <tgl@sss.pgh.pa.us>:

> Bill Moran <wmoran@potentialtech.com> writes:
> > In response to Tom Lane <tgl@sss.pgh.pa.us>:
> >> We should presumably let the encoding be changed when cloning
> >> from template0, and probably it's reasonable to trust the user
> >> if either source or destination DB encoding is SQL_ASCII.
> >> In other cases I'm thinking it should fail.
> 
> > On a pedantic level, doesn't this remove the ability to have
> > databases on a single cluster that are different encodings?  I mean,
> > if template1 is utf8, and I can't change that using CREATE
> > DATABASE, then I'm stuck with utf8 for all databases on that
> > cluster ... unless I'm missing something.
> 
> You're supposed to clone from template0, not template1, when creating
> DBs that are different in either encoding or locale from the
> installation default.  We already enforce this except for having missed
> the special case of C locale.

Ah ... was not aware of that.  It hasn't come up in my usage.

> (There might be some corner cases
> involving UTF8 on Windows, too; not sure about that.)  The reason is
> that template0 is expected to contain only ASCII data, but template1
> might not.

Makes sense, with that explanation.

-- 
Bill Moran
http://www.potentialtech.com
http://people.collaborativefusion.com/~wmoran/

Re: Why do we let CREATE DATABASE reassign encoding?

From

Greg Stark

Date:

23 April 2009, 16:08:45

So it would still be possible to byass this check by cloning a  
database into SQL_ASCII and then cloning it into the desired encoding?  
Doesn't sound like it really accomplishes much.

I do seem to recall some discussion about this way back. I don't  
recall the conclusion but I remember some talk about detecting an  
empty template database and the ned to reindex.

-- 
Greg


On 23 Apr 2009, at 19:46, Bill Moran <wmoran@potentialtech.com> wrote:

> In response to Tom Lane <tgl@sss.pgh.pa.us>:
>
>> If I have locale set to C, I can do this:
>>
>> regression=# create database u8 encoding 'utf8';
>> CREATE DATABASE
>> regression=# create database l1 encoding 'latin1' template u8;
>> CREATE DATABASE
>>
>> Had I had any actual utf8 data in u8, l1 would now contain
>> encoding-corrupt information.  Given that we've tried to
>> clamp down on encoding violations in recent releases, I wonder
>> why this case is still allowed.
>>
>> (In non-C locales, this will typically fail because the two
>> different encodings can't both match the locale.  But I don't
>> believe it's our policy to enforce encoding validity only for
>> non-C locales.)
>>
>> We should presumably let the encoding be changed when cloning
>> from template0, and probably it's reasonable to trust the user
>> if either source or destination DB encoding is SQL_ASCII.
>> In other cases I'm thinking it should fail.
>
> On a pedantic level, doesn't this remove the ability to have
> databases on a single cluster that are different encodings?  I mean,
> if template1 is utf8, and I can't change that using CREATE
> DATABASE, then I'm stuck with utf8 for all databases on that
> cluster ... unless I'm missing something.
>
> Granted, there's the potential for special cases with databases used
> only for templates, but as I see it, this should be allowed, it should
> just fail if any data in the template can't be converted to the
> desired encoding.  I mean, I can always alter template1 by inserting
> non-utf8 data, and then try to use it to create a utf8 encoded
> database ...
>
> -- 
> Bill Moran
> http://www.potentialtech.com
> http://people.collaborativefusion.com/~wmoran/
>
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

Re: Why do we let CREATE DATABASE reassign encoding?

From

Tom Lane

Date:

23 April 2009, 16:10:26

Greg Stark <greg.stark@enterprisedb.com> writes:
> So it would still be possible to byass this check by cloning a  
> database into SQL_ASCII and then cloning it into the desired encoding?  
> Doesn't sound like it really accomplishes much.

Well, it accomplishes preventing stupid encoding violations.  The point
came to mind when I saw a post a little bit ago on -general in which
the poster seemed to imagine that CREATE DATABASE would convert
encodings for him.  Since that is not happening in the foreseeable
future, I think we need to prevent the system from acting like it
would work.

If we wanted to be entirely anal about this, we could allow SQL_ASCII
destination with a different source encoding, but not the reverse.
However, we currently consider that you're on your own to ensure sanity
when using SQL_ASCII as far as locale goes, so I'm not sure why the
policy would be different for encoding.
        regards, tom lane

Re: Why do we let CREATE DATABASE reassign encoding?

From

Andrew Dunstan

Date:

23 April 2009, 16:27:14

Tom Lane wrote:
> If we wanted to be entirely anal about this, we could allow SQL_ASCII
> destination with a different source encoding, but not the reverse.
> However, we currently consider that you're on your own to ensure sanity
> when using SQL_ASCII as far as locale goes, so I'm not sure why the
> policy would be different for encoding.
>
>             
>   

The trouble is that people won't know the provenance of their database. 
I think we should try to guarantee as far as possible that if a database 
has encoding X then all the data in it is valid in that encoding.

cheers

andrew

Re: Why do we let CREATE DATABASE reassign encoding?

From

"David E. Wheeler"

Date:

23 April 2009, 16:32:25

On Apr 23, 2009, at 12:00 PM, Tom Lane wrote:

> You can get around this by cloning template0 instead of template1
> (we assume template0 contains nothing that's encoding-specific).
> Possibly the docs will need to be improved to emphasize that.

I was just about to suggest that. With this change, template0 is  
suddenly going to be a lot more important for people to know about and  
make use of.

Best,

David

Re: Why do we let CREATE DATABASE reassign encoding?

From

Peter Eisentraut

Date:

23 April 2009, 18:40:16

On Thursday 23 April 2009 22:00:25 Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
> > So the following sequence woiuld be illegal:
> >
> > initdb -E latin1
> > createdb -E utf8
>
> Yes, that's rather the point.  Note that it already is illegal
> unless you happen to have selected C locale;

AFAIR, the only reason that we haven't disallowed this sort of stuff years and 
years ago is that people use it; the Japanese in particular.  I don't see what 
is different now.

Re: Why do we let CREATE DATABASE reassign encoding?

From

Tom Lane

Date:

23 April 2009, 20:00:44

Peter Eisentraut <peter_e@gmx.net> writes:
> AFAIR, the only reason that we haven't disallowed this sort of stuff
> years and years ago is that people use it; the Japanese in particular.
> I don't see what is different now.

What's different now is that 8.4 has already established the principle
that you have to clone template0 if you want to change the locale of a
database.  I think this is a good time to establish the same principle
for encodings.  (Or in other words, if we don't fix it now, when will
be a better time?)
        regards, tom lane

Re: Why do we let CREATE DATABASE reassign encoding?

From

Tom Lane

Date:

05 May 2009, 21:41:12

I wrote:
> Peter Eisentraut <peter_e@gmx.net> writes:
>> AFAIR, the only reason that we haven't disallowed this sort of stuff
>> years and years ago is that people use it; the Japanese in particular.
>> I don't see what is different now.

> What's different now is that 8.4 has already established the principle
> that you have to clone template0 if you want to change the locale of a
> database.  I think this is a good time to establish the same principle
> for encodings.  (Or in other words, if we don't fix it now, when will
> be a better time?)

Attached is a proposed patch (without documentation changes as yet)
for this.  Since the code is already enforcing exact locale match when
cloning a non-template0 database, I just made it act the same for
encoding, without any strange exceptions for SQL_ASCII.

I found that mbregress.sh was already broken by the existing
restrictions, if you try to use it in a database whose default
locale isn't C.  The patch adds switches to fix that.

The patch also incidentally fixes a few ereport's that were missing
errcode values.

Last chance for objections ...

            regards, tom lane

Index: src/backend/commands/dbcommands.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/commands/dbcommands.c,v
retrieving revision 1.223
diff -c -r1.223 dbcommands.c
*** src/backend/commands/dbcommands.c    5 May 2009 23:39:55 -0000    1.223
--- src/backend/commands/dbcommands.c    6 May 2009 00:30:59 -0000
***************
*** 361,367 ****
  #endif
            (encoding == PG_SQL_ASCII && superuser())))
          ereport(ERROR,
!                 (errmsg("encoding %s does not match locale %s",
                          pg_encoding_to_char(encoding),
                          dbctype),
               errdetail("The chosen LC_CTYPE setting requires encoding %s.",
--- 361,368 ----
  #endif
            (encoding == PG_SQL_ASCII && superuser())))
          ereport(ERROR,
!                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
!                  errmsg("encoding %s does not match locale %s",
                          pg_encoding_to_char(encoding),
                          dbctype),
               errdetail("The chosen LC_CTYPE setting requires encoding %s.",
***************
*** 374,402 ****
  #endif
            (encoding == PG_SQL_ASCII && superuser())))
          ereport(ERROR,
!                 (errmsg("encoding %s does not match locale %s",
                          pg_encoding_to_char(encoding),
                          dbcollate),
               errdetail("The chosen LC_COLLATE setting requires encoding %s.",
                         pg_encoding_to_char(collate_encoding))));

      /*
!      * Check that the new locale is compatible with the source database.
       *
!      * We know that template0 doesn't contain any indexes that depend on
!      * collation or ctype, so template0 can be used as template for
!      * any locale.
       */
      if (strcmp(dbtemplate, "template0") != 0)
      {
          if (strcmp(dbcollate, src_collate) != 0)
              ereport(ERROR,
!                     (errmsg("new collation is incompatible with the collation of the template database (%s)",
src_collate),
                       errhint("Use the same collation as in the template database, or use template0 as template.")));

          if (strcmp(dbctype, src_ctype) != 0)
              ereport(ERROR,
!                     (errmsg("new LC_CTYPE is incompatible with LC_CTYPE of the template database (%s)", src_ctype),
                       errhint("Use the same LC_CTYPE as in the template database, or use template0 as template.")));
      }

--- 375,419 ----
  #endif
            (encoding == PG_SQL_ASCII && superuser())))
          ereport(ERROR,
!                 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
!                  errmsg("encoding %s does not match locale %s",
                          pg_encoding_to_char(encoding),
                          dbcollate),
               errdetail("The chosen LC_COLLATE setting requires encoding %s.",
                         pg_encoding_to_char(collate_encoding))));

      /*
!      * Check that the new encoding and locale settings match the source
!      * database.  We insist on this because we simply copy the source data ---
!      * any non-ASCII data would be wrongly encoded, and any indexes sorted
!      * according to the source locale would be wrong.
       *
!      * However, we assume that template0 doesn't contain any non-ASCII data
!      * nor any indexes that depend on collation or ctype, so template0 can be
!      * used as template for creating a database with any encoding or locale.
       */
      if (strcmp(dbtemplate, "template0") != 0)
      {
+         if (encoding != src_encoding)
+             ereport(ERROR,
+                     (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                      errmsg("new encoding (%s) is incompatible with the encoding of the template database (%s)",
+                             pg_encoding_to_char(encoding),
+                             pg_encoding_to_char(src_encoding)),
+                      errhint("Use the same encoding as in the template database, or use template0 as template.")));
+
          if (strcmp(dbcollate, src_collate) != 0)
              ereport(ERROR,
!                     (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
!                      errmsg("new collation (%s) is incompatible with the collation of the template database (%s)",
!                             dbcollate, src_collate),
                       errhint("Use the same collation as in the template database, or use template0 as template.")));

          if (strcmp(dbctype, src_ctype) != 0)
              ereport(ERROR,
!                     (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
!                      errmsg("new LC_CTYPE (%s) is incompatible with the LC_CTYPE of the template database (%s)",
!                             dbctype, src_ctype),
                       errhint("Use the same LC_CTYPE as in the template database, or use template0 as template.")));
      }

***************
*** 1099,1105 ****
                  continue;

              ereport(ERROR,
!                     (errmsg("some relations of database \"%s\" are already in tablespace \"%s\"",
                              dbname, tblspcname),
                       errhint("You must move them back to the database's default tablespace before using this
command.")));
          }
--- 1116,1123 ----
                  continue;

              ereport(ERROR,
!                     (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
!                      errmsg("some relations of database \"%s\" are already in tablespace \"%s\"",
                              dbname, tblspcname),
                       errhint("You must move them back to the database's default tablespace before using this
command.")));
          }
Index: src/test/mb/mbregress.sh
===================================================================
RCS file: /cvsroot/pgsql/src/test/mb/mbregress.sh,v
retrieving revision 1.9
diff -c -r1.9 mbregress.sh
*** src/test/mb/mbregress.sh    24 Jun 2005 15:11:59 -0000    1.9
--- src/test/mb/mbregress.sh    6 May 2009 00:30:59 -0000
***************
*** 15,21 ****
  fi

  dropdb utf8
! createdb -E UTF8 utf8

  PSQL="psql -n -e -q"
  tests="euc_jp sjis euc_kr euc_cn euc_tw big5 utf8 mule_internal"
--- 15,21 ----
  fi

  dropdb utf8
! createdb -T template0 -l C -E UTF8 utf8

  PSQL="psql -n -e -q"
  tests="euc_jp sjis euc_kr euc_cn euc_tw big5 utf8 mule_internal"
***************
*** 36,42 ****
          unset PGCLIENTENCODING
      else
          dropdb $i >/dev/null 2>&1
!         createdb -E `echo $i | tr 'abcdefghijklmnopqrstuvwxyz' 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'` $i >/dev/null
          $PSQL $i < sql/${i}.sql > results/${i}.out 2>&1
      fi

--- 36,42 ----
          unset PGCLIENTENCODING
      else
          dropdb $i >/dev/null 2>&1
!         createdb -T template0 -l C -E `echo $i | tr 'abcdefghijklmnopqrstuvwxyz' 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'` $i
>/dev/null
          $PSQL $i < sql/${i}.sql > results/${i}.out 2>&1
      fi