Thread: check_locale() and the empty string

check_locale() and the empty string

From
Jeff Davis
Date:
The following SQL succeeds:

  create database foodb with
    template = template0
    encoding = 'UTF8'
    lc_collate=''
    lc_ctype='';

(any other template fails because it actually checks for a match against
the template).

The problem seems to be in check_locale(), which just checks for a
non-NULL return value from setlocale(). However, the manual for
setlocale() says:

  If locale is "", each part of the locale that should be modified
  is set according  to  the  environment  variables.   The details
  are implementation-dependent.

Surely we don't want it to be set from the environment, right?

Regards,
    Jeff Davis

Re: check_locale() and the empty string

From
Jeff Davis
Date:
On Sun, 2012-03-11 at 11:20 -0700, Jeff Davis wrote:
> The problem seems to be in check_locale(), which just checks for a
> non-NULL return value from setlocale(). However, the manual for
> setlocale() says:
>
>   If locale is "", each part of the locale that should be modified
>   is set according  to  the  environment  variables.   The details
>   are implementation-dependent.

Trivial patch attached.

Regards,
    Jeff Davis

Attachment

Re: check_locale() and the empty string

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> The following SQL succeeds:
>   create database foodb with
>     template = template0
>     encoding = 'UTF8'
>     lc_collate=''
>     lc_ctype='';

Sure.

> Surely we don't want it to be set from the environment, right?

Why not?  We have always done that, and in fact the various lc_xxx GUC
variables are documented thusly:

    If this variable is set to the empty string (which is the
    default) then the value is inherited from the execution
    environment of the server in a system-dependent way.

The "trivial patch" you propose breaks that behavior.

I do agree that it's probably unwise to store an empty string as the
value of pg_database.datcollate or datctype, because that would mean
that if the server is restarted with different LC_XXX environment values
then it will think the database has different locale settings, leading
to havoc.  However, ISTM the right fix is to replace an empty-string
value with the implied locale name at createdb time.  Proposed patch
attached.

Note 1: there's no need to change the behavior for the locale GUCs,
since we don't have any assumptions that those hold still over server
restarts.

Note 2: there is code in initdb that is supposed to be kept parallel
to this, but it's not currently making any attempt to canonicalize
non-empty locale names.  Should we make it do that too?

            regards, tom lane

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 91d74815287c1f6d46359b1a0ad0bddd0fd763be..9721ce9e0a6562b8b934c786adcc01eafd28b20c 100644
*** a/src/backend/commands/dbcommands.c
--- b/src/backend/commands/dbcommands.c
*************** createdb(const CreatedbStmt *stmt)
*** 123,128 ****
--- 123,129 ----
      const char *dbtemplate = NULL;
      char       *dbcollate = NULL;
      char       *dbctype = NULL;
+     char       *canonname;
      int            encoding = -1;
      int            dbconnlimit = -1;
      int            notherbackends;
*************** createdb(const CreatedbStmt *stmt)
*** 318,332 ****
                  (errcode(ERRCODE_WRONG_OBJECT_TYPE),
                   errmsg("invalid server encoding %d", encoding)));

!     /* Check that the chosen locales are valid */
!     if (!check_locale(LC_COLLATE, dbcollate))
          ereport(ERROR,
                  (errcode(ERRCODE_WRONG_OBJECT_TYPE),
                   errmsg("invalid locale name %s", dbcollate)));
!     if (!check_locale(LC_CTYPE, dbctype))
          ereport(ERROR,
                  (errcode(ERRCODE_WRONG_OBJECT_TYPE),
                   errmsg("invalid locale name %s", dbctype)));

      check_encoding_locale_matches(encoding, dbcollate, dbctype);

--- 319,335 ----
                  (errcode(ERRCODE_WRONG_OBJECT_TYPE),
                   errmsg("invalid server encoding %d", encoding)));

!     /* Check that the chosen locales are valid, and get canonical spellings */
!     if (!check_locale(LC_COLLATE, dbcollate, &canonname))
          ereport(ERROR,
                  (errcode(ERRCODE_WRONG_OBJECT_TYPE),
                   errmsg("invalid locale name %s", dbcollate)));
!     dbcollate = canonname;
!     if (!check_locale(LC_CTYPE, dbctype, &canonname))
          ereport(ERROR,
                  (errcode(ERRCODE_WRONG_OBJECT_TYPE),
                   errmsg("invalid locale name %s", dbctype)));
+     dbctype = canonname;

      check_encoding_locale_matches(encoding, dbcollate, dbctype);

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 9f112e8c5cb7f0f3b4f9a6a522b041e418b90b23..627172f48e02faa303215962aaab69de05448dfe 100644
*** a/src/backend/utils/adt/pg_locale.c
--- b/src/backend/utils/adt/pg_locale.c
*************** pg_perm_setlocale(int category, const ch
*** 222,233 ****

  /*
   * Is the locale name valid for the locale category?
   */
  bool
! check_locale(int category, const char *value)
  {
      char       *save;
!     bool        ret;

      save = setlocale(category, NULL);
      if (!save)
--- 222,240 ----

  /*
   * Is the locale name valid for the locale category?
+  *
+  * If successful, and canonname isn't NULL, a palloc'd copy of the locale's
+  * canonical name is stored there.  This is especially useful for figuring out
+  * what locale name "" means (ie, the server environment value).
   */
  bool
! check_locale(int category, const char *locale, char **canonname)
  {
      char       *save;
!     char       *res;
!
!     if (canonname)
!         *canonname = NULL;        /* in case of failure */

      save = setlocale(category, NULL);
      if (!save)
*************** check_locale(int category, const char *v
*** 237,250 ****
      save = pstrdup(save);

      /* set the locale with setlocale, to see if it accepts it. */
!     ret = (setlocale(category, value) != NULL);

      /* restore old value. */
      if (!setlocale(category, save))
          elog(WARNING, "failed to restore old locale");
      pfree(save);

!     return ret;
  }


--- 244,261 ----
      save = pstrdup(save);

      /* set the locale with setlocale, to see if it accepts it. */
!     res = setlocale(category, locale);
!
!     /* save canonical name if requested. */
!     if (res && canonname)
!         *canonname = pstrdup(res);

      /* restore old value. */
      if (!setlocale(category, save))
          elog(WARNING, "failed to restore old locale");
      pfree(save);

!     return (res != NULL);
  }


*************** check_locale(int category, const char *v
*** 262,268 ****
  bool
  check_locale_monetary(char **newval, void **extra, GucSource source)
  {
!     return check_locale(LC_MONETARY, *newval);
  }

  void
--- 273,279 ----
  bool
  check_locale_monetary(char **newval, void **extra, GucSource source)
  {
!     return check_locale(LC_MONETARY, *newval, NULL);
  }

  void
*************** assign_locale_monetary(const char *newva
*** 274,280 ****
  bool
  check_locale_numeric(char **newval, void **extra, GucSource source)
  {
!     return check_locale(LC_NUMERIC, *newval);
  }

  void
--- 285,291 ----
  bool
  check_locale_numeric(char **newval, void **extra, GucSource source)
  {
!     return check_locale(LC_NUMERIC, *newval, NULL);
  }

  void
*************** assign_locale_numeric(const char *newval
*** 286,292 ****
  bool
  check_locale_time(char **newval, void **extra, GucSource source)
  {
!     return check_locale(LC_TIME, *newval);
  }

  void
--- 297,303 ----
  bool
  check_locale_time(char **newval, void **extra, GucSource source)
  {
!     return check_locale(LC_TIME, *newval, NULL);
  }

  void
*************** check_locale_messages(char **newval, voi
*** 322,328 ****
       * On Windows, we can't even check the value, so accept blindly
       */
  #if defined(LC_MESSAGES) && !defined(WIN32)
!     return check_locale(LC_MESSAGES, *newval);
  #else
      return true;
  #endif
--- 333,339 ----
       * On Windows, we can't even check the value, so accept blindly
       */
  #if defined(LC_MESSAGES) && !defined(WIN32)
!     return check_locale(LC_MESSAGES, *newval, NULL);
  #else
      return true;
  #endif
diff --git a/src/include/utils/pg_locale.h b/src/include/utils/pg_locale.h
index 448c01a69712c99ee04bfff44d69138bafb76ece..3c38aa272926f775de41eb5d685f5ce4de84ac2f 100644
*** a/src/include/utils/pg_locale.h
--- b/src/include/utils/pg_locale.h
*************** extern void assign_locale_numeric(const
*** 42,48 ****
  extern bool check_locale_time(char **newval, void **extra, GucSource source);
  extern void assign_locale_time(const char *newval, void *extra);

! extern bool check_locale(int category, const char *locale);
  extern char *pg_perm_setlocale(int category, const char *locale);

  extern bool lc_collate_is_c(Oid collation);
--- 42,48 ----
  extern bool check_locale_time(char **newval, void **extra, GucSource source);
  extern void assign_locale_time(const char *newval, void *extra);

! extern bool check_locale(int category, const char *locale, char **canonname);
  extern char *pg_perm_setlocale(int category, const char *locale);

  extern bool lc_collate_is_c(Oid collation);

Re: check_locale() and the empty string

From
Jeff Davis
Date:
On Sat, 2012-03-24 at 19:07 -0400, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > Surely we don't want it to be set from the environment, right?
>
> Why not?

I agree that we shouldn't change the documented behavior of those GUCs.

But a SQL command like CREATE DATABASE being environment sensitive does
seem like surprising behavior to me, and it did indirectly
lead to a bug.

> I do agree that it's probably unwise to store an empty string as the
> value of pg_database.datcollate or datctype, because that would mean
> that if the server is restarted with different LC_XXX environment values
> then it will think the database has different locale settings, leading
> to havoc.

Yes, that's the worst of the problem. I should have mentioned that more
explicitly in the original report.

> However, ISTM the right fix is to replace an empty-string
> value with the implied locale name at createdb time.  Proposed patch
> attached.

+1.

> Note 2: there is code in initdb that is supposed to be kept parallel
> to this, but it's not currently making any attempt to canonicalize
> non-empty locale names.  Should we make it do that too?

I assume you are talking about the code that results in writing the
settings to postgresql.conf?

It doesn't look quite as dangerous in that area because (a) it ignores
zero-length strings; and (b) setting the GUC to the wrong value will
either be prevented or will not cause any major problem. However, it
does seem like a good idea to canonicalize the settings unless there is
some reason not to.

Regards,
    Jeff Davis

Re: check_locale() and the empty string

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On Sat, 2012-03-24 at 19:07 -0400, Tom Lane wrote:
>> Jeff Davis <pgsql@j-davis.com> writes:
>>> Surely we don't want it to be set from the environment, right?

>> Why not?

> I agree that we shouldn't change the documented behavior of those GUCs.
> But a SQL command like CREATE DATABASE being environment sensitive does
> seem like surprising behavior to me, and it did indirectly
> lead to a bug.

Well, our locale documentation makes it pretty clear that a lot of this
behavior is inherited from the server's environment by default:
http://www.postgresql.org/docs/devel/static/locale.html
So I don't see anything wrong with that in principle.  But we'd better
make sure that a database's lc_collate/lc_ctype are locked down after
creation.

>> Note 2: there is code in initdb that is supposed to be kept parallel
>> to this, but it's not currently making any attempt to canonicalize
>> non-empty locale names.  Should we make it do that too?

> I assume you are talking about the code that results in writing the
> settings to postgresql.conf?

> It doesn't look quite as dangerous in that area because (a) it ignores
> zero-length strings; and (b) setting the GUC to the wrong value will
> either be prevented or will not cause any major problem. However, it
> does seem like a good idea to canonicalize the settings unless there is
> some reason not to.

When I wrote the proposed patch, I was imagining that setlocale()
would in fact do some canonicalization of the supplied string.
Experimentation shows that's not so, though, at least not on current
Linux and OS X: you seem to get back the supplied string in all cases
except where it's "".

The reason I was hoping for canonicalization is that right now we
consider locale names like "en_US.utf8" and "en_US.UTF-8" to be
different, even though libc very probably treats them the same;
this results in CREATE DATABASE rejecting lc_collate/ctype settings
that are only cosmetically different from the template database's
values.  Canonicalization by setlocale() would fix that without
requiring us to make any unsupported assumptions about which names
mean the same.  Oh well.

However, it might still be that somewhere there is a libc that does
take the opportunity to canonicalize the locale name, and if that
did happen then it would be important for initdb to do it the same
way as CREATE DATABASE does.  Otherwise, we might end up rejecting
a CREATE DATABASE lc_collate/ctype setting that's identical to what
the user told initdb to use, because one got canonicalized and the
other not.  So this roundabout series of assumptions leads me to
think that initdb needs to be tweaked too.

            regards, tom lane