Thread: "could not adopt C locale" failure at startup on Windows

"could not adopt C locale" failure at startup on Windows

From
Tom Lane
Date:
We've seen several reports of $SUBJECT lately.  I have no idea what's
going on, but it occurred to me that it could be informative to tweak
init_locale() so that it reports which category failed to be set.
Any objections to squeezing that into today's releases?
        regards, tom lane



Re: "could not adopt C locale" failure at startup on Windows

From
Andres Freund
Date:
On 2015-06-09 11:20:06 -0400, Tom Lane wrote:
> We've seen several reports of $SUBJECT lately.  I have no idea what's
> going on, but it occurred to me that it could be informative to tweak
> init_locale() so that it reports which category failed to be set.
> Any objections to squeezing that into today's releases?

No objection, +1. Seems fairly low risk.

The error seems odd. The only even remotely related change between 9.4.1
and .2 seems to be ca325941. Could also be a build environment change.



Re: "could not adopt C locale" failure at startup on Windows

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2015-06-09 11:20:06 -0400, Tom Lane wrote:
>> We've seen several reports of $SUBJECT lately.  I have no idea what's
>> going on, but it occurred to me that it could be informative to tweak
>> init_locale() so that it reports which category failed to be set.
>> Any objections to squeezing that into today's releases?

> No objection, +1. Seems fairly low risk.

> The error seems odd. The only even remotely related change between 9.4.1
> and .2 seems to be ca325941. Could also be a build environment change.

Yeah, my first instinct was to blame ca325941 as well, but I don't think
any of that code executes during init_locale().  Also,
http://www.postgresql.org/message-id/20150326040321.2492.24716@wrigleys.postgresql.org
says it's been seen in 9.4.1.  Of course, it might be premature to assume
that report had an identical cause to the later ones.

What I plan to do is this:

static void
init_locale(const char *categoryname, int category, const char *locale)
{   if (pg_perm_setlocale(category, locale) == NULL &&       pg_perm_setlocale(category, "C") == NULL)
elog(FATAL,"could not adopt either \"%s\" locale or C locale for %s", locale, categoryname);
 
}

with the obvious changes to the call sites to pass the string names of
the categories not just their numeric codes.  (We could just log the
numbers, but it'd be much harder to interpret.)  This might be overkill,
but when you don't know what you're looking for, every little bit helps ...
        regards, tom lane



Re: "could not adopt C locale" failure at startup on Windows

From
Noah Misch
Date:
On Tue, Jun 09, 2015 at 12:24:02PM -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > The error seems odd. The only even remotely related change between 9.4.1
> > and .2 seems to be ca325941. Could also be a build environment change.
> 
> Yeah, my first instinct was to blame ca325941 as well, but I don't think
> any of that code executes during init_locale().  Also,
> http://www.postgresql.org/message-id/20150326040321.2492.24716@wrigleys.postgresql.org
> says it's been seen in 9.4.1.

The return value check and error message were new in 9.4.1.  I suspect the
underlying problem has been present far longer, undetected.

I can reproduce this with "initdb --locale=C",
postgresql-9.4.3-1-windows-binaries.zip (32-bit), Windows 7 x64, and the
Windows ANSI code page set to CP936.  (Choose "Chinese (Simplified, PRC)" in
Control Panel -> Region and Language -> Administrative -> Language for
non-Unicode programs.)  It is neither necessary nor sufficient to change
Control Panel -> Region and Language -> Formats -> Format.  Binaries from
postgresql-9.4.3-1-windows-x64-binaries.zip do not exhibit the problem.  Note
that CP936 is a PG_ENCODING_IS_CLIENT_ONLY() encoding.

> What I plan to do is this:
> 
> static void
> init_locale(const char *categoryname, int category, const char *locale)
> {
>     if (pg_perm_setlocale(category, locale) == NULL &&
>         pg_perm_setlocale(category, "C") == NULL)
>             elog(FATAL, "could not adopt either \"%s\" locale or C locale for %s", locale, categoryname);
> }

Good to have.



Re: "could not adopt C locale" failure at startup on Windows

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Tue, Jun 09, 2015 at 12:24:02PM -0400, Tom Lane wrote:
>> Yeah, my first instinct was to blame ca325941 as well, but I don't think
>> any of that code executes during init_locale().  Also,
>> http://www.postgresql.org/message-id/20150326040321.2492.24716@wrigleys.postgresql.org
>> says it's been seen in 9.4.1.

> The return value check and error message were new in 9.4.1.  I suspect the
> underlying problem has been present far longer, undetected.

Oooh ... I'd forgotten that 6fdba8ceb was so recent.  Agreed, what we are
seeing is probably a situation that's been there for a long time, but we
were ignoring the failure up to now (and, evidently, it wasn't really
creating any problems).

> I can reproduce this with "initdb --locale=C",
> postgresql-9.4.3-1-windows-binaries.zip (32-bit), Windows 7 x64, and the
> Windows ANSI code page set to CP936.  (Choose "Chinese (Simplified, PRC)" in
> Control Panel -> Region and Language -> Administrative -> Language for
> non-Unicode programs.)  It is neither necessary nor sufficient to change
> Control Panel -> Region and Language -> Formats -> Format.  Binaries from
> postgresql-9.4.3-1-windows-x64-binaries.zip do not exhibit the problem.  Note
> that CP936 is a PG_ENCODING_IS_CLIENT_ONLY() encoding.

Hm.  I could understand getting encoding difficulties in that environment,
but it's hard to see why they'd manifest like this.  Can you trace through
pg_perm_setlocale and figure out why it's reporting failure?
        regards, tom lane



Re: "could not adopt C locale" failure at startup on Windows

From
Noah Misch
Date:
On Wed, Jun 10, 2015 at 10:09:38AM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > I can reproduce this with "initdb --locale=C",
> > postgresql-9.4.3-1-windows-binaries.zip (32-bit), Windows 7 x64, and the
> > Windows ANSI code page set to CP936.  (Choose "Chinese (Simplified, PRC)" in
> > Control Panel -> Region and Language -> Administrative -> Language for
> > non-Unicode programs.)  It is neither necessary nor sufficient to change
> > Control Panel -> Region and Language -> Formats -> Format.  Binaries from
> > postgresql-9.4.3-1-windows-x64-binaries.zip do not exhibit the problem.  Note
> > that CP936 is a PG_ENCODING_IS_CLIENT_ONLY() encoding.
>
> Hm.  I could understand getting encoding difficulties in that environment,
> but it's hard to see why they'd manifest like this.  Can you trace through
> pg_perm_setlocale and figure out why it's reporting failure?

A faster test is to set LC_CTYPE=C in the environment and run "postgres
--version".  The root cause is a bug my commit 5f538ad introduced at the start
of the 9.4 cycle.  pg_perm_setlocale() now calls pg_bind_textdomain_codeset(),
which calls setlocale(LC_CTYPE, NULL).  POSIX permits that to clobber all
previous setlocale() return values, which it did here[1].  The ensuing
putenv("LC_CTYPE=<garbage bytes>") at the end of pg_perm_setlocale() fails
under Windows ANSI code page 936, because the garbage bytes often aren't a
valid CP936 string.  I would expect the same symptom on other multibyte
Windows locales.

While Windows was the bellwether, harm potential is greater on non-Windows
systems.  pg_perm_setlocale() sets the LC_CTYPE environment variable to help
PL/Perl avoid clobbering the process locale; see plperl_init_interp()
comments.  However, that function has bespoke code for Windows, on which
setting the environment variable doesn't help.  I don't know which other
platforms invalidate previous setlocale() return values on setlocale(LC_CTYPE,
NULL).  Therefore, I propose committing the attached diagnostic patch and
reverting it after about one buildfarm cycle.  It will make affected
configurations fail hard, and then I'll have a notion about the prevalence of
damage to expect in the field.

The actual fix is trivial, attached second.  This is for back-patch to 9.4.


[1] It does so in 32-bit "release" (non-debug), NLS builds done under Visual
Studio 2012 and Visual Studio 2013.  The official binaries used VS2013.  The
symptoms are slightly different under VS2012.  I did not test earlier
versions.  Debug builds and 64-bit builds were unaffected.

Attachment

Re: "could not adopt C locale" failure at startup on Windows

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Wed, Jun 10, 2015 at 10:09:38AM -0400, Tom Lane wrote:
>> Hm.  I could understand getting encoding difficulties in that environment,
>> but it's hard to see why they'd manifest like this.  Can you trace through
>> pg_perm_setlocale and figure out why it's reporting failure?

> A faster test is to set LC_CTYPE=C in the environment and run "postgres
> --version".  The root cause is a bug my commit 5f538ad introduced at the start
> of the 9.4 cycle.  pg_perm_setlocale() now calls pg_bind_textdomain_codeset(),
> which calls setlocale(LC_CTYPE, NULL).  POSIX permits that to clobber all
> previous setlocale() return values, which it did here[1].

Ah-hah.

> While Windows was the bellwether, harm potential is greater on non-Windows
> systems.  pg_perm_setlocale() sets the LC_CTYPE environment variable to help
> PL/Perl avoid clobbering the process locale; see plperl_init_interp()
> comments.  However, that function has bespoke code for Windows, on which
> setting the environment variable doesn't help.  I don't know which other
> platforms invalidate previous setlocale() return values on setlocale(LC_CTYPE,
> NULL).  Therefore, I propose committing the attached diagnostic patch and
> reverting it after about one buildfarm cycle.  It will make affected
> configurations fail hard, and then I'll have a notion about the prevalence of
> damage to expect in the field.

I doubt this will teach us anything; if any buildfarm systems were
exhibiting the issue, they'd have been failing all along, no?  This should
break at least the bootstrap/initdb case on any affected system.

> The actual fix is trivial, attached second.  This is for back-patch to 9.4.

Looks sane to me.
        regards, tom lane



Re: "could not adopt C locale" failure at startup on Windows

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> A faster test is to set LC_CTYPE=C in the environment and run "postgres
> --version".  The root cause is a bug my commit 5f538ad introduced at the start
> of the 9.4 cycle.  pg_perm_setlocale() now calls pg_bind_textdomain_codeset(),
> which calls setlocale(LC_CTYPE, NULL).  POSIX permits that to clobber all
> previous setlocale() return values, which it did here[1].

After further thought, ISTM that this bug is evidence that 5f538ad
was badly designed, and the proposed fix has blinkers on.  If
pg_bind_textdomain_codeset() is looking at the locale environment,
we should not be calling it from inside pg_perm_setlocale() at all,
but from higher level code *after* we're done setting up the whole libc
locale environment --- thus, after the unsetenv("LC_ALL") call in main.c,
and somewhere near the bottom of CheckMyDatabase() in postinit.c.
It's mere chance that the order of calls to pg_perm_setlocale() is
such that the code works now; and it's not hard to imagine future
changes in gettext, or reordering of our own code, that would break it.
So I think having to duplicate that call is a reasonable price to pay
for not having surprising entanglements in what should be a very thin
wrapper around setlocale(3).
        regards, tom lane



Re: "could not adopt C locale" failure at startup on Windows

From
Noah Misch
Date:
On Mon, Jun 15, 2015 at 08:47:12AM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > While Windows was the bellwether, harm potential is greater on non-Windows
> > systems.  pg_perm_setlocale() sets the LC_CTYPE environment variable to help
> > PL/Perl avoid clobbering the process locale; see plperl_init_interp()
> > comments.  However, that function has bespoke code for Windows, on which
> > setting the environment variable doesn't help.  I don't know which other
> > platforms invalidate previous setlocale() return values on setlocale(LC_CTYPE,
> > NULL).  Therefore, I propose committing the attached diagnostic patch and
> > reverting it after about one buildfarm cycle.  It will make affected
> > configurations fail hard, and then I'll have a notion about the prevalence of
> > damage to expect in the field.
> 
> I doubt this will teach us anything; if any buildfarm systems were
> exhibiting the issue, they'd have been failing all along, no?

No; most systems let environment variables carry arbitrary strings of non-nul
bytes, so they don't see $SUBJECT.  I want to probe for all systems that are
currently issuing putenv("LC_CTYPE=<garbage>"), not just the ones where a
picky putenv() illuminates it.



Re: "could not adopt C locale" failure at startup on Windows

From
Noah Misch
Date:
On Mon, Jun 15, 2015 at 12:37:43PM -0400, Tom Lane wrote:
> After further thought, ISTM that this bug is evidence that 5f538ad
> was badly designed, and the proposed fix has blinkers on.  If
> pg_bind_textdomain_codeset() is looking at the locale environment,
> we should not be calling it from inside pg_perm_setlocale() at all,
> but from higher level code *after* we're done setting up the whole libc
> locale environment --- thus, after the unsetenv("LC_ALL") call in main.c,
> and somewhere near the bottom of CheckMyDatabase() in postinit.c.
> It's mere chance that the order of calls to pg_perm_setlocale() is
> such that the code works now; and it's not hard to imagine future
> changes in gettext, or reordering of our own code, that would break it.

pg_bind_textdomain_codeset() looks at one piece of the locale environment,
namely setlocale(LC_CTYPE, NULL), so the order of pg_perm_setlocale() calls
does not affect it.  There's nothing acutely bad about the alternative you
identify here, but it's no better equipped to forestall mistakes.  Moving the
call later would slightly expand the window during which translated messages
are garbled.



Re: "could not adopt C locale" failure at startup on Windows

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Mon, Jun 15, 2015 at 12:37:43PM -0400, Tom Lane wrote:
>> It's mere chance that the order of calls to pg_perm_setlocale() is
>> such that the code works now; and it's not hard to imagine future
>> changes in gettext, or reordering of our own code, that would break it.

> pg_bind_textdomain_codeset() looks at one piece of the locale environment,
> namely setlocale(LC_CTYPE, NULL), so the order of pg_perm_setlocale() calls
> does not affect it.

Well, my point is that that is a larger assumption about the behavior of
pg_bind_textdomain_codeset() than I think we ought to make, even if it
happens to be true today.

> There's nothing acutely bad about the alternative you
> identify here, but it's no better equipped to forestall mistakes.  Moving the
> call later would slightly expand the window during which translated messages
> are garbled.

I'm not exactly sure that they wouldn't be garbled anyway during the
interval where we're setting these things up.  Until DatabaseEncoding,
ClientEncoding, and gettext's internal notions are all in sync, we
are at risk of that type of issue no matter what.
        regards, tom lane



Re: "could not adopt C locale" failure at startup on Windows

From
Noah Misch
Date:
On Wed, Jun 17, 2015 at 01:43:55PM -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > On Mon, Jun 15, 2015 at 12:37:43PM -0400, Tom Lane wrote:
> >> It's mere chance that the order of calls to pg_perm_setlocale() is
> >> such that the code works now; and it's not hard to imagine future
> >> changes in gettext, or reordering of our own code, that would break it.
> 
> > pg_bind_textdomain_codeset() looks at one piece of the locale environment,
> > namely setlocale(LC_CTYPE, NULL), so the order of pg_perm_setlocale() calls
> > does not affect it.
> 
> Well, my point is that that is a larger assumption about the behavior of
> pg_bind_textdomain_codeset() than I think we ought to make, even if it
> happens to be true today.

Perhaps it's just me, but I can envision changes of similar plausibility that
break under each approach and not the other.  Without a way to distinguish on
that basis, I'm left shrugging about a proposal to switch.  For that matter,
if pg_bind_textdomain_codeset() starts to act on other facets of the locale,
that's liable to be a bug independent of our choice here.  However locale
facets conflict, we expect LC_CTYPE to control the message codeset.

> > There's nothing acutely bad about the alternative you
> > identify here, but it's no better equipped to forestall mistakes.  Moving the
> > call later would slightly expand the window during which translated messages
> > are garbled.
> 
> I'm not exactly sure that they wouldn't be garbled anyway during the
> interval where we're setting these things up.  Until DatabaseEncoding,
> ClientEncoding, and gettext's internal notions are all in sync, we
> are at risk of that type of issue no matter what.

True; the window exists and is small enough both ways.  This is merely one
more reason to fix the bug without fixing what ain't broke.



Re: "could not adopt C locale" failure at startup on Windows

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Wed, Jun 17, 2015 at 01:43:55PM -0400, Tom Lane wrote:
>> I'm not exactly sure that they wouldn't be garbled anyway during the
>> interval where we're setting these things up.  Until DatabaseEncoding,
>> ClientEncoding, and gettext's internal notions are all in sync, we
>> are at risk of that type of issue no matter what.

> True; the window exists and is small enough both ways.  This is merely one
> more reason to fix the bug without fixing what ain't broke.

[ shrug... ]  I'm not planning to waste a whole lot more breath on this
topic, but to my mind the current definition of pg_perm_setlocale() *is*
broke.  Previously, that function only interacted with the standard libc
locale facilities.  Now it's also entangled with gettext(), as well as
SetMessageEncoding and GetDatabaseEncoding.  IMO that extra cross-module
entanglement is the exact reason we have a bug here.  It's a layering
violation and I think we should get rid of it.
        regards, tom lane