Thread: "could not adopt C locale" failure at startup on Windows
We've seen several reports of $SUBJECT lately. I have no idea what's going on, but it occurred to me that it could be informative to tweak init_locale() so that it reports which category failed to be set. Any objections to squeezing that into today's releases? regards, tom lane
On 2015-06-09 11:20:06 -0400, Tom Lane wrote: > We've seen several reports of $SUBJECT lately. I have no idea what's > going on, but it occurred to me that it could be informative to tweak > init_locale() so that it reports which category failed to be set. > Any objections to squeezing that into today's releases? No objection, +1. Seems fairly low risk. The error seems odd. The only even remotely related change between 9.4.1 and .2 seems to be ca325941. Could also be a build environment change.
Andres Freund <andres@anarazel.de> writes: > On 2015-06-09 11:20:06 -0400, Tom Lane wrote: >> We've seen several reports of $SUBJECT lately. I have no idea what's >> going on, but it occurred to me that it could be informative to tweak >> init_locale() so that it reports which category failed to be set. >> Any objections to squeezing that into today's releases? > No objection, +1. Seems fairly low risk. > The error seems odd. The only even remotely related change between 9.4.1 > and .2 seems to be ca325941. Could also be a build environment change. Yeah, my first instinct was to blame ca325941 as well, but I don't think any of that code executes during init_locale(). Also, http://www.postgresql.org/message-id/20150326040321.2492.24716@wrigleys.postgresql.org says it's been seen in 9.4.1. Of course, it might be premature to assume that report had an identical cause to the later ones. What I plan to do is this: static void init_locale(const char *categoryname, int category, const char *locale) { if (pg_perm_setlocale(category, locale) == NULL && pg_perm_setlocale(category, "C") == NULL) elog(FATAL,"could not adopt either \"%s\" locale or C locale for %s", locale, categoryname); } with the obvious changes to the call sites to pass the string names of the categories not just their numeric codes. (We could just log the numbers, but it'd be much harder to interpret.) This might be overkill, but when you don't know what you're looking for, every little bit helps ... regards, tom lane
On Tue, Jun 09, 2015 at 12:24:02PM -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > The error seems odd. The only even remotely related change between 9.4.1 > > and .2 seems to be ca325941. Could also be a build environment change. > > Yeah, my first instinct was to blame ca325941 as well, but I don't think > any of that code executes during init_locale(). Also, > http://www.postgresql.org/message-id/20150326040321.2492.24716@wrigleys.postgresql.org > says it's been seen in 9.4.1. The return value check and error message were new in 9.4.1. I suspect the underlying problem has been present far longer, undetected. I can reproduce this with "initdb --locale=C", postgresql-9.4.3-1-windows-binaries.zip (32-bit), Windows 7 x64, and the Windows ANSI code page set to CP936. (Choose "Chinese (Simplified, PRC)" in Control Panel -> Region and Language -> Administrative -> Language for non-Unicode programs.) It is neither necessary nor sufficient to change Control Panel -> Region and Language -> Formats -> Format. Binaries from postgresql-9.4.3-1-windows-x64-binaries.zip do not exhibit the problem. Note that CP936 is a PG_ENCODING_IS_CLIENT_ONLY() encoding. > What I plan to do is this: > > static void > init_locale(const char *categoryname, int category, const char *locale) > { > if (pg_perm_setlocale(category, locale) == NULL && > pg_perm_setlocale(category, "C") == NULL) > elog(FATAL, "could not adopt either \"%s\" locale or C locale for %s", locale, categoryname); > } Good to have.
Noah Misch <noah@leadboat.com> writes: > On Tue, Jun 09, 2015 at 12:24:02PM -0400, Tom Lane wrote: >> Yeah, my first instinct was to blame ca325941 as well, but I don't think >> any of that code executes during init_locale(). Also, >> http://www.postgresql.org/message-id/20150326040321.2492.24716@wrigleys.postgresql.org >> says it's been seen in 9.4.1. > The return value check and error message were new in 9.4.1. I suspect the > underlying problem has been present far longer, undetected. Oooh ... I'd forgotten that 6fdba8ceb was so recent. Agreed, what we are seeing is probably a situation that's been there for a long time, but we were ignoring the failure up to now (and, evidently, it wasn't really creating any problems). > I can reproduce this with "initdb --locale=C", > postgresql-9.4.3-1-windows-binaries.zip (32-bit), Windows 7 x64, and the > Windows ANSI code page set to CP936. (Choose "Chinese (Simplified, PRC)" in > Control Panel -> Region and Language -> Administrative -> Language for > non-Unicode programs.) It is neither necessary nor sufficient to change > Control Panel -> Region and Language -> Formats -> Format. Binaries from > postgresql-9.4.3-1-windows-x64-binaries.zip do not exhibit the problem. Note > that CP936 is a PG_ENCODING_IS_CLIENT_ONLY() encoding. Hm. I could understand getting encoding difficulties in that environment, but it's hard to see why they'd manifest like this. Can you trace through pg_perm_setlocale and figure out why it's reporting failure? regards, tom lane
On Wed, Jun 10, 2015 at 10:09:38AM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > I can reproduce this with "initdb --locale=C", > > postgresql-9.4.3-1-windows-binaries.zip (32-bit), Windows 7 x64, and the > > Windows ANSI code page set to CP936. (Choose "Chinese (Simplified, PRC)" in > > Control Panel -> Region and Language -> Administrative -> Language for > > non-Unicode programs.) It is neither necessary nor sufficient to change > > Control Panel -> Region and Language -> Formats -> Format. Binaries from > > postgresql-9.4.3-1-windows-x64-binaries.zip do not exhibit the problem. Note > > that CP936 is a PG_ENCODING_IS_CLIENT_ONLY() encoding. > > Hm. I could understand getting encoding difficulties in that environment, > but it's hard to see why they'd manifest like this. Can you trace through > pg_perm_setlocale and figure out why it's reporting failure? A faster test is to set LC_CTYPE=C in the environment and run "postgres --version". The root cause is a bug my commit 5f538ad introduced at the start of the 9.4 cycle. pg_perm_setlocale() now calls pg_bind_textdomain_codeset(), which calls setlocale(LC_CTYPE, NULL). POSIX permits that to clobber all previous setlocale() return values, which it did here[1]. The ensuing putenv("LC_CTYPE=<garbage bytes>") at the end of pg_perm_setlocale() fails under Windows ANSI code page 936, because the garbage bytes often aren't a valid CP936 string. I would expect the same symptom on other multibyte Windows locales. While Windows was the bellwether, harm potential is greater on non-Windows systems. pg_perm_setlocale() sets the LC_CTYPE environment variable to help PL/Perl avoid clobbering the process locale; see plperl_init_interp() comments. However, that function has bespoke code for Windows, on which setting the environment variable doesn't help. I don't know which other platforms invalidate previous setlocale() return values on setlocale(LC_CTYPE, NULL). Therefore, I propose committing the attached diagnostic patch and reverting it after about one buildfarm cycle. It will make affected configurations fail hard, and then I'll have a notion about the prevalence of damage to expect in the field. The actual fix is trivial, attached second. This is for back-patch to 9.4. [1] It does so in 32-bit "release" (non-debug), NLS builds done under Visual Studio 2012 and Visual Studio 2013. The official binaries used VS2013. The symptoms are slightly different under VS2012. I did not test earlier versions. Debug builds and 64-bit builds were unaffected.
Attachment
Noah Misch <noah@leadboat.com> writes: > On Wed, Jun 10, 2015 at 10:09:38AM -0400, Tom Lane wrote: >> Hm. I could understand getting encoding difficulties in that environment, >> but it's hard to see why they'd manifest like this. Can you trace through >> pg_perm_setlocale and figure out why it's reporting failure? > A faster test is to set LC_CTYPE=C in the environment and run "postgres > --version". The root cause is a bug my commit 5f538ad introduced at the start > of the 9.4 cycle. pg_perm_setlocale() now calls pg_bind_textdomain_codeset(), > which calls setlocale(LC_CTYPE, NULL). POSIX permits that to clobber all > previous setlocale() return values, which it did here[1]. Ah-hah. > While Windows was the bellwether, harm potential is greater on non-Windows > systems. pg_perm_setlocale() sets the LC_CTYPE environment variable to help > PL/Perl avoid clobbering the process locale; see plperl_init_interp() > comments. However, that function has bespoke code for Windows, on which > setting the environment variable doesn't help. I don't know which other > platforms invalidate previous setlocale() return values on setlocale(LC_CTYPE, > NULL). Therefore, I propose committing the attached diagnostic patch and > reverting it after about one buildfarm cycle. It will make affected > configurations fail hard, and then I'll have a notion about the prevalence of > damage to expect in the field. I doubt this will teach us anything; if any buildfarm systems were exhibiting the issue, they'd have been failing all along, no? This should break at least the bootstrap/initdb case on any affected system. > The actual fix is trivial, attached second. This is for back-patch to 9.4. Looks sane to me. regards, tom lane
Noah Misch <noah@leadboat.com> writes: > A faster test is to set LC_CTYPE=C in the environment and run "postgres > --version". The root cause is a bug my commit 5f538ad introduced at the start > of the 9.4 cycle. pg_perm_setlocale() now calls pg_bind_textdomain_codeset(), > which calls setlocale(LC_CTYPE, NULL). POSIX permits that to clobber all > previous setlocale() return values, which it did here[1]. After further thought, ISTM that this bug is evidence that 5f538ad was badly designed, and the proposed fix has blinkers on. If pg_bind_textdomain_codeset() is looking at the locale environment, we should not be calling it from inside pg_perm_setlocale() at all, but from higher level code *after* we're done setting up the whole libc locale environment --- thus, after the unsetenv("LC_ALL") call in main.c, and somewhere near the bottom of CheckMyDatabase() in postinit.c. It's mere chance that the order of calls to pg_perm_setlocale() is such that the code works now; and it's not hard to imagine future changes in gettext, or reordering of our own code, that would break it. So I think having to duplicate that call is a reasonable price to pay for not having surprising entanglements in what should be a very thin wrapper around setlocale(3). regards, tom lane
On Mon, Jun 15, 2015 at 08:47:12AM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > While Windows was the bellwether, harm potential is greater on non-Windows > > systems. pg_perm_setlocale() sets the LC_CTYPE environment variable to help > > PL/Perl avoid clobbering the process locale; see plperl_init_interp() > > comments. However, that function has bespoke code for Windows, on which > > setting the environment variable doesn't help. I don't know which other > > platforms invalidate previous setlocale() return values on setlocale(LC_CTYPE, > > NULL). Therefore, I propose committing the attached diagnostic patch and > > reverting it after about one buildfarm cycle. It will make affected > > configurations fail hard, and then I'll have a notion about the prevalence of > > damage to expect in the field. > > I doubt this will teach us anything; if any buildfarm systems were > exhibiting the issue, they'd have been failing all along, no? No; most systems let environment variables carry arbitrary strings of non-nul bytes, so they don't see $SUBJECT. I want to probe for all systems that are currently issuing putenv("LC_CTYPE=<garbage>"), not just the ones where a picky putenv() illuminates it.
On Mon, Jun 15, 2015 at 12:37:43PM -0400, Tom Lane wrote: > After further thought, ISTM that this bug is evidence that 5f538ad > was badly designed, and the proposed fix has blinkers on. If > pg_bind_textdomain_codeset() is looking at the locale environment, > we should not be calling it from inside pg_perm_setlocale() at all, > but from higher level code *after* we're done setting up the whole libc > locale environment --- thus, after the unsetenv("LC_ALL") call in main.c, > and somewhere near the bottom of CheckMyDatabase() in postinit.c. > It's mere chance that the order of calls to pg_perm_setlocale() is > such that the code works now; and it's not hard to imagine future > changes in gettext, or reordering of our own code, that would break it. pg_bind_textdomain_codeset() looks at one piece of the locale environment, namely setlocale(LC_CTYPE, NULL), so the order of pg_perm_setlocale() calls does not affect it. There's nothing acutely bad about the alternative you identify here, but it's no better equipped to forestall mistakes. Moving the call later would slightly expand the window during which translated messages are garbled.
Noah Misch <noah@leadboat.com> writes: > On Mon, Jun 15, 2015 at 12:37:43PM -0400, Tom Lane wrote: >> It's mere chance that the order of calls to pg_perm_setlocale() is >> such that the code works now; and it's not hard to imagine future >> changes in gettext, or reordering of our own code, that would break it. > pg_bind_textdomain_codeset() looks at one piece of the locale environment, > namely setlocale(LC_CTYPE, NULL), so the order of pg_perm_setlocale() calls > does not affect it. Well, my point is that that is a larger assumption about the behavior of pg_bind_textdomain_codeset() than I think we ought to make, even if it happens to be true today. > There's nothing acutely bad about the alternative you > identify here, but it's no better equipped to forestall mistakes. Moving the > call later would slightly expand the window during which translated messages > are garbled. I'm not exactly sure that they wouldn't be garbled anyway during the interval where we're setting these things up. Until DatabaseEncoding, ClientEncoding, and gettext's internal notions are all in sync, we are at risk of that type of issue no matter what. regards, tom lane
On Wed, Jun 17, 2015 at 01:43:55PM -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > On Mon, Jun 15, 2015 at 12:37:43PM -0400, Tom Lane wrote: > >> It's mere chance that the order of calls to pg_perm_setlocale() is > >> such that the code works now; and it's not hard to imagine future > >> changes in gettext, or reordering of our own code, that would break it. > > > pg_bind_textdomain_codeset() looks at one piece of the locale environment, > > namely setlocale(LC_CTYPE, NULL), so the order of pg_perm_setlocale() calls > > does not affect it. > > Well, my point is that that is a larger assumption about the behavior of > pg_bind_textdomain_codeset() than I think we ought to make, even if it > happens to be true today. Perhaps it's just me, but I can envision changes of similar plausibility that break under each approach and not the other. Without a way to distinguish on that basis, I'm left shrugging about a proposal to switch. For that matter, if pg_bind_textdomain_codeset() starts to act on other facets of the locale, that's liable to be a bug independent of our choice here. However locale facets conflict, we expect LC_CTYPE to control the message codeset. > > There's nothing acutely bad about the alternative you > > identify here, but it's no better equipped to forestall mistakes. Moving the > > call later would slightly expand the window during which translated messages > > are garbled. > > I'm not exactly sure that they wouldn't be garbled anyway during the > interval where we're setting these things up. Until DatabaseEncoding, > ClientEncoding, and gettext's internal notions are all in sync, we > are at risk of that type of issue no matter what. True; the window exists and is small enough both ways. This is merely one more reason to fix the bug without fixing what ain't broke.
Noah Misch <noah@leadboat.com> writes: > On Wed, Jun 17, 2015 at 01:43:55PM -0400, Tom Lane wrote: >> I'm not exactly sure that they wouldn't be garbled anyway during the >> interval where we're setting these things up. Until DatabaseEncoding, >> ClientEncoding, and gettext's internal notions are all in sync, we >> are at risk of that type of issue no matter what. > True; the window exists and is small enough both ways. This is merely one > more reason to fix the bug without fixing what ain't broke. [ shrug... ] I'm not planning to waste a whole lot more breath on this topic, but to my mind the current definition of pg_perm_setlocale() *is* broke. Previously, that function only interacted with the standard libc locale facilities. Now it's also entangled with gettext(), as well as SetMessageEncoding and GetDatabaseEncoding. IMO that extra cross-module entanglement is the exact reason we have a bug here. It's a layering violation and I think we should get rid of it. regards, tom lane