Re: BUG #7493: Postmaster messages unreadable in a Windows console - Mailing list pgsql-hackers

From Noah Misch
Subject Re: BUG #7493: Postmaster messages unreadable in a Windows console
Date
Msg-id 20130623165359.GB782173@tornado.leadboat.com
Whole thread Raw
In response to Re: BUG #7493: Postmaster messages unreadable in a Windows console  (Alexander Law <exclusion@gmail.com>)
Responses Re: BUG #7493: Postmaster messages unreadable in a Windows console
List pgsql-hackers
On Wed, Feb 20, 2013 at 04:00:00PM +0400, Alexander Law wrote:
> 15.02.2013 02:59, Noah Misch wrote:
>>>> With your proposed change, the problem will resurface in an actual SQL_ASCII
>>>> database.  At the problem's root is write_console()'s assumption that messages
>>>> are in the database encoding.  pg_bind_textdomain_codeset() tries to make that
>>>> so, but it only works for encodings with a pg_enc2gettext_tbl entry.  That
>>>> excludes SQL_ASCII, MULE_INTERNAL, and others.  write_console() needs to
>>>> behave differently in such cases.
>>> Thank you for the notice. So it seems that "DatabaseEncoding" variable
>>> alone can't present a database encoding (for communication with a
>>> client) and current process messages encoding (for logging messages) at
>>> once. There should be another variable, something like
>>> CurrentProcessEncoding, that will be set to OS encoding at start and can
>>> be changed to encoding of a connected database (if
>>> bind_textdomain_codeset succeeded).
>> I'd call it MessageEncoding unless it corresponds with similar rigor to a
>> broader concept.
> Please look at the next version of the patch.

I studied this patch.  I like the APIs it adds, but the implementation details
required attention.

Your patch assumes the message encoding will be a backend encoding, but this
need not be so; locale "Japanese (Japan)" uses CP932 aka SJIS.  I've also
added the non-backend encodings to pg_enc2gettext_tbl; I'd welcome sanity
checks from folks who know those encodings well.

write_console() still garbled the payload in all cases where it used write()
to a console instead of WriteConsoleW().  You can observe this by configuring
Windows for Russian, running initdb with default locale settings, and running
"select 1/0" in template1.  See comments for the technical details; I fixed
this by removing the logic to preemptively skip WriteConsoleW() for
encoding-related reasons.  (Over in write_eventlog(), I am suspicious of the
benefits we get from avoiding ReportEventW() where possible.  I have not
removed that, though.)

> --- a/src/backend/main/main.c
> +++ b/src/backend/main/main.c
> @@ -100,6 +100,8 @@ main(int argc, char *argv[])
>
>      set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("postgres"));
>
> +    SetMessageEncoding(GetPlatformEncoding());

SetMessageEncoding() and GetPlatformEncoding() can both elog()/ereport(), but
this is too early in the life of a PostgreSQL process for that.

The goal of this line of code is to capture gettext's default encoding, which
is platform-specific.  On non-Windows platforms, it's the encoding implied by
LC_CTYPE.  On Windows, it's the Windows ANSI code page.  GetPlatformEncoding()
always gives the encoding implied by LC_CTYPE, which is therefore not correct
on Windows.  You can observe broken output by setting your locale (in control
panel "Region and Language" -> Formats -> Format) to something unrelated to
your Windows ANSI code page (Region and Language -> Administrative -> Change
system locale...).  Let's make PostgreSQL independent of gettext's
Windows-specific default by *always* calling bind_textdomain_codeset() on
Windows.  In the postmaster and other non-database-attached processes, as well
as in SQL_ASCII databases, we would pass the encoding implied by LC_CTYPE.
Besides removing a discrepancy in PostgreSQL behavior between Windows and
non-Windows platforms, this has the great practical advantage of reducing
PostgreSQL's dependence on the deprecated Windows ANSI code page, which can
only be changed on a system-wide basis, by an administrator, after a reboot.

For message creation purposes, the encoding implied by LC_CTYPE on Windows is
somewhat arbitrary.  Microsoft has deprecated them all in favor of UTF16, and
some locales (e.g. "Hindi (India)") have no legacy code page.  I can see
adding a postmaster_encoding GUC to be used in place of consulting LC_CTYPE.
I haven't done that now; I just mention it for future reference.

On a !ENABLE_NLS build, messages are ASCII with database-encoding strings
sometimes interpolated.  Therefore, such builds should keep the message
encoding equal to the database encoding.

The MessageEncoding was not maintained accurately on non-Windows systems.  For
example, connecting to an lc_ctype=LATIN1, encoding=LATIN1 database does not
require a bind_textdomain_codeset() call on such systems, so we did not update
the message encoding.  This was academic for the time being, because
MessageEncoding is only consulted on Windows.


The attached revision fixes all above points.  Would you look it over?  The
area was painfully light on comments, so I added some.  I renamed
pgwin32_toUTF16(), which ceases to be a good name now that it converts from
message encoding, not database encoding.  GetPlatformEncoding() became unused,
so I removed it.  (If we have cause reintroduce the exact same concept later,
GetTTYEncoding() would name it more accurately.)


What should we do for the back branches, if anything?  Fixes for garbled
display on consoles and event logs are fair to back-patch, but users might be
accustomed to the present situation for SQL_ASCII databases.  Given the low
incidence of complaints and the workaround of using logging_collector, I am
inclined to put the whole thing in master only.

Thanks,
nm

--
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Kevin Grittner
Date:
Subject: Re: changeset generation v5-01 - Patches & git tree
Next
From: Kevin Grittner
Date:
Subject: Re: changeset generation v5-01 - Patches & git tree