On Sun, Feb 10, 2013 at 06:47:30PM -0500, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > Following some actual testing, I see that we treat postgresql.conf values as
> > byte sequences; any reinterpretation as encoded text happens later. Hence,
> > contrary to my earlier suspicion, your patch does not make that situation
> > worse. The present situation is bad; among other things, current_setting() is
> > a vector for injecting invalid text data. But unconditionally validating
> > postgresql.conf values in the platform encoding would not be an improvement.
> > Suppose you have a UTF-8 platform encoding and KOI8R databases. You may wish
> > to put KOI8R strings in a GUC, say search_path. That's possible today; if we
> > required that postgresql.conf conform to the platform encoding and no other,
> > it would become impossible. This area warrants improvement, but doing so will
> > entail careful design.
>
> The key problem, ISTM, is that it's not at all clear what encoding to
> expect the incoming data to be in. I'm concerned about trying to fix
> that by assuming it's in some "platform encoding" --- for one thing,
> while that might be a well-defined concept on Windows, I don't believe
> it is anywhere else.
GetPlatformEncoding() imposes a sufficiently-portable definition. I just
don't think that definition leads to a value that can be presumed desirable
and adequate for postgresql.conf.
> If we knew that postgresql.conf was stored in, say, UTF8, then it would
> probably be possible to perform encoding conversion to get string
> variables into the database encoding. Perhaps we should allow some
> magic syntax to tell us the encoding of a config file?
>
> file_encoding = 'utf8' # must precede any non-ASCII in the file
>
> There would still be a lot of practical problems to solve, like what to
> do if we fail to convert some string into the database encoding. But at
> least the problems would be somewhat well-defined.
Agreed. That's a promising direction.
> While we're thinking about this, it'd be nice to fix our handling (or
> rather lack of handling) of encoding considerations for database names,
> user names, and passwords. I could imagine adding some sort of encoding
> marker to connection request packets, which could fix the don't-know-
> the-encoding problem as far as incoming data is concerned.
That deserves a TODO entry under Wire Protocol Changes to avoid losing it.
> But how
> shall we deal with storing the strings in shared catalogs, which have to
> be readable from multiple databases possibly of different encodings?
I suppose we would pick an encoding sufficient for all values we intend to
support (UTF8? MULE_INTERNAL?), then store the data in that encoding using
either bytea or a new type, say "omnitext".
Thanks,
nm