Noah Misch <noah@leadboat.com> writes:
> Following some actual testing, I see that we treat postgresql.conf values as
> byte sequences; any reinterpretation as encoded text happens later. Hence,
> contrary to my earlier suspicion, your patch does not make that situation
> worse. The present situation is bad; among other things, current_setting() is
> a vector for injecting invalid text data. But unconditionally validating
> postgresql.conf values in the platform encoding would not be an improvement.
> Suppose you have a UTF-8 platform encoding and KOI8R databases. You may wish
> to put KOI8R strings in a GUC, say search_path. That's possible today; if we
> required that postgresql.conf conform to the platform encoding and no other,
> it would become impossible. This area warrants improvement, but doing so will
> entail careful design.
The key problem, ISTM, is that it's not at all clear what encoding to
expect the incoming data to be in. I'm concerned about trying to fix
that by assuming it's in some "platform encoding" --- for one thing,
while that might be a well-defined concept on Windows, I don't believe
it is anywhere else.
If we knew that postgresql.conf was stored in, say, UTF8, then it would
probably be possible to perform encoding conversion to get string
variables into the database encoding. Perhaps we should allow some
magic syntax to tell us the encoding of a config file?
file_encoding = 'utf8' # must precede any non-ASCII in the file
There would still be a lot of practical problems to solve, like what to
do if we fail to convert some string into the database encoding. But at
least the problems would be somewhat well-defined.
While we're thinking about this, it'd be nice to fix our handling (or
rather lack of handling) of encoding considerations for database names,
user names, and passwords. I could imagine adding some sort of encoding
marker to connection request packets, which could fix the don't-know-
the-encoding problem as far as incoming data is concerned. But how
shall we deal with storing the strings in shared catalogs, which have to
be readable from multiple databases possibly of different encodings?
regards, tom lane