Re: psql weird behaviour with charset encodings - Mailing list pgsql-general

From Tom Lane
Subject Re: psql weird behaviour with charset encodings
Date
Msg-id 3797.1273276002@sss.pgh.pa.us
Whole thread Raw
In response to Re: psql weird behaviour with charset encodings  (hernan gonzalez <hgonzalez@gmail.com>)
Responses Re: psql weird behaviour with charset encodings  (hgonzalez@gmail.com)
List pgsql-general
hernan gonzalez <hgonzalez@gmail.com> writes:
> The issue is that psql tries (apparently) to convert to UTF8
> (even when he plans to output the raw text -LATIN9 in this case)
> just for computing the lenght of the field, to build the table.
> And because for this computation he (apparently) rely on the string
> routines with it's own locale, instead of the DB or client encoding.

I didn't believe this, since I know perfectly well that the formatting
code doesn't rely on any OS-supplied width calculations.  But when I
tested it out, I found I could reproduce Hernan's problem on Fedora 11.
Some tracing showed that the problem is here:

                fprintf(fout, "%.*s", bytes_to_output,
                        this_line->ptr + bytes_output[j]);

As the variable name indicates, psql has carefully calculated the number
of *bytes* it wants to print.  However, it appears that glibc's printf
code interprets the parameter as the number of *characters* to print,
and to determine what's a character it assumes the string is in the
environment LC_CTYPE's encoding.  I haven't dug into the glibc code to
check, but it's presumably barfing because the string isn't valid
according to UTF8 encoding, and then failing to print anything.

It appears to me that this behavior violates the Single Unix Spec,
which says very clearly that the count is a count of bytes:
http://www.opengroup.org/onlinepubs/007908799/xsh/fprintf.html
However, I'm quite sure that our chances of persuading the glibc boys
that this is a bad idea are zero.  I think we're going to have to
change the code to not rely on %.*s here.  Even without the charset
mismatch in Hernan's example, we'd be printing the wrong amount of
data anytime the LC_CTYPE charset is multibyte.  (IOW, the code should
do the wrong thing with forced-line-wrap cases if LC_CTYPE is UTF8,
even if client_encoding is too; anybody want to check?)

The above coding is new in 8.4, but it's probably not the only use of
%.*s --- we had better go looking for other trouble spots, too.

            regards, tom lane

pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: initdb fails on Centos 5.4 x64
Next
From: hgonzalez@gmail.com
Date:
Subject: Re: psql weird behaviour with charset encodings