Thread: Re: [GENERAL] psql weird behaviour with charset encodings
hernan gonzalez <hgonzalez@gmail.com> writes: > The issue is that psql tries (apparently) to convert to UTF8 > (even when he plans to output the raw text -LATIN9 in this case) > just for computing the lenght of the field, to build the table. > And because for this computation he (apparently) rely on the string > routines with it's own locale, instead of the DB or client encoding. I didn't believe this, since I know perfectly well that the formatting code doesn't rely on any OS-supplied width calculations. But when I tested it out, I found I could reproduce Hernan's problem on Fedora 11. Some tracing showed that the problem is here: fprintf(fout, "%.*s", bytes_to_output, this_line->ptr + bytes_output[j]); As the variable name indicates, psql has carefully calculated the number of *bytes* it wants to print. However, it appears that glibc's printf code interprets the parameter as the number of *characters* to print, and to determine what's a character it assumes the string is in the environment LC_CTYPE's encoding. I haven't dug into the glibc code to check, but it's presumably barfing because the string isn't valid according to UTF8 encoding, and then failing to print anything. It appears to me that this behavior violates the Single Unix Spec, which says very clearly that the count is a count of bytes: http://www.opengroup.org/onlinepubs/007908799/xsh/fprintf.html However, I'm quite sure that our chances of persuading the glibc boys that this is a bad idea are zero. I think we're going to have to change the code to not rely on %.*s here. Even without the charset mismatch in Hernan's example, we'd be printing the wrong amount of data anytime the LC_CTYPE charset is multibyte. (IOW, the code should do the wrong thing with forced-line-wrap cases if LC_CTYPE is UTF8, even if client_encoding is too; anybody want to check?) The above coding is new in 8.4, but it's probably not the only use of %.*s --- we had better go looking for other trouble spots, too. regards, tom lane
> However, it appears that glibc's printf
code interprets the parameter as the number of *characters* to print,
and to determine what's a character it assumes the string is in the
environment LC_CTYPE's encoding.
Well, I myself have problems to believe that :-)
This would be nasty... Are you sure?
I couldn reproduce that.
I made a quick test, passing a utf-8 encoded string
(5 bytes correspoding to 4 unicode chars: "niño")
And my glib (same Fedora 12) seems to count bytes,
as it should.
#include<stdio.h>
main () {
char s[] = "ni\xc3\xb1o";
printf("|%.*s|\n",5,s);
}
This, compiled with gcc 4.4.3, run with my root locale (utf8)
did not padded a blank. i.e. it worked as expected.
Hernán
code interprets the parameter as the number of *characters* to print,
and to determine what's a character it assumes the string is in the
environment LC_CTYPE's encoding.
Well, I myself have problems to believe that :-)
This would be nasty... Are you sure?
I couldn reproduce that.
I made a quick test, passing a utf-8 encoded string
(5 bytes correspoding to 4 unicode chars: "niño")
And my glib (same Fedora 12) seems to count bytes,
as it should.
#include<stdio.h>
main () {
char s[] = "ni\xc3\xb1o";
printf("|%.*s|\n",5,s);
}
This, compiled with gcc 4.4.3, run with my root locale (utf8)
did not padded a blank. i.e. it worked as expected.
Hernán
Sorry about a error in my previous example (mixed width and precision). But the conclusion is the same - it works on bytes: #include<stdio.h> main () { char s[] = "ni\xc3\xb1o"; /* 5 bytes , 4 utf8 chars */ printf("|%*s|\n",6,s); /* this should pad a black */ printf("|%.*s|\n",4,s); /* this should eat a char */ } [root@myserv tmp]# ./a.out | od -t cx1 0000000 | n i 303 261 o | \n | n i 303 261 | \n 7c 20 6e 69 c3 b1 6f 7c 0a 7c 6e 69 c3 b1 7c 0a Hernán On Fri, May 7, 2010 at 10:48 PM, <hgonzalez@gmail.com> wrote: >> However, it appears that glibc's printf > code interprets the parameter as the number of *characters* to print, > and to determine what's a character it assumes the string is in the > environment LC_CTYPE's encoding. > > Well, I myself have problems to believe that :-) > This would be nasty... Are you sure? > > I couldn reproduce that. > I made a quick test, passing a utf-8 encoded string > (5 bytes correspoding to 4 unicode chars: "niño") > And my glib (same Fedora 12) seems to count bytes, > as it should. > > #include<stdio.h> > main () { > char s[] = "ni\xc3\xb1o"; > printf("|%.*s|\n",5,s); > } > > This, compiled with gcc 4.4.3, run with my root locale (utf8) > did not padded a blank. i.e. it worked as expected. > > Hernán
hernan gonzalez <hgonzalez@gmail.com> writes: > Sorry about a error in my previous example (mixed width and precision). > But the conclusion is the same - it works on bytes: This example works like that because it's running in C locale always. Try something like this: #include<stdio.h> #include<locale.h> int main () { char s[] = "ni\xc3qo"; /* 5 bytes , not valid utf8 */ setlocale(LC_ALL, ""); printf("|%.*s|\n",3,s); return 0; } I get different (and undesirable) effects depending on LANG. regards, tom lane
Wow, you are right, this is bizarre... And it's not that glibc intends to compute the length in unicode chars, it actually counts bytes (c plain chars) -as it should- for computing field widths... But, for some strange reason, when there is some width calculation involved it tries so parse the char[] using the locale encoding (when there's no point in doing it!) and if it fails, it truncates (silently) the printf output. So it seems more a glib bug to me than an interpretion issue (bytes vs chars). I posted some details in stackoverflow: http://stackoverflow.com/questions/2792567/printf-field-width-bytes-or-chars BTW, I understand that postgresql uses locale semantics in the server code. But is this really necessary/appropiate in the client (psql) side? Couldnt we stick with C locale here? -- Hernán J. González http://hjg.com.ar/
Well, I finally found some related -rather old- issues in Bugzilla (glib)<br /><br />http://sources.redhat.com/bugzilla/show_bug.cgi?id=6530<br />http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=208308<br/>http://sources.redhat.com/bugzilla/show_bug.cgi?id=649<br /><br/>The last explains why they do not consider it a bug:<br /><br /> ISO C99 requires for %.*s to only write completecharacters that fit below the<br /> precision number of bytes. If you are using say UTF-8 locale, but ISO-8859-1<br/> characters as shown in the input file you provided, some of the strings are<br /> not valid UTF-8 strings,therefore sprintf fails with -1 because of the<br /> encoding error. That's not a bug in glibc.<br /><br />It's clear,though it's also rather ugly, from a specification point of view (we must <br />count raw bytes for the width field,but also must decode the utf8 chars for finding <br />character boundaries). I guess we must live with that. <br /><br/>Hernán J. González
hgonzalez@gmail.com writes: > http://sources.redhat.com/bugzilla/show_bug.cgi?id=649 > The last explains why they do not consider it a bug: > ISO C99 requires for %.*s to only write complete characters that fit below > the > precision number of bytes. If you are using say UTF-8 locale, but ISO-8859-1 > characters as shown in the input file you provided, some of the strings are > not valid UTF-8 strings, therefore sprintf fails with -1 because of the > encoding error. That's not a bug in glibc. Yeah, that was about the position I thought they'd take. So the bottom line here is that we're best off to avoid %.*s because it may fail if the string contains data that isn't validly encoded according to libc's idea of the prevailing encoding. I think that means the patch I committed earlier is still a good idea, but the comments need a bit of adjustment. Will fix. regards, tom lane
hernan gonzalez <hgonzalez@gmail.com> writes: > BTW, I understand that postgresql uses locale semantics in the server code. > But is this really necessary/appropiate in the client (psql) side? > Couldnt we stick with C locale here? As far as that goes, I think we have to turn on that machinery in order to have gettext() work (ie, to have localized error messages). regards, tom lane
On Sat, May 08, 2010 at 09:24:45PM -0400, Tom Lane wrote: > hgonzalez@gmail.com writes: > > http://sources.redhat.com/bugzilla/show_bug.cgi?id=649 > > > The last explains why they do not consider it a bug: > > > ISO C99 requires for %.*s to only write complete characters that fit below > > the > > precision number of bytes. If you are using say UTF-8 locale, but ISO-8859-1 > > characters as shown in the input file you provided, some of the strings are > > not valid UTF-8 strings, therefore sprintf fails with -1 because of the > > encoding error. That's not a bug in glibc. > > Yeah, that was about the position I thought they'd take. GNU libc eventually revisited that conclusion and fixed the bug through commit 715a900c9085907fa749589bf738b192b1a2bda5. RHEL 7.1 is fixed, but RHEL 6.6 and RHEL 5.11 are still affected; the bug will be relevant for another 8+ years. > So the bottom line here is that we're best off to avoid %.*s because > it may fail if the string contains data that isn't validly encoded > according to libc's idea of the prevailing encoding. Yep. Immediate precisions like %.10s trigger the bug as effectively as %.*s, so tarCreateHeader() [_tarWriteHeader() in 9.2 and earlier] is also affected. Switching to strlcpy(), as attached, fixes the bug while simplifying the code. The bug symptom is error 'pg_basebackup: unrecognized link indicator "0"' when the name of a file in the data directory is not a valid multibyte string. Commit 6dd9584 introduced a new use of .*s, to pg_upgrade. It works reliably for now, because it always runs in the C locale. pg_upgrade never calls set_pglocale_pgservice() or otherwise sets its permanent locale. It would be natural for us to fix that someday, at which point non-ASCII database names would perturb this status output. It would be good to purge the code of precisions on "s" conversion specifiers, then Assert(!pointflag) in fmtstr() to catch new introductions. I won't plan to do it myself, but it would be a nice little defensive change.
Attachment
<div dir="ltr"><br /><br />On Sun, May 24, 2015 at 2:43 AM, Noah Misch <<a href="mailto:noah@leadboat.com">noah@leadboat.com</a>>wrote:<br />> On Sat, May 08, 2010 at 09:24:45PM -0400, Tom Lanewrote:<br />>> <a href="mailto:hgonzalez@gmail.com">hgonzalez@gmail.com</a> writes:<br />>> > <a href="http://sources.redhat.com/bugzilla/show_bug.cgi?id=649">http://sources.redhat.com/bugzilla/show_bug.cgi?id=649</a><br />>><br/>>> > The last explains why they do not consider it a bug:<br />>><br />>> > ISO C99requires for %.*s to only write complete characters that fit below<br />>> > the<br />>> > precisionnumber of bytes. If you are using say UTF-8 locale, but ISO-8859-1<br />>> > characters as shown in theinput file you provided, some of the strings are<br />>> > not valid UTF-8 strings, therefore sprintf fails with-1 because of the<br />>> > encoding error. That's not a bug in glibc.<br />>><br />>> Yeah, thatwas about the position I thought they'd take.<br />><br />> GNU libc eventually revisited that conclusion and fixedthe bug through commit<br />> 715a900c9085907fa749589bf738b192b1a2bda5. RHEL 7.1 is fixed, but RHEL 6.6 and<br />>RHEL 5.11 are still affected; the bug will be relevant for another 8+ years.<br />><br />>> So the bottomline here is that we're best off to avoid %.*s because<br />>> it may fail if the string contains data that isn'tvalidly encoded<br />>> according to libc's idea of the prevailing encoding.<br />><br />> Yep. Immediateprecisions like %.10s trigger the bug as effectively as %.*s,<br />> so tarCreateHeader() [_tarWriteHeader()in 9.2 and earlier] is also affected.<br />> Switching to strlcpy(), as attached, fixes the bug whilesimplifying the code.<br />> The bug symptom is error 'pg_basebackup: unrecognized link indicator "0"' when<br />>the name of a file in the data directory is not a valid multibyte string.<br />><br />><br />> Commit 6dd9584introduced a new use of .*s, to pg_upgrade. It works reliably<br />> for now, because it always runs in the Clocale. pg_upgrade never calls<br />> set_pglocale_pgservice() or otherwise sets its permanent locale. It would be<br/>> natural for us to fix that someday, at which point non-ASCII database names<br />> would perturb this statusoutput.<br /><br />I caught up the following places that need attention on top of the 4 ones in tar.c:<br />src/backend/nodes/read.c: elog(ERROR, "unrecognized integer: \"%.*s\"",<br/>src/backend/nodes/read.c: elog(ERROR, "unrecognizedOID: \"%.*s\"",<br />src/backend/nodes/read.c: elog(ERROR, "unrecognized token:\"%.*s\"", tok_len, token);<br />src/backend/nodes/readfuncs.c: elog(ERROR, "unrecognized token: \"%.*s\"",length, token);<br />src/backend/nodes/readfuncs.c: elog(ERROR, "unrecognized token: \"%.*s\"", length,token);<br />src/backend/nodes/readfuncs.c: elog(ERROR, "unrecognized integer: \"%.*s\"", length,token);<br />src/backend/nodes/readfuncs.c: elog(ERROR, "unrecognized boolop \"%.*s\"", length, token);<br/>src/backend/nodes/readfuncs.c: elog(ERROR, "badly formatted node string \"%.32s\"...", token);<br />src/backend/tsearch/wparser_def.c: * Use of %.*s here is a bit risky since it can misbehave if the data is<br />src/backend/tsearch/wparser_def.c: fprintf(stderr, "parsing \"%.*s\"\n", len, str);<br />src/backend/tsearch/wparser_def.c: /* See note above about %.*s */<br />src/backend/tsearch/wparser_def.c: fprintf(stderr,"parsing copy of \"%.*s\"\n", prs->lenstr, prs->str);<br />src/backend/utils/adt/datetime.c: * Note: the uses of %.*s in this function would be risky if the<br />src/backend/utils/adt/datetime.c: sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);<br/>src/backend/utils/adt/datetime.c: sprintf(str + strlen(str), " %.*s", MAXTZLEN,tzn);<br />src/backend/utils/adt/datetime.c: sprintf(str + strlen(str), "%.*s", MAXTZLEN, tzn);<br />src/backend/utils/adt/datetime.c: /* %.*s is safe since all our tokensare ASCII */<br />src/backend/utils/adt/datetime.c: elog(LOG, "token too long in %s table: \"%.*s\"",<br/>src/interfaces/ecpg/ecpglib/error.c: /* %.*s is safe here as long as sqlstate is all-ASCII */<br />src/interfaces/ecpg/ecpglib/error.c: ecpg_log("raising sqlstate %.*s (sqlcode %ld): %s\n",<br />src/interfaces/ecpg/pgtypeslib/dt_common.c: * Note: the uses of %.*s in this function would be riskyif the<br />src/interfaces/ecpg/pgtypeslib/dt_common.c: sprintf(str + strlen(str)," %.*s", MAXTZLEN, tzn);<br />src/interfaces/ecpg/pgtypeslib/dt_common.c: sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);<br />src/interfaces/ecpg/pgtypeslib/dt_common.c: sprintf(str + strlen(str), " %.*s", MAXTZLEN, tzn);<br />src/bin/pg_basebackup/pg_basebackup.c: ngettext("%*s/%s kB (%d%%), %d/%d tablespace (%s%-*.*s)",<br />src/bin/pg_basebackup/pg_basebackup.c: "%*s/%s kB (%d%%), %d/%d tablespaces(%s%-*.*s)",<br />src/bin/pg_upgrade/util.c: printf(" %s%-*.*s\r",<br /><br />>It would be good to purge the code of precisions on "s" conversion specifiers,<br />> then Assert(!pointflag) infmtstr() to catch new introductions. I won't plan<br />> to do it myself, but it would be a nice little defensive change.<br/><br />This sounds like a good protection idea, but as it impacts existing backend code relying in sprintf's portversion we should only do the assertion in HEAD in my opinion, and mention it in the release notes of the next majorversion at the time a patch in this area is applied. I guess that we had better backpatch the places using .*s thoughon back-branches.<br />-- <br />Michael</div>
On Tue, Jun 2, 2015 at 4:19 PM, Michael Paquier wrote:
> It would be good to purge the code of precisions on "s" conversion specifiers,
> then Assert(!pointflag) in fmtstr() to catch new introductions. I won't plan
> to do it myself, but it would be a nice little defensive change.
This sounds like a good protection idea, but as it impacts existing backend code relying in sprintf's port version we should only do the assertion in HEAD in my opinion, and mention it in the release notes of the next major version at the time a patch in this area is applied. I guess that we had better backpatch the places using .*s though on back-branches.
Attached is a patch purging a bunch of places from using %.*s, this will make the code more solid when facing non-ASCII strings. I let pg_upgrade and pg_basebackup code paths alone as it reduces the code lisibility by moving out of this separator. We may want to fix them though if file path names have non-ASCII characters, but it seems less critical.
Thoughts?
--
Michael
Attachment
On Wed, Jun 03, 2015 at 05:25:45PM +0900, Michael Paquier wrote: > On Tue, Jun 2, 2015 at 4:19 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > > On Sun, May 24, 2015 at 2:43 AM, Noah Misch <noah@leadboat.com> wrote: > > > It would be good to purge the code of precisions on "s" conversion specifiers, > > > then Assert(!pointflag) in fmtstr() to catch new introductions. I won't plan > > > to do it myself, but it would be a nice little defensive change. > > > > This sounds like a good protection idea, but as it impacts existing > > backend code relying in sprintf's port version we should only do the > > assertion in HEAD in my opinion, and mention it in the release notes of the > > next major version at the time a patch in this area is applied. I guess Adding the assertion would be master-only. We don't necessarily release-note C API changes. > > that we had better backpatch the places using .*s though on back-branches. I would tend to back-patch only the ones that cause interesting bugs. For example, we should not reach the read.c elog() calls anyway, so it's not a big deal if the GNU libc bug makes them a bit less helpful in back branches. (Thanks for the list of code sites; it was more complete than anything I had.) So far, only tar.c looks harmed enough to back-patch. > Attached is a patch purging a bunch of places from using %.*s, this will > make the code more solid when facing non-ASCII strings. I let pg_upgrade > and pg_basebackup code paths alone as it reduces the code lisibility by > moving out of this separator. We may want to fix them though if file path > names have non-ASCII characters, but it seems less critical. To add the assertion, we must of course fix all uses. Having seen the patch I requested, I don't like the result as much as I expected to like it. The patched code is definitely harder to read and write: > @@ -1534,7 +1541,10 @@ parseNodeString(void) > return_value = _readDeclareCursorStmt(); > else > { > - elog(ERROR, "badly formatted node string \"%.32s\"...", token); > + char buf[33]; > + memcpy(buf, token, 32); > + buf[33] = '\0'; > + elog(ERROR, "badly formatted node string \"%s\"...", buf); > return_value = NULL; /* keep compiler quiet */ > } (Apropos, that terminator belongs in buf[32], not buf[33].) Perhaps we're better off setting aside the whole idea, or forcing use of snprintf.c on configurations having the bug? Thanks, nm
On Thu, Jun 18, 2015 at 9:47 AM, Noah Misch <noah@leadboat.com> wrote: > On Wed, Jun 03, 2015 at 05:25:45PM +0900, Michael Paquier wrote: >> On Tue, Jun 2, 2015 at 4:19 PM, Michael Paquier <michael.paquier@gmail.com> wrote: >> > On Sun, May 24, 2015 at 2:43 AM, Noah Misch <noah@leadboat.com> wrote: >> > > It would be good to purge the code of precisions on "s" conversion specifiers, >> > > then Assert(!pointflag) in fmtstr() to catch new introductions. I won't plan >> > > to do it myself, but it would be a nice little defensive change. >> > >> > This sounds like a good protection idea, but as it impacts existing >> > backend code relying in sprintf's port version we should only do the >> > assertion in HEAD in my opinion, and mention it in the release notes of the >> > next major version at the time a patch in this area is applied. I guess > > Adding the assertion would be master-only. We don't necessarily release-note > C API changes. Cool. So we are on the same page. >> > that we had better backpatch the places using .*s though on back-branches. > > I would tend to back-patch only the ones that cause interesting bugs. For > example, we should not reach the read.c elog() calls anyway, so it's not a big > deal if the GNU libc bug makes them a bit less helpful in back branches. > (Thanks for the list of code sites; it was more complete than anything I had.) > So far, only tar.c looks harmed enough to back-patch. > >> Attached is a patch purging a bunch of places from using %.*s, this will >> make the code more solid when facing non-ASCII strings. I let pg_upgrade >> and pg_basebackup code paths alone as it reduces the code lisibility by >> moving out of this separator. We may want to fix them though if file path >> names have non-ASCII characters, but it seems less critical. > > To add the assertion, we must of course fix all uses. Sure. > Having seen the patch I requested, I don't like the result as much as I > expected to like it. The patched code is definitely harder to read and write: > >> @@ -1534,7 +1541,10 @@ parseNodeString(void) >> return_value = _readDeclareCursorStmt(); >> else >> { >> - elog(ERROR, "badly formatted node string \"%.32s\"...", token); >> + char buf[33]; >> + memcpy(buf, token, 32); >> + buf[33] = '\0'; >> + elog(ERROR, "badly formatted node string \"%s\"...", buf); >> return_value = NULL; /* keep compiler quiet */ >> } We could spread what the first patch did in readfuncs.c by having some more macros doing the duplicated work. Not that it would improve the code readability of those macros.. > (Apropos, that terminator belongs in buf[32], not buf[33].) Indeed. > Perhaps we're better off setting aside the whole idea, At least on OSX (10.8), I am seeing that no more than the number of bytes defined by the precision is written. So it looks that we are safe there. So yes thinking long-term this looks the better alternative. And I am wondering about the potential breakage that this could actually have with Postgres third-part tools using src/port's snprintf. > or forcing use of snprintf.c on configurations having the bug? I am less sure about that. It doesn't seem worth it knowing that the tendance is to evaluate the precision in terms in bytes and not characters. -- Michael