Thread: [HACKERS] Tightening isspace() tests where behavior should match SQL parser
The proximate cause of bug #14662, https://www.postgresql.org/message-id/flat/20170519162316.29945.5021%40wrigleys.postgresql.org appears to be that SplitIdentifierString's isspace() checks are identifying some bytes of a multibyte character as spaces. It's not quite clear to me whether there's an encoding configuration problem involved in this specific report, but in any case feeding individual bytes of a multibyte character to <ctype.h> functions is highly dubious. The best you can say about that is that the behavior will be platform-dependent. I think that the easiest way to resolve this is to replace the isspace() calls with scanner_isspace(), on the grounds that we are trying to parse identifiers the same way that the core SQL lexer would, and therefore we should use its definition of whitespace. I went looking through all our other calls of isspace() to see if we have the same problem anywhere else, and identified these functions as being at risk: parse_ident() regproc.c's parseNameAndArgTypes (for regprocedurein and siblings) SplitIdentifierString() SplitDirectoriesString() In all four cases, we must allow for non-ASCII input and it's arguable that correct behavior is to match the core lexer, so I propose replacing isspace() with scanner_isspace() in these places. There are quite a lot more isspace() calls than that of course, but the rest of them seem probably all right to me. As an example, I don't see a need to make float8in() stricter about what it will allow as trailing whitespace. We're not expecting any non-ASCII input really, and if we do see multibyte characters and all the bytes manage to pass isspace(), not much harm is done. Attached is a proposed patch. I'm vacillating on whether to back-patch this --- it will fix a reported bug, but it seems at least somewhat conceivable that it will also break cases that were working acceptably before. Thoughts? regards, tom lane diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c index 7dcecb2..9cc0b08 100644 --- a/src/backend/utils/adt/misc.c +++ b/src/backend/utils/adt/misc.c @@ -770,7 +770,7 @@ parse_ident(PG_FUNCTION_ARGS) nextp = qualname_str; /* skip leading whitespace */ - while (isspace((unsigned char) *nextp)) + while (scanner_isspace(*nextp)) nextp++; for (;;) @@ -858,14 +858,14 @@ parse_ident(PG_FUNCTION_ARGS) text_to_cstring(qualname)))); } - while (isspace((unsigned char) *nextp)) + while (scanner_isspace(*nextp)) nextp++; if (*nextp == '.') { after_dot = true; nextp++; - while (isspace((unsigned char) *nextp)) + while (scanner_isspace(*nextp)) nextp++; } else if (*nextp == '\0') diff --git a/src/backend/utils/adt/regproc.c b/src/backend/utils/adt/regproc.c index a90452c..ba879e8 100644 --- a/src/backend/utils/adt/regproc.c +++ b/src/backend/utils/adt/regproc.c @@ -32,6 +32,7 @@ #include "lib/stringinfo.h" #include "miscadmin.h" #include "parser/parse_type.h" +#include "parser/scansup.h" #include "utils/builtins.h" #include "utils/lsyscache.h" #include "utils/syscache.h" @@ -1769,7 +1770,7 @@ parseNameAndArgTypes(const char *string, bool allowNone, List **names, ptr2 = ptr + strlen(ptr); while (--ptr2 > ptr) { - if (!isspace((unsigned char) *ptr2)) + if (!scanner_isspace(*ptr2)) break; } if (*ptr2 != ')') @@ -1786,7 +1787,7 @@ parseNameAndArgTypes(const char *string, bool allowNone, List **names, for (;;) { /* allow leading whitespace */ - while (isspace((unsigned char) *ptr)) + while (scanner_isspace(*ptr)) ptr++; if (*ptr == '\0') { @@ -1842,7 +1843,7 @@ parseNameAndArgTypes(const char *string, bool allowNone, List **names, /* Lop off trailing whitespace */ while (--ptr2 >= typename) { - if (!isspace((unsigned char) *ptr2)) + if (!scanner_isspace(*ptr2)) break; *ptr2 = '\0'; } diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c index be399f4..a0dd391 100644 --- a/src/backend/utils/adt/varlena.c +++ b/src/backend/utils/adt/varlena.c @@ -3252,7 +3252,7 @@ SplitIdentifierString(char *rawstring, char separator, *namelist = NIL; - while (isspace((unsigned char) *nextp)) + while (scanner_isspace(*nextp)) nextp++; /* skip leading whitespace */ if (*nextp == '\0') @@ -3290,7 +3290,7 @@ SplitIdentifierString(char *rawstring, char separator, curname = nextp; while (*nextp && *nextp != separator && - !isspace((unsigned char) *nextp)) + !scanner_isspace(*nextp)) nextp++; endp = nextp; if (curname == nextp) @@ -3312,13 +3312,13 @@ SplitIdentifierString(char *rawstring, char separator, pfree(downname); } - while (isspace((unsigned char) *nextp)) + while (scanner_isspace(*nextp)) nextp++; /* skip trailing whitespace */ if (*nextp == separator) { nextp++; - while (isspace((unsigned char) *nextp)) + while (scanner_isspace(*nextp)) nextp++; /* skip leading whitespace for next */ /* we expect another name, so done remains false */ } @@ -3377,7 +3377,7 @@ SplitDirectoriesString(char *rawstring, char separator, *namelist = NIL; - while (isspace((unsigned char) *nextp)) + while (scanner_isspace(*nextp)) nextp++; /* skip leading whitespace */ if (*nextp == '\0') @@ -3414,7 +3414,7 @@ SplitDirectoriesString(char *rawstring, char separator, while (*nextp && *nextp != separator) { /* trailing whitespace should not be included in name */ - if (!isspace((unsigned char) *nextp)) + if (!scanner_isspace(*nextp)) endp = nextp + 1; nextp++; } @@ -3422,13 +3422,13 @@ SplitDirectoriesString(char *rawstring, char separator, return false; /* empty unquoted name not allowed */ } - while (isspace((unsigned char) *nextp)) + while (scanner_isspace(*nextp)) nextp++; /* skip trailing whitespace */ if (*nextp == separator) { nextp++; - while (isspace((unsigned char) *nextp)) + while (scanner_isspace(*nextp)) nextp++; /* skip leading whitespace for next */ /* we expect another name, so done remains false */ } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tightening isspace() tests where behavior should matchSQL parser
From
Heikki Linnakangas
Date:
On 05/20/2017 01:48 PM, Tom Lane wrote: > Attached is a proposed patch. I'm vacillating on whether to > back-patch this --- it will fix a reported bug, but it seems > at least somewhat conceivable that it will also break cases > that were working acceptably before. Thoughts? +1 for back-patching. If I understand correctly, it would change the behavior when you pass a string with non-ASCII leading or trailing whitespace characters to those functions. I suppose that can happen, but it's only pure luck if the current code happens to work in that case. And even if so, it's IMO still quite reasonable to change the behavior, on the grounds that the new behavior is what the core SQL lexer would do. - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > On 05/20/2017 01:48 PM, Tom Lane wrote: >> Attached is a proposed patch. I'm vacillating on whether to >> back-patch this --- it will fix a reported bug, but it seems >> at least somewhat conceivable that it will also break cases >> that were working acceptably before. Thoughts? > +1 for back-patching. If I understand correctly, it would change the > behavior when you pass a string with non-ASCII leading or trailing > whitespace characters to those functions. I suppose that can happen, but > it's only pure luck if the current code happens to work in that case. Well, it'd work properly for e.g. no-break space in LATINn. But that seems like a very narrow use-case, and probably not enough to justify the misbehavior you might get in multi-byte character sets. A compromise possibly worth considering is to apply isspace() only in single-byte encodings. I think that's more complication than it's worth, but others might think differently. regards, tom lane
I wrote: > Heikki Linnakangas <hlinnaka@iki.fi> writes: >> +1 for back-patching. If I understand correctly, it would change the >> behavior when you pass a string with non-ASCII leading or trailing >> whitespace characters to those functions. I suppose that can happen, but >> it's only pure luck if the current code happens to work in that case. > Well, it'd work properly for e.g. no-break space in LATINn. Actually, it's dubious that treating no-break space as whitespace is correct at all in these use-cases. The core scanner would think it's an identifier character, so it's not impossible that somebody would consider it cute to write as part of a SQL identifier. If the core scanner accepts that, so must these functions. Hence, applied and back-patched. regards, tom lane
Re: [HACKERS] Tightening isspace() tests where behavior should matchSQL parser
From
Peter Eisentraut
Date:
On 5/24/17 15:34, Tom Lane wrote: > I wrote: >> Heikki Linnakangas <hlinnaka@iki.fi> writes: >>> +1 for back-patching. If I understand correctly, it would change the >>> behavior when you pass a string with non-ASCII leading or trailing >>> whitespace characters to those functions. I suppose that can happen, but >>> it's only pure luck if the current code happens to work in that case. > >> Well, it'd work properly for e.g. no-break space in LATINn. > > Actually, it's dubious that treating no-break space as whitespace is > correct at all in these use-cases. The core scanner would think it's > an identifier character, so it's not impossible that somebody would > consider it cute to write as part of a SQL identifier. If > the core scanner accepts that, so must these functions. The SQL standard might permit non-breaking spaces or similar things as token delimiters, so it could be legitimate to look into changing that sometime. But in any case it should be consistent, so it's correct to make this change now. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services