Thread: BUG #5590: undefined shift behavior
The following bug has been logged online: Bug reference: 5590 Logged by: John Regehr Email address: regehr@cs.utah.edu PostgreSQL version: head 8/2/10 Operating system: OSX Description: undefined shift behavior Details: During a "make check" the left-shift operator at tsquery_util.c 48:18 is passed a negative right-hand argument a number of times.
"John Regehr" <regehr@cs.utah.edu> writes: > Bug reference: 5590 > Logged by: John Regehr > Email address: regehr@cs.utah.edu > PostgreSQL version: head 8/2/10 > Operating system: OSX > Description: undefined shift behavior > Details: > During a "make check" the left-shift operator at tsquery_util.c 48:18 is > passed a negative right-hand argument a number of times. Hmm. valcrc is declared as signed int32, so depending on what your compiler thinks the semantics of % is, this clearly can potentially happen. I notice the same problem in makeTSQuerySign() in tsquery_op.c. The fix is presumably to cast the valcrc value to unsigned int before executing %. However, I'm a bit worried about whether this could change the results, and if it did whether that would invalidate any on-disk data structures. Oleg, Teodor, do either TSQuerySign or QTNode.sign ever get to disk? John: how did you detect this? regards, tom lane
Hi Tom, One of my students has hacked Clang to detect integer undefined behaviors in C, like this shift problem or signed overflows. This was the only problem that came up during a "make check" of a postgresql with this checking turned on, which is pretty cool. I'd expect to be able to find more problems if I could get hold of a good fuzz tester for postgresql, or at least some much larger test inputs. Are there any of these you folks would suggest that I use? Thanks, John On 08/02/2010 09:06 AM, Tom Lane wrote: > "John Regehr" <regehr@cs.utah.edu> writes: >> Bug reference: 5590 >> Logged by: John Regehr >> Email address: regehr@cs.utah.edu >> PostgreSQL version: head 8/2/10 >> Operating system: OSX >> Description: undefined shift behavior >> Details: > >> During a "make check" the left-shift operator at tsquery_util.c 48:18 is >> passed a negative right-hand argument a number of times. > > Hmm. valcrc is declared as signed int32, so depending on what your > compiler thinks the semantics of % is, this clearly can potentially > happen. I notice the same problem in makeTSQuerySign() in tsquery_op.c. > > The fix is presumably to cast the valcrc value to unsigned int before > executing %. However, I'm a bit worried about whether this could change > the results, and if it did whether that would invalidate any on-disk > data structures. Oleg, Teodor, do either TSQuerySign or QTNode.sign > ever get to disk? > > John: how did you detect this? > > regards, tom lane >
John Regehr <regehr@cs.utah.edu> writes: > On 08/02/2010 09:06 AM, Tom Lane wrote: >> John: how did you detect this? > One of my students has hacked Clang to detect integer undefined > behaviors in C, like this shift problem or signed overflows. Cool. > This was > the only problem that came up during a "make check" of a postgresql with > this checking turned on, which is pretty cool. Hrm, I'd have expected you to see a few integer overflows during the regression tests --- we do test that the overflow checks in places like int4pl work. You might be interested in this concurrent thread: http://archives.postgresql.org/pgsql-hackers/2010-08/msg00024.php particularly the comments about overflow. > I'd expect to be able to find more problems if I could get hold of a > good fuzz tester for postgresql, or at least some much larger test > inputs. Are there any of these you folks would suggest that I use? Yeah, the PG regression tests aren't amazingly good coverage-wise (although running the contrib tests as well as core helps --- did you do that?). I'm afraid I haven't got a good suggestion for you. regards, tom lane
> Hrm, I'd have expected you to see a few integer overflows during the > regression tests --- we do test that the overflow checks in places > like int4pl work. I saw no signed overflows. Our patch still has some rough edges, but this part is pretty well tested. Perhaps the int4pl checks fire before the overflowing operation can be performed, stopping them from happening? Anyway when I get a bit of time I'll stub out the checks and make sure that I see some checks firing. > You might be interested in this concurrent thread: > http://archives.postgresql.org/pgsql-hackers/2010-08/msg00024.php > particularly the comments about overflow. Thanks. I've been corresponding with Neil about this a bit. Nice to hear that Clang has -fwrapv now, I hadn't known about that. > Yeah, the PG regression tests aren't amazingly good coverage-wise > (although running the contrib tests as well as core helps --- did you > do that?). I'm afraid I haven't got a good suggestion for you. I haven't tried the contrib tests, I'll do that, thanks. Sooner or later we'll push our patch into LLVM and then anyone testing your code can help out. Just randomly: it's trivial to compile postgresql using clang on OSX. However, I cannot get it to work on Linux with or without Peter Eisentraut's patches. It's no big deal since I have a Mac sitting around but I could get stuff done faster on Linux! John
Tom, on the list you said: "I would be ecstatic if we could write int4pl like this: if (sum_overflows(arg1, arg2)) elog(ERROR, "overflow"); else return arg1 + arg2; " This is effectively what our clang patch does (automatically, for all integer operations). Right now our stuff isn't very efficient (~20% slowdown in SPEC INT) but LLVM does support intrinsics for detecting overflows using processor flags. Currently we cannot use these because they're buggy but we're working to get them fixed. I'd expect our integer-checked binaries to be pretty efficient once this is all working. John On 8/2/2010 10:16 AM, Tom Lane wrote: > John Regehr<regehr@cs.utah.edu> writes: >> On 08/02/2010 09:06 AM, Tom Lane wrote: >>> John: how did you detect this? > >> One of my students has hacked Clang to detect integer undefined >> behaviors in C, like this shift problem or signed overflows. > > Cool. > >> This was >> the only problem that came up during a "make check" of a postgresql with >> this checking turned on, which is pretty cool. > > Hrm, I'd have expected you to see a few integer overflows during the > regression tests --- we do test that the overflow checks in places > like int4pl work. You might be interested in this concurrent thread: > http://archives.postgresql.org/pgsql-hackers/2010-08/msg00024.php > particularly the comments about overflow. > >> I'd expect to be able to find more problems if I could get hold of a >> good fuzz tester for postgresql, or at least some much larger test >> inputs. Are there any of these you folks would suggest that I use? > > Yeah, the PG regression tests aren't amazingly good coverage-wise > (although running the contrib tests as well as core helps --- did you > do that?). I'm afraid I haven't got a good suggestion for you. > > regards, tom lane >
Aha-- the -fwrapv flag (which I had though was a nop) screws up our checks. Another rough edge to fix. Removing this flag caused us to find a bunch of integer overflows. I'll start reporting them later today. John
I wrote: > "John Regehr" <regehr@cs.utah.edu> writes: >> During a "make check" the left-shift operator at tsquery_util.c 48:18 is >> passed a negative right-hand argument a number of times. > Hmm. valcrc is declared as signed int32, so depending on what your > compiler thinks the semantics of % is, this clearly can potentially > happen. I notice the same problem in makeTSQuerySign() in tsquery_op.c. On further investigation, the reason makeTSQuerySign didn't show up in John's test is that it's coded like this: sign |= ((TSQuerySign) 1) << (ptr->qoperand.valcrc % TSQS_SIGLEN); where TSQS_SIGLEN is a macro that involves sizeof(TSQuerySign), and therefore will produce an unsigned result (on most machines anyway). So the % is done in unsigned arithmetic and there's no problem. I think it'd still be a good idea to cast valcrc to unsigned int explicitly here, but this is just for future-proofing not because there's a real bug ATM. Which is a good thing, because TSQuerySign does go to disk so we'd have a backwards-compatibility mess if we had to change this computation. The problem in QT2QTN() is real, on the other hand, because it's coded as node->sign = 1 << (in->qoperand.valcrc % 32); so the modulo is signed. Some investigation shows that Intel machines seem to give the right answer anyway, which has masked the problem for most people. But I find that on PPC, the result of the shift is *zero* if valcrc is negative --- haven't checked, but I wonder if that hardware interprets a negative left shift as a right shift. So on PPC the "sign" comes out as zero about half the time. As best I can tell at the moment, this only results in some inefficiency (because the Bloom filters don't work as intended) in some not-too- heavily-used operations; and the QTNode structures aren't written to disk so there's no compatibility issue from fixing the computation. To sum up: we should insert these casts in HEAD but at the moment I don't see a strong reason to back-patch, nor any indication that we'd be creating a need for a forced initdb. regards, tom lane