Thread: random crashes on -HEAD for a few days now

random crashes on -HEAD for a few days now

From
Stefan Kaltenbrunner
Date:
at least one of my buildfarm members (emu) is crashing on what seems 
totally unrelated regression tests for a few days now:

http://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=emu&br=HEAD

it took me about 10 tries to reproduce that manually and I'm getting the 
following stacktrace:


#0  varbit_out (fcinfo=0x88c75000) at varbit.c:549
549             x = *sp;
(gdb) bt
#0  varbit_out (fcinfo=0x88c75000) at varbit.c:549
#1  0x1c217930 in FunctionCall1 (flinfo=0x1, arg1=2294763520) at fmgr.c:1195
#2  0x1c036fae in printtup (slot=0x88c730b0, self=0x7ecf4bc8) at 
printtup.c:326
#3  0x1c10ab13 in ExecSelect (slot=0x88c730b0, dest=0x88c75000, 
estate=0x88c7301c) at execMain.c:1427
#4  0x1c10a8b3 in ExecutePlan (estate=0x88c7301c, planstate=0x88c731b4, 
operation=CMD_SELECT, numberTuples=0, direction=-2000203776, 
dest=0x7ecf4bc8) at execMain.c:1353
#5  0x1c109793 in ExecutorRun (queryDesc=0x7fc60574, 
direction=ForwardScanDirection, count=0) at execMain.c:243
#6  0x1c19b917 in PortalRunSelect (portal=0x7fd9f01c, forward=1 '\001', 
count=0, dest=0x7ecf4bc8) at pquery.c:943
#7  0x1c19b63e in PortalRun (portal=0x7fd9f01c, count=2147483647, 
isTopLevel=1 '\001', dest=0x7ecf4bc8, altdest=0x7ecf4bc8, 
completionTag=0xcfbd1e50 "") at pquery.c:797
#8  0x1c19709b in exec_simple_query (    query_string=0x7e4e301c "SELECT v,\n       SUBSTRING(v FROM 2 FOR 
4) AS sub_2_4,\n       SUBSTRING(v FROM 7 FOR 13) AS sub_7_13,\n 
SUBSTRING(v FROM 6) AS sub_6\n       FROM VARBIT_TABLE;") at postgres.c:962
#9  0x1c199fe2 in PostgresMain (argc=4, argv=0x894395c0, 
username=0x89439454 "mastermind") at postgres.c:3529
#10 0x1c171fbe in BackendRun (port=0x88ac1400) at postmaster.c:3177
#11 0x1c171864 in BackendStartup (port=0x88ac1400) at postmaster.c:2800
#12 0x1c16f8af in ServerLoop () at postmaster.c:1272
#13 0x1c16ee85 in PostmasterMain (argc=6, argv=0xcfbd224c) at 
postmaster.c:1027
#14 0x1c12bf74 in main (argc=6, argv=0xcfbd224c) at main.c:188



Stefan


Re: random crashes on -HEAD for a few days now

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> at least one of my buildfarm members (emu) is crashing on what seems 
> totally unrelated regression tests for a few days now:

I was wondering about that ...

> it took me about 10 tries to reproduce that manually and I'm getting the 
> following stacktrace:

> #0  varbit_out (fcinfo=0x88c75000) at varbit.c:549
> 549             x = *sp;

Just eyeballing that code, it looks like it will try to fetch the byte
immediately beyond the end of the bit array, when the number of bits is
an exact multiple of 8.  This is unlikely to cause a problem but it
*could* happen that the input is right up against the end of memory.
Could you check whether that is what happened here?  (The important
question is whether the input seems to be sane, ie, "len" isn't huge.)
        regards, tom lane


Re: random crashes on -HEAD for a few days now

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> at least one of my buildfarm members (emu) is crashing on what seems 
>> totally unrelated regression tests for a few days now:
> 
> I was wondering about that ...
> 
>> it took me about 10 tries to reproduce that manually and I'm getting the 
>> following stacktrace:
> 
>> #0  varbit_out (fcinfo=0x88c75000) at varbit.c:549
>> 549             x = *sp;
> 
> Just eyeballing that code, it looks like it will try to fetch the byte
> immediately beyond the end of the bit array, when the number of bits is
> an exact multiple of 8.  This is unlikely to cause a problem but it
> *could* happen that the input is right up against the end of memory.
> Could you check whether that is what happened here?  (The important
> question is whether the input seems to be sane, ie, "len" isn't huge.)

"end of memory" sounds familiar to:

http://archives.postgresql.org/pgsql-hackers/2005-06/msg00819.php

which is how emu is (still) set up.

as for len it seems to be 0:

#0  varbit_out (fcinfo=0x88c75000) at varbit.c:549        s = (VarBit *) 0x88c75000        result = 0x84d33128 ""
r = 0x84d33128 ""        sp = (bits8 *) 0x88c75000 <Address 0x88c75000 out of bounds>        x = 0 '\0'        i = 0
   k = 0        len = 0
 


Stefan


Re: random crashes on -HEAD for a few days now

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> as for len it seems to be 0:

> #0  varbit_out (fcinfo=0x88c75000) at varbit.c:549
>          s = (VarBit *) 0x88c75000
>          result = 0x84d33128 ""
>          r = 0x84d33128 ""
>          sp = (bits8 *) 0x88c75000 <Address 0x88c75000 out of bounds>
>          x = 0 '\0'
>          i = 0
>          k = 0
>          len = 0

Hmm ... s and sp really shouldn't be equal, nor equal to fcinfo, but
it's likely that the compiler optimized them into the same register.
We need to confirm what actually got passed as the argument.  Could you
go to frame 1 and see what is in its local fcinfo var, in particular see
what args[0] is?  I'm betting it's 0x88c75000 minus 8 ... if so, look at
what is in those last 8 bytes.  If that's int32 8 followed by int32 0,
then indeed we have a zero-length bitstring at the end of memory.
        regards, tom lane


Re: random crashes on -HEAD for a few days now

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>> as for len it seems to be 0:
> 
>> #0  varbit_out (fcinfo=0x88c75000) at varbit.c:549
>>          s = (VarBit *) 0x88c75000
>>          result = 0x84d33128 ""
>>          r = 0x84d33128 ""
>>          sp = (bits8 *) 0x88c75000 <Address 0x88c75000 out of bounds>
>>          x = 0 '\0'
>>          i = 0
>>          k = 0
>>          len = 0
> 
> Hmm ... s and sp really shouldn't be equal, nor equal to fcinfo, but
> it's likely that the compiler optimized them into the same register.
> We need to confirm what actually got passed as the argument.  Could you
> go to frame 1 and see what is in its local fcinfo var, in particular see
> what args[0] is?  I'm betting it's 0x88c75000 minus 8 ... if so, look at
> what is in those last 8 bytes.  If that's int32 8 followed by int32 0,
> then indeed we have a zero-length bitstring at the end of memory.

with a bit of help from alvaro:

(gdb) frame 1
#1  0x1c217930 in FunctionCall1 (flinfo=0x1, arg1=2294763520) at fmgr.c:1195
1195            result = FunctionCallInvoke(&fcinfo);
(gdb) info args
flinfo = (FmgrInfo *) 0x1
arg1 = 2294763520
(gdb) info locals
fcinfo = {flinfo = 0x84d33028, context = 0x0, resultinfo = 0x0, isnull = 0 '\0', nargs = 1, arg = {2294763512, 16,
2377208416,1,   2343471056, 2343471056, 4294967295, 2342861632, 0, 0, 3485276712,   470248306, 11, 1, 4294967295, 257,
2294762772,6, 2294762772,
 
227882802,   197, 0, 0, 20480, 3703223788, 4098, 4294967295, 0, 1, 0, 3485276792,   471883625, 470050980, 1560,
227725220,764289000, 3703223788,
 
2228453376,   3485276872, 3485276864, 2234862596, 2263890620, 3485277048, 471889873,   2234862596, 2263890620, 1560,
3485277024,5, 533, 533, 0, 0,
 
3485276904,   4294967295, 2228432896, 2263890588, 483, 4031427043, 9314280, 0,
262142,   0, 471661208, 184, 65538, 0, 2137853048, 0, 1560, 0 <repeats 21 times>,   2228433032, 2228433044, 1565,
3485277048,471950646, 1565, 0, 0, 0}, argnull =
 

"\000\001\000\000\000\000\000\000\030\006\000\000\2100?204\000\000\000\000\003\000\000\000\230\033??220L5\205\000\004\000\000???\033??220L5\205\b\000\000\000X\022?213?\033??216?\"\034\220L5\205\b\000\000\000\002\000\000\000&@\022\034
\000\000\000X\022?213??210?D\005\034??210"}
result = 2228432924

(gdb) x/16x 0x88c75000 - 8
0x88c74ff8:     0x00000020      0x00000000      Cannot access memory at
address 0x88c75000

is that what you are interested in ?


Stefan


Re: random crashes on -HEAD for a few days now

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> (gdb) info locals
> fcinfo = {flinfo = 0x84d33028, context = 0x0, resultinfo = 0x0,
>   isnull = 0 '\0', nargs = 1, arg = {2294763512, 16, 2377208416, 1,
> ...
> (gdb) x/16x 0x88c75000 - 8
> 0x88c74ff8:     0x00000020      0x00000000      Cannot access memory at
> address 0x88c75000

> is that what you are interested in ?

Yup, that seems pretty conclusive.  Patch committed --- thanks for
tracking it down!
        regards, tom lane