Thread: Failed assert ((data - start) == data_size) in heaptuple.c

Failed assert ((data - start) == data_size) in heaptuple.c

From
Brendan Jurd
Date:
Hi folks,

I am running a 9.0.3 Hot Standy + Streaming Replication slave which
occasionally segfaults (every 1-2 days).  I rebuilt Postgres with
--enable-cassert and --enable-debug, switched on core dumping and
waited for some results.

The first crash since enabling debugging was a failed assert in heaptuple.c:

TRAP: FailedAssertion("!((data - start) == data_size)", File:
"heaptuple.c", Line: 255)
2011-04-07 04:20:20 EST LOG:  server process (PID 32195) was
terminated by signal 6: Aborted
2011-04-07 04:20:20 EST LOG:  terminating any other active server processes

For context, the only things running on this server are the slave
database, and a tomcat instance.  The tomcat instance is the only
connection into this database, which continually runs through a series
of SELECTs (100ms sleep between each run).

The slave database is basically stock standard 9.0.3 config, apart
from the replication setup and shared_buffers increased to 2GB.

Here's the backtrace:

Core was generated by `postgres: backend surecast 127.0.0.1(37155)
SELECT                            '.
Program terminated with signal 6, Aborted.
#0  0x00007f40a93aba75 in *__GI_raise (sig=<value optimised out>)   at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64      ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.       in
../nptl/sysdeps/unix/sysv/linux/raise.c
(gdb) bt
#0  0x00007f40a93aba75 in *__GI_raise (sig=<value optimised out>)   at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007f40a93af5c0 in *__GI_abort () at abort.c:92
#2  0x00000000006f861d in ExceptionalCondition (conditionName=<value
optimised out>,   errorType=<value optimised out>, fileName=<value optimised out>,   lineNumber=<value optimised out>)
atassert.c:57
 
#3  0x0000000000459b07 in heap_form_minimal_tuple
(tupleDescriptor=0xf6bdd0, values=0x8,   isnull=0xf6c680 "") at heaptuple.c:1459
#4  0x0000000000580d12 in ExecFetchSlotMinimalTuple (slot=0xf6bb90) at
execTuples.c:684
#5  0x0000000000588d10 in ExecHashTableInsert (hashtable=0xf4c3b0, slot=0x7dc3,   hashvalue=6) at nodeHash.c:697
#6  0x0000000000589bf6 in MultiExecHash (node=<value optimised out>)
at nodeHash.c:123
#7  0x000000000058a9ab in ExecHashJoin (node=0xf24008) at nodeHashjoin.c:154
#8  0x00000000005788a8 in ExecProcNode (node=0xf24008) at execProcnode.c:427
#9  0x000000000058fb21 in ExecNestLoop (node=0xf8eb98) at nodeNestloop.c:120
#10 0x00000000005788c8 in ExecProcNode (node=0xf8eb98) at execProcnode.c:419
#11 0x000000000057756d in ExecutePlan (queryDesc=0xf8bc30,
direction=32195, count=0)   at execMain.c:1187
#12 standard_ExecutorRun (queryDesc=0xf8bc30, direction=32195,
count=0) at execMain.c:280
#13 0x0000000000642e28 in PortalRunSelect (portal=0xf11f88,
forward=<value optimised out>,   count=0, dest=0xe00120) at pquery.c:952
#14 0x00000000006442e9 in PortalRun (portal=<value optimised out>,   count=<value optimised out>, isTopLevel=<value
optimisedout>,   dest=<value optimised out>, altdest=<value optimised out>,   completionTag=<value optimised out>) at
pquery.c:796
#15 0x00000000006419e3 in exec_execute_message (argc=<value optimised out>,   argv=<value optimised out>,
username=<valueoptimised out>) at
 
postgres.c:2003
#16 PostgresMain (argc=<value optimised out>, argv=<value optimised out>,   username=<value optimised out>) at
postgres.c:3988
#17 0x0000000000606a07 in BackendRun () at postmaster.c:3555
#18 BackendStartup () at postmaster.c:3242
#19 ServerLoop () at postmaster.c:1431
#20 0x000000000060931d in PostmasterMain (argc=14918336, argv=0xdfb160)   at postmaster.c:1092
#21 0x00000000005a9310 in main (argc=5, argv=0xdfb140) at main.c:188

Let me know if there is any additional information I can provide.

Cheers,
BJ


Re: [BUGS] Failed assert ((data - start) == data_size) in heaptuple.c

From
Tom Lane
Date:
Brendan Jurd <direvus@gmail.com> writes:
> TRAP: FailedAssertion("!((data - start) == data_size)", File:
> "heaptuple.c", Line: 255)

[ scratches head ... ]  That implies that heap_fill_tuple came to a
different conclusion about a tuple's data size than the immediately
preceding heap_compute_data_size.  Which I would sure want to believe
is impossible.  Have you checked for flaky memory on this machine?
        regards, tom lane


Re: [BUGS] Failed assert ((data - start) == data_size) in heaptuple.c

From
Brendan Jurd
Date:
On 7 April 2011 16:56, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Brendan Jurd <direvus@gmail.com> writes:
>> TRAP: FailedAssertion("!((data - start) == data_size)", File:
>> "heaptuple.c", Line: 255)
>
> [ scratches head ... ]  That implies that heap_fill_tuple came to a
> different conclusion about a tuple's data size than the immediately
> preceding heap_compute_data_size.  Which I would sure want to believe
> is impossible.  Have you checked for flaky memory on this machine?
>

We are doing so now -- although the RAM is ECC and just a few months
old, so flakiness seems a distant possibility.  I will report back
after we've given it a proper thrashing with memtest.

Cheers,
BJ


Re: [BUGS] Failed assert ((data - start) == data_size) in heaptuple.c

From
Craig Ringer
Date:
On 04/07/2011 03:07 PM, Brendan Jurd wrote:
> On 7 April 2011 16:56, Tom Lane<tgl@sss.pgh.pa.us>  wrote:
>> Brendan Jurd<direvus@gmail.com>  writes:
>>> TRAP: FailedAssertion("!((data - start) == data_size)", File:
>>> "heaptuple.c", Line: 255)
>>
>> [ scratches head ... ]  That implies that heap_fill_tuple came to a
>> different conclusion about a tuple's data size than the immediately
>> preceding heap_compute_data_size.  Which I would sure want to believe
>> is impossible.  Have you checked for flaky memory on this machine?
>>
>
> We are doing so now -- although the RAM is ECC and just a few months
> old, so flakiness seems a distant possibility.  I will report back
> after we've given it a proper thrashing with memtest.

Apparently bad RAM can also mean faulty CPU (bad cache, heat problems, 
etc). memtest86 seems ... rough ... at best when it comes to finding 
issues; I've had some systems run it for a day yet continuously segfault 
in real-world use until the RAM was re-seated or swapped out.

--
Craig Ringer


Re: [BUGS] Failed assert ((data - start) == data_size) in heaptuple.c

From
Alvaro Herrera
Date:
Excerpts from Brendan Jurd's message of jue abr 07 03:07:32 -0300 2011:
> Hi folks,
> 
> I am running a 9.0.3 Hot Standy + Streaming Replication slave which
> occasionally segfaults (every 1-2 days).  I rebuilt Postgres with
> --enable-cassert and --enable-debug, switched on core dumping and
> waited for some results.

What's the platform, and what's the query?  Are there funny datatypes
involved?

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: [BUGS] Failed assert ((data - start) == data_size) in heaptuple.c

From
Brendan Jurd
Date:
On 8 April 2011 00:16, Alvaro Herrera <alvherre@commandprompt.com> wrote:
> Excerpts from Brendan Jurd's message of jue abr 07 03:07:32 -0300 2011:
>> I am running a 9.0.3 Hot Standy + Streaming Replication slave which
>> occasionally segfaults (every 1-2 days).  I rebuilt Postgres with
>> --enable-cassert and --enable-debug, switched on core dumping and
>> waited for some results.
>
> What's the platform, and what's the query?  Are there funny datatypes
> involved?

Ubuntu 10.04 x64 on:

HP DL380R05
1x Quad Core Xeon E5440
10GB PC 5400 DDR ECC
2x HP 146GB 15krpm SAS drives in RAID 1+0

The tomcat instance repeatedly runs a series of some 9 queries, I'm
not sure which of the queries is the culprit or even whether it is the
same one each time.  However, they are all straightforward SELECTs.
The one with the most complicated plan joins a whole six tables.  I do
keep the transaction open until I have executed all the SELECTs in the
series, then commit and start over again with a fresh transaction.
That's just to make sure all of the queries are pulling data from the
same snapshot.

As for datatypes, I do have one type that I have defined which is used
in one of the queries.  It's just an RGB colour value, defined as a
composite type:

CREATE DOMAIN colour_channel AS smallint   CHECK (VALUE >= 0 AND VALUE < 256);

CREATE TYPE rgb AS (   red         colour_channel,   green       colour_channel,   blue        colour_channel
);

All of the user-defined functions I have written for this db are
either SQL or PL/pgSQL, and all of the functions called by these
queries are either STABLE or IMMUTABLE.

Cheers,
BJ


Re: [BUGS] Failed assert ((data - start) == data_size) in heaptuple.c

From
Brendan Jurd
Date:
On 7 April 2011 16:56, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Brendan Jurd <direvus@gmail.com> writes:
>> TRAP: FailedAssertion("!((data - start) == data_size)", File:
>> "heaptuple.c", Line: 255)
>
> [ scratches head ... ]  That implies that heap_fill_tuple came to a
> different conclusion about a tuple's data size than the immediately
> preceding heap_compute_data_size.  Which I would sure want to believe
> is impossible.  Have you checked for flaky memory on this machine?
>

Memtest didn't report any errors.  I intend to try swapping out the
RAM tomorrow, but in the meantime we got a *different* assertion
failure today.  The fact that we are tripping over various different
assertions seems to lend further weight to the flaky hardware
hypothesis.

TRAP: FailedAssertion("!(((lpp)->lp_flags == 1))", File: "heapam.c", Line: 727)

#0  0x00007f2773f23a75 in *__GI_raise (sig=<value optimised out>)   at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007f2773f275c0 in *__GI_abort () at abort.c:92
#2  0x00000000006f9eed in ExceptionalCondition (conditionName=<value
optimised out>,   errorType=<value optimised out>, fileName=<value optimised out>,   lineNumber=<value optimised out>)
atassert.c:57 
#3  0x0000000000473641 in heapgettup_pagemode (scan=0x2366da8,
dir=<value optimised out>,   nkeys=<value optimised out>, key=<value optimised out>) at heapam.c:727
#4  0x0000000000474b16 in heap_getnext (scan=0x2366da8,
direction=1495) at heapam.c:1322
#5  0x0000000000590fcb in SeqNext (node=<value optimised out>) at
nodeSeqscan.c:66
#6  0x00000000005808ff in ExecScanFetch (node=0x22d5ff8,
accessMtd=<value optimised out>,   recheckMtd=<value optimised out>) at execScan.c:82
#7  ExecScan (node=0x22d5ff8, accessMtd=<value optimised out>,   recheckMtd=<value optimised out>) at execScan.c:164
#8  0x0000000000578d58 in ExecProcNode (node=0x22d5ff8) at execProcnode.c:378
#9  0x000000000058abf7 in ExecHashJoinOuterGetTuple (node=0x22d4a60)
at nodeHashjoin.c:562
#10 ExecHashJoin (node=0x22d4a60) at nodeHashjoin.c:187
#11 0x0000000000578ca8 in ExecProcNode (node=0x22d4a60) at execProcnode.c:427
#12 0x000000000058abf7 in ExecHashJoinOuterGetTuple (node=0x22d3430)
at nodeHashjoin.c:562
#13 ExecHashJoin (node=0x22d3430) at nodeHashjoin.c:187
#14 0x0000000000578ca8 in ExecProcNode (node=0x22d3430) at execProcnode.c:427
#15 0x0000000000590021 in ExecNestLoop (node=0x22d26d8) at nodeNestloop.c:120
#16 0x0000000000578cc8 in ExecProcNode (node=0x22d26d8) at execProcnode.c:419
#17 0x0000000000590021 in ExecNestLoop (node=0x22c0c88) at nodeNestloop.c:120
#18 0x0000000000578cc8 in ExecProcNode (node=0x22c0c88) at execProcnode.c:419
#19 0x0000000000591bf9 in ExecSort (node=0x22c0a50) at nodeSort.c:102
#20 0x0000000000578c88 in ExecProcNode (node=0x22c0a50) at execProcnode.c:438
#21 0x000000000057795e in ExecutePlan (queryDesc=0x23151f0,
direction=1495, count=0)   at execMain.c:1187
#22 standard_ExecutorRun (queryDesc=0x23151f0, direction=1495,
count=0) at execMain.c:280
#23 0x0000000000643d67 in PortalRunSelect (portal=0x229bf78,   forward=<value optimised out>, count=0, dest=0x218a120)
atpquery.c:952 
#24 0x0000000000645210 in PortalRun (portal=<value optimised out>,   count=<value optimised out>, isTopLevel=<value
optimisedout>,   dest=<value optimised out>, altdest=<value optimised out>,   completionTag=<value optimised out>) at
pquery.c:796
#25 0x00000000006428dc in exec_execute_message (argc=<value optimised out>,   argv=<value optimised out>,
username=<valueoptimised out>) at 
postgres.c:2003
#26 PostgresMain (argc=<value optimised out>, argv=<value optimised out>,   username=<value optimised out>) at
postgres.c:3988
#27 0x0000000000607351 in BackendRun () at postmaster.c:3555
#28 BackendStartup () at postmaster.c:3242
#29 ServerLoop () at postmaster.c:1431
#30 0x0000000000609c6d in PostmasterMain (argc=35406528, argv=0x2185160)   at postmaster.c:1092
#31 0x00000000005a99a0 in main (argc=5, argv=0x2185140) at main.c:188


Re: [BUGS] Failed assert ((data - start) == data_size) in heaptuple.c

From
Alvaro Herrera
Date:
Excerpts from Brendan Jurd's message of vie abr 08 06:00:22 -0300 2011:

> Memtest didn't report any errors.  I intend to try swapping out the
> RAM tomorrow, but in the meantime we got a *different* assertion
> failure today.  The fact that we are tripping over various different
> assertions seems to lend further weight to the flaky hardware
> hypothesis.
> 
> TRAP: FailedAssertion("!(((lpp)->lp_flags == 1))", File: "heapam.c", Line: 727)

Yep.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: [BUGS] Failed assert ((data - start) == data_size) in heaptuple.c

From
Brendan Jurd
Date:
On 9 April 2011 00:41, Alvaro Herrera <alvherre@commandprompt.com> wrote:
> Excerpts from Brendan Jurd's message of vie abr 08 06:00:22 -0300 2011:
>> Memtest didn't report any errors.  I intend to try swapping out the
>> RAM tomorrow, but in the meantime we got a *different* assertion
>> failure today.  The fact that we are tripping over various different
>> assertions seems to lend further weight to the flaky hardware
>> hypothesis.
>>
>> TRAP: FailedAssertion("!(((lpp)->lp_flags == 1))", File: "heapam.c", Line: 727)
>
> Yep.
>

I swapped the RAM with another machine, and after a few hours running
the other machine popped a segfault.  The faulty RAM diagnosis is now
official, so I won't be bothering you folks about this any further.

Cheers,
BJ