Thread: BUG #5238: frequent signal 11 segfaults

BUG #5238: frequent signal 11 segfaults

From
"Daniel Nagy"
Date:
The following bug has been logged online:

Bug reference:      5238
Logged by:          Daniel Nagy
Email address:      nagy.daniel@telekom.hu
PostgreSQL version: 8.4.1
Operating system:   Debian Lenny 5.0.3 x86_64. Kernel: 2.6.31.6-grsec
Description:        frequent signal 11 segfaults
Details:

I got postgres segfaults several times a day.
Postgres log:
Dec  9 21:15:07 goldbolt postgres[4515]: [292-1] user=,db= LOG:  00000:
server process (PID 8354) was terminated by signal 11: Segmentation fault
Dec  9 21:15:07 goldbolt postgres[4515]: [292-2] user=,db= LOCATION:
LogChildExit, postmaster.c:2725
Dec  9 21:15:07 goldbolt postgres[4515]: [293-1] user=,db= LOG:  00000:
terminating any other active server processes
Dec  9 21:15:07 goldbolt postgres[4515]: [293-2] user=,db= LOCATION:
HandleChildCrash, postmaster.c:2552

dmesg output:
postmaster[8354]: segfault at 7fbfbde42ee2 ip 00000000004534d0 sp
00007fff4b220f90 error 4 in postgres[400000+446000]
grsec: Segmentation fault occurred at 00007fbfbde42ee2 in
/usr/local/postgres-8.4.1/bin/postgres[postmaster:8354] uid/euid:111/111
gid/egid:114/114, parent
/usr/local/postgres-8.4.1/bin/postgres[postmaster:4515] uid/euid:111/111
gid/egid:114/114

Notes:
- Postgres was built from source with --enable-thread-safety
- Tried several kernels, no luck
- I have psql on a different hardware, same problem happens there too
- There are no sign of HW (memory, disk) errors
- No other daemons (apache, nginx) segfault, only postgres

The binaries are not stripped, how can I help finding the cause?

Thanks a lot,

Daniel

Re: BUG #5238: frequent signal 11 segfaults

From
Tom Lane
Date:
"Daniel Nagy" <nagy.daniel@telekom.hu> writes:
> The binaries are not stripped, how can I help finding the cause?

Get a stack trace from the core dump using gdb.

            regards, tom lane

Re: BUG #5238: frequent signal 11 segfaults

From
Craig Ringer
Date:
On 10/12/2009 5:12 AM, Daniel Nagy wrote:
>
> The following bug has been logged online:
>
> Bug reference:      5238
> Logged by:          Daniel Nagy
> Email address:      nagy.daniel@telekom.hu
> PostgreSQL version: 8.4.1
> Operating system:   Debian Lenny 5.0.3 x86_64. Kernel: 2.6.31.6-grsec
> Description:        frequent signal 11 segfaults
> Details:
>
> I got postgres segfaults several times a day.


http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

--
Craig Ringer

Re: BUG #5238: frequent signal 11 segfaults

From
Tom Lane
Date:
Nagy Daniel <nagy.daniel@telekom.hu> writes:
> (gdb) backtrace
> #0  0x0000000000453415 in slot_deform_tuple ()
> #1  0x000000000045383a in slot_getattr ()
> #2  0x0000000000550dac in ExecHashGetHashValue ()
> #3  0x0000000000552a98 in ExecHashJoin ()
> #4  0x0000000000543368 in ExecProcNode ()
> #5  0x0000000000552aa6 in ExecHashJoin ()
> #6  0x0000000000543368 in ExecProcNode ()

Not terribly informative (these binaries are apparently not as
un-stripped as you thought).  However, this suggests it's a specific
query going wrong --- "p debug_query_string" in gdb might tell you what.
Please see if you can extract a test case.

            regards, tom lane

Re: BUG #5238: frequent signal 11 segfaults

From
Tom Lane
Date:
Nagy Daniel <nagy.daniel@telekom.hu> writes:
> (gdb) p debug_query_string
> $1 = 12099472

Huh, your stripped build is being quite unhelpful :-(.  I think
"p (char *) debug_query_string" would have produced something useful.

            regards, tom lane

Re: BUG #5238: frequent signal 11 segfaults

From
Nagy Daniel
Date:
Hi Guys,

Here you are:

nagyd@goldbolt:~$ gdb /usr/local/pgsql/bin/postgres core
GNU gdb 6.8-debian
...
warning: Can't read pathname for load map: Input/output error.
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libm.so.6...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib/libnss_compat.so.2...done.
Loaded symbols for /lib/libnss_compat.so.2
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/libnss_nis.so.2...done.
Loaded symbols for /lib/libnss_nis.so.2
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
Core was generated by `postgres: randir lovehunter 127.0.0.1(33247)
SELECT                           '.
Program terminated with signal 11, Segmentation fault.
[New process 11764]
#0  0x0000000000453415 in slot_deform_tuple ()
(gdb)
(gdb) backtrace
#0  0x0000000000453415 in slot_deform_tuple ()
#1  0x000000000045383a in slot_getattr ()
#2  0x0000000000550dac in ExecHashGetHashValue ()
#3  0x0000000000552a98 in ExecHashJoin ()
#4  0x0000000000543368 in ExecProcNode ()
#5  0x0000000000552aa6 in ExecHashJoin ()
#6  0x0000000000543368 in ExecProcNode ()
#7  0x0000000000552aa6 in ExecHashJoin ()
#8  0x0000000000543368 in ExecProcNode ()
#9  0x0000000000557251 in ExecSort ()
#10 0x0000000000543290 in ExecProcNode ()
#11 0x0000000000555308 in ExecMergeJoin ()
#12 0x0000000000543380 in ExecProcNode ()
#13 0x0000000000557251 in ExecSort ()
#14 0x0000000000543290 in ExecProcNode ()
#15 0x0000000000540e92 in standard_ExecutorRun ()
#16 0x00000000005ecc27 in PortalRunSelect ()
#17 0x00000000005edfd9 in PortalRun ()
#18 0x00000000005e93a7 in exec_simple_query ()
#19 0x00000000005ea977 in PostgresMain ()
#20 0x00000000005bf2a8 in ServerLoop ()
#21 0x00000000005c0037 in PostmasterMain ()
#22 0x0000000000569b48 in main ()
Current language:  auto; currently asm



Thanks,

Daniel


Craig Ringer wrote:
> On 10/12/2009 5:12 AM, Daniel Nagy wrote:
>> The following bug has been logged online:
>>
>> Bug reference:      5238
>> Logged by:          Daniel Nagy
>> Email address:      nagy.daniel@telekom.hu
>> PostgreSQL version: 8.4.1
>> Operating system:   Debian Lenny 5.0.3 x86_64. Kernel: 2.6.31.6-grsec
>> Description:        frequent signal 11 segfaults
>> Details:
>>
>> I got postgres segfaults several times a day.
>
>
> http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD
>
> --
> Craig Ringer

Re: BUG #5238: frequent signal 11 segfaults

From
Nagy Daniel
Date:
(gdb) p debug_query_string
$1 = 12099472

Now I recompiled pg with --enable-debug and waiting for a new core dump.
I'll post the backtrace and the debug_query_string output ASAP.

Please let me know if there is anything more I can do.

Thanks,

Daniel



Tom Lane wrote:
> Nagy Daniel <nagy.daniel@telekom.hu> writes:
>> (gdb) backtrace
>> #0  0x0000000000453415 in slot_deform_tuple ()
>> #1  0x000000000045383a in slot_getattr ()
>> #2  0x0000000000550dac in ExecHashGetHashValue ()
>> #3  0x0000000000552a98 in ExecHashJoin ()
>> #4  0x0000000000543368 in ExecProcNode ()
>> #5  0x0000000000552aa6 in ExecHashJoin ()
>> #6  0x0000000000543368 in ExecProcNode ()
>
> Not terribly informative (these binaries are apparently not as
> un-stripped as you thought).  However, this suggests it's a specific
> query going wrong --- "p debug_query_string" in gdb might tell you what.
> Please see if you can extract a test case.
>
>             regards, tom lane

Re: BUG #5238: frequent signal 11 segfaults

From
Tom Lane
Date:
Nagy Daniel <nagy.daniel@telekom.hu> writes:
> Here's a better backtrace:

The crash location suggests a problem with a corrupted tuple, but it's
impossible to guess where the tuple came from.  In particular I can't
guess whether this reflects on-disk data corruption or some internal
bug.  Now that you have (some of) the query, can you put together a test
case?  Or try "select * from" each of the tables used in the query to
check for on-disk corruption.

            regards, tom lane

Re: BUG #5238: frequent signal 11 segfaults

From
Nagy Daniel
Date:
Here's a better backtrace:

(gdb) bt
#0  slot_deform_tuple (slot=0xc325b8, natts=21) at heaptuple.c:1130
#1  0x00000000004535f0 in slot_getsomeattrs (slot=0xc325b8, attnum=21)
at heaptuple.c:1340
#2  0x0000000000543cc6 in ExecProject (projInfo=0xc44c98,
isDone=0x7fffe33f30a4) at execQual.c:5164
#3  0x00000000005528fb in ExecHashJoin (node=0xc3f130) at nodeHashjoin.c:282
#4  0x0000000000543368 in ExecProcNode (node=0xc3f130) at execProcnode.c:412
#5  0x0000000000552aa6 in ExecHashJoin (node=0xc3dc90) at nodeHashjoin.c:598
#6  0x0000000000543368 in ExecProcNode (node=0xc3dc90) at execProcnode.c:412
#7  0x0000000000552aa6 in ExecHashJoin (node=0xc37140) at nodeHashjoin.c:598
#8  0x0000000000543368 in ExecProcNode (node=0xc37140) at execProcnode.c:412
#9  0x0000000000557251 in ExecSort (node=0xc37030) at nodeSort.c:102
#10 0x0000000000543290 in ExecProcNode (node=0xc37030) at execProcnode.c:423
#11 0x0000000000555308 in ExecMergeJoin (node=0xc36220) at
nodeMergejoin.c:626
#12 0x0000000000543380 in ExecProcNode (node=0xc36220) at execProcnode.c:408
#13 0x0000000000557251 in ExecSort (node=0xc34000) at nodeSort.c:102
#14 0x0000000000543290 in ExecProcNode (node=0xc34000) at execProcnode.c:423
#15 0x0000000000540e92 in standard_ExecutorRun (queryDesc=0xbc1c10,
direction=ForwardScanDirection, count=0) at execMain.c:1504
#16 0x00000000005ecc27 in PortalRunSelect (portal=0xc30160,
forward=<value optimized out>, count=0, dest=0x7f7219b7f3e8)
    at pquery.c:953
#17 0x00000000005edfd9 in PortalRun (portal=0xc30160,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x7f7219b7f3e8,
    altdest=0x7f7219b7f3e8, completionTag=0x7fffe33f3620 "") at pquery.c:779
#18 0x00000000005e93a7 in exec_simple_query (
    query_string=0xb89a00 "SELECT w1.kivel, date_max(w1.mikor,w2.mikor),
w1.megnezes, u.* FROM valogatas_valasz w1, valogatas_valasz w2, useradat
u WHERE w1.ki=65549 AND not w1.del AND w2.kivel=65549 AND w1.megnezes=0
AND w1.ki"...) at postgres.c:991
#19 0x00000000005ea977 in PostgresMain (argc=4, argv=<value optimized
out>, username=0xaceb10 "randir") at postgres.c:3614
#20 0x00000000005bf2a8 in ServerLoop () at postmaster.c:3447
#21 0x00000000005c0037 in PostmasterMain (argc=3, argv=0xac9820) at
postmaster.c:1040
#22 0x0000000000569b48 in main (argc=3, argv=0xac9820) at main.c:188

(gdb) p debug_query_string
$1 = 0xb89a00 "SELECT w1.kivel, date_max(w1.mikor,w2.mikor),
w1.megnezes, u.* FROM valogatas_valasz w1, valogatas_valasz w2, useradat
u WHERE w1.ki=65549 AND not w1.del AND w2.kivel=65549 AND w1.megnezes=0
AND w1.ki"...


Thanks,

Daniel



Tom Lane wrote:
> Nagy Daniel <nagy.daniel@telekom.hu> writes:
>> (gdb) p debug_query_string
>> $1 = 12099472
>
> Huh, your stripped build is being quite unhelpful :-(.  I think
> "p (char *) debug_query_string" would have produced something useful.
>
>             regards, tom lane

Re: BUG #5238: frequent signal 11 segfaults

From
Nagy Daniel
Date:
I ran "select * from" on both tables. All rows were returned
successfully, no error logs were produced during the selects.

However there are usually many 23505 errors in indices, like:
Dec 13 10:02:13 goldbolt postgres[21949]: [26-1]
user=randirw,db=lovehunter ERROR:  23505: duplicate key value violates
unique constraint "kepek_eredeti_uid_meret_idx"
Dec 13 10:02:13 goldbolt postgres[21949]: [26-2]
user=randirw,db=lovehunter LOCATION:  _bt_check_unique, nbtinsert.c:301

There are many 58P01 errors as well, like:
Dec 13 10:05:18 goldbolt postgres[7931]: [23-1] user=munin,db=lovehunter
ERROR:  58P01: could not open segment 1 of relation base/16
400/19856 (target block 3014766): No such file or directory
Dec 13 10:05:18 goldbolt postgres[7931]: [23-2] user=munin,db=lovehunter
LOCATION:  _mdfd_getseg, md.c:1572
Dec 13 10:05:18 goldbolt postgres[7931]: [23-3] user=munin,db=lovehunter
STATEMENT:  SELECT count(*) FROM users WHERE nem='t'

Reindexing sometimes helps, but the error logs appear again within
hours.

Recently a new error appeared:

Dec 13 03:46:55 goldbolt postgres[18628]: [15-1]
user=randir,db=lovehunter ERROR:  XX000: tuple offset out of range: 0
Dec 13 03:46:55 goldbolt postgres[18628]: [15-2]
user=randir,db=lovehunter LOCATION:  tbm_add_tuples, tidbitmap.c:286
Dec 13 03:46:55 goldbolt postgres[18628]: [15-3]
user=randir,db=lovehunter STATEMENT:  SELECT * FROM valogatas WHERE
uid!='16208' AND eletkor BETWEEN 39 AND 55 AND megyeid='1' AND
keresettnem='f' AND dom='iwiw.hu' AND appid='2001434963' AND nem='t'
ORDER BY random() DESC



If there is on-disk corruption, would a complete dump and
restore to an other directory fix it?

Apart from that, I think that pg shouldn't crash in case of
on-disk corruptions, but log an error message instead.
I'm sure that it's not that easy to implement as it seems,
but nothing is impossible :)


Regards,

Daniel


Tom Lane wrote:
> Nagy Daniel <nagy.daniel@telekom.hu> writes:
>> Here's a better backtrace:
>
> The crash location suggests a problem with a corrupted tuple, but it's
> impossible to guess where the tuple came from.  In particular I can't
> guess whether this reflects on-disk data corruption or some internal
> bug.  Now that you have (some of) the query, can you put together a test
> case?  Or try "select * from" each of the tables used in the query to
> check for on-disk corruption.
>
>             regards, tom lane

Re: BUG #5238: frequent signal 11 segfaults

From
Pavel Stehule
Date:
2009/12/13 Nagy Daniel <nagy.daniel@telekom.hu>:
> I ran "select * from" on both tables. All rows were returned
> successfully, no error logs were produced during the selects.
>
> However there are usually many 23505 errors in indices, like:
> Dec 13 10:02:13 goldbolt postgres[21949]: [26-1]
> user=3Drandirw,db=3Dlovehunter ERROR: =C2=A023505: duplicate key value vi=
olates
> unique constraint "kepek_eredeti_uid_meret_idx"
> Dec 13 10:02:13 goldbolt postgres[21949]: [26-2]
> user=3Drandirw,db=3Dlovehunter LOCATION: =C2=A0_bt_check_unique, nbtinser=
t.c:301
>
> There are many 58P01 errors as well, like:
> Dec 13 10:05:18 goldbolt postgres[7931]: [23-1] user=3Dmunin,db=3Dlovehun=
ter
> ERROR: =C2=A058P01: could not open segment 1 of relation base/16
> 400/19856 (target block 3014766): No such file or directory
> Dec 13 10:05:18 goldbolt postgres[7931]: [23-2] user=3Dmunin,db=3Dlovehun=
ter
> LOCATION: =C2=A0_mdfd_getseg, md.c:1572
> Dec 13 10:05:18 goldbolt postgres[7931]: [23-3] user=3Dmunin,db=3Dlovehun=
ter
> STATEMENT: =C2=A0SELECT count(*) FROM users WHERE nem=3D't'
>
> Reindexing sometimes helps, but the error logs appear again within
> hours.
>

You can have a some hardware problems. Try to check your hardware,
please. Minimum is memory test.

Regards
Pavel Stehule

> Recently a new error appeared:
>
> Dec 13 03:46:55 goldbolt postgres[18628]: [15-1]
> user=3Drandir,db=3Dlovehunter ERROR: =C2=A0XX000: tuple offset out of ran=
ge: 0
> Dec 13 03:46:55 goldbolt postgres[18628]: [15-2]
> user=3Drandir,db=3Dlovehunter LOCATION: =C2=A0tbm_add_tuples, tidbitmap.c=
:286
> Dec 13 03:46:55 goldbolt postgres[18628]: [15-3]
> user=3Drandir,db=3Dlovehunter STATEMENT: =C2=A0SELECT * FROM valogatas WH=
ERE
> uid!=3D'16208' AND eletkor BETWEEN 39 AND 55 AND megyeid=3D'1' AND
> keresettnem=3D'f' AND dom=3D'iwiw.hu' AND appid=3D'2001434963' AND nem=3D=
't'
> ORDER BY random() DESC
>
>
>
> If there is on-disk corruption, would a complete dump and
> restore to an other directory fix it?
>
> Apart from that, I think that pg shouldn't crash in case of
> on-disk corruptions, but log an error message instead.
> I'm sure that it's not that easy to implement as it seems,
> but nothing is impossible :)
>
>
> Regards,
>
> Daniel
>
>
> Tom Lane wrote:
>> Nagy Daniel <nagy.daniel@telekom.hu> writes:
>>> Here's a better backtrace:
>>
>> The crash location suggests a problem with a corrupted tuple, but it's
>> impossible to guess where the tuple came from. =C2=A0In particular I can=
't
>> guess whether this reflects on-disk data corruption or some internal
>> bug. =C2=A0Now that you have (some of) the query, can you put together a=
 test
>> case? =C2=A0Or try "select * from" each of the tables used in the query =
to
>> check for on-disk corruption.
>>
>> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 regards, tom lane
>
> --
> Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-bugs
>

Re: BUG #5238: frequent signal 11 segfaults

From
Tom Lane
Date:
Nagy Daniel <nagy.daniel@telekom.hu> writes:
> I ran "select * from" on both tables. All rows were returned
> successfully, no error logs were produced during the selects.

Well, that would seem to eliminate the initial theory of on-disk
corruption, except that these *other* symptoms that you just mentioned
for the first time look a lot like index corruption.  I concur with
Pavel that intermittent hardware problems are looking more and more
likely.  Try a memory test first --- a patch of bad RAM could easily
produce symptoms like this.

> Apart from that, I think that pg shouldn't crash in case of
> on-disk corruptions, but log an error message instead.

There is very little that software can do to protect itself from
flaky hardware :-(

            regards, tom lane

Re: BUG #5238: frequent signal 11 segfaults

From
Nagy Daniel
Date:
I have pg segfaults on two boxes, a DL160G6 and a DL380g5.
I've just checked their memory with memtest86+ v2.11
No errors were detected.

We also monitor the boxes via IPMI, and there are no signs
of HW failures.

Regards,

Daniel


Tom Lane wrote:
> Nagy Daniel <nagy.daniel@telekom.hu> writes:
>> I ran "select * from" on both tables. All rows were returned
>> successfully, no error logs were produced during the selects.
>
> Well, that would seem to eliminate the initial theory of on-disk
> corruption, except that these *other* symptoms that you just mentioned
> for the first time look a lot like index corruption.  I concur with
> Pavel that intermittent hardware problems are looking more and more
> likely.  Try a memory test first --- a patch of bad RAM could easily
> produce symptoms like this.
>
>> Apart from that, I think that pg shouldn't crash in case of
>> on-disk corruptions, but log an error message instead.
>
> There is very little that software can do to protect itself from
> flaky hardware :-(
>
>             regards, tom lane

Re: BUG #5238: frequent signal 11 segfaults

From
Tom Lane
Date:
Nagy Daniel <nagy.daniel@telekom.hu> writes:
> I have pg segfaults on two boxes, a DL160G6 and a DL380g5.
> I've just checked their memory with memtest86+ v2.11
> No errors were detected.
> We also monitor the boxes via IPMI, and there are no signs
> of HW failures.

Hm.  Well, now that 8.4.2 is out, the first thing you ought to do is
update and see if this happens to be resolved by any of the recent
fixes.  (I'm not too optimistic about that, because it doesn't look
exactly like any of the known symptoms, but an update is certainly
worth your time in any case.)

If you still see it after that, please try to extract a reproducible
test case.

            regards, tom lane

Re: BUG #5238: frequent signal 11 segfaults

From
Nagy Daniel
Date:
I upgraded to 8.4.2, did a full reindex and vacuum (there were
no errors). But it segfaults as well:

Core was generated by `postgres: randir lovehunter 127.0.0.1(48268)
SELECT                           '.
Program terminated with signal 11, Segmentation fault.
[New process 7262]
#0  slot_deform_tuple (slot=0xc0d3b8, natts=20) at heaptuple.c:1130
1130                    off = att_align_pointer(off, thisatt->attalign, -1,
(gdb) bt
#0  slot_deform_tuple (slot=0xc0d3b8, natts=20) at heaptuple.c:1130
#1  0x0000000000453b9a in slot_getattr (slot=0xc0d3b8, attnum=20,
isnull=0x7fffe48130af "") at heaptuple.c:1253
#2  0x000000000054418c in ExecEvalNot (notclause=<value optimized out>,
econtext=0x1dda4503, isNull=0x7fffe48130af "",
    isDone=<value optimized out>) at execQual.c:2420
#3  0x000000000054466b in ExecQual (qual=<value optimized out>,
econtext=0xc17000, resultForNull=0 '\0') at execQual.c:4909
#4  0x000000000054ae55 in ExecScan (node=0xc16ef0, accessMtd=0x550b80
<BitmapHeapNext>) at execScan.c:131
#5  0x0000000000543da0 in ExecProcNode (node=0xc16ef0) at execProcnode.c:373
#6  0x0000000000553516 in ExecHashJoin (node=0xc15dd0) at nodeHashjoin.c:598
#7  0x0000000000543de8 in ExecProcNode (node=0xc15dd0) at execProcnode.c:412
#8  0x0000000000556367 in ExecNestLoop (node=0xc14cc0) at nodeNestloop.c:120
#9  0x0000000000543e18 in ExecProcNode (node=0xc14cc0) at execProcnode.c:404
#10 0x0000000000556367 in ExecNestLoop (node=0xc12ee0) at nodeNestloop.c:120
#11 0x0000000000543e18 in ExecProcNode (node=0xc12ee0) at execProcnode.c:404
#12 0x0000000000556367 in ExecNestLoop (node=0xc110e0) at nodeNestloop.c:120
#13 0x0000000000543e18 in ExecProcNode (node=0xc110e0) at execProcnode.c:404
#14 0x0000000000557cc1 in ExecSort (node=0xc0eec0) at nodeSort.c:102
#15 0x0000000000543d10 in ExecProcNode (node=0xc0eec0) at execProcnode.c:423
#16 0x00000000005418b2 in standard_ExecutorRun (queryDesc=0xbb95e0,
direction=ForwardScanDirection, count=0) at execMain.c:1504
#17 0x00000000005ed687 in PortalRunSelect (portal=0xbf7000,
forward=<value optimized out>, count=0, dest=0x7f863746adb8)
    at pquery.c:953
#18 0x00000000005eea39 in PortalRun (portal=0xbf7000,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x7f863746adb8,
    altdest=0x7f863746adb8, completionTag=0x7fffe4813650 "") at pquery.c:779
#19 0x00000000005e9e07 in exec_simple_query (
    query_string=0xb82e10 "SELECT * FROM valogatas WHERE uid!='64708'
AND eletkor BETWEEN 40 AND 52 AND megyeid='9' AND keresettnem='t' AND
dom='iwiw.hu' AND appid='2001434963' AND nem='f' ORDER BY random()
DESC") at postgres.c:991
#20 0x00000000005eb3d7 in PostgresMain (argc=4, argv=<value optimized
out>, username=0xacfe90 "randir") at postgres.c:3614
#21 0x00000000005bfe18 in ServerLoop () at postmaster.c:3449
#22 0x00000000005c0ba7 in PostmasterMain (argc=5, argv=0xacaa90) at
postmaster.c:1040
#23 0x000000000056a568 in main (argc=5, argv=0xacaa90) at main.c:188

(gdb) p (char *) debug_query_string
$1 = 0xb82e10 "SELECT * FROM valogatas WHERE uid!='64708' AND eletkor
BETWEEN 40 AND 52 AND megyeid='9' AND keresettnem='t' AND dom='iwiw.hu'
AND appid='2001434963' AND nem='f' ORDER BY random() DESC"

When I run this query manually, it works.

Regards,

Daniel



Tom Lane wrote:
> Nagy Daniel <nagy.daniel@telekom.hu> writes:
>> I have pg segfaults on two boxes, a DL160G6 and a DL380g5.
>> I've just checked their memory with memtest86+ v2.11
>> No errors were detected.
>> We also monitor the boxes via IPMI, and there are no signs
>> of HW failures.
>
> Hm.  Well, now that 8.4.2 is out, the first thing you ought to do is
> update and see if this happens to be resolved by any of the recent
> fixes.  (I'm not too optimistic about that, because it doesn't look
> exactly like any of the known symptoms, but an update is certainly
> worth your time in any case.)
>
> If you still see it after that, please try to extract a reproducible
> test case.
>
>             regards, tom lane

Re: BUG #5238: frequent signal 11 segfaults

From
Nagy Daniel
Date:
I don't know if it's related, but we often have index problems
as well. When performing full vacuum, many indexes contain
one more row versions than the tables:

WARNING:  index "iwiw_start_top_napi_fast_idx" contains 10932 row
versions, but table contains 10931 row versions
WARNING:  index "iwiw_start_top_napi_fast_idx" contains 10932 row
versions, but table contains 10931 row versions
WARNING:  index "iwiw_jatekok_ertek_idx" contains 17 row versions, but
table contains 16 row versions
WARNING:  index "ujtema_nehezseg_idx" contains 696 row versions, but
table contains 695 row versions

etc...

Daniel



Tom Lane wrote:
> Nagy Daniel <nagy.daniel@telekom.hu> writes:
>> I have pg segfaults on two boxes, a DL160G6 and a DL380g5.
>> I've just checked their memory with memtest86+ v2.11
>> No errors were detected.
>> We also monitor the boxes via IPMI, and there are no signs
>> of HW failures.
>
> Hm.  Well, now that 8.4.2 is out, the first thing you ought to do is
> update and see if this happens to be resolved by any of the recent
> fixes.  (I'm not too optimistic about that, because it doesn't look
> exactly like any of the known symptoms, but an update is certainly
> worth your time in any case.)
>
> If you still see it after that, please try to extract a reproducible
> test case.
>
>             regards, tom lane

Re: BUG #5238: frequent signal 11 segfaults

From
Nagy Daniel
Date:
More info: we disabled autovacuum (we do vacuuming via cron)
and the segfaults seem to be gone.

Daniel


Tom Lane wrote:
> Nagy Daniel <nagy.daniel@telekom.hu> writes:
>> I have pg segfaults on two boxes, a DL160G6 and a DL380g5.
>> I've just checked their memory with memtest86+ v2.11
>> No errors were detected.
>> We also monitor the boxes via IPMI, and there are no signs
>> of HW failures.
>
> Hm.  Well, now that 8.4.2 is out, the first thing you ought to do is
> update and see if this happens to be resolved by any of the recent
> fixes.  (I'm not too optimistic about that, because it doesn't look
> exactly like any of the known symptoms, but an update is certainly
> worth your time in any case.)
>
> If you still see it after that, please try to extract a reproducible
> test case.
>
>             regards, tom lane

Re: BUG #5238: frequent signal 11 segfaults

From
Alvaro Herrera
Date:
Nagy Daniel wrote:
> More info: we disabled autovacuum (we do vacuuming via cron)
> and the segfaults seem to be gone.

That's pretty weird if it means that autovacuum was crashing and your
cron-based vacuums are not.


--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: BUG #5238: frequent signal 11 segfaults

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Nagy Daniel wrote:
>> More info: we disabled autovacuum (we do vacuuming via cron)
>> and the segfaults seem to be gone.

> That's pretty weird if it means that autovacuum was crashing and your
> cron-based vacuums are not.

We've seen crashes in the autovac control logic before ... except that
the backtrace said it was nowhere near there.

            regards, tom lane