Thread: BUG #5238: frequent signal 11 segfaults
The following bug has been logged online: Bug reference: 5238 Logged by: Daniel Nagy Email address: nagy.daniel@telekom.hu PostgreSQL version: 8.4.1 Operating system: Debian Lenny 5.0.3 x86_64. Kernel: 2.6.31.6-grsec Description: frequent signal 11 segfaults Details: I got postgres segfaults several times a day. Postgres log: Dec 9 21:15:07 goldbolt postgres[4515]: [292-1] user=,db= LOG: 00000: server process (PID 8354) was terminated by signal 11: Segmentation fault Dec 9 21:15:07 goldbolt postgres[4515]: [292-2] user=,db= LOCATION: LogChildExit, postmaster.c:2725 Dec 9 21:15:07 goldbolt postgres[4515]: [293-1] user=,db= LOG: 00000: terminating any other active server processes Dec 9 21:15:07 goldbolt postgres[4515]: [293-2] user=,db= LOCATION: HandleChildCrash, postmaster.c:2552 dmesg output: postmaster[8354]: segfault at 7fbfbde42ee2 ip 00000000004534d0 sp 00007fff4b220f90 error 4 in postgres[400000+446000] grsec: Segmentation fault occurred at 00007fbfbde42ee2 in /usr/local/postgres-8.4.1/bin/postgres[postmaster:8354] uid/euid:111/111 gid/egid:114/114, parent /usr/local/postgres-8.4.1/bin/postgres[postmaster:4515] uid/euid:111/111 gid/egid:114/114 Notes: - Postgres was built from source with --enable-thread-safety - Tried several kernels, no luck - I have psql on a different hardware, same problem happens there too - There are no sign of HW (memory, disk) errors - No other daemons (apache, nginx) segfault, only postgres The binaries are not stripped, how can I help finding the cause? Thanks a lot, Daniel
"Daniel Nagy" <nagy.daniel@telekom.hu> writes: > The binaries are not stripped, how can I help finding the cause? Get a stack trace from the core dump using gdb. regards, tom lane
On 10/12/2009 5:12 AM, Daniel Nagy wrote: > > The following bug has been logged online: > > Bug reference: 5238 > Logged by: Daniel Nagy > Email address: nagy.daniel@telekom.hu > PostgreSQL version: 8.4.1 > Operating system: Debian Lenny 5.0.3 x86_64. Kernel: 2.6.31.6-grsec > Description: frequent signal 11 segfaults > Details: > > I got postgres segfaults several times a day. http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD -- Craig Ringer
Nagy Daniel <nagy.daniel@telekom.hu> writes: > (gdb) backtrace > #0 0x0000000000453415 in slot_deform_tuple () > #1 0x000000000045383a in slot_getattr () > #2 0x0000000000550dac in ExecHashGetHashValue () > #3 0x0000000000552a98 in ExecHashJoin () > #4 0x0000000000543368 in ExecProcNode () > #5 0x0000000000552aa6 in ExecHashJoin () > #6 0x0000000000543368 in ExecProcNode () Not terribly informative (these binaries are apparently not as un-stripped as you thought). However, this suggests it's a specific query going wrong --- "p debug_query_string" in gdb might tell you what. Please see if you can extract a test case. regards, tom lane
Nagy Daniel <nagy.daniel@telekom.hu> writes: > (gdb) p debug_query_string > $1 = 12099472 Huh, your stripped build is being quite unhelpful :-(. I think "p (char *) debug_query_string" would have produced something useful. regards, tom lane
Hi Guys, Here you are: nagyd@goldbolt:~$ gdb /usr/local/pgsql/bin/postgres core GNU gdb 6.8-debian ... warning: Can't read pathname for load map: Input/output error. Reading symbols from /lib/libdl.so.2...done. Loaded symbols for /lib/libdl.so.2 Reading symbols from /lib/libm.so.6...done. Loaded symbols for /lib/libm.so.6 Reading symbols from /lib/libc.so.6...done. Loaded symbols for /lib/libc.so.6 Reading symbols from /lib/ld-linux-x86-64.so.2...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /lib/libnss_compat.so.2...done. Loaded symbols for /lib/libnss_compat.so.2 Reading symbols from /lib/libnsl.so.1...done. Loaded symbols for /lib/libnsl.so.1 Reading symbols from /lib/libnss_nis.so.2...done. Loaded symbols for /lib/libnss_nis.so.2 Reading symbols from /lib/libnss_files.so.2...done. Loaded symbols for /lib/libnss_files.so.2 Core was generated by `postgres: randir lovehunter 127.0.0.1(33247) SELECT '. Program terminated with signal 11, Segmentation fault. [New process 11764] #0 0x0000000000453415 in slot_deform_tuple () (gdb) (gdb) backtrace #0 0x0000000000453415 in slot_deform_tuple () #1 0x000000000045383a in slot_getattr () #2 0x0000000000550dac in ExecHashGetHashValue () #3 0x0000000000552a98 in ExecHashJoin () #4 0x0000000000543368 in ExecProcNode () #5 0x0000000000552aa6 in ExecHashJoin () #6 0x0000000000543368 in ExecProcNode () #7 0x0000000000552aa6 in ExecHashJoin () #8 0x0000000000543368 in ExecProcNode () #9 0x0000000000557251 in ExecSort () #10 0x0000000000543290 in ExecProcNode () #11 0x0000000000555308 in ExecMergeJoin () #12 0x0000000000543380 in ExecProcNode () #13 0x0000000000557251 in ExecSort () #14 0x0000000000543290 in ExecProcNode () #15 0x0000000000540e92 in standard_ExecutorRun () #16 0x00000000005ecc27 in PortalRunSelect () #17 0x00000000005edfd9 in PortalRun () #18 0x00000000005e93a7 in exec_simple_query () #19 0x00000000005ea977 in PostgresMain () #20 0x00000000005bf2a8 in ServerLoop () #21 0x00000000005c0037 in PostmasterMain () #22 0x0000000000569b48 in main () Current language: auto; currently asm Thanks, Daniel Craig Ringer wrote: > On 10/12/2009 5:12 AM, Daniel Nagy wrote: >> The following bug has been logged online: >> >> Bug reference: 5238 >> Logged by: Daniel Nagy >> Email address: nagy.daniel@telekom.hu >> PostgreSQL version: 8.4.1 >> Operating system: Debian Lenny 5.0.3 x86_64. Kernel: 2.6.31.6-grsec >> Description: frequent signal 11 segfaults >> Details: >> >> I got postgres segfaults several times a day. > > > http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD > > -- > Craig Ringer
(gdb) p debug_query_string $1 = 12099472 Now I recompiled pg with --enable-debug and waiting for a new core dump. I'll post the backtrace and the debug_query_string output ASAP. Please let me know if there is anything more I can do. Thanks, Daniel Tom Lane wrote: > Nagy Daniel <nagy.daniel@telekom.hu> writes: >> (gdb) backtrace >> #0 0x0000000000453415 in slot_deform_tuple () >> #1 0x000000000045383a in slot_getattr () >> #2 0x0000000000550dac in ExecHashGetHashValue () >> #3 0x0000000000552a98 in ExecHashJoin () >> #4 0x0000000000543368 in ExecProcNode () >> #5 0x0000000000552aa6 in ExecHashJoin () >> #6 0x0000000000543368 in ExecProcNode () > > Not terribly informative (these binaries are apparently not as > un-stripped as you thought). However, this suggests it's a specific > query going wrong --- "p debug_query_string" in gdb might tell you what. > Please see if you can extract a test case. > > regards, tom lane
Nagy Daniel <nagy.daniel@telekom.hu> writes: > Here's a better backtrace: The crash location suggests a problem with a corrupted tuple, but it's impossible to guess where the tuple came from. In particular I can't guess whether this reflects on-disk data corruption or some internal bug. Now that you have (some of) the query, can you put together a test case? Or try "select * from" each of the tables used in the query to check for on-disk corruption. regards, tom lane
Here's a better backtrace: (gdb) bt #0 slot_deform_tuple (slot=0xc325b8, natts=21) at heaptuple.c:1130 #1 0x00000000004535f0 in slot_getsomeattrs (slot=0xc325b8, attnum=21) at heaptuple.c:1340 #2 0x0000000000543cc6 in ExecProject (projInfo=0xc44c98, isDone=0x7fffe33f30a4) at execQual.c:5164 #3 0x00000000005528fb in ExecHashJoin (node=0xc3f130) at nodeHashjoin.c:282 #4 0x0000000000543368 in ExecProcNode (node=0xc3f130) at execProcnode.c:412 #5 0x0000000000552aa6 in ExecHashJoin (node=0xc3dc90) at nodeHashjoin.c:598 #6 0x0000000000543368 in ExecProcNode (node=0xc3dc90) at execProcnode.c:412 #7 0x0000000000552aa6 in ExecHashJoin (node=0xc37140) at nodeHashjoin.c:598 #8 0x0000000000543368 in ExecProcNode (node=0xc37140) at execProcnode.c:412 #9 0x0000000000557251 in ExecSort (node=0xc37030) at nodeSort.c:102 #10 0x0000000000543290 in ExecProcNode (node=0xc37030) at execProcnode.c:423 #11 0x0000000000555308 in ExecMergeJoin (node=0xc36220) at nodeMergejoin.c:626 #12 0x0000000000543380 in ExecProcNode (node=0xc36220) at execProcnode.c:408 #13 0x0000000000557251 in ExecSort (node=0xc34000) at nodeSort.c:102 #14 0x0000000000543290 in ExecProcNode (node=0xc34000) at execProcnode.c:423 #15 0x0000000000540e92 in standard_ExecutorRun (queryDesc=0xbc1c10, direction=ForwardScanDirection, count=0) at execMain.c:1504 #16 0x00000000005ecc27 in PortalRunSelect (portal=0xc30160, forward=<value optimized out>, count=0, dest=0x7f7219b7f3e8) at pquery.c:953 #17 0x00000000005edfd9 in PortalRun (portal=0xc30160, count=9223372036854775807, isTopLevel=1 '\001', dest=0x7f7219b7f3e8, altdest=0x7f7219b7f3e8, completionTag=0x7fffe33f3620 "") at pquery.c:779 #18 0x00000000005e93a7 in exec_simple_query ( query_string=0xb89a00 "SELECT w1.kivel, date_max(w1.mikor,w2.mikor), w1.megnezes, u.* FROM valogatas_valasz w1, valogatas_valasz w2, useradat u WHERE w1.ki=65549 AND not w1.del AND w2.kivel=65549 AND w1.megnezes=0 AND w1.ki"...) at postgres.c:991 #19 0x00000000005ea977 in PostgresMain (argc=4, argv=<value optimized out>, username=0xaceb10 "randir") at postgres.c:3614 #20 0x00000000005bf2a8 in ServerLoop () at postmaster.c:3447 #21 0x00000000005c0037 in PostmasterMain (argc=3, argv=0xac9820) at postmaster.c:1040 #22 0x0000000000569b48 in main (argc=3, argv=0xac9820) at main.c:188 (gdb) p debug_query_string $1 = 0xb89a00 "SELECT w1.kivel, date_max(w1.mikor,w2.mikor), w1.megnezes, u.* FROM valogatas_valasz w1, valogatas_valasz w2, useradat u WHERE w1.ki=65549 AND not w1.del AND w2.kivel=65549 AND w1.megnezes=0 AND w1.ki"... Thanks, Daniel Tom Lane wrote: > Nagy Daniel <nagy.daniel@telekom.hu> writes: >> (gdb) p debug_query_string >> $1 = 12099472 > > Huh, your stripped build is being quite unhelpful :-(. I think > "p (char *) debug_query_string" would have produced something useful. > > regards, tom lane
I ran "select * from" on both tables. All rows were returned successfully, no error logs were produced during the selects. However there are usually many 23505 errors in indices, like: Dec 13 10:02:13 goldbolt postgres[21949]: [26-1] user=randirw,db=lovehunter ERROR: 23505: duplicate key value violates unique constraint "kepek_eredeti_uid_meret_idx" Dec 13 10:02:13 goldbolt postgres[21949]: [26-2] user=randirw,db=lovehunter LOCATION: _bt_check_unique, nbtinsert.c:301 There are many 58P01 errors as well, like: Dec 13 10:05:18 goldbolt postgres[7931]: [23-1] user=munin,db=lovehunter ERROR: 58P01: could not open segment 1 of relation base/16 400/19856 (target block 3014766): No such file or directory Dec 13 10:05:18 goldbolt postgres[7931]: [23-2] user=munin,db=lovehunter LOCATION: _mdfd_getseg, md.c:1572 Dec 13 10:05:18 goldbolt postgres[7931]: [23-3] user=munin,db=lovehunter STATEMENT: SELECT count(*) FROM users WHERE nem='t' Reindexing sometimes helps, but the error logs appear again within hours. Recently a new error appeared: Dec 13 03:46:55 goldbolt postgres[18628]: [15-1] user=randir,db=lovehunter ERROR: XX000: tuple offset out of range: 0 Dec 13 03:46:55 goldbolt postgres[18628]: [15-2] user=randir,db=lovehunter LOCATION: tbm_add_tuples, tidbitmap.c:286 Dec 13 03:46:55 goldbolt postgres[18628]: [15-3] user=randir,db=lovehunter STATEMENT: SELECT * FROM valogatas WHERE uid!='16208' AND eletkor BETWEEN 39 AND 55 AND megyeid='1' AND keresettnem='f' AND dom='iwiw.hu' AND appid='2001434963' AND nem='t' ORDER BY random() DESC If there is on-disk corruption, would a complete dump and restore to an other directory fix it? Apart from that, I think that pg shouldn't crash in case of on-disk corruptions, but log an error message instead. I'm sure that it's not that easy to implement as it seems, but nothing is impossible :) Regards, Daniel Tom Lane wrote: > Nagy Daniel <nagy.daniel@telekom.hu> writes: >> Here's a better backtrace: > > The crash location suggests a problem with a corrupted tuple, but it's > impossible to guess where the tuple came from. In particular I can't > guess whether this reflects on-disk data corruption or some internal > bug. Now that you have (some of) the query, can you put together a test > case? Or try "select * from" each of the tables used in the query to > check for on-disk corruption. > > regards, tom lane
2009/12/13 Nagy Daniel <nagy.daniel@telekom.hu>: > I ran "select * from" on both tables. All rows were returned > successfully, no error logs were produced during the selects. > > However there are usually many 23505 errors in indices, like: > Dec 13 10:02:13 goldbolt postgres[21949]: [26-1] > user=3Drandirw,db=3Dlovehunter ERROR: =C2=A023505: duplicate key value vi= olates > unique constraint "kepek_eredeti_uid_meret_idx" > Dec 13 10:02:13 goldbolt postgres[21949]: [26-2] > user=3Drandirw,db=3Dlovehunter LOCATION: =C2=A0_bt_check_unique, nbtinser= t.c:301 > > There are many 58P01 errors as well, like: > Dec 13 10:05:18 goldbolt postgres[7931]: [23-1] user=3Dmunin,db=3Dlovehun= ter > ERROR: =C2=A058P01: could not open segment 1 of relation base/16 > 400/19856 (target block 3014766): No such file or directory > Dec 13 10:05:18 goldbolt postgres[7931]: [23-2] user=3Dmunin,db=3Dlovehun= ter > LOCATION: =C2=A0_mdfd_getseg, md.c:1572 > Dec 13 10:05:18 goldbolt postgres[7931]: [23-3] user=3Dmunin,db=3Dlovehun= ter > STATEMENT: =C2=A0SELECT count(*) FROM users WHERE nem=3D't' > > Reindexing sometimes helps, but the error logs appear again within > hours. > You can have a some hardware problems. Try to check your hardware, please. Minimum is memory test. Regards Pavel Stehule > Recently a new error appeared: > > Dec 13 03:46:55 goldbolt postgres[18628]: [15-1] > user=3Drandir,db=3Dlovehunter ERROR: =C2=A0XX000: tuple offset out of ran= ge: 0 > Dec 13 03:46:55 goldbolt postgres[18628]: [15-2] > user=3Drandir,db=3Dlovehunter LOCATION: =C2=A0tbm_add_tuples, tidbitmap.c= :286 > Dec 13 03:46:55 goldbolt postgres[18628]: [15-3] > user=3Drandir,db=3Dlovehunter STATEMENT: =C2=A0SELECT * FROM valogatas WH= ERE > uid!=3D'16208' AND eletkor BETWEEN 39 AND 55 AND megyeid=3D'1' AND > keresettnem=3D'f' AND dom=3D'iwiw.hu' AND appid=3D'2001434963' AND nem=3D= 't' > ORDER BY random() DESC > > > > If there is on-disk corruption, would a complete dump and > restore to an other directory fix it? > > Apart from that, I think that pg shouldn't crash in case of > on-disk corruptions, but log an error message instead. > I'm sure that it's not that easy to implement as it seems, > but nothing is impossible :) > > > Regards, > > Daniel > > > Tom Lane wrote: >> Nagy Daniel <nagy.daniel@telekom.hu> writes: >>> Here's a better backtrace: >> >> The crash location suggests a problem with a corrupted tuple, but it's >> impossible to guess where the tuple came from. =C2=A0In particular I can= 't >> guess whether this reflects on-disk data corruption or some internal >> bug. =C2=A0Now that you have (some of) the query, can you put together a= test >> case? =C2=A0Or try "select * from" each of the tables used in the query = to >> check for on-disk corruption. >> >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 regards, tom lane > > -- > Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-bugs >
Nagy Daniel <nagy.daniel@telekom.hu> writes: > I ran "select * from" on both tables. All rows were returned > successfully, no error logs were produced during the selects. Well, that would seem to eliminate the initial theory of on-disk corruption, except that these *other* symptoms that you just mentioned for the first time look a lot like index corruption. I concur with Pavel that intermittent hardware problems are looking more and more likely. Try a memory test first --- a patch of bad RAM could easily produce symptoms like this. > Apart from that, I think that pg shouldn't crash in case of > on-disk corruptions, but log an error message instead. There is very little that software can do to protect itself from flaky hardware :-( regards, tom lane
I have pg segfaults on two boxes, a DL160G6 and a DL380g5. I've just checked their memory with memtest86+ v2.11 No errors were detected. We also monitor the boxes via IPMI, and there are no signs of HW failures. Regards, Daniel Tom Lane wrote: > Nagy Daniel <nagy.daniel@telekom.hu> writes: >> I ran "select * from" on both tables. All rows were returned >> successfully, no error logs were produced during the selects. > > Well, that would seem to eliminate the initial theory of on-disk > corruption, except that these *other* symptoms that you just mentioned > for the first time look a lot like index corruption. I concur with > Pavel that intermittent hardware problems are looking more and more > likely. Try a memory test first --- a patch of bad RAM could easily > produce symptoms like this. > >> Apart from that, I think that pg shouldn't crash in case of >> on-disk corruptions, but log an error message instead. > > There is very little that software can do to protect itself from > flaky hardware :-( > > regards, tom lane
Nagy Daniel <nagy.daniel@telekom.hu> writes: > I have pg segfaults on two boxes, a DL160G6 and a DL380g5. > I've just checked their memory with memtest86+ v2.11 > No errors were detected. > We also monitor the boxes via IPMI, and there are no signs > of HW failures. Hm. Well, now that 8.4.2 is out, the first thing you ought to do is update and see if this happens to be resolved by any of the recent fixes. (I'm not too optimistic about that, because it doesn't look exactly like any of the known symptoms, but an update is certainly worth your time in any case.) If you still see it after that, please try to extract a reproducible test case. regards, tom lane
I upgraded to 8.4.2, did a full reindex and vacuum (there were no errors). But it segfaults as well: Core was generated by `postgres: randir lovehunter 127.0.0.1(48268) SELECT '. Program terminated with signal 11, Segmentation fault. [New process 7262] #0 slot_deform_tuple (slot=0xc0d3b8, natts=20) at heaptuple.c:1130 1130 off = att_align_pointer(off, thisatt->attalign, -1, (gdb) bt #0 slot_deform_tuple (slot=0xc0d3b8, natts=20) at heaptuple.c:1130 #1 0x0000000000453b9a in slot_getattr (slot=0xc0d3b8, attnum=20, isnull=0x7fffe48130af "") at heaptuple.c:1253 #2 0x000000000054418c in ExecEvalNot (notclause=<value optimized out>, econtext=0x1dda4503, isNull=0x7fffe48130af "", isDone=<value optimized out>) at execQual.c:2420 #3 0x000000000054466b in ExecQual (qual=<value optimized out>, econtext=0xc17000, resultForNull=0 '\0') at execQual.c:4909 #4 0x000000000054ae55 in ExecScan (node=0xc16ef0, accessMtd=0x550b80 <BitmapHeapNext>) at execScan.c:131 #5 0x0000000000543da0 in ExecProcNode (node=0xc16ef0) at execProcnode.c:373 #6 0x0000000000553516 in ExecHashJoin (node=0xc15dd0) at nodeHashjoin.c:598 #7 0x0000000000543de8 in ExecProcNode (node=0xc15dd0) at execProcnode.c:412 #8 0x0000000000556367 in ExecNestLoop (node=0xc14cc0) at nodeNestloop.c:120 #9 0x0000000000543e18 in ExecProcNode (node=0xc14cc0) at execProcnode.c:404 #10 0x0000000000556367 in ExecNestLoop (node=0xc12ee0) at nodeNestloop.c:120 #11 0x0000000000543e18 in ExecProcNode (node=0xc12ee0) at execProcnode.c:404 #12 0x0000000000556367 in ExecNestLoop (node=0xc110e0) at nodeNestloop.c:120 #13 0x0000000000543e18 in ExecProcNode (node=0xc110e0) at execProcnode.c:404 #14 0x0000000000557cc1 in ExecSort (node=0xc0eec0) at nodeSort.c:102 #15 0x0000000000543d10 in ExecProcNode (node=0xc0eec0) at execProcnode.c:423 #16 0x00000000005418b2 in standard_ExecutorRun (queryDesc=0xbb95e0, direction=ForwardScanDirection, count=0) at execMain.c:1504 #17 0x00000000005ed687 in PortalRunSelect (portal=0xbf7000, forward=<value optimized out>, count=0, dest=0x7f863746adb8) at pquery.c:953 #18 0x00000000005eea39 in PortalRun (portal=0xbf7000, count=9223372036854775807, isTopLevel=1 '\001', dest=0x7f863746adb8, altdest=0x7f863746adb8, completionTag=0x7fffe4813650 "") at pquery.c:779 #19 0x00000000005e9e07 in exec_simple_query ( query_string=0xb82e10 "SELECT * FROM valogatas WHERE uid!='64708' AND eletkor BETWEEN 40 AND 52 AND megyeid='9' AND keresettnem='t' AND dom='iwiw.hu' AND appid='2001434963' AND nem='f' ORDER BY random() DESC") at postgres.c:991 #20 0x00000000005eb3d7 in PostgresMain (argc=4, argv=<value optimized out>, username=0xacfe90 "randir") at postgres.c:3614 #21 0x00000000005bfe18 in ServerLoop () at postmaster.c:3449 #22 0x00000000005c0ba7 in PostmasterMain (argc=5, argv=0xacaa90) at postmaster.c:1040 #23 0x000000000056a568 in main (argc=5, argv=0xacaa90) at main.c:188 (gdb) p (char *) debug_query_string $1 = 0xb82e10 "SELECT * FROM valogatas WHERE uid!='64708' AND eletkor BETWEEN 40 AND 52 AND megyeid='9' AND keresettnem='t' AND dom='iwiw.hu' AND appid='2001434963' AND nem='f' ORDER BY random() DESC" When I run this query manually, it works. Regards, Daniel Tom Lane wrote: > Nagy Daniel <nagy.daniel@telekom.hu> writes: >> I have pg segfaults on two boxes, a DL160G6 and a DL380g5. >> I've just checked their memory with memtest86+ v2.11 >> No errors were detected. >> We also monitor the boxes via IPMI, and there are no signs >> of HW failures. > > Hm. Well, now that 8.4.2 is out, the first thing you ought to do is > update and see if this happens to be resolved by any of the recent > fixes. (I'm not too optimistic about that, because it doesn't look > exactly like any of the known symptoms, but an update is certainly > worth your time in any case.) > > If you still see it after that, please try to extract a reproducible > test case. > > regards, tom lane
I don't know if it's related, but we often have index problems as well. When performing full vacuum, many indexes contain one more row versions than the tables: WARNING: index "iwiw_start_top_napi_fast_idx" contains 10932 row versions, but table contains 10931 row versions WARNING: index "iwiw_start_top_napi_fast_idx" contains 10932 row versions, but table contains 10931 row versions WARNING: index "iwiw_jatekok_ertek_idx" contains 17 row versions, but table contains 16 row versions WARNING: index "ujtema_nehezseg_idx" contains 696 row versions, but table contains 695 row versions etc... Daniel Tom Lane wrote: > Nagy Daniel <nagy.daniel@telekom.hu> writes: >> I have pg segfaults on two boxes, a DL160G6 and a DL380g5. >> I've just checked their memory with memtest86+ v2.11 >> No errors were detected. >> We also monitor the boxes via IPMI, and there are no signs >> of HW failures. > > Hm. Well, now that 8.4.2 is out, the first thing you ought to do is > update and see if this happens to be resolved by any of the recent > fixes. (I'm not too optimistic about that, because it doesn't look > exactly like any of the known symptoms, but an update is certainly > worth your time in any case.) > > If you still see it after that, please try to extract a reproducible > test case. > > regards, tom lane
More info: we disabled autovacuum (we do vacuuming via cron) and the segfaults seem to be gone. Daniel Tom Lane wrote: > Nagy Daniel <nagy.daniel@telekom.hu> writes: >> I have pg segfaults on two boxes, a DL160G6 and a DL380g5. >> I've just checked their memory with memtest86+ v2.11 >> No errors were detected. >> We also monitor the boxes via IPMI, and there are no signs >> of HW failures. > > Hm. Well, now that 8.4.2 is out, the first thing you ought to do is > update and see if this happens to be resolved by any of the recent > fixes. (I'm not too optimistic about that, because it doesn't look > exactly like any of the known symptoms, but an update is certainly > worth your time in any case.) > > If you still see it after that, please try to extract a reproducible > test case. > > regards, tom lane
Nagy Daniel wrote: > More info: we disabled autovacuum (we do vacuuming via cron) > and the segfaults seem to be gone. That's pretty weird if it means that autovacuum was crashing and your cron-based vacuums are not. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Nagy Daniel wrote: >> More info: we disabled autovacuum (we do vacuuming via cron) >> and the segfaults seem to be gone. > That's pretty weird if it means that autovacuum was crashing and your > cron-based vacuums are not. We've seen crashes in the autovac control logic before ... except that the backtrace said it was nowhere near there. regards, tom lane