Thread: 3rd time is a charm.....right sibling is not next child crash.
Not looking for help...just putting some data out there. 2 previous crashes caused by corrupt slony indexes http://archives.postgresql.org/pgsql-general/2010-02/msg00022.php http://archives.postgresql.org/pgsql-general/2009-12/msg01172.php New one yesterday. Jun 7 15:05:01 db-1 postgres[9334]: [ID 748848 local0.crit] [3989781-1] 2010-06-07 15:05:01.087 CDT 9334PANIC: rightsibling 169 of block 168 is not next child of 249 in index "sl_seqlog_idx" We are on the eve of switching off our SAN to some direct attached storage and upgrading postgres and slony in the processthis weekend....so any thoughts that it might be hardware, driver or even postgres/slony should be alleviated by thefact that everything is changing. That being said, the fact that each time this has happened, it has been a slony index that has been corrupt, I find it 'odd'. While I can't imagine a bug in slony corrupting postgres indexes...and I can't imagine a bug in postgres corruptingonly slony indexes, I don't really know what to think. Just putting this out there in case anyone has similarissues or can use this data in some meaningful way. Stack trace looks similar to last time. Program terminated with signal 6, Aborted. #0 0xfecba227 in _lwp_kill () from /lib/libc.so.1 (gdb) bt #0 0xfecba227 in _lwp_kill () from /lib/libc.so.1 #1 0xfecb598f in thr_kill () from /lib/libc.so.1 #2 0xfec61ed3 in raise () from /lib/libc.so.1 #3 0xfec41d0d in abort () from /lib/libc.so.1 #4 0x0821b8a6 in errfinish (dummy=0) at elog.c:471 #5 0x0821c74b in elog_finish (elevel=22, fmt=0x82b7780 "right sibling %u of block %u is not next child of %u in index \"%s\"")at elog.c:964 #6 0x0809e1a0 in _bt_pagedel (rel=0x867bcd8, buf=139905, stack=0x86b3768, vacuum_full=0 '\0') at nbtpage.c:1141 #7 0x0809f835 in btvacuumscan (info=0x8043f70, stats=0x86b5c30, callback=0, callback_state=0x0, cycleid=29488) at nbtree.c:936 #8 0x0809fc65 in btbulkdelete (fcinfo=0x0) at nbtree.c:547 #9 0x0821f424 in FunctionCall4 (flinfo=0x0, arg1=0, arg2=0, arg3=0, arg4=0) at fmgr.c:1215 #10 0x0809a89f in index_bulk_delete (info=0x8043f70, stats=0x0, callback=0x812ffc8 <lazy_tid_reaped>, callback_state=0x86b5818)at indexam.c:573 #11 0x0812ff54 in lazy_vacuum_index (indrel=0x867bcd8, stats=0x86b5b70, vacrelstats=0x86b5818) at vacuumlazy.c:660 #12 0x0813055a in lazy_vacuum_rel (onerel=0x867b7f8, vacstmt=0x86659b8) at vacuumlazy.c:487 #13 0x0812e910 in vacuum_rel (relid=140925368, vacstmt=0x86659b8, expected_relkind=114 'r') at vacuum.c:1107 #14 0x0812f95a in vacuum (vacstmt=0x86659b8, relids=0x8665bc0) at vacuum.c:400 #15 0x08186e16 in AutoVacMain (argc=0, argv=0x0) at autovacuum.c:914 #16 0x08187278 in autovac_start () at autovacuum.c:178 #17 0x0818bfed in ServerLoop () at postmaster.c:1252 #18 0x0818d16d in PostmasterMain (argc=3, argv=0x833adc8) at postmaster.c:966 #19 0x08152cce in main (argc=3, argv=0x833adc8) at main.c:188 (gdb)
On Tue, Jun 8, 2010 at 9:26 AM, Jeff Amiel <becauseimjeff@yahoo.com> wrote: > That being said, the fact that each time this has happened, it has been a slony index that has been corrupt, I find it'odd'. While I can't imagine a bug in slony corrupting postgres indexes...and I can't imagine a bug in postgres corruptingonly slony indexes, I don't really know what to think. Just putting this out there in case anyone has similarissues or can use this data in some meaningful way. > Slony is just a client to postgres, nothing more, nothing less. Any index corruption is going to be the fault of Postgres server or hardware. We don't know what version of postgres you're running, so nobody can say anything for sure. I will say that we observed an index corruption (on a zero-row "status" table, so it grows and changes a lot during the course of a day) about a week ago. It was not a slony table. A quick reindex on that table and we were off and running. We noticed it because slony was trying to insert into the table and getting a read error. This has happened to us maybe one other time in the 10+ years this application has been running. I'm on Pg 8.3.7.
Jeff Amiel <becauseimjeff@yahoo.com> writes: > New one yesterday. > Jun 7 15:05:01 db-1 postgres[9334]: [ID 748848 local0.crit] [3989781-1] 2010-06-07 15:05:01.087 CDT 9334PANIC: rightsibling 169 of block 168 is not next child of 249 in index "sl_seqlog_idx" In your original report you mentioned that the next autovacuum attempt on the same table succeeded without incident. Has that been true each time? I wonder whether this is some transient state, rather than actual corruption that you need to REINDEX to recover from. regards, tom lane
On 6/8/10 11:23 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > In your original report you mentioned that the next autovacuum attempt > on the same table succeeded without incident. Has that been true each > time? I wonder whether this is some transient state, rather than actual > corruption that you need to REINDEX to recover from. > I didn't waste time during this event and reindexed as quick as I could. I will note, however, that these indexes (specifically the one that generated the error) were in use (successfully) AFTER the event BEFORE the reindex by the slony triggers (by virtue of inserts into the log tables and such) even though I stopped autovacuum and slon daemon. Not reads, obviously...just inserts/updates. If nobody else has seen/is seeing this, I will chalk this whole scenario up to oddball SAN issues (we've logged many a write/read error over the year(s) causing corruption on these heavily manipulated indexes. I'll be glad to move to my attached storage as quickly as possible. On a side note, I am 100% sure that autovacuum was disabled when I brought the database back up after the core dump(s). However, minutes after restarting, some of my larger tables started getting vacuumed by pgsql user. Any way that a vacuum would kick off for a particular table (or series of tables) even when autovacuum was off in the postgresql.conf? My only manual vacuum process is kicked off late at night, so this was not it. Also....before I had a chance to disable the slon daemon I also noticed other slony tables being vacuum analyzed (even though autovacuum was off) ....Does Slony manage it's own vacuuming separate from postgres' autovacuum?
Jeff Amiel <jamiel@istreamimaging.com> writes: > On a side note, I am 100% sure that autovacuum was disabled when I brought > the database back up after the core dump(s). However, minutes after > restarting, some of my larger tables started getting vacuumed by pgsql user. > Any way that a vacuum would kick off for a particular table (or series of > tables) even when autovacuum was off in the postgresql.conf? Anti-transaction-wraparound vacuums, perhaps? regards, tom lane
On 6/8/10 12:56 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > Jeff Amiel <jamiel@istreamimaging.com> writes: >> On a side note, I am 100% sure that autovacuum was disabled when I brought >> the database back up after the core dump(s). However, minutes after >> restarting, some of my larger tables started getting vacuumed by pgsql user. >> Any way that a vacuum would kick off for a particular table (or series of >> tables) even when autovacuum was off in the postgresql.conf? > > Anti-transaction-wraparound vacuums, perhaps? I would hope not. :) This is postgres 8.2.X. Autovacuum has been enabled forever (seemingly with no errors) Anything I can look for ? (I searched the logs for references to "must be vacuumed within" but found nothing....) SELECT datname, age(datfrozenxid) FROM pg_database; postgres 178649703 prod 204588751 template1 178653277 template0 178653131
On Tue, Jun 8, 2010 at 12:49 PM, Jeff Amiel <jamiel@istreamimaging.com> wrote: > > ....Does Slony manage it's own vacuuming separate from postgres' autovacuum? > Yes it does: http://www.slony.info/documentation/maintenance.html -- Jaime Casanova www.2ndQuadrant.com Soporte y capacitación de PostgreSQL
On 6/8/10 1:15 PM, "Jaime Casanova" <jaime@2ndquadrant.com> wrote: > On Tue, Jun 8, 2010 at 12:49 PM, Jeff Amiel <jamiel@istreamimaging.com> wrote: >> >> ....Does Slony manage it's own vacuuming separate from postgres' autovacuum? >> > > Yes it does: http://www.slony.info/documentation/maintenance.html " It seems preferable to configure autovacuum to avoid vacuum Slony-I-managed configuration tables. " Hmmm....I don't do this. Surely this is not relative to my corrupt indexes....2 attempted vacuums on same indexes?
Excerpts from Jeff Amiel's message of mar jun 08 14:19:02 -0400 2010: > " It seems preferable to configure autovacuum to avoid vacuum > Slony-I-managed configuration tables. " > > > Hmmm....I don't do this. > Surely this is not relative to my corrupt indexes....2 attempted vacuums on > same indexes? Pretty unlikely. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Excerpts from Jeff Amiel's message of mar jun 08 09:26:25 -0400 2010: > Not looking for help...just putting some data out there. > > 2 previous crashes caused by corrupt slony indexes > > http://archives.postgresql.org/pgsql-general/2010-02/msg00022.php > > http://archives.postgresql.org/pgsql-general/2009-12/msg01172.php > > New one yesterday. > > Jun 7 15:05:01 db-1 postgres[9334]: [ID 748848 local0.crit] [3989781-1] 2010-06-07 15:05:01.087 CDT 9334PANIC: rightsibling 169 of block 168 is not next child of 249 in index "sl_seqlog_idx" I've seen this problem (and others) in a high-load environment. Not Slony related though. I wrote a small tool to check btree index files for consistency problems such as this one, by parsing pg_filedump output. I've seen strange things such as index pointers pointing to pages that shouldn't have been pointed to; mismatching sibling pointers; and others. Do you have a copy of the broken index file? -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On 6/8/10 2:03 PM, "Alvaro Herrera" <alvherre@commandprompt.com> wrote: > > I've seen this problem (and others) in a high-load environment. Not > Slony related though. > > I wrote a small tool to check btree index files for consistency problems > such as this one, by parsing pg_filedump output. I've seen strange > things such as index pointers pointing to pages that shouldn't have been > pointed to; mismatching sibling pointers; and others. > > Do you have a copy of the broken index file? Alas, no. But I have another 5 days to reproduce the problem before changing hardware/versions (which will probably bring a whole new set of 'opportunities' to deal with). If it happens again, I will snag the index file for analysis. Any way you can share your tool source so I can proactively look for issues in my indexes?
Alvaro Herrera <alvherre@commandprompt.com> writes: > Excerpts from Jeff Amiel's message of mar jun 08 09:26:25 -0400 2010: >> Jun 7 15:05:01 db-1 postgres[9334]: [ID 748848 local0.crit] [3989781-1] 2010-06-07 15:05:01.087 CDT 9334PANIC: rightsibling 169 of block 168 is not next child of 249 in index "sl_seqlog_idx" > I've seen this problem (and others) in a high-load environment. Not > Slony related though. > I wrote a small tool to check btree index files for consistency problems > such as this one, by parsing pg_filedump output. I've seen strange > things such as index pointers pointing to pages that shouldn't have been > pointed to; mismatching sibling pointers; and others. I spent some more time today looking for possible causes of this (within our code that is; the obvious elephant in the room is the possibility of the disk storage system losing an update). I didn't find anything, but it did occur to me that there is a simple way to ameliorate this problem: we could rearrange the code in _bt_pagedel() so it checks for this case before entering its critical section. Then, corruption of this kind is at least only an ERROR not a PANIC. Any objections to doing that? regards, tom lane