Thread: Index corruption
We have now experienced index corruption on two separate but identical slony clusters. In each case the slony subscriber failed after attempting to insert a duplicate record. In each case reindexing the sl_log_1 table on the provider fixed the problem. The latest occurrence was on our production cluster yesterday. This has only happened since we performed kernel upgrades and we are uncertain whether this represents a kernel bug, or a postgres bug exposed by different timings in the new kernel. Our systems are: Sun v40z 4 x Dual Core AMD Opteron(tm) Processor 875 Kernel 2.6.16.14 #8 SMP x86_64 x86_64 x86_64 GNU/Linux kernel boot option: elevator=deadline 16 Gigs of RAM postgresql-8.0.3-1PGDG Bonded e1000/tg3 NICs with 8192 MTU. Slony 1.1.0 NetApp FAS270 OnTap 7.0.3 Mounted with the NFS options rw,nfsvers=3,hard,rsize=32768,wsize=32768,timeo=600,tcp,noac Jumbo frames 8192 MTU. All postgres data and logs are stored on the netapp. In the latest episode, the index corruption was coincident with a slony-induced vacuum. I don't know if this was the case with our test system failures. What can we do to help identify the cause of this? I believe we will be able to reproduce this on a test system if there is some useful investigation we can perform. __ Marc
Marc Munro <marc@bloodnok.com> writes: > We have now experienced index corruption on two separate but identical > slony clusters. In each case the slony subscriber failed after > attempting to insert a duplicate record. In each case reindexing the > sl_log_1 table on the provider fixed the problem. Please be more specific about what you mean by "index corruption" ... what were the exact symptoms? > postgresql-8.0.3-1PGDG The *first* thing you should do is update to 8.0.8 and see if the problem is still there. regards, tom lane
On Tom Lane's advice, we upgraded to Postgres 8.0.8. We also upgraded slony to 1.1.5, due to some rpm issues. Apart from that everything is as described below. We were able to corrupt the index within 90 minutes of starting up our cluster. A slony-induced vacuum was under way on the provider at the time the subscriber failed. What can we do to help identify the cause of this? We have a test system that seems able to reproduce this fairly easily. __ Marc On Wed, 2006-06-28 at 09:28 -0700, Marc Munro wrote: > We have now experienced index corruption on two separate but identical > slony clusters. In each case the slony subscriber failed after > attempting to insert a duplicate record. In each case reindexing the > sl_log_1 table on the provider fixed the problem. > > The latest occurrence was on our production cluster yesterday. This has > only happened since we performed kernel upgrades and we are uncertain > whether this represents a kernel bug, or a postgres bug exposed by > different timings in the new kernel. > > Our systems are: > > Sun v40z 4 x Dual Core AMD Opteron(tm) Processor 875 > Kernel 2.6.16.14 #8 SMP x86_64 x86_64 x86_64 GNU/Linux > kernel boot option: elevator=deadline > 16 Gigs of RAM > postgresql-8.0.3-1PGDG > Bonded e1000/tg3 NICs with 8192 MTU. > Slony 1.1.0 > > NetApp FAS270 OnTap 7.0.3 > Mounted with the NFS options > rw,nfsvers=3,hard,rsize=32768,wsize=32768,timeo=600,tcp,noac > Jumbo frames 8192 MTU. > > All postgres data and logs are stored on the netapp. > > In the latest episode, the index corruption was coincident with a > slony-induced vacuum. I don't know if this was the case with our test > system failures. > > __ > Marc
Marc Munro <marc@bloodnok.com> writes: > On Tom Lane's advice, we upgraded to Postgres 8.0.8. OK, so it's not an already-known problem. > We were able to corrupt the index within 90 minutes of starting up our > cluster. A slony-induced vacuum was under way on the provider at the > time the subscriber failed. You still haven't given any details. What do you mean by "corrupt the index" --- what actual symptoms are you seeing? regards, tom lane
On Thu, 2006-06-29 at 12:11 -0400, Tom Lane wrote: > OK, so it's not an already-known problem. > > > We were able to corrupt the index within 90 minutes of starting up our > > cluster. A slony-induced vacuum was under way on the provider at the > > time the subscriber failed. > > You still haven't given any details. What do you mean by "corrupt the > index" --- what actual symptoms are you seeing? We have a two node slony cluster with load tests running againt the provider node. After resyncing the subscriber, and running load tests for about an hour, the slony subscriber begins to fail. From the log file: 2006-06-28 17:58:43 PDT ERROR remoteWorkerThread_1: "insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value)values ('374520943','22001','0'); insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22002','0'); insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22003','0'); insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22004','0'); insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22005','0'); insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22006','0'); insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22007','0'); insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22007','0'); insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520942','22009','0'); insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520942','22010','0'); " ERROR: duplicate key violates unique constraint "table_trans_attr_pk" As you see, slony is attempting to enter one tuple ('374520943','22007','0') two times. Each previous time we have had this problem, rebuilding the indexes on slony log table (sl_log_1) has fixed the problem. I have not reindexed the table this time as I do not want to destroy any usable evidence. __ Marc
Marc Munro <marc@bloodnok.com> writes: > As you see, slony is attempting to enter one tuple > ('374520943','22007','0') two times. > Each previous time we have had this problem, rebuilding the indexes on > slony log table (sl_log_1) has fixed the problem. I have not reindexed > the table this time as I do not want to destroy any usable evidence. Good. How big is the log table --- would it be possible for you to send me the physical table and index files? (Not a pg_dump, the actual disk files) regards, tom lane
We have reproduced the problem again. This time it looks like vacuum is not part of the picture. From the provider's log: 2006-06-29 14:02:41 CST DEBUG2 cleanupThread: 101.057 seconds for vacuuming And from the subscriber's: 2006-06-29 13:00:43 PDT ERROR remoteWorkerThread_1: "insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value)values ('374740387','22008','4000'); If my maths is correct and the logs are honest, the vacuum would have started at 14:01:00 CST (13:01:PDT), about 20 seconds after we first encounter the problem. The clocks on the two machines, though in different timezones, are currently synced. Tom, I will create another tarball of the sl_log_1 table and indexes. Should be quite a bit smaller than the previous one. __ Marc
marc@bloodnok.com (Marc Munro) writes: > As you see, slony is attempting to enter one tuple > ('374520943','22007','0') two times. > > Each previous time we have had this problem, rebuilding the indexes on > slony log table (sl_log_1) has fixed the problem. I have not reindexed > the table this time as I do not want to destroy any usable evidence. We have seen this phenomenon on 7.4.8 several times; pulled dumps of sl_log_1 and index files that Jan Wieck looked at, which alas hasn't led to a fix. He did, mind you, find some concurrency pattern that led, if memory serves, to 7.4.12's release. We had experienced cases where there was some worse corruption that required that we rebuild replicas from scratch :-(. -- output = reverse("gro.mca" "@" "enworbbc") http://cbbrowne.com/info/advocacy.html "There is no psychiatrist in the world like a puppy licking your face." -- Ben Williams
Marc Munro <marc@bloodnok.com> writes: > Tom, > we have a newer and much smaller (35M) file showing the same thing: Thanks. Looking into this, what I find is that *both* indexes have duplicated entries for the same heap tuple: idx1: Item 190 -- Length: 24 Offset: 3616 (0x0e20) Flags: USED Block Id: 27806 linp Index: 2 Size: 24 Has Nulls: 0 Has Varwidths:0 0e20: 00009e6c 02001800 01000000 0057d123 ...l.........W.# 0e30: 1a781200 00000000 .x...... Item 191 -- Length: 24 Offset: 3592 (0x0e08) Flags: USED Block Id: 27806 linp Index: 2 Size: 24 Has Nulls: 0 Has Varwidths:0 0e08: 00009e6c 02001800 01000000 0057d123 ...l.........W.# 0e18: 2e781200 00000000 .x...... idx2: Item 127 -- Length: 16 Offset: 6144 (0x1800) Flags: USED Block Id: 27806 linp Index: 2 Size: 16 Has Nulls: 0 Has Varwidths:0 1800: 00009e6c 02001000 0057d123 00000000 ...l.....W.#.... Item 128 -- Length: 16 Offset: 6128 (0x17f0) Flags: USED Block Id: 27806 linp Index: 2 Size: 16 Has Nulls: 0 Has Varwidths:0 17f0: 00009e6c 02001000 0057d123 00000000 ...l.....W.#.... However, the two entries in idx1 contain different data!! What I speculate right at the moment is that we are not looking at index corruption at all, but at heap corruption: somehow, the first insertion into ctid (27806,2) got lost and the same ctid got re-used for the next inserted row. We fixed one bug like this before ... regards, tom lane
Ühel kenal päeval, N, 2006-06-29 kell 16:42, kirjutas Chris Browne: > marc@bloodnok.com (Marc Munro) writes: > > As you see, slony is attempting to enter one tuple > > ('374520943','22007','0') two times. > > > > Each previous time we have had this problem, rebuilding the indexes on > > slony log table (sl_log_1) has fixed the problem. I have not reindexed > > the table this time as I do not want to destroy any usable evidence. > > We have seen this phenomenon on 7.4.8 several times; pulled dumps of > sl_log_1 and index files that Jan Wieck looked at, which alas hasn't > led to a fix. > > He did, mind you, find some concurrency pattern that led, if memory > serves, to 7.4.12's release. We had experienced cases where there was > some worse corruption that required that we rebuild replicas from > scratch :-(. How well did you check the C-language triggers and special slony functions for possibly corrupting some backend/shared-mem structures ? -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
Ühel kenal päeval, N, 2006-06-29 kell 17:23, kirjutas Tom Lane: > Marc Munro <marc@bloodnok.com> writes: > > Tom, > > we have a newer and much smaller (35M) file showing the same thing: > > Thanks. Looking into this, what I find is that *both* indexes have > duplicated entries for the same heap tuple: > ... > However, the two entries in idx1 contain different data!! > > What I speculate right at the moment is that we are not looking at index > corruption at all, but at heap corruption: somehow, the first insertion > into ctid (27806,2) got lost and the same ctid got re-used for the next > inserted row. We fixed one bug like this before ... Marc: do you have triggers on some replicated tables ? I remember having some corruption in a database with weird circular trigger structures, some of them being slony log triggers. The thing that seemed to mess up something inside there, was when change on parent rownt fired a trigger that changes child table rows and there rows fired another trigger that changed the same parent row again. -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
On Fri, 2006-06-30 at 00:37 +0300, Hannu Krosing wrote: > Marc: do you have triggers on some replicated tables ? > We have a non-slony trigger on only 2 tables, neither of them involved in this transaction. We certainly have no circular trigger structures. > I remember having some corruption in a database with weird circular > trigger structures, some of them being slony log triggers. > > The thing that seemed to mess up something inside there, was when change > on parent rownt fired a trigger that changes child table rows and there > rows fired another trigger that changed the same parent row again. > __ Marc
I wrote: > What I speculate right at the moment is that we are not looking at index > corruption at all, but at heap corruption: somehow, the first insertion > into ctid (27806,2) got lost and the same ctid got re-used for the next > inserted row. We fixed one bug like this before ... Further study makes this look more probable. It seems that most of the activity around the trouble spot consists of transactions inserting exactly 13 rows into the table; for instance here are traces of two successive transactions: Block Item XMIN CMIN 27800 7 600921860 3 27800 8 600921860 7 27800 9 600921860 11 27800 10 600921860 15 27800 11 600921860 19 27800 12 600921860 23 27800 13 600921860 27 27800 14 600921860 31 27800 15 600921860 35 27800 16 600921860 39 27800 17 600921860 43 27800 18 600921860 47 27800 19 600921860 51 27800 20 600921870 3 27800 21 600921870 7 27800 22 600921870 11 27800 23 600921870 15 27800 24 600921870 19 27800 25 600921870 23 27800 26 600921870 27 27800 27 600921870 31 27800 28 600921870 35 27800 29 600921870 39 27800 30 600921870 43 27800 31 600921870 47 27800 32 600921870 51 The pattern of CMIN values is the same for all these transactions. But look at the rows inserted by 600921856 and 600921858, the two transactions that seem to be involved with the problem on page 27086: 27787 50 600921856 3 27795 41 600921858 3 27795 42 600921858 7 27795 43 600921858 11 27795 44 600921858 15 27795 45 600921858 19 27795 46 600921858 23 27795 47 600921858 27 27795 48 600921858 31 27795 49 600921858 35 27795 50 600921858 39 27795 51 600921858 43 27806 1 600921858 47 27806 2 600921856 15 27806 3 600921856 19 27806 4 600921856 23 27806 5 600921856 27 27806 6 600921856 31 27806 7 600921856 35 27806 8 600921856 39 27806 9 600921856 43 27806 10 600921856 47 27806 11 600921856 51 27806 12 600921858 51 (27787,50) and (27795,51) are both the last rows on their pages. What's evidently happened is that the two transactions filled those pages and then both seized on 27806 as the next page to insert into. I think that 600921856 tried to insert its CMIN 7 and 11 rows as items 1 and 2 on that page, and then something wiped the page, then 600921858 inserted its CMIN 47 row as item 1, and then 600921856 got control back and finished inserting its rows. Further study of the indexes shows that there are two entries in each index pointing to each of (27806,1) and (27806,2) --- but since the xxid values are different for the two (27806,1) entries, those didn't show up as duplicates in my first look. Given the apparent connection to vacuuming, this is looking a WHOLE lot like this bug fixed in 8.0.3: 2005-05-07 17:32 tgl * src/backend/: access/heap/hio.c, commands/vacuumlazy.c(REL7_3_STABLE), access/heap/hio.c, access/nbtree/nbtpage.c,access/nbtree/nbtree.c,commands/vacuumlazy.c (REL7_4_STABLE),access/heap/hio.c, commands/vacuumlazy.c(REL7_2_STABLE),access/heap/hio.c, access/nbtree/nbtpage.c, access/nbtree/nbtree.c,commands/vacuumlazy.c(REL8_0_STABLE), access/heap/hio.c,access/nbtree/nbtpage.c, access/nbtree/nbtree.c,commands/vacuumlazy.c:Repair very-low-probability race conditionbetween relation extension and VACUUM:in the interval betweenadding a new page to the relation and formatting it, it waspossible for VACUUM to come alongand decide it should format thepage too. Though not harmful in itself, this would cause data lossif a third transactionwere able to insert tuples into the vacuumedpage before the original extender got control back. Are you *certain* this slave isn't running 8.0.2 or older? If you can verify that, then I guess we need to look for another mechanism that could cause the same kind of thing. regards, tom lane
Marc Munro <marc@bloodnok.com> writes: > If there's anything we can do to help debug this we will. We can apply > patches, different build options, etc. One easy thing that would be worth trying is to build with --enable-cassert and see if any Asserts get provoked during the failure case. I don't have a lot of hope for that, but it's something that would require only machine time not people time. A couple other things to try, given that you can provoke the failure fairly easily: 1. In studying the code, it bothers me a bit that P_NEW is the same as InvalidBlockNumber. The intended uses of P_NEW appear to be adequately interlocked, but it's fairly easy to see how something like this could happen if there are any places where InvalidBlockNumber is unintentionally passed to ReadBuffer --- that would look like a P_NEW call and it *wouldn't* be interlocked. So it would be worth changing P_NEW to "(-2)" (this should just take a change in bufmgr.h and recompile) and adding an "Assert(blockNum != InvalidBlockNumber)" at the head of ReadBufferInternal(). Then rebuild with asserts enabled and see if the failure case provokes that assert. 2. I'm also eyeing this bit of code in hio.c: /* * If the FSM knows nothing of the rel, try the last page before * we give up and extend. This avoidsone-tuple-per-page syndrome * during bootstrapping or in a recently-started system. */ if (targetBlock== InvalidBlockNumber) { BlockNumber nblocks = RelationGetNumberOfBlocks(relation); if (nblocks > 0) targetBlock = nblocks - 1; } If someone else has just extended the relation, it's possible that this will allow a process to get to the page before the intended extender has finished initializing it. AFAICT that's not harmful because the page will look like it has no free space ... but it seems a bit fragile. If you dike out the above-mentioned code, can you still provoke the failure? A different line of attack is to see if you can make a self-contained test case so other people can try to reproduce it. More eyeballs on the problem are always better. Lastly, it might be interesting to look at the WAL logs for the period leading up to a failure. This would give us an idea of what was happening concurrently with the processes that seem directly involved. regards, tom lane
[ back to the start of the thread... ] Marc Munro <marc@bloodnok.com> writes: > We have now experienced index corruption on two separate but identical > slony clusters. In each case the slony subscriber failed after > attempting to insert a duplicate record. In each case reindexing the > sl_log_1 table on the provider fixed the problem. > Sun v40z 4 x Dual Core AMD Opteron(tm) Processor 875 > Kernel 2.6.16.14 #8 SMP x86_64 x86_64 x86_64 GNU/Linux > ... > NetApp FAS270 OnTap 7.0.3 > Mounted with the NFS options > rw,nfsvers=3D3,hard,rsize=3D32768,wsize=3D32768,timeo=3D600,tcp,noac > Jumbo frames 8192 MTU. > All postgres data and logs are stored on the netapp. BTW, a couple of thoughts here: * If my theory about the low-level cause is correct, then reindexing sl_log_1 would make the "duplicate key" errors go away, but nonetheless you'd have lost data --- the overwritten rows would be gone. I suppose that this would result in the slave missing some rows that are present on the master. Have you tried comparing slave and master databases to see if you can find any discrepancies? * One way that the problem could happen would be if a race condition in the kernel allowed an lseek(fd, 0, SEEK_END) to return a value less than the true end-of-file (as determined by another process' write() extending the EOF just slightly earlier --- ie, lseek fails to note the effects of the just-completed write, and returns the prior EOF value). PG does have internal locking that should guarantee that the lseek is not done until after the write completes ... but could there be a bug in the kernel allowing stale data to be returned? The SMP hardware is relevant (maybe one processor sees different data than the other) and frankly I don't trust NFS very far at all for questions such as this. It'd be interesting to see if you can reproduce the problem in a database on local storage. regards, tom lane
On Thu, 2006-06-29 at 21:47 -0400, Tom Lane wrote: > One easy thing that would be worth trying is to build with > --enable-cassert and see if any Asserts get provoked during the > failure case. I don't have a lot of hope for that, but it's > something that would require only machine time not people time. I'll try this tomorrow. > A couple other things to try, given that you can provoke the failure > fairly easily: > > 1. In studying the code, it bothers me a bit that P_NEW is the same as > InvalidBlockNumber. The intended uses of P_NEW appear to be adequately > interlocked, but it's fairly easy to see how something like this could > happen if there are any places where InvalidBlockNumber is > unintentionally passed to ReadBuffer --- that would look like a P_NEW > call and it *wouldn't* be interlocked. So it would be worth changing > P_NEW to "(-2)" (this should just take a change in bufmgr.h and > recompile) and adding an "Assert(blockNum != InvalidBlockNumber)" > at the head of ReadBufferInternal(). Then rebuild with asserts enabled > and see if the failure case provokes that assert. I'll try this too. > 2. I'm also eyeing this bit of code in hio.c: > > /* > * If the FSM knows nothing of the rel, try the last page before > * we give up and extend. This avoids one-tuple-per-page syndrome > * during bootstrapping or in a recently-started system. > */ > if (targetBlock == InvalidBlockNumber) > { > BlockNumber nblocks = RelationGetNumberOfBlocks(relation); > > if (nblocks > 0) > targetBlock = nblocks - 1; > } > > If someone else has just extended the relation, it's possible that this > will allow a process to get to the page before the intended extender has > finished initializing it. AFAICT that's not harmful because the page > will look like it has no free space ... but it seems a bit fragile. > If you dike out the above-mentioned code, can you still provoke the > failure? By dike out, you mean remove? Please confirm and I'll try it. > A different line of attack is to see if you can make a self-contained > test case so other people can try to reproduce it. More eyeballs on the > problem are always better. Can't really see this being possible. This is clearly a very unusual problem and without similar hardware I doubt that anyone else will trigger it. We ran this system happily for nearly a year on the previous kernel without experiencing this problem (tcp lockups are a different matter). Also the load is provided by a bunch of servers and robots simulating rising and falling load. > Lastly, it might be interesting to look at the WAL logs for the period > leading up to a failure. This would give us an idea of what was > happening concurrently with the processes that seem directly involved. Next time we reproduce it, I'll take a copy of the WAL files too. __ Marc
Marc Munro <marc@bloodnok.com> writes: > By dike out, you mean remove? Please confirm and I'll try it. Right, just remove (or comment out) the lines I quoted. > We ran this system happily for nearly a year on the > previous kernel without experiencing this problem (tcp lockups are a > different matter). I'm starting to think that a kernel bug is a realistic explanation, see other message. Whose kernel build is this exactly? regards, tom lane
On Thu, 2006-06-29 at 21:59 -0400, Tom Lane wrote: > [ back to the start of the thread... ] > > BTW, a couple of thoughts here: > > * If my theory about the low-level cause is correct, then reindexing > sl_log_1 would make the "duplicate key" errors go away, but nonetheless > you'd have lost data --- the overwritten rows would be gone. I suppose > that this would result in the slave missing some rows that are present > on the master. Have you tried comparing slave and master databases to > see if you can find any discrepancies? Haven't done that yet - in test we tend to restart the old subscriber as the new provider and rebuild the cluster. I'll check the logs from our production failure to figure out what to compare and see what I can discover. > * One way that the problem could happen would be if a race condition in > the kernel allowed an lseek(fd, 0, SEEK_END) to return a value less than > the true end-of-file (as determined by another process' write() > extending the EOF just slightly earlier --- ie, lseek fails to note the > effects of the just-completed write, and returns the prior EOF value). > PG does have internal locking that should guarantee that the lseek is > not done until after the write completes ... but could there be a bug in > the kernel allowing stale data to be returned? The SMP hardware is > relevant (maybe one processor sees different data than the other) and > frankly I don't trust NFS very far at all for questions such as this. > It'd be interesting to see if you can reproduce the problem in a > database on local storage. Unfortunately we haven't got any local storage that can stand the sort of loads we are putting through. With slower storage the CPUs mostly sit idle and we are very unlikely to trigger a timing-based bug if that's what it is. I'll get back to you with kernel build information tomorrow. We'll also try to talk to some kernel hackers about this. Many thanks for your efforts so far. __ Marc
Marc Munro <marc@bloodnok.com> writes: > I'll get back to you with kernel build information tomorrow. We'll also > try to talk to some kernel hackers about this. Some googling turned up recent discussions about race conditions in Linux NFS code: http://threebit.net/mail-archive/linux-kernel/msg05313.html http://lkml.org/lkml/2006/3/1/381 I don't know the kernel nearly well enough to guess if these are related ... regards, tom lane
On 6/30/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I don't know the kernel nearly well enough to guess if these are related > ... The sl_log_* tables are indexed on xid, where the relations between values are not exactly stable. When having high enough activity on one node or having nodes with XIDs on different enough positions funny things happen. Have you already excluded this as a probable reason? -- marko
"Marko Kreen" <markokr@gmail.com> writes: > The sl_log_* tables are indexed on xid, where the relations between > values are not exactly stable. When having high enough activity on > one node or having nodes with XIDs on different enough positions > funny things happen. Yeah, that was one of the first things I thought about, but the range of XIDs involved in these test cases isn't large enough to involve a wraparound, and anyway it's now looking like the problem is loss of heap entries, not index corruption at all. Slony's use of XID comparison semantics for indexes is definitely pretty scary though. If I were them I'd find a way to get rid of it. In theory, since the table is only supposed to contain "recent" XIDs, as long as you keep it vacuumed the index should never contain any inconsistently-comparable XIDs ... but in a big index, the boundary keys for upper-level index pages might hang around an awful long time ... regards, tom lane
I trawled through the first, larger dump you sent me, and found multiple index entries pointing to quite a few heap tuples: Occurrences block item 2 43961 12 43961 22 43961 32 43961 42 43961 52 43961 62 43961 72 43961 82 43961 92 119695 12 119695 22 119695 32 126029 12 126029 22 126029 32 166694 12 166865 12 166865 22 166865 32 166865 42 166865 52 166865 62 166865 72 206221 12 247123 12 327775 12 327775 22 327775 32 327775 42 327775 52 327775 62 327775 72 327775 82 327775 92 327775 102 327775 11 Both indexes show identical sets of duplicates, which makes it pretty hard to credit that it's a within-index problem. You mentioned that the test had been allowed to run for a good while after the first slave error was noted. So it seems there's no question that we are looking at some mechanism that allows the first few entries on a heap page to be lost and overwritten :-(, and that this happened several times over the course of the larger run. regards, tom lane
On 6/30/2006 9:55 AM, Tom Lane wrote: > "Marko Kreen" <markokr@gmail.com> writes: >> The sl_log_* tables are indexed on xid, where the relations between >> values are not exactly stable. When having high enough activity on >> one node or having nodes with XIDs on different enough positions >> funny things happen. > > Yeah, that was one of the first things I thought about, but the range of > XIDs involved in these test cases isn't large enough to involve a > wraparound, and anyway it's now looking like the problem is loss of heap > entries, not index corruption at all. > > Slony's use of XID comparison semantics for indexes is definitely pretty > scary though. If I were them I'd find a way to get rid of it. In > theory, since the table is only supposed to contain "recent" XIDs, > as long as you keep it vacuumed the index should never contain any > inconsistently-comparable XIDs ... but in a big index, the boundary keys > for upper-level index pages might hang around an awful long time ... With the final implementation of log switching, the problem of xxid wraparound will be avoided entirely. Every now and then slony will switch from one to another log table and when the old one is drained and logically empty, it is truncated, which should reinitialize all indexes. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On 6/30/06, Jan Wieck <JanWieck@yahoo.com> wrote: > With the final implementation of log switching, the problem of xxid > wraparound will be avoided entirely. Every now and then slony will > switch from one to another log table and when the old one is drained and > logically empty, it is truncated, which should reinitialize all indexes. If the xxid-s come from different DB-s, then there can still be problems. -- marko
On 6/30/2006 11:17 AM, Marko Kreen wrote: > On 6/30/06, Jan Wieck <JanWieck@yahoo.com> wrote: >> With the final implementation of log switching, the problem of xxid >> wraparound will be avoided entirely. Every now and then slony will >> switch from one to another log table and when the old one is drained and >> logically empty, it is truncated, which should reinitialize all indexes. > > If the xxid-s come from different DB-s, then there can still be problems. How so? They are allways part of a multi-key index having the originating node ID first. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck <JanWieck@Yahoo.com> writes: > On 6/30/2006 11:17 AM, Marko Kreen wrote: >> If the xxid-s come from different DB-s, then there can still be problems. > How so? They are allways part of a multi-key index having the > originating node ID first. Really? create table @NAMESPACE@.sl_log_1 (log_origin int4,log_xid @NAMESPACE@.xxid,log_tableid int4,log_actionseq int8,log_cmdtype char,log_cmddata text ); create index sl_log_1_idx1 on @NAMESPACE@.sl_log_1(log_origin, log_xid @NAMESPACE@.xxid_ops, log_actionseq); create index sl_log_1_idx2 on @NAMESPACE@.sl_log_1(log_xid @NAMESPACE@.xxid_ops); sl_log_1_idx2 doesn't seem to have any such protection. When I was poking at Marc's example, though, it seemed that the numbers going into the table were all *locally generated* XIDs, in fact the same as the XID doing the insertions. If this is only true on the master, and slaves can be inserting XIDs coming from different masters, then I think it will break. regards, tom lane
Tom Lane wrote: > Marc Munro <marc@bloodnok.com> writes: >> I'll get back to you with kernel build information tomorrow. We'll also >> try to talk to some kernel hackers about this. > > Some googling turned up recent discussions about race conditions in > Linux NFS code: > > http://threebit.net/mail-archive/linux-kernel/msg05313.html > http://lkml.org/lkml/2006/3/1/381 > > I don't know the kernel nearly well enough to guess if these are related > ... It may or may not be the same issue, but for what it's worth, we've seen the same sl_log_1 corruption on AIX 5.1 and 5.3 -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp.
On 6/30/2006 11:55 AM, Tom Lane wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: >> On 6/30/2006 11:17 AM, Marko Kreen wrote: >>> If the xxid-s come from different DB-s, then there can still be problems. > >> How so? They are allways part of a multi-key index having the >> originating node ID first. > > Really? > > create table @NAMESPACE@.sl_log_1 ( > log_origin int4, > log_xid @NAMESPACE@.xxid, > log_tableid int4, > log_actionseq int8, > log_cmdtype char, > log_cmddata text > ); > create index sl_log_1_idx1 on @NAMESPACE@.sl_log_1 > (log_origin, log_xid @NAMESPACE@.xxid_ops, log_actionseq); > > create index sl_log_1_idx2 on @NAMESPACE@.sl_log_1 > (log_xid @NAMESPACE@.xxid_ops); You're right ... forgot about that one. And yes, there can be transactions originating from multiple origins (masters) in the same log. The thing is, the index is only there because in a single origin situation (most installations are), the log_origin is allways the same. The optimizer therefore sometimes didn't think using an index at all would be good. However, transactions from different origins are NEVER selected together and it wouldn't make sense to compare their xid's anyway. So the index might return index tuples for rows from another origin, but the following qualifications against the log_origin in the heap tuple will filter them out. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Brad Nicholson wrote: > Tom Lane wrote: >> Marc Munro <marc@bloodnok.com> writes: >>> I'll get back to you with kernel build information tomorrow. We'll also >>> try to talk to some kernel hackers about this. >> Some googling turned up recent discussions about race conditions in >> Linux NFS code: >> >> http://threebit.net/mail-archive/linux-kernel/msg05313.html >> http://lkml.org/lkml/2006/3/1/381 >> >> I don't know the kernel nearly well enough to guess if these are related >> ... > > It may or may not be the same issue, but for what it's worth, we've seen > the same sl_log_1 corruption on AIX 5.1 and 5.3 > On 7.4.12, I should add. -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp.
Brad Nicholson <bnichols@ca.afilias.info> writes: > It may or may not be the same issue, but for what it's worth, we've seen > the same sl_log_1 corruption on AIX 5.1 and 5.3 Hm, on what filesystem, and what PG version(s)? I'm not completely satisfied by the its-a-kernel-bug theory, because if it were then ISTM extending an index would be subject to the same risks as extending a table; but I see no evidence of index page lossage in Marc's dump. OTOH the usage patterns are different, so maybe PG isn't stressing the write-to-lseek path quite as hard for indexes. regards, tom lane
Tom Lane wrote: > Brad Nicholson <bnichols@ca.afilias.info> writes: >> It may or may not be the same issue, but for what it's worth, we've seen >> the same sl_log_1 corruption on AIX 5.1 and 5.3 > > Hm, on what filesystem, and what PG version(s)? > > I'm not completely satisfied by the its-a-kernel-bug theory, because if > it were then ISTM extending an index would be subject to the same risks > as extending a table; but I see no evidence of index page lossage in > Marc's dump. OTOH the usage patterns are different, so maybe PG isn't > stressing the write-to-lseek path quite as hard for indexes. > > regards, tom lane jfs2 in all cases. I don't recall the PG version for 5.1, but it was greater that 7.4.8. For 5.3, it was 7.4.12. -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp.
Jan Wieck <JanWieck@Yahoo.com> writes: > You're right ... forgot about that one. > However, transactions from different origins are NEVER selected together > and it wouldn't make sense to compare their xid's anyway. So the index > might return index tuples for rows from another origin, but the > following qualifications against the log_origin in the heap tuple will > filter them out. No, that's not the point here. The point here is that if the xids that are simultaneously present in the index span more than a 2G-XID range, btree *will fail* because it will be dealing with keys that do not obey the transitive law. You do have a problem --- it doesn't explain Marc's troubles, but sl_log_1_idx2 is broken for multi master situations. All you need is masters with sufficiently different XID counters. regards, tom lane
Ühel kenal päeval, R, 2006-06-30 kell 12:05, kirjutas Jan Wieck: > On 6/30/2006 11:55 AM, Tom Lane wrote: > > > Jan Wieck <JanWieck@Yahoo.com> writes: > >> On 6/30/2006 11:17 AM, Marko Kreen wrote: > >>> If the xxid-s come from different DB-s, then there can still be problems. > > > >> How so? They are allways part of a multi-key index having the > >> originating node ID first. > > > > Really? > > > > create table @NAMESPACE@.sl_log_1 ( > > log_origin int4, > > log_xid @NAMESPACE@.xxid, > > log_tableid int4, > > log_actionseq int8, > > log_cmdtype char, > > log_cmddata text > > ); > > create index sl_log_1_idx1 on @NAMESPACE@.sl_log_1 > > (log_origin, log_xid @NAMESPACE@.xxid_ops, log_actionseq); > > > > create index sl_log_1_idx2 on @NAMESPACE@.sl_log_1 > > (log_xid @NAMESPACE@.xxid_ops); > > You're right ... forgot about that one. And yes, there can be > transactions originating from multiple origins (masters) in the same > log. The thing is, the index is only there because in a single origin > situation (most installations are), the log_origin is allways the same. > The optimizer therefore sometimes didn't think using an index at all > would be good. > > However, transactions from different origins are NEVER selected together > and it wouldn't make sense to compare their xid's anyway. So the index > might return index tuples for rows from another origin, but the > following qualifications against the log_origin in the heap tuple will > filter them out. The problem was not only with returning too many rows from tuples, but as much returning too few. In case when you return too few rows some actions will just be left out from replication and thus will be missing from slaves. -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
On Thu, 2006-06-29 at 21:47 -0400, Tom Lane wrote: > One easy thing that would be worth trying is to build with > --enable-cassert and see if any Asserts get provoked during the > A couple other things to try, given that you can provoke the failure > fairly easily: > . . . > 1. In studying the code, it bothers me a bit that P_NEW is the same as > InvalidBlockNumber. The intended uses of P_NEW appear to be adequately > . . . > 2. I'm also eyeing this bit of code in hio.c: > > If someone else has just extended the relation, it's possible that this > will allow a process to get to the page before the intended extender has > . . . We tried all of these suggestions and still get the problem. Nothing interesting in the log file so I guess the Asserts did not fire. We are going to try experimenting with different kernels now. Unless anyone has any other suggestions. __ Marc
Marc Munro <marc@bloodnok.com> writes: > We tried all of these suggestions and still get the problem. Nothing > interesting in the log file so I guess the Asserts did not fire. Not surprising, it was a long shot that any of those things were really broken. But worth testing. > We are going to try experimenting with different kernels now. Unless > anyone has any other suggestions. Right at the moment I have no better ideas :-( regards, tom lane
For the record, here are the results of our (ongoing) inevstigation into the index/heap corruption problems I reported a couple of weeks ago. We were able to trigger the problem with kernels 2.6.16, 2.6.17 and 2.6.18.rc1, with 2.6.16 seeming to be the most flaky. By replacing the NFS-mounted netapp with a fibre-channel SAN, we have eliminated the problem on all kernels. From this, it would seem to be an NFS bug introduced post 2.6.14, though we cannot rule out a postgres bug exposed by unusual timing issues. Our starting systems are: Sun v40z 4 x Dual Core AMD Opteron(tm) Processor 875 Kernel 2.6.16.14 #8 SMP x86_64 x86_64 x86_64 GNU/Linux (and others) kernel boot option: elevator=deadline 16 Gigs of RAM postgresql-8.0.8-1PGDG Bonded e1000/tg3 NICs with 8192 MTU. Slony 1.1.5 NetApp FAS270 OnTap 7.0.3 Mounted with the NFS options rw,nfsvers=3,hard,rsize=32768,wsize=32768,timeo=600,tcp,noac Jumbo frames 8192 MTU. All postgres data and logs are stored on the netapp. All tests results were reproduced with postgres 8.0.8 __ Marc On Fri, 2006-06-30 at 23:20 -0400, Tom Lane wrote: > Marc Munro <marc@bloodnok.com> writes: > > We tried all of these suggestions and still get the problem. Nothing > > interesting in the log file so I guess the Asserts did not fire. > > Not surprising, it was a long shot that any of those things were really > broken. But worth testing. > > > We are going to try experimenting with different kernels now. Unless > > anyone has any other suggestions. > > Right at the moment I have no better ideas :-( > > regards, tom lane >
Marc Munro wrote: >For the record, here are the results of our (ongoing) inevstigation into >the index/heap corruption problems I reported a couple of weeks ago. > >We were able to trigger the problem with kernels 2.6.16, 2.6.17 and >2.6.18.rc1, with 2.6.16 seeming to be the most flaky. > >By replacing the NFS-mounted netapp with a fibre-channel SAN, we have >eliminated the problem on all kernels. > >From this, it would seem to be an NFS bug introduced post 2.6.14, though >we cannot rule out a postgres bug exposed by unusual timing issues. > >Our starting systems are: > >Sun v40z 4 x Dual Core AMD Opteron(tm) Processor 875 >Kernel 2.6.16.14 #8 SMP x86_64 x86_64 x86_64 GNU/Linux (and others) >kernel boot option: elevator=deadline >16 Gigs of RAM >postgresql-8.0.8-1PGDG >Bonded e1000/tg3 NICs with 8192 MTU. >Slony 1.1.5 > >NetApp FAS270 OnTap 7.0.3 >Mounted with the NFS options >rw,nfsvers=3,hard,rsize=32768,wsize=32768,timeo=600,tcp,noac >Jumbo frames 8192 MTU. > >All postgres data and logs are stored on the netapp. > >All tests results were reproduced with postgres 8.0.8 > >__ >Marc > >On Fri, 2006-06-30 at 23:20 -0400, Tom Lane wrote: > > >>Marc Munro <marc@bloodnok.com> writes: >> >> >>>We tried all of these suggestions and still get the problem. Nothing >>>interesting in the log file so I guess the Asserts did not fire. >>> >>> >>Not surprising, it was a long shot that any of those things were really >>broken. But worth testing. >> >> >> >>>We are going to try experimenting with different kernels now. Unless >>>anyone has any other suggestions. >>> >>> >>Right at the moment I have no better ideas :-( >> >> regards, tom lane >> >> > > > On a good stock day, some levity is justified. How are hackers like politicians? -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.1.394 / Virus Database: 268.10.1/390 - Release Date: 7/17/2006