Thread: Re: BUG #7969: Postgres Recovery Fatal With: "incorrect local pin count:2"
Folks, So I'm a bit surprised that this bug report hasn't gotten a follow-up. Does this sound like the known 9.2.2 corruption issue, or is it potentially something else? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Re: Re: BUG #7969: Postgres Recovery Fatal With: "incorrect local pin count:2"
From
Heikki Linnakangas
Date:
On 27.03.2013 20:27, Josh Berkus wrote: > Folks, > > So I'm a bit surprised that this bug report hasn't gotten a follow-up. > Does this sound like the known 9.2.2 corruption issue, or is it > potentially something else? It seems like a new issue. At a quick glance, I think there's a bug in heap_xlog_update, ie. the redo routine of a heap update. If the new tuple is put on a different page, and at redo, the new page doesn't exist (that's normal if it was later vacuumed away), heap_xlog_update leaks a pin on the old page. Here: > { > nbuffer = XLogReadBuffer(xlrec->target.node, > ItemPointerGetBlockNumber(&(xlrec->newtid)), > false); > if (!BufferIsValid(nbuffer)) > return; > page = (Page) BufferGetPage(nbuffer); > > if (XLByteLE(lsn, PageGetLSN(page))) /* changes are applied */ > { > UnlockReleaseBuffer(nbuffer); > if (BufferIsValid(obuffer)) > UnlockReleaseBuffer(obuffer); > return; > } > } Notice how in the first 'return' above, obuffer is not released. I'll try to create a reproducible test case for this, and fix.. - Heikki
Re: Re: BUG #7969: Postgres Recovery Fatal With: "incorrect local pin count:2"
From
Heikki Linnakangas
Date:
On 27.03.2013 21:04, Heikki Linnakangas wrote: > On 27.03.2013 20:27, Josh Berkus wrote: >> Folks, >> >> So I'm a bit surprised that this bug report hasn't gotten a follow-up. >> Does this sound like the known 9.2.2 corruption issue, or is it >> potentially something else? > > It seems like a new issue. At a quick glance, I think there's a bug in > heap_xlog_update, ie. the redo routine of a heap update. If the new > tuple is put on a different page, and at redo, the new page doesn't > exist (that's normal if it was later vacuumed away), heap_xlog_update > leaks a pin on the old page. Here: > >> { >> nbuffer = XLogReadBuffer(xlrec->target.node, >> ItemPointerGetBlockNumber(&(xlrec->newtid)), >> false); >> if (!BufferIsValid(nbuffer)) >> return; >> page = (Page) BufferGetPage(nbuffer); >> >> if (XLByteLE(lsn, PageGetLSN(page))) /* changes are applied */ >> { >> UnlockReleaseBuffer(nbuffer); >> if (BufferIsValid(obuffer)) >> UnlockReleaseBuffer(obuffer); >> return; >> } >> } > > Notice how in the first 'return' above, obuffer is not released. > > I'll try to create a reproducible test case for this, and fix.. Ok, here's how to reproduce it: create table foo (i int4 primary key); insert into foo select generate_series(1,1000); checkpoint; -- update a tuple from the first page, new tuple goes to last page update foo set i = 10000 where i = 1; -- delete everything on pages > 1 delete from foo where i > 10; -- truncate the table, including the page the updated tuple went to vacuum verbose foo; pg_ctl stop -m immediate This bug was introduced by commit 8805ff6580621d0daee350826de5211d6bb36ec3, in 9.2.2 (and 9.1.7 and 9.0.11), which fixed multiple WAL replay issues with Hot Standby. Before that commit, replaying a heap update didn't try to keep both buffers locked at the same time, which is necessary for the correctness of hot standby. The patch fixed that, but missed releasing the old buffer in this corner case. I was not able to come up with a scenario with full_page_writes=on where this would fail, but I'm also not 100% sure it can't happen. I scanned through the commit, and couldn't see any other instances of this kind of a bug. heap_xlog_update is more complicated than other redo functions, with all the return statements inside it. It could use some refactoring, but for now, I'll commit the attached small fix. - Heikki
Attachment
Heikki Linnakangas <hlinnakangas@vmware.com> writes: > This bug was introduced by commit > 8805ff6580621d0daee350826de5211d6bb36ec3, in 9.2.2 (and 9.1.7 and > 9.0.11), which fixed multiple WAL replay issues with Hot Standby. Ooops. Thanks for finding that. regards, tom lane
On Mar 27, 2013, at 12:48 PM, Heikki Linnakangas <hlinnakangas@vmware.com> = wrote: >=20 > This bug was introduced by commit 8805ff6580621d0daee350826de5211d6bb36ec= 3, in 9.2.2 (and 9.1.7 and 9.0.11), which fixed multiple WAL replay issues = with Hot Standby. Before that commit, replaying a heap update didn't try to= keep both buffers locked at the same time, which is necessary for the corr= ectness of hot standby. The patch fixed that, but missed releasing the old = buffer in this corner case. I was not able to come up with a scenario with = full_page_writes=3Don where this would fail, but I'm also not 100% sure it = can't happen. >=20 > I scanned through the commit, and couldn't see any other instances of thi= s kind of a bug. heap_xlog_update is more complicated than other redo funct= ions, with all the return statements inside it. It could use some refactori= ng, but for now, I'll commit the attached small fix. >=20 > - Heikki > <fix-heap-update-redo-buffer-leak.patch> Heikki: Is it safe to apply the patch you attached? We'd really like to roll out th= is fix to production instead of waiting for this to be released as a postgr= es minor version update. -Yunong=
"Yunong Xiao" <yunong@joyent.com> writes: > On Mar 27, 2013, at 12:48 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: >> This bug was introduced by commit 8805ff6580621d0daee350826de5211d6bb36ec3, in 9.2.2 (and 9.1.7 and 9.0.11), which fixedmultiple WAL replay issues with Hot Standby. Before that commit, replaying a heap update didn't try to keep both bufferslocked at the same time, which is necessary for the correctness of hot standby. The patch fixed that, but missed releasingthe old buffer in this corner case. I was not able to come up with a scenario with full_page_writes=on where thiswould fail, but I'm also not 100% sure it can't happen. >> >> I scanned through the commit, and couldn't see any other instances of this kind of a bug. heap_xlog_update is more complicatedthan other redo functions, with all the return statements inside it. It could use some refactoring, but for now,I'll commit the attached small fix. >> >> - Heikki >> <fix-heap-update-redo-buffer-leak.patch> > Heikki: > Is it safe to apply the patch you attached? We'd really like to roll out this fix to production instead of waiting forthis to be released as a postgres minor version update. Eh? That patch *is* in the latest minor versions. regards, tom lane
On Apr 11, 2013, at 10:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Yunong Xiao" <yunong@joyent.com> writes: >> On Mar 27, 2013, at 12:48 PM, Heikki Linnakangas <hlinnakangas@vmware.co= m> wrote: >>> This bug was introduced by commit 8805ff6580621d0daee350826de5211d6bb36= ec3, in 9.2.2 (and 9.1.7 and 9.0.11), which fixed multiple WAL replay issue= s with Hot Standby. Before that commit, replaying a heap update didn't try = to keep both buffers locked at the same time, which is necessary for the co= rrectness of hot standby. The patch fixed that, but missed releasing the ol= d buffer in this corner case. I was not able to come up with a scenario wit= h full_page_writes=3Don where this would fail, but I'm also not 100% sure i= t can't happen. >>>=20 >>> I scanned through the commit, and couldn't see any other instances of t= his kind of a bug. heap_xlog_update is more complicated than other redo fun= ctions, with all the return statements inside it. It could use some refacto= ring, but for now, I'll commit the attached small fix. >>>=20 >>> - Heikki >>> <fix-heap-update-redo-buffer-leak.patch> >=20 >> Heikki: >=20 >> Is it safe to apply the patch you attached? We'd really like to roll out= this fix to production instead of waiting for this to be released as a pos= tgres minor version update. >=20 > Eh? That patch *is* in the latest minor versions. >=20 > regards, tom lane Indeed it is. My apologies. Thanks so much for the fix!=