Thread: Hot standby, overflowed snapshots, testing
In GetSnapshotData(), we set subcount to -1 if the snapshot was overflowed: > subcount = GetKnownAssignedTransactions(snapshot->subxip, > &xmin, xmax, &overflow); > > /* > * See if we have removed any subxids from KnownAssignedXids that > * we might need to see. If so, mark snapshot overflowed. > */ > if (overflow) > subcount = -1; /* overflowed */ In XidInMVCCSnapshot we do this: > /* > * In recovery we store all xids in the subxact array because it > * is by far the bigger array, and we mostly don't know which xids > * are top-level and which are subxacts. The xip array is empty. > * > * We start by searching subtrans, if we overflowed. > */ > if (snapshot->subxcnt < 0) > { > /* overflowed, so convert xid to top-level */ > xid = SubTransGetTopmostTransaction(xid); > > /* > * If xid was indeed a subxact, we might now have an xid < xmin, so > * recheck to avoid an array scan. No point in rechecking xmax. > */ > if (TransactionIdPrecedes(xid, snapshot->xmin)) > return false; > } > > /* > * We now have either a top-level xid higher than xmin or an > * indeterminate xid. We don't know whether it's top level or subxact > * but it doesn't matter. If it's present, the xid is visible. > */ > for (j = 0; j < snapshot->subxcnt; j++) > { > if (TransactionIdEquals(xid, snapshot->subxip[j])) > return true; > } Note that if subxcnt is -1 to mark that the snapshot is overflowed, the for-loop will do nothing. IOW, if the snapshot is overflowed, XidInMVCCSnapshot always returns false. This seems pretty straightforward to fix, we'll just need separate flag to mark whether the subxid array has overflowed, but I'm bringing this up as a separate email because this is the 2nd bug in XidInMVCCSnapshot already, the first one being the silly one that the if-condition was backwards. It seems that no-one still has done any testing of the subxact stuff and visibility, and that is pretty scary. I got the impression earlier that you had some test environment set up to test hot standby. Can you share any details of what test cases you've run? I think we're going to have a good number of volunteers to test Hot Standby, but it would be useful to define some specific test cases to exercise all the hairy recovery transaction tracking and snapshot related things. Or at least provide a list of the difficult parts so that people know to what to test. If people know roughly what the tricky areas are, we'll get better coverage than if people just kick the tires. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2009-11-13 at 22:19 +0200, Heikki Linnakangas wrote: > I got the impression earlier that you had some test environment set up > to test hot standby. Can you share any details of what test cases > you've run? Fair question. The Sep 15 submission happened too quickly for us to mobilise testers, so the final submission was submitted with only manual testing by me. Many last minute major bug fixes meant that the code was much less tested than I would have hoped - you found some of those while I lay exhausted from the efforts to hit a superimposed and unrealistic deadline. I expected us to kick in to fix those but it never happened and that was why I was keen to withdraw the patch about a week later. You've been kicking hell out of it for a while now, rightly so, so I've left it a while before commencing another set of changes and more testing to follow. It takes time, and money, to mobilise qualified testers, so that should begin again shortly. I agreed with you at PGday that we shouldn't expect a quick commit. There are good reasons for that, but still no panic in my mind about skipping this release. -- Simon Riggs www.2ndQuadrant.com
Hi Simon and Heikki, I can help set up automated basic tests for hot standby using 1+1 setups on Amazon. I¹m already working on tests for warm standby for our commercial Tungsten implementation and need to solve the problem of creating tests that adapt flexibly across different replication mechanisms. It would be nice to add a list of test cases to the write-up on the Hot Standby wiki (http://wiki.postgresql.org/wiki/Hot_Standby). I would be happy to help with that effort. Cheers, Robert On 11/13/09 1:43 PM PST, "Simon Riggs" <simon@2ndQuadrant.com> wrote: > On Fri, 2009-11-13 at 22:19 +0200, Heikki Linnakangas wrote: > >> I got the impression earlier that you had some test environment set up >> to test hot standby. Can you share any details of what test cases >> you've run? > > Fair question. The Sep 15 submission happened too quickly for us to > mobilise testers, so the final submission was submitted with only manual > testing by me. Many last minute major bug fixes meant that the code was > much less tested than I would have hoped - you found some of those while > I lay exhausted from the efforts to hit a superimposed and unrealistic > deadline. I expected us to kick in to fix those but it never happened > and that was why I was keen to withdraw the patch about a week later. > > You've been kicking hell out of it for a while now, rightly so, so I've > left it a while before commencing another set of changes and more > testing to follow. > > It takes time, and money, to mobilise qualified testers, so that should > begin again shortly. > > I agreed with you at PGday that we shouldn't expect a quick commit. > There are good reasons for that, but still no panic in my mind about > skipping this release. > > -- > Simon Riggs www.2ndQuadrant.com > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
On Sat, 2009-11-14 at 08:43 -0800, Robert Hodges wrote: > I can help set up automated basic tests for hot standby using 1+1 setups on > Amazon. I¹m already working on tests for warm standby for our commercial > Tungsten implementation and need to solve the problem of creating tests that > adapt flexibly across different replication mechanisms. I didn't leap immediately to say yes for a couple of reasons. More than 50% of the bugs found on HS now have been theoretical-ish issues that would very difficult to observe, let alone isolate with black box testing. In many cases they are unlikely to happen, but that is not our approach to quality. This shows there isn't a good substitute for very long explanatory comments which are then read and challenged by a reviewer, though I would note Heikki's particular skill in doing that. The second most frequent class of bugs have been "unit test" bugs, where the modules themselves need better unit testing. Block box testing only works to address this when there is an exhaustive test-coverage driven approach, but even then it's hard to inject real/appropriate conditions into many deeply buried routines. Best way seems to be just multiple debugger sessions and lots of time. HS is characterised by a very low "additional feature" profile. It leverages many existing modules to create something on the standby that already exists on the primary. So in many ways it is a very different sort of patch to many others. There have been a few dumb-ass bugs and I hold my hand up to those, though the reason is to do with timing of patch delivery and testing. I don't see any long term issues, just unfortunate short term circumstance because of patch churn. -- Simon Riggs www.2ndQuadrant.com
On 11/15/09 2:25 AM PST, "Simon Riggs" <simon@2ndQuadrant.com> wrote: > On Sat, 2009-11-14 at 08:43 -0800, Robert Hodges wrote: > >> I can help set up automated basic tests for hot standby using 1+1 setups on >> Amazon. I¹m already working on tests for warm standby for our commercial >> Tungsten implementation and need to solve the problem of creating tests that >> adapt flexibly across different replication mechanisms. > > I didn't leap immediately to say yes for a couple of reasons. > I'm easy on this. We are going to find some hot standby problems no matter what from our own testing. At least I hope so. It does sound to me as if there is a class of errors that would be easiest to find by putting up a long running test that throws a lot of different queries at the server over time. We have such tests already written in our Bristlecone tools. Cheers, Robert