Thread: Hot standby, overflowed snapshots, testing

Hot standby, overflowed snapshots, testing

From

Heikki Linnakangas

Date:

13 November 2009, 16:19:31

In GetSnapshotData(), we set subcount to -1 if the snapshot was overflowed:

>         subcount = GetKnownAssignedTransactions(snapshot->subxip,
>                                                 &xmin, xmax, &overflow);
> 
>         /*
>          * See if we have removed any subxids from KnownAssignedXids that
>          * we might need to see. If so, mark snapshot overflowed.
>          */
>         if (overflow)
>             subcount = -1;    /* overflowed */

In XidInMVCCSnapshot we do this:

>         /*
>          * In recovery we store all xids in the subxact array because it
>          * is by far the bigger array, and we mostly don't know which xids
>          * are top-level and which are subxacts. The xip array is empty.
>          *
>          * We start by searching subtrans, if we overflowed.
>          */
>         if (snapshot->subxcnt < 0)
>         {
>             /* overflowed, so convert xid to top-level */
>             xid = SubTransGetTopmostTransaction(xid);
> 
>             /*
>              * If xid was indeed a subxact, we might now have an xid < xmin, so
>              * recheck to avoid an array scan.    No point in rechecking xmax.
>              */
>             if (TransactionIdPrecedes(xid, snapshot->xmin))
>                 return false;
>         }
> 
>         /*
>          * We now have either a top-level xid higher than xmin or an
>          * indeterminate xid. We don't know whether it's top level or subxact
>          * but it doesn't matter. If it's present, the xid is visible.
>          */
>         for (j = 0; j < snapshot->subxcnt; j++)
>         {
>             if (TransactionIdEquals(xid, snapshot->subxip[j]))
>                 return true;
>         }

Note that if subxcnt is -1 to mark that the snapshot is overflowed, the
for-loop will do nothing. IOW, if the snapshot is overflowed,
XidInMVCCSnapshot always returns false.

This seems pretty straightforward to fix, we'll just need separate flag
to mark whether the subxid array has overflowed, but I'm bringing this
up as a separate email because this is the 2nd bug in XidInMVCCSnapshot
already, the first one being the silly one that the if-condition was
backwards. It seems that no-one still has done any testing of the
subxact stuff and visibility, and that is pretty scary.

I got the impression earlier that you had some test environment set up
to test hot standby. Can you share any details of what test cases you've
run?

I think we're going to have a good number of volunteers to test Hot
Standby, but it would be useful to define some specific test cases to
exercise all the hairy recovery transaction tracking and snapshot
related things. Or at least provide a list of the difficult parts so
that people know to what to test. If people know roughly what the tricky
areas are, we'll get better coverage than if people just kick the tires.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: Hot standby, overflowed snapshots, testing

From

Simon Riggs

Date:

13 November 2009, 17:44:01

On Fri, 2009-11-13 at 22:19 +0200, Heikki Linnakangas wrote:

> I got the impression earlier that you had some test environment set up
> to test hot standby. Can you share any details of what test cases
> you've run?

Fair question. The Sep 15 submission happened too quickly for us to
mobilise testers, so the final submission was submitted with only manual
testing by me. Many last minute major bug fixes meant that the code was
much less tested than I would have hoped - you found some of those while
I lay exhausted from the efforts to hit a superimposed and unrealistic
deadline. I expected us to kick in to fix those but it never happened
and that was why I was keen to withdraw the patch about a week later.

You've been kicking hell out of it for a while now, rightly so, so I've
left it a while before commencing another set of changes and more
testing to follow.

It takes time, and money, to mobilise qualified testers, so that should
begin again shortly.

I agreed with you at PGday that we shouldn't expect a quick commit.
There are good reasons for that, but still no panic in my mind about
skipping this release.

-- Simon Riggs           www.2ndQuadrant.com

Re: Hot standby, overflowed snapshots, testing

From

Robert Hodges

Date:

14 November 2009, 12:44:14

Hi Simon and Heikki,

I can help set up automated basic tests for hot standby using 1+1 setups on
Amazon.   I¹m already working on tests for warm standby for our commercial
Tungsten implementation and need to solve the problem of creating tests that
adapt flexibly across different replication mechanisms.

It would be nice to add a list of test cases to the write-up on the Hot
Standby wiki (http://wiki.postgresql.org/wiki/Hot_Standby).  I would be
happy to help with that effort.

Cheers, Robert

On 11/13/09 1:43 PM PST, "Simon Riggs" <simon@2ndQuadrant.com> wrote:

> On Fri, 2009-11-13 at 22:19 +0200, Heikki Linnakangas wrote:
>
>> I got the impression earlier that you had some test environment set up
>> to test hot standby. Can you share any details of what test cases
>> you've run?
>
> Fair question. The Sep 15 submission happened too quickly for us to
> mobilise testers, so the final submission was submitted with only manual
> testing by me. Many last minute major bug fixes meant that the code was
> much less tested than I would have hoped - you found some of those while
> I lay exhausted from the efforts to hit a superimposed and unrealistic
> deadline. I expected us to kick in to fix those but it never happened
> and that was why I was keen to withdraw the patch about a week later.
>
> You've been kicking hell out of it for a while now, rightly so, so I've
> left it a while before commencing another set of changes and more
> testing to follow.
>
> It takes time, and money, to mobilise qualified testers, so that should
> begin again shortly.
>
> I agreed with you at PGday that we shouldn't expect a quick commit.
> There are good reasons for that, but still no panic in my mind about
> skipping this release.
>
> --
>  Simon Riggs           www.2ndQuadrant.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Re: Hot standby, overflowed snapshots, testing

From

Simon Riggs

Date:

15 November 2009, 06:28:34

On Sat, 2009-11-14 at 08:43 -0800, Robert Hodges wrote:

> I can help set up automated basic tests for hot standby using 1+1 setups on
> Amazon.   I¹m already working on tests for warm standby for our commercial
> Tungsten implementation and need to solve the problem of creating tests that
> adapt flexibly across different replication mechanisms.

I didn't leap immediately to say yes for a couple of reasons.

More than 50% of the bugs found on HS now have been theoretical-ish
issues that would very difficult to observe, let alone isolate with
black box testing. In many cases they are unlikely to happen, but that
is not our approach to quality. This shows there isn't a good substitute
for very long explanatory comments which are then read and challenged by
a reviewer, though I would note Heikki's particular skill in doing that.

The second most frequent class of bugs have been "unit test" bugs, where
the modules themselves need better unit testing.  Block box testing only
works to address this when there is an exhaustive test-coverage driven
approach, but even then it's hard to inject real/appropriate conditions
into many deeply buried routines. Best way seems to be just multiple
debugger sessions and lots of time.

HS is characterised by a very low "additional feature" profile. It
leverages many existing modules to create something on the standby that
already exists on the primary. So in many ways it is a very different
sort of patch to many others.

There have been a few dumb-ass bugs and I hold my hand up to those,
though the reason is to do with timing of patch delivery and testing. I
don't see any long term issues, just unfortunate short term circumstance
because of patch churn.

-- Simon Riggs           www.2ndQuadrant.com

Re: Hot standby, overflowed snapshots, testing

From

Robert Hodges

Date:

15 November 2009, 19:19:38

On 11/15/09 2:25 AM PST, "Simon Riggs" <simon@2ndQuadrant.com> wrote:

> On Sat, 2009-11-14 at 08:43 -0800, Robert Hodges wrote:
> 
>> I can help set up automated basic tests for hot standby using 1+1 setups on
>> Amazon.   I¹m already working on tests for warm standby for our commercial
>> Tungsten implementation and need to solve the problem of creating tests that
>> adapt flexibly across different replication mechanisms.
> 
> I didn't leap immediately to say yes for a couple of reasons.
> 
I'm easy on this.  We are going to find some hot standby problems no matter
what from our own testing.  At least I hope so.

It does sound to me as if there is a class of errors that would be easiest
to find by putting up a long running test that throws a lot of different
queries at the server over time.  We have such tests already written in our
Bristlecone tools. 

Cheers, Robert