Thread: Hot Standby dev build (v8)

Hot Standby dev build (v8)

From
Simon Riggs
Date:
http://wiki.postgresql.org/wiki/Hot_Standby

v8 - resolves a dozen minor issues.

Dev build with various debugging enabled. Expect clean v10 on Friday.

All outstanding items from emails now listed on Wiki.

Some new refactoring targets and tuning opportunities identified as
knock-on effects from earlier refactoring.

Testing started on this about 3 hrs ago, nothing new found as yet.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Attachment

Re: Hot Standby dev build (v8)

From
Simon Riggs
Date:
On Wed, 2009-01-14 at 23:55 +0000, Simon Riggs wrote:
> http://wiki.postgresql.org/wiki/Hot_Standby
>
> v8 - resolves a dozen minor issues.

Found and fixed another 8 issues, through code inspection, bash test and
specific feature tests.

Looking much better; still working - check Wiki for status.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Attachment

Re: Hot Standby dev build (v8)

From
Heikki Linnakangas
Date:
I don't think RecentGlobalXmin is good enough here:

> !             /*
> !              * We would like to set an accurate latestRemovedXid, but there
> !              * is no easy way of obtaining a useful value. So we use the
> !              * probably far too conservative value of RecentGlobalXmin instead.
> !              */
> !             xlrec_delete.latestRemovedXid = RecentGlobalXmin;
> !             rdata[0].data = (char *) &xlrec_delete;
> !             rdata[0].len = SizeOfBtreeDelete;

RecentGlobalXmin is just a hint, it lags behind the real oldest xmin 
that GetOldestXmin() would return. If another backend has a more recent 
RecentGlobalXmin value, and has killed more recent tuples on the page, 
the latestRemovedXid written here is too old.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby dev build (v8)

From
Simon Riggs
Date:
On Fri, 2009-01-16 at 16:33 +0200, Heikki Linnakangas wrote:
> I don't think RecentGlobalXmin is good enough here:
> 
> > !             /*
> > !              * We would like to set an accurate latestRemovedXid, but there
> > !              * is no easy way of obtaining a useful value. So we use the
> > !              * probably far too conservative value of RecentGlobalXmin instead.
> > !              */
> > !             xlrec_delete.latestRemovedXid = RecentGlobalXmin;
> > !             rdata[0].data = (char *) &xlrec_delete;
> > !             rdata[0].len = SizeOfBtreeDelete;
> 
> RecentGlobalXmin is just a hint, it lags behind the real oldest xmin 
> that GetOldestXmin() would return. If another backend has a more recent 
> RecentGlobalXmin value, and has killed more recent tuples on the page, 
> the latestRemovedXid written here is too old.

What do you think we should do instead?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot Standby dev build (v8)

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Fri, 2009-01-16 at 16:33 +0200, Heikki Linnakangas wrote:
>> I don't think RecentGlobalXmin is good enough here:
>>
>>> !             /*
>>> !              * We would like to set an accurate latestRemovedXid, but there
>>> !              * is no easy way of obtaining a useful value. So we use the
>>> !              * probably far too conservative value of RecentGlobalXmin instead.
>>> !              */
>>> !             xlrec_delete.latestRemovedXid = RecentGlobalXmin;
>>> !             rdata[0].data = (char *) &xlrec_delete;
>>> !             rdata[0].len = SizeOfBtreeDelete;
>> RecentGlobalXmin is just a hint, it lags behind the real oldest xmin 
>> that GetOldestXmin() would return. If another backend has a more recent 
>> RecentGlobalXmin value, and has killed more recent tuples on the page, 
>> the latestRemovedXid written here is too old.
> 
> What do you think we should do instead?

Dunno. Maybe call GetOldestXmin().

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby dev build (v8)

From
Simon Riggs
Date:
On Fri, 2009-01-16 at 22:09 +0200, Heikki Linnakangas wrote:

> >> RecentGlobalXmin is just a hint, it lags behind the real oldest xmin 
> >> that GetOldestXmin() would return. If another backend has a more recent 
> >> RecentGlobalXmin value, and has killed more recent tuples on the page, 
> >> the latestRemovedXid written here is too old.
> > 
> > What do you think we should do instead?
> 
> Dunno. Maybe call GetOldestXmin().

We are discussing btree deletes, not btree vacuums. If we are doing
btree delete then we have an unreleased snapshot therefore we also have
a non-zero xmin. How can another backend have a later RecentGlobalXmin
or result from GetOldestXmin() than we do?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot Standby dev build (v8)

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> On Fri, 2009-01-16 at 22:09 +0200, Heikki Linnakangas wrote:
> 
>>>> RecentGlobalXmin is just a hint, it lags behind the real oldest xmin 
>>>> that GetOldestXmin() would return. If another backend has a more recent 
>>>> RecentGlobalXmin value, and has killed more recent tuples on the page, 
>>>> the latestRemovedXid written here is too old.
>>> What do you think we should do instead?
>> Dunno. Maybe call GetOldestXmin().
> 
> We are discussing btree deletes, not btree vacuums. 

Pardon my ignorance, but what's the difference?

> If we are doing
> btree delete then we have an unreleased snapshot therefore we also have
> a non-zero xmin. How can another backend have a later RecentGlobalXmin
> or result from GetOldestXmin() than we do?

Sure it can, for example:

1. Transaction 1 begins in backend A
2. Transaction 2 begins in backend B, xmin = 1
3. Transaction 1 ends
4. Transaction 3 begins in backend C, xmin = 2
5. Backend C gets snapshot, TransactionXmin = 2, RecentGlobalXmin = 1
6. Transaction 2 ends.
7. Transaction 4 begins in backend A, gets snapshot TransactionXmin = 2, 
RecentGlobalXmin = 2
8. Transaction 4 kills tuple, using its RecentGlobalxmin of 1
9. Transaciont 3 splits the page, emits a delete xlog record, setting 
latestRemovedXid to its RecentGlobalXmin of 2

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby dev build (v8)

From
Simon Riggs
Date:
On Mon, 2009-01-19 at 09:16 +0200, Heikki Linnakangas wrote: 
> Simon Riggs wrote:
> > On Fri, 2009-01-16 at 22:09 +0200, Heikki Linnakangas wrote:
> > 
> >>>> RecentGlobalXmin is just a hint, it lags behind the real oldest xmin 
> >>>> that GetOldestXmin() would return. If another backend has a more recent 
> >>>> RecentGlobalXmin value, and has killed more recent tuples on the page, 
> >>>> the latestRemovedXid written here is too old.
> >>> What do you think we should do instead?
> >> Dunno. Maybe call GetOldestXmin().
> > 
> > We are discussing btree deletes, not btree vacuums. 
> 
> Pardon my ignorance, but what's the difference?

In terms of current HEAD, not much. In terms of Hot Standby, a
significant difference - the two actions have been split, rather than
continuing to share the same WAL record.

XLOG_BTREE_VACUUM removes index tuples as a result of a vacuum. The
initial scan of the heap already generated an XLOG_HEAP2_CLEANUP_INFO
which gives the latestRemovedXid for that vacuum. So we don't need to
worry about putting a latestRemovedXid on XLOG_BTREE_VACUUM. The WAL
records also differ because the XLOG_BTREE_VACUUM contains details of
blocks that need to be pinned but not otherwise touched.

XLOG_BTREE_DELETE is different in 3 ways. It isn't part of a vacuum, so:
* we don't need to take a cleanup lock
* it doesn't contain info about other blocks we need to scan beforehand
for correctness purposes
* it wasn't preceded by an XLOG_HEAP2_CLEANUP_INFO record, so it must
have a *correct* (even if too conservative) value for latestRemovedXid
set.

So the only time we need to set latestRemovedXid correctly is during a
normal transaction, not during a vacuum.

> > If we are doing
> > btree delete then we have an unreleased snapshot therefore we also have
> > a non-zero xmin. How can another backend have a later RecentGlobalXmin
> > or result from GetOldestXmin() than we do?
> 
> Sure it can, for example:
> 
> 1. Transaction 1 begins in backend A
> 2. Transaction 2 begins in backend B, xmin = 1
> 3. Transaction 1 ends
> 4. Transaction 3 begins in backend C, xmin = 2
> 5. Backend C gets snapshot, TransactionXmin = 2, RecentGlobalXmin = 1
> 6. Transaction 2 ends.
> 7. Transaction 4 begins in backend A, gets snapshot TransactionXmin = 2, 
> RecentGlobalXmin = 2
> 8. Transaction 4 kills tuple, using its RecentGlobalxmin of 1
> 9. Transaciont 3 splits the page, emits a delete xlog record, setting 
> latestRemovedXid to its RecentGlobalXmin of 2

Well, steps 7 and 8 don't make sense.

Your earlier comment was that it was possible for a WAL record to be
written with a RecentGlobalXmin that was lower than other backends
values. In step 9 the RecentGlobalXmin is *not* lower than any other
backend, it is the same. 

So if there is a proof, this isn't it. 

But I can't see how there can be one: Two concurrent vacuums can have
different OldestXmin values, but two concurrent transactions cannot.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot Standby dev build (v8)

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> Well, steps 7 and 8 don't make sense.
> 
> Your earlier comment was that it was possible for a WAL record to be
> written with a RecentGlobalXmin that was lower than other backends
> values. In step 9 the RecentGlobalXmin is *not* lower than any other
> backend, it is the same. 
> 
> So if there is a proof, this isn't it. 

Yeah, you're right. I got steps 8 and 9 mixed. Let me try again:

1. Transaction 1 begins in backend A
2. Transaction 2 begins in backend B, xmin = 1
3. Transaction 1 ends
4. Transaction 3 begins in backend C, xmin = 2
5. Backend C gets snapshot, TransactionXmin = 2, RecentGlobalXmin = 1
6. Transaction 2 ends.
7. Transaction 4 begins in backend A, gets snapshot TransactionXmin = 2, 
RecentGlobalXmin = 2
8. Transaction 3 kills tuple, using its RecentGlobalxmin of 2
9. Transaction 4 splits the page, emits a delete xlog record, setting 
latestRemovedXid to its RecentGlobalXmin of 1

> But I can't see how there can be one: Two concurrent vacuums can have
> different OldestXmin values, but two concurrent transactions cannot.

Of course they can.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby dev build (v8)

From
Simon Riggs
Date:
On Mon, 2009-01-19 at 12:22 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > Well, steps 7 and 8 don't make sense.
> > 
> > Your earlier comment was that it was possible for a WAL record to be
> > written with a RecentGlobalXmin that was lower than other backends
> > values. In step 9 the RecentGlobalXmin is *not* lower than any other
> > backend, it is the same. 
> > 
> > So if there is a proof, this isn't it. 
> 
> Yeah, you're right. I got steps 8 and 9 mixed. Let me try again:
> 
> 1. Transaction 1 begins in backend A
> 2. Transaction 2 begins in backend B, xmin = 1
> 3. Transaction 1 ends
> 4. Transaction 3 begins in backend C, xmin = 2
> 5. Backend C gets snapshot, TransactionXmin = 2, RecentGlobalXmin = 1
> 6. Transaction 2 ends.
> 7. Transaction 4 begins in backend A, gets snapshot TransactionXmin = 2, 
> RecentGlobalXmin = 2
> 8. Transaction 3 kills tuple, using its RecentGlobalxmin of 2
> 9. Transaction 4 splits the page, emits a delete xlog record, setting 
> latestRemovedXid to its RecentGlobalXmin of 1

One of us needs a coffee.

How does Transaction 4 have a RecentGlobalXmin of 2 in step (7), but at
step (9) the value of RecentGlobalXmin has gone backwards?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot Standby dev build (v8)

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> One of us needs a coffee.

Clearly, I put the kettle on...

> How does Transaction 4 have a RecentGlobalXmin of 2 in step (7), but at
> step (9) the value of RecentGlobalXmin has gone backwards?

Looks like I mixed up the xids of the two transactions in steps 8 and 9. 
Let's see if I got it right this time:

1. Transaction 1 begins in backend A
2. Transaction 2 begins in backend B, xmin = 1
3. Transaction 1 ends
4. Transaction 3 begins in backend C, xmin = 2
5. Backend C gets snapshot, TransactionXmin = 2, RecentGlobalXmin = 1
6. Transaction 2 ends.
7. Transaction 4 begins in backend A, gets snapshot TransactionXmin = 2, 
RecentGlobalXmin = 2
8. Transaction 4 kills tuple, using its RecentGlobalxmin of 2
9. Transaction 3 splits the page, emits a delete xlog record, setting 
latestRemovedXid to its RecentGlobalXmin of 1

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby dev build (v8)

From
Simon Riggs
Date:
On Mon, 2009-01-19 at 12:50 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > One of us needs a coffee.
> 
> Clearly, I put the kettle on...

I had one too, just in case.

> > How does Transaction 4 have a RecentGlobalXmin of 2 in step (7), but at
> > step (9) the value of RecentGlobalXmin has gone backwards?
> 
> Looks like I mixed up the xids of the two transactions in steps 8 and 9. 
> Let's see if I got it right this time:
> 
> 1. Transaction 1 begins in backend A
> 2. Transaction 2 begins in backend B, xmin = 1
> 3. Transaction 1 ends
> 4. Transaction 3 begins in backend C, xmin = 2
> 5. Backend C gets snapshot, TransactionXmin = 2, RecentGlobalXmin = 1
> 6. Transaction 2 ends.
> 7. Transaction 4 begins in backend A, gets snapshot TransactionXmin = 2, 
> RecentGlobalXmin = 2
> 8. Transaction 4 kills tuple, using its RecentGlobalxmin of 2
> 9. Transaction 3 splits the page, emits a delete xlog record, setting 
> latestRemovedXid to its RecentGlobalXmin of 1

I don't see how step (5) is possible. GetSnapshotData() sets
RecentGlobalXmin to the result of the snapshot's xmin.

If step (5) is possible, then yes, step (9) can happen.

You are correct to say that RecentGlobalXmin is not always correctly
set. All I'm saying is that at the exact time, place and circumstance I
use it, it is correct. In other cases, it may not be.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot Standby dev build (v8)

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> GetSnapshotData() sets
> RecentGlobalXmin to the result of the snapshot's xmin.

No. RecentGlobalXmin is set to the oldest *xmin* observed, across all 
running transactions. TransactionXmin is the xid of the oldest running 
transaction. IOW, RecentGlobalXmin is the xid of transaction that the 
oldest running transaction sees as running.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby dev build (v8)

From
Simon Riggs
Date:
On Mon, 2009-01-19 at 14:00 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > GetSnapshotData() sets
> > RecentGlobalXmin to the result of the snapshot's xmin.
> 
> No. RecentGlobalXmin is set to the oldest *xmin* observed, across all 
> running transactions. TransactionXmin is the xid of the oldest running 
> transaction. IOW, RecentGlobalXmin is the xid of transaction that the 
> oldest running transaction sees as running.

OK. That was fun.

These WAL records are annoying, no matter what the exact value of
latestRemovedXid is and they seem likely to conflict with many queries
on the standby. 

If we don't use RecentGlobalXmin then I can't see any easily derived
value that we can use in its place. It isn't worth the effort on the
master to derive a more exact value, not when we don't even know if it
matters yet. 

I suggest we handle this on the recovery side, not on the master, by
deriving the xmin at the point the WAL record arrives. We would
calculate it by looking at recovery procs only. That will likely give us
a later value than we would get from the master, but that can't be
helped.

For me, this makes it essential now that I put in place the deferred
cancellation mechanism. Some refactoring in this area is also required
because we need to handle two other types of conflict to correctly
support drop database and drop user, which is now underway.

Btree deletes were an important optimisation when it first went it, but
now we have HOT it is much less important. Another route might be to put
an option to turn off btree delete on the master, default = on. We
probably should consider turning it off entirely when it doesn't yield
significant benefit. Lots of scanning to remove the odd row is probably
pretty wasteful and likely adds contention at the very point we don't
want it - index splits.

Thoughts? 

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot Standby dev build (v8)

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> I suggest we handle this on the recovery side, not on the master, by
> deriving the xmin at the point the WAL record arrives. We would
> calculate it by looking at recovery procs only. That will likely give us
> a later value than we would get from the master, but that can't be
> helped.

Hmm, that's an interesting idea. It presumes that we see an abort/commit 
WAL record at the right moment for every transaction that we have a 
recovery proc for. We just concluded in the other thread that we do 
always emit abortion records when the database is running normally; I 
think that's good enough for this purpose.

A few other random ideas I had:

- in btree delete redo, follow the index pointers, and look at the xids 
on the heap tuples. That requires some random I/O, but will give the 
exact value we need. Since it's quite expensive, I think we'd only want 
to do it after using some more conservative test but quicker test to 
determine that there might be a conflict.

- Add latestRemovedXid to b-tree page header, and update it as tuples 
are killed. Need to tolerate the fact that tuple kills are not WAL-logged.

> Btree deletes were an important optimisation when it first went it, but
> now we have HOT it is much less important. 

If HOT is working well for your application, there won't be many btree 
deletes anyway, and the whole issue is moot.

> Another route might be to put
> an option to turn off btree delete on the master, default = on. We
> probably should consider turning it off entirely when it doesn't yield
> significant benefit.

I'd rather put in a generic mechanism to prevent vacuuming of recent 
tuples that might still be needed in the standby. Like always 
subtracting a fixed amount of xids from OldestXmin/RecentGlobalXmin, or 
having a feedback loop from the standby to the master, allowing the 
master to say what it's oldest xmin is. But that's a fair amount of 
work; I'd rather leave that as a future enhancement, and just figure out 
something simple for this specific issue. We'll need to handle it 
gracefully even if we try to avoid it by retaining dead tuples longer.

> Lots of scanning to remove the odd row is probably
> pretty wasteful and likely adds contention at the very point we don't
> want it - index splits.

Remember that if you can remove enough dead tuples from the index page, 
you've just made room on the page and don't need to split. Splitting is 
pretty expensive compared to scanning a few line pointers.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby dev build (v8)

From
Simon Riggs
Date:
On Mon, 2009-01-19 at 15:47 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > I suggest we handle this on the recovery side, not on the master, by
> > deriving the xmin at the point the WAL record arrives. We would
> > calculate it by looking at recovery procs only. That will likely give us
> > a later value than we would get from the master, but that can't be
> > helped.
> 
> Hmm, that's an interesting idea. It presumes that we see an abort/commit 
> WAL record at the right moment for every transaction that we have a 
> recovery proc for. We just concluded in the other thread that we do 
> always emit abortion records when the database is running normally; I 
> think that's good enough for this purpose.

But not perfect.

> A few other random ideas I had:
> 
> - in btree delete redo, follow the index pointers, and look at the xids 
> on the heap tuples. That requires some random I/O, but will give the 
> exact value we need. Since it's quite expensive, I think we'd only want 
> to do it after using some more conservative test but quicker test to 
> determine that there might be a conflict.

Ouch.

> - Add latestRemovedXid to b-tree page header, and update it as tuples 
> are killed. Need to tolerate the fact that tuple kills are not WAL-logged.

Sounds easy-ish. 

If tuple kills aren't WAL logged then if we crash latestRemovedXid will
remain as it was at time of last write. So if we do a delete scan it
will only remove the index tuples with hint bits set at time of that
write, so the value would always be correct, no?

I'm somehow uncomfortable with this idea though. Care to persuade me
further?

> > Btree deletes were an important optimisation when it first went it, but
> > now we have HOT it is much less important. 
> 
> If HOT is working well for your application, there won't be many btree 
> deletes anyway, and the whole issue is moot.

That was my point.

> > Another route might be to put
> > an option to turn off btree delete on the master, default = on. We
> > probably should consider turning it off entirely when it doesn't yield
> > significant benefit.
> 
> I'd rather put in a generic mechanism to prevent vacuuming of recent 
> tuples that might still be needed in the standby. Like always 
> subtracting a fixed amount of xids from OldestXmin/RecentGlobalXmin, or 
> having a feedback loop from the standby to the master, allowing the 
> master to say what it's oldest xmin is. But that's a fair amount of 
> work; I'd rather leave that as a future enhancement, and just figure out 
> something simple for this specific issue. We'll need to handle it 
> gracefully even if we try to avoid it by retaining dead tuples longer.

Yeh, looked at both of those also. Definitely after sync rep goes in
though.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot Standby dev build (v8)

From
Simon Riggs
Date:
On Mon, 2009-01-19 at 12:54 +0000, Simon Riggs wrote:
> Some refactoring in this area is also required
> because we need to handle two other types of conflict to correctly
> support drop database and drop user, which is now underway.

I've hung the drop database conflict code in dbase_redo().

For drop role, there isn't an rmgr at all, but I can add code in a few
places.

* add XLOG_DBASE_DROP_USER - i.e. add drop user to the Database rmgr

* DropRole() takes an AccessExclusiveLock, so we do write a WAL record
for it. I could add a special case to the Relation rmgr.

* Add a new rmgr (unused slot 7) and have it handle DropRole.

I prefer the last one, but if you think otherwise, please shout.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Hot Standby dev build (v8)

From
Heikki Linnakangas
Date:
Simon Riggs wrote:
> I prefer the last one, but if you think otherwise, please shout.

We're now emitting WAL records for relcache invalidations, IIRC. I 
wonder if those are useful for this?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Hot Standby dev build (v8)

From
Simon Riggs
Date:
On Mon, 2009-01-19 at 16:50 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > I prefer the last one, but if you think otherwise, please shout.
> 
> We're now emitting WAL records for relcache invalidations, IIRC. I 
> wonder if those are useful for this?

Tom already objected to putting strange inval messages into WAL.

Hmm, DROP USER is transactional, so we can only do this at commit. So
forget the other ideas I had.

We already know about the auth file update at commit.

So we should say, at commit, re-read the list of roleids in use and if
any don't match a row in pg_user then remove them. If we do that after
the flat file update and the actual commit that removes the user then we
will be guaranteed no race condition exists to allow new users to logon
as we try to disconnect them.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support