Re: Hot Standby dev build (v8) - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Hot Standby dev build (v8)
Date
Msg-id 1232359728.2327.6.camel@ebony.2ndQuadrant
Whole thread Raw
In response to Re: Hot Standby dev build (v8)  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Responses Re: Hot Standby dev build (v8)
List pgsql-hackers
On Mon, 2009-01-19 at 09:16 +0200, Heikki Linnakangas wrote: 
> Simon Riggs wrote:
> > On Fri, 2009-01-16 at 22:09 +0200, Heikki Linnakangas wrote:
> > 
> >>>> RecentGlobalXmin is just a hint, it lags behind the real oldest xmin 
> >>>> that GetOldestXmin() would return. If another backend has a more recent 
> >>>> RecentGlobalXmin value, and has killed more recent tuples on the page, 
> >>>> the latestRemovedXid written here is too old.
> >>> What do you think we should do instead?
> >> Dunno. Maybe call GetOldestXmin().
> > 
> > We are discussing btree deletes, not btree vacuums. 
> 
> Pardon my ignorance, but what's the difference?

In terms of current HEAD, not much. In terms of Hot Standby, a
significant difference - the two actions have been split, rather than
continuing to share the same WAL record.

XLOG_BTREE_VACUUM removes index tuples as a result of a vacuum. The
initial scan of the heap already generated an XLOG_HEAP2_CLEANUP_INFO
which gives the latestRemovedXid for that vacuum. So we don't need to
worry about putting a latestRemovedXid on XLOG_BTREE_VACUUM. The WAL
records also differ because the XLOG_BTREE_VACUUM contains details of
blocks that need to be pinned but not otherwise touched.

XLOG_BTREE_DELETE is different in 3 ways. It isn't part of a vacuum, so:
* we don't need to take a cleanup lock
* it doesn't contain info about other blocks we need to scan beforehand
for correctness purposes
* it wasn't preceded by an XLOG_HEAP2_CLEANUP_INFO record, so it must
have a *correct* (even if too conservative) value for latestRemovedXid
set.

So the only time we need to set latestRemovedXid correctly is during a
normal transaction, not during a vacuum.

> > If we are doing
> > btree delete then we have an unreleased snapshot therefore we also have
> > a non-zero xmin. How can another backend have a later RecentGlobalXmin
> > or result from GetOldestXmin() than we do?
> 
> Sure it can, for example:
> 
> 1. Transaction 1 begins in backend A
> 2. Transaction 2 begins in backend B, xmin = 1
> 3. Transaction 1 ends
> 4. Transaction 3 begins in backend C, xmin = 2
> 5. Backend C gets snapshot, TransactionXmin = 2, RecentGlobalXmin = 1
> 6. Transaction 2 ends.
> 7. Transaction 4 begins in backend A, gets snapshot TransactionXmin = 2, 
> RecentGlobalXmin = 2
> 8. Transaction 4 kills tuple, using its RecentGlobalxmin of 1
> 9. Transaciont 3 splits the page, emits a delete xlog record, setting 
> latestRemovedXid to its RecentGlobalXmin of 2

Well, steps 7 and 8 don't make sense.

Your earlier comment was that it was possible for a WAL record to be
written with a RecentGlobalXmin that was lower than other backends
values. In step 9 the RecentGlobalXmin is *not* lower than any other
backend, it is the same. 

So if there is a proof, this isn't it. 

But I can't see how there can be one: Two concurrent vacuums can have
different OldestXmin values, but two concurrent transactions cannot.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: FATAL: could not open relation pg_tblspc/491086/467369/491103: No such file or directory
Next
From: Heikki Linnakangas
Date:
Subject: Re: Hot Standby dev build (v8)