Re: Transaction Snapshots and Hot Standby - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Transaction Snapshots and Hot Standby
Date
Msg-id 1221510431.3913.1563.camel@ebony.2ndQuadrant
Whole thread Raw
In response to Re: Transaction Snapshots and Hot Standby  ("Florian G. Pflug" <fgp@phlo.org>)
List pgsql-hackers
On Mon, 2008-09-15 at 19:20 +0100, Florian G. Pflug wrote:
> Simon Riggs wrote:
> > On Sat, 2008-09-13 at 10:48 +0100, Florian G. Pflug wrote:
> > 
> >> The main idea was to invert the meaning of the xid array in the snapshot
> >> struct - instead of storing all the xid's between xmin and xmax that are
> >> to be considering "in-progress", the array contained all the xid's >
> >> xmin that are to be considered "completed".
> > 
> >> The downside is that the size of the read-only snapshot is theoretically
> >> unbounded, which poses a bit of a problem if it's supposed to live
> >> inside shared memory...
> > 
> > Why do it inverted? That clearly has problems.
> 
> Because it solves the problem of "sponteaously" apprearing XIDs in the 
> WAL. At least prior to 8.3 with virtual xids, a transaction might have 
> allocated it's xid long before actually writing anything to disk, and 
> therefore long before this XID ever shows up in the WAL. And with a 
> non-inverted snapshot such an XID would be considered to be "completed" 
> by transactions on the slave... So, one either needs to periodically log 
> a snapshot on the master or log XID allocations which both seem to cause 
> considerable additional load on the master. With an inverted snapshot, 
> it's sufficient to log the current RecentXmin - a values that is readily 
> available on the master, and therefore the cost amounts to just one 
> additional 4-byte field per xlog entry.

I think I understand what you're saying now, though I think it
mostly/only applies before your brilliant idea in 8.3.

If we have a transaction history that looks like this:

ReadA, WriteB, WriteA (both commit in either order)

then pre-8.3 we would have xidA < xidB, whereas at 8.3 and above we see
that xidA is actually higher than xidB. Now, TransactionIds are assigned
in the order of their first page write and *not* in the order of
transaction start as was previously the case, which isn't very obvious.

So when we replay WAL, we know that WAL is only written with locks held,
so that WriteA and WriteB must be independent of each other. So that
means the locks held by Read1 can be ignored and we can assume that the
above history is the same as

WriteB, WriteA

So if we took a snapshot in the middle of WriteB we would be safe to say
that only transaction B was in progress and that transaction A was not
yet started. So the snapshot we derive on the standby is different to
the one we would have derived on the client, yet the serializable order
is the same. In general, this means that all reads on a transaction
prior to the first write can be reordered later so that they can be
assumed to occur exactly prior to the first write of a transaction.
(Please shoot me down, if incorrect).

So when we see the first WAL record of a transaction we know that there
are no in progress transactions with a *lower* xid that we have not yet
seen. So we cannot be confused about whether a transaction is
in-progress, or not.

Almost. Now having written all of that I see there is an obvious race
condition between assignment of an xid and actions that result in the
acquisition of WALInsertLock. So even though the above seems mostly
correct, there is still a gap to plug, but a much smaller one.

So when writing the first WAL record for xid=48, it is possible that
xid=47 has just been assigned and is also just about to write a WAL
record. Thus the absence of a WAL record for xid=47 is not evidence that
xid=47 is complete because it was read-only.

We might handle this with the inverted technique you describe, but there
should be an easier way to track dense packing of xid sequence.

We expect xids to be written to WAL in the order assigned, so we might
check whether a newly assigned xid is 1 higher than the last highest
value to have inserted into WAL. If it is not, then we can write a short
WAL record to inform readers of WAL that the missing xids in sequence
are in progress also. So readers of WAL will "see" xids in the correct
sequence and are thus able to construct valid snapshots direct from WAL.

I think I should measure how often that occurs to see what problem or
overhead this might cause, if any.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



pgsql-hackers by date:

Previous
From: "Florian G. Pflug"
Date:
Subject: Re: Transaction Snapshots and Hot Standby
Next
From: Magnus Hagander
Date:
Subject: Re: Parsing of pg_hba.conf and authentication inconsistencies