Re: Transaction Snapshots and Hot Standby - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Re: Transaction Snapshots and Hot Standby |
Date | |
Msg-id | 1221510431.3913.1563.camel@ebony.2ndQuadrant Whole thread Raw |
In response to | Re: Transaction Snapshots and Hot Standby ("Florian G. Pflug" <fgp@phlo.org>) |
List | pgsql-hackers |
On Mon, 2008-09-15 at 19:20 +0100, Florian G. Pflug wrote: > Simon Riggs wrote: > > On Sat, 2008-09-13 at 10:48 +0100, Florian G. Pflug wrote: > > > >> The main idea was to invert the meaning of the xid array in the snapshot > >> struct - instead of storing all the xid's between xmin and xmax that are > >> to be considering "in-progress", the array contained all the xid's > > >> xmin that are to be considered "completed". > > > >> The downside is that the size of the read-only snapshot is theoretically > >> unbounded, which poses a bit of a problem if it's supposed to live > >> inside shared memory... > > > > Why do it inverted? That clearly has problems. > > Because it solves the problem of "sponteaously" apprearing XIDs in the > WAL. At least prior to 8.3 with virtual xids, a transaction might have > allocated it's xid long before actually writing anything to disk, and > therefore long before this XID ever shows up in the WAL. And with a > non-inverted snapshot such an XID would be considered to be "completed" > by transactions on the slave... So, one either needs to periodically log > a snapshot on the master or log XID allocations which both seem to cause > considerable additional load on the master. With an inverted snapshot, > it's sufficient to log the current RecentXmin - a values that is readily > available on the master, and therefore the cost amounts to just one > additional 4-byte field per xlog entry. I think I understand what you're saying now, though I think it mostly/only applies before your brilliant idea in 8.3. If we have a transaction history that looks like this: ReadA, WriteB, WriteA (both commit in either order) then pre-8.3 we would have xidA < xidB, whereas at 8.3 and above we see that xidA is actually higher than xidB. Now, TransactionIds are assigned in the order of their first page write and *not* in the order of transaction start as was previously the case, which isn't very obvious. So when we replay WAL, we know that WAL is only written with locks held, so that WriteA and WriteB must be independent of each other. So that means the locks held by Read1 can be ignored and we can assume that the above history is the same as WriteB, WriteA So if we took a snapshot in the middle of WriteB we would be safe to say that only transaction B was in progress and that transaction A was not yet started. So the snapshot we derive on the standby is different to the one we would have derived on the client, yet the serializable order is the same. In general, this means that all reads on a transaction prior to the first write can be reordered later so that they can be assumed to occur exactly prior to the first write of a transaction. (Please shoot me down, if incorrect). So when we see the first WAL record of a transaction we know that there are no in progress transactions with a *lower* xid that we have not yet seen. So we cannot be confused about whether a transaction is in-progress, or not. Almost. Now having written all of that I see there is an obvious race condition between assignment of an xid and actions that result in the acquisition of WALInsertLock. So even though the above seems mostly correct, there is still a gap to plug, but a much smaller one. So when writing the first WAL record for xid=48, it is possible that xid=47 has just been assigned and is also just about to write a WAL record. Thus the absence of a WAL record for xid=47 is not evidence that xid=47 is complete because it was read-only. We might handle this with the inverted technique you describe, but there should be an easier way to track dense packing of xid sequence. We expect xids to be written to WAL in the order assigned, so we might check whether a newly assigned xid is 1 higher than the last highest value to have inserted into WAL. If it is not, then we can write a short WAL record to inform readers of WAL that the missing xids in sequence are in progress also. So readers of WAL will "see" xids in the correct sequence and are thus able to construct valid snapshots direct from WAL. I think I should measure how often that occurs to see what problem or overhead this might cause, if any. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
pgsql-hackers by date: