Transaction Snapshots and Hot Standby - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Transaction Snapshots and Hot Standby |
Date | |
Msg-id | 1221091675.3913.806.camel@ebony.2ndQuadrant Whole thread Raw |
Responses |
Re: Transaction Snapshots and Hot Standby
Re: Transaction Snapshots and Hot Standby Re: Transaction Snapshots and Hot Standby |
List | pgsql-hackers |
Transaction snapshots is probably the most difficult problem for Hot Standby to resolve. We *must* have a transaction snapshot to access table data in a consistent way, or we must accept some inconsistency, though that is not simple either. We can derive transaction snapshots * remotely from primary node * locally on the standby node If we derive a snapshot locally, then we will end up with a situation where the xmin of the local snapshot precedes the xmin of the primary node. When this occurs it will then be possible for WAL records to arrive on the standby that request removal of rows that a transaction might wish to see. Preventing that situation can be done by either deferring WAL apply or by cancelling queries. We can defer WAL apply for particular tables only, but this could significantly complicate the apply process and is not a suggested option for the first release of this feature. We might control query cancellation by tracking which tables have had rows removed that would have been visible to particular queries. Again, possible but suffers from the problem that tables on which HOT is frequently active would be almost unusable. So not a general solution. Simply ignoring WAL removal has been discussed and rejected. http://archives.postgresql.org/pgsql-hackers/2008-05/msg00753.php Explicitly defining the tables a transaction wishes to see has also been discussed and rejected. http://archives.postgresql.org/pgsql-hackers/2008-08/msg00268.php So the most generally applicable and simplest initial solution for generating snapshots is to take them from the remote primary node. The proposal for this follows: Transaction snapshots will be managed by a new process, Snapproc. Snapproc will start only in recovery mode and will exit when normal processing mode begins. Snapshot process will connect to the remote primary node and export snapshot data and copy this to shared memory on the standby node. When all standby backends have run UnregisterSnapshot() then the snapshot will then be unregistered on the remote primary node also. The standby must not think a transaction is visible until all changes made by it are have been applied. So snapshots from the primary cannot be used on the standby until the LSN at which they were taken has been reached by WAL apply on the standby. Snapshots don't normally have an LSN, so we must derive that information as well when we take a Snapshot. With asynchronous commits a transaction can be committed and yet not yet written to disk, so we cannot rely on the current WAL write pointer. Taking the WAL insert pointer is the safest thing to do, though most heavily contended. We don't want to hold ProcArrayLock while requesting WALInsertLock, so we will get the LSN of the WAL insert pointer *after* the Snapshot has been derived (it doesn't matter that much, as long as its not before the snapshot). So snapshots from the primary take time before they can be used. The delay is exactly the current processing delay from primary to standby, so another reason why we do not wish to fall behind. Taking snapshots from primary has a few disadvantages * snapshots take time before they are usable * requesting new snapshots is via remote request * snapshots onprimary prevent row removal (but this was also an advantage of this technique!) If primary and standby node are connected by private ethernet then the snapshot request time will be ~10ms, though that is probably 100 times slower than current snapshot access. If primary and standby are opposite sides of an ocean then times could be as high as 500ms. Taking snapshots directly could be annoyingly slow for small read-only statements, so we need to buffer this process in various ways. For larger queries, this may not be a problem at all, but we do not wish to limit or assume the uses of the standby node. First of all, each backend will have the opportunity to reuse previous snapshots both within transactions and across them. A userset parameter snapshot_reuse_window = 0..60000ms will define the time window in which any request for a new snapshot will simply result in being fed the last snapshot again. When the window on a snapshot has expired a newer snapshot will be presented instead. This idea is similar to serializable transactions, which continually reuse the same snapshot. This is a useful parameter for normal processing as well, since it will reduce contention on the ProcArrayLock for many applications. Snapshots can be reused across transactions in recovery mode, since they are held in shared memory. Amount of shared memory dedicated to storing snapshots will be max_connections * max size of snapshots. Since there is a delay between obtaining a new snapshot and it becoming usable the Snapshot process will buffer them until they become "mature", like a good Whiskey. Snapshot process will take regular snapshots and pre-age them so that when a backend requests a snapshot it will be given the most recently matured snapshot. Time between snapshots is set by snapshot_preread_timeout = 0..60000ms. If a request for a snapshot arrives and there are no snapshots waiting to mature then this will trigger snapshot process to request a new primary snapshot. (This parameter could be automatically set based upon the arrival rate of snapshot requests, but this is a something to consider later). If snapshot_reuse_window = 0 then a backend will be presented with a freshly obtained snapshot and will then wait until the exact first moment it can be used before returning. We can continue to reuse snapshots from the primary even if the primary crashes, becomes disconnected or is shutdown/restarted. New snapshots are obviously not possible until it appears again. It's not that common for us to lose contact with the primary *and* for it to *not* be a failover, so this seems like an acceptable restriction. Other alternatives? Maybe. In earlier discussions, I suggested that we might use "Read Uncommitted" mode for use with Hot Standby. Tom pointed out that what I had suggested was not "Read Uncommitted" as described by SQL Standard. For Hot Standby, I've had a look at the various transaction isolation modes possible to see which, if any, are desirable: 1. The first is to just accept that transaction *isolation* is not possible, but carry on anyway. We might call this READ INCONSISTENT - which carries the problems noted by Tom earlier. This mode allows us to "see" committed deletes because their row versions might be missing from our query, it will cause updated records to disappear from queries and yet at the same time for newly inserted data to be invisible. However, if the table is insert only or read only this will give a consistent result, so this mode *is* appealing to many potential users. (It's been suggested to me twice, independently.) We would need to derive a snapshot based upon which transactions are "running" according to the replay of transactions during recovery. 2. The second is to throw an error if transaction isolation is violated. Currently we can only do this approximately, by recording the global LSN at start of query and then aborting the query if we touch a data block that has changes made after the LSN. Slightly more extreme version of (1), but always consistent if it completes. (Somebody has joked we could make the error message "snapshot too old"). We can implement both of those with an additional parameter, allow_xmin_advance = off (default) | on Mode (1) is allow_xmin_advance = on in READ COMMITTED mode Mode (2) is allow_xmin_advance = on in SERIALIZABLE mode 3. We can also implement Read Uncommitted mode itself in the database. We would get inconsistent results for updates and deletes *and* inserts but at least we would always see the latest version of an updated row, rather than skipping them completely. On a read only table this might be very useful. On an insert only table this might result in some long running queries as a scan struggles to keep up with inserts! For (3) we would need to retrieve tuple from executor scan nodes in SnapshotDirty mode. If we see a tuple we would follow its tuple chain to another row on the same block using EvalPlanQual() logic. If we are using an IndexScan and the update chain goes off-block then there will be another index pointer that we (may) see to access the tuple, so we would stop at that point. If we are using a SeqScan we should follow the tuple chain until we see the top tuple, even if it goes off-block. Bitmap scans would cause problems. Deriving snapshots from the master seems the most obvious, but lets see where the dice roll... Other topics on Hot Standby will be covered later; this is just the first part. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
pgsql-hackers by date: