Thread: Transaction Snapshots and Hot Standby
Transaction snapshots is probably the most difficult problem for Hot Standby to resolve. We *must* have a transaction snapshot to access table data in a consistent way, or we must accept some inconsistency, though that is not simple either. We can derive transaction snapshots * remotely from primary node * locally on the standby node If we derive a snapshot locally, then we will end up with a situation where the xmin of the local snapshot precedes the xmin of the primary node. When this occurs it will then be possible for WAL records to arrive on the standby that request removal of rows that a transaction might wish to see. Preventing that situation can be done by either deferring WAL apply or by cancelling queries. We can defer WAL apply for particular tables only, but this could significantly complicate the apply process and is not a suggested option for the first release of this feature. We might control query cancellation by tracking which tables have had rows removed that would have been visible to particular queries. Again, possible but suffers from the problem that tables on which HOT is frequently active would be almost unusable. So not a general solution. Simply ignoring WAL removal has been discussed and rejected. http://archives.postgresql.org/pgsql-hackers/2008-05/msg00753.php Explicitly defining the tables a transaction wishes to see has also been discussed and rejected. http://archives.postgresql.org/pgsql-hackers/2008-08/msg00268.php So the most generally applicable and simplest initial solution for generating snapshots is to take them from the remote primary node. The proposal for this follows: Transaction snapshots will be managed by a new process, Snapproc. Snapproc will start only in recovery mode and will exit when normal processing mode begins. Snapshot process will connect to the remote primary node and export snapshot data and copy this to shared memory on the standby node. When all standby backends have run UnregisterSnapshot() then the snapshot will then be unregistered on the remote primary node also. The standby must not think a transaction is visible until all changes made by it are have been applied. So snapshots from the primary cannot be used on the standby until the LSN at which they were taken has been reached by WAL apply on the standby. Snapshots don't normally have an LSN, so we must derive that information as well when we take a Snapshot. With asynchronous commits a transaction can be committed and yet not yet written to disk, so we cannot rely on the current WAL write pointer. Taking the WAL insert pointer is the safest thing to do, though most heavily contended. We don't want to hold ProcArrayLock while requesting WALInsertLock, so we will get the LSN of the WAL insert pointer *after* the Snapshot has been derived (it doesn't matter that much, as long as its not before the snapshot). So snapshots from the primary take time before they can be used. The delay is exactly the current processing delay from primary to standby, so another reason why we do not wish to fall behind. Taking snapshots from primary has a few disadvantages * snapshots take time before they are usable * requesting new snapshots is via remote request * snapshots onprimary prevent row removal (but this was also an advantage of this technique!) If primary and standby node are connected by private ethernet then the snapshot request time will be ~10ms, though that is probably 100 times slower than current snapshot access. If primary and standby are opposite sides of an ocean then times could be as high as 500ms. Taking snapshots directly could be annoyingly slow for small read-only statements, so we need to buffer this process in various ways. For larger queries, this may not be a problem at all, but we do not wish to limit or assume the uses of the standby node. First of all, each backend will have the opportunity to reuse previous snapshots both within transactions and across them. A userset parameter snapshot_reuse_window = 0..60000ms will define the time window in which any request for a new snapshot will simply result in being fed the last snapshot again. When the window on a snapshot has expired a newer snapshot will be presented instead. This idea is similar to serializable transactions, which continually reuse the same snapshot. This is a useful parameter for normal processing as well, since it will reduce contention on the ProcArrayLock for many applications. Snapshots can be reused across transactions in recovery mode, since they are held in shared memory. Amount of shared memory dedicated to storing snapshots will be max_connections * max size of snapshots. Since there is a delay between obtaining a new snapshot and it becoming usable the Snapshot process will buffer them until they become "mature", like a good Whiskey. Snapshot process will take regular snapshots and pre-age them so that when a backend requests a snapshot it will be given the most recently matured snapshot. Time between snapshots is set by snapshot_preread_timeout = 0..60000ms. If a request for a snapshot arrives and there are no snapshots waiting to mature then this will trigger snapshot process to request a new primary snapshot. (This parameter could be automatically set based upon the arrival rate of snapshot requests, but this is a something to consider later). If snapshot_reuse_window = 0 then a backend will be presented with a freshly obtained snapshot and will then wait until the exact first moment it can be used before returning. We can continue to reuse snapshots from the primary even if the primary crashes, becomes disconnected or is shutdown/restarted. New snapshots are obviously not possible until it appears again. It's not that common for us to lose contact with the primary *and* for it to *not* be a failover, so this seems like an acceptable restriction. Other alternatives? Maybe. In earlier discussions, I suggested that we might use "Read Uncommitted" mode for use with Hot Standby. Tom pointed out that what I had suggested was not "Read Uncommitted" as described by SQL Standard. For Hot Standby, I've had a look at the various transaction isolation modes possible to see which, if any, are desirable: 1. The first is to just accept that transaction *isolation* is not possible, but carry on anyway. We might call this READ INCONSISTENT - which carries the problems noted by Tom earlier. This mode allows us to "see" committed deletes because their row versions might be missing from our query, it will cause updated records to disappear from queries and yet at the same time for newly inserted data to be invisible. However, if the table is insert only or read only this will give a consistent result, so this mode *is* appealing to many potential users. (It's been suggested to me twice, independently.) We would need to derive a snapshot based upon which transactions are "running" according to the replay of transactions during recovery. 2. The second is to throw an error if transaction isolation is violated. Currently we can only do this approximately, by recording the global LSN at start of query and then aborting the query if we touch a data block that has changes made after the LSN. Slightly more extreme version of (1), but always consistent if it completes. (Somebody has joked we could make the error message "snapshot too old"). We can implement both of those with an additional parameter, allow_xmin_advance = off (default) | on Mode (1) is allow_xmin_advance = on in READ COMMITTED mode Mode (2) is allow_xmin_advance = on in SERIALIZABLE mode 3. We can also implement Read Uncommitted mode itself in the database. We would get inconsistent results for updates and deletes *and* inserts but at least we would always see the latest version of an updated row, rather than skipping them completely. On a read only table this might be very useful. On an insert only table this might result in some long running queries as a scan struggles to keep up with inserts! For (3) we would need to retrieve tuple from executor scan nodes in SnapshotDirty mode. If we see a tuple we would follow its tuple chain to another row on the same block using EvalPlanQual() logic. If we are using an IndexScan and the update chain goes off-block then there will be another index pointer that we (may) see to access the tuple, so we would stop at that point. If we are using a SeqScan we should follow the tuple chain until we see the top tuple, even if it goes off-block. Bitmap scans would cause problems. Deriving snapshots from the master seems the most obvious, but lets see where the dice roll... Other topics on Hot Standby will be covered later; this is just the first part. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > Taking snapshots from primary has a few disadvantages > > ... > * snapshots on primary prevent row removal (but this was also an > advantage of this technique!) That makes it an awful solution for high availability. A backend hung in transaction-in-progress state in the slave will prevent row removal on the master. Isolating the master from queries done performed in the slave is exactly the reason why people use hot standby. And running long reporting queries in the standby is again a very typical use case. And still we can't escape the scenario that the slave receives a WAL record that vacuums away a tuple that's still visible according to a snapshot used in the slave. Even with the proposed scheme, this can happen: 1. Slave receives a snapshot from master 2. A long-running transaction begins on the slave, using that snapshot 3. Network connection is lost 4. Master hits a timeout, and decides to discard the snapshot it sent to the slave 5. A tuple visible to the snapshot is vacuumed 6. Network connection is re-established 7. Slave receives the vacuum WAL record, even though the long-running transaction still needs the tuple. I like the idea of acquiring snapshots locally in the slave much more. As you mentioned, the options there are to defer applying WAL, or cancel queries. I think both options need the same ability to detect when you're about to remove a tuple that's still visible to some snapshot, just the action is different. We should probably provide a GUC to control which you want. However, if we still to provide the behavior that "as long as the network connection works, the master will not remove tuples still needed in the slave" as an option, a lot simpler implementation is to periodically send the slave's oldest xmin to master. Master can take that into account when calculating its own oldest xmin. That requires a lot less communication than the proposed scheme to send snapshots back and forth. A softer version of that is also possible, where the master obeys the slave's oldest xmin, but only up to a point. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2008-09-11 at 09:24 +0300, Heikki Linnakangas wrote: > I like the idea of acquiring snapshots locally in the slave much more. > As you mentioned, the options there are to defer applying WAL, or cancel > queries. More exotic ways to defer applying WAL include using some smart filesystems to get per-backend data snapshots, using either copy-of-write overlay filesystems and filesystem or disk level snapshots. Al least the disk level snapshots exist in SAN-s with aim of easing backups, though I'm not sure if it is effective for use hot standby intended use. Using any of those needs detecting and bypassing shared buffers if they hold "too new" data pages and reading these pages directly from disk snapshot. > I think both options need the same ability to detect when > you're about to remove a tuple that's still visible to some snapshot, > just the action is different. We should probably provide a GUC to > control which you want. We probably need to have two LSN's per page to make maximal use of our MVCC in Hot Standby situation, so we can distinguish addition to a page, which implies no data loss from row removal which does. Currently only Vacuum and Hot pruning can cause row removal. > However, if we still to provide the behavior that "as long as the > network connection works, the master will not remove tuples still needed > in the slave" as an option, a lot simpler implementation is to > periodically send the slave's oldest xmin to master. Master can take > that into account when calculating its own oldest xmin. That requires a > lot less communication than the proposed scheme to send snapshots back > and forth. A softer version of that is also possible, where the master > obeys the slave's oldest xmin, but only up to a point. That point could be statement_timeout or (currently missing) transaction_timeout Also, decision to advance xmin should probably be sent to slave as well, even though it is not something that is needed in local WAL logs. -------------- Hannu
On Thu, 2008-09-11 at 09:24 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > Taking snapshots from primary has a few disadvantages > > > > ... > > * snapshots on primary prevent row removal (but this was also an > > advantage of this technique!) > > That makes it an awful solution for high availability. Please be careful about making such statements. People might think you were opposing the whole idea of Hot Standby, rather than giving an opinion about one suggestion out of many implementation proposals. Looks like you've got some good additional suggestions later in the post. I'll reply later to those, so thanks for that. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Heikki Linnakangas wrote: > Simon Riggs wrote: >> Taking snapshots from primary has a few disadvantages >> >> ... >> * snapshots on primary prevent row removal (but this was also an >> advantage of this technique!) > > That makes it an awful solution for high availability. A backend hung in > transaction-in-progress state in the slave will prevent row removal on > the master. Isolating the master from queries done performed in the > slave is exactly the reason why people use hot standby. And running long > reporting queries in the standby is again a very typical use case. I have to say I agree with Heikki here. Blocking the master based on what the slave is doing seems to make host standby less useful than warm. > I like the idea of acquiring snapshots locally in the slave much more. It's the option that I can see people (well, me) understanding the easiest. All the others sound like ways to get things wrong. As for inconsistent query-results - that way madness lies. How on earth will anyone be able to diagnose or report bugs when they occur? > As you mentioned, the options there are to defer applying WAL, or cancel > queries. I think both options need the same ability to detect when > you're about to remove a tuple that's still visible to some snapshot, > just the action is different. We should probably provide a GUC to > control which you want. I think there's only one value here: "hot standby wal delay time before cancelling query". Might be a shorter name. -- Richard Huxton Archonet Ltd
On Thu, 2008-09-11 at 11:11 +0100, Richard Huxton wrote: > I have to say I agree with Heikki here. Blocking the master based on > what the slave is doing seems to make host standby less useful than warm. I agree also, that why I flagged it up for discussion. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > On Thu, 2008-09-11 at 11:11 +0100, Richard Huxton wrote: > >> I have to say I agree with Heikki here. Blocking the master based on >> what the slave is doing seems to make host standby less useful than warm. > > I agree also, that why I flagged it up for discussion. So as far as I can see there are basically two option here. Either a) transactions with live snapshots on the slave prevent the master from being able to vacuum tuples (which defeats thepurpose of having a live standby server for some users). or b) vacuum on the server which cleans up a tuple the slave has in scope has to block WAL reply on the slave (which I supposedefeats the purpose of having a live standby for users concerned more with fail-over latency). Is there any middle ground or brilliant ways to get the best of both worlds? If not it seems to me the latter is preferable since at least the consequence of having a long-running query on the slave occurs on the same machine running the query. And the work-around -- killing the long-running query -- requires taking action on the same machine as the consequences. Also, when you take action it fixes the problem immediately as WAL reply can recommence which seems like a better deal than a bloated database. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
On Thu, Sep 11, 2008 at 7:18 AM, Gregory Stark <stark@enterprisedb.com> wrote: > a) transactions with live snapshots on the slave prevent the master from being > able to vacuum tuples (which defeats the purpose of having a live standby > server for some users). > > or > > b) vacuum on the server which cleans up a tuple the slave has in scope has to > block WAL reply on the slave (which I suppose defeats the purpose of having > a live standby for users concerned more with fail-over latency). There is nothing stopping you from setting up two (or more) slave servers, with one designated as failover that doens't serve queries, right? Option b seems pretty reasonable to me, although I'd prefer to block wal replay vs canceling queries...although it might be nice to manually be able to force wal replay 'with query cancel' via a checkpoint. merlin
Merlin Moncure wrote: > There is nothing stopping you from setting up two (or more) slave > servers, with one designated as failover that doens't serve queries, > right? I'd imagine that even if applying the WAL on the slave is blocked, it's still streamed from the master to the slave, and in case of failover the slave will fast-forward before starting up as the new master. Of course, if it has fallen 3 days behind because of a giant reporting query, it can take a while to replay all the WAL that has accumulated. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2008-09-11 at 15:23 +0300, Heikki Linnakangas wrote: > I'd imagine that even if applying the WAL on the slave is blocked, it's > still streamed from the master to the slave, and in case of failover the > slave will fast-forward before starting up as the new master. Which begs the question: what happens with a query which is running on the slave in the moment when the slave switches from recovery mode and starts up ? Should the running queries be canceled if they are blocking applying of WAL, to allow start-up, or let them finish ? Cheers, Csaba.
Gregory Stark wrote: > b) vacuum on the server which cleans up a tuple the slave has in scope has to > block WAL reply on the slave (which I suppose defeats the purpose of having > a live standby for users concerned more with fail-over latency). One problem with this, BTW, is that if there's a continuous stream of medium-length transaction in the slave, each new snapshot taken will prevent progress in the WAL replay, so the WAL replay will advance in "baby steps", and can fall behind indefinitely. As soon as there's a moment that there's no active snapshot, it can catch up, but if the slave is seriously busy, that might never happen. Nevertheless, I think it's a much nicer approach. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Csaba Nagy wrote: > On Thu, 2008-09-11 at 15:23 +0300, Heikki Linnakangas wrote: >> I'd imagine that even if applying the WAL on the slave is blocked, it's >> still streamed from the master to the slave, and in case of failover the >> slave will fast-forward before starting up as the new master. > > Which begs the question: what happens with a query which is running on > the slave in the moment when the slave switches from recovery mode and > starts up ? Should the running queries be canceled if they are blocking > applying of WAL, to allow start-up, or let them finish ? Depends on application, I'd say. I guess we'll need both, like the smart and fast shutdown modes. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, 2008-09-11 at 15:42 +0300, Heikki Linnakangas wrote: > One problem with this, BTW, is that if there's a continuous stream of > medium-length transaction in the slave, each new snapshot taken will > prevent progress in the WAL replay, so the WAL replay will advance in > "baby steps", and can fall behind indefinitely. Why would it fall behind indefinitely ? It only should fall behind to the "blocking horizon", which should be the start of the longest currently running transaction... which should be continually advancing and not too far in the past if there are only medium length transactions involved. Isn't normal WAL recovery also doing baby-steps, one WAL record a time ? ;-) Cheers, Csaba.
On Thu, Sep 11, 2008 at 2:07 AM, Simon Riggs wrote: > Transaction snapshots is probably the most difficult problem for Hot > Standby to resolve. > * remotely from primary node > * locally on the standby node > > If we derive a snapshot locally, then we will end up with a situation > where the xmin of the local snapshot precedes the xmin of the primary > node. When this occurs it will then be possible for WAL records to > arrive on the standby that request removal of rows that a transaction > might wish to see. Preventing that situation can be done by either > deferring WAL apply or by cancelling queries. Which operations can request row removal? Isn't that just specific operations that have their own 'is this save to remove' calculations anyway (i.e. vacuum and HOT prune)? What I am thinking about is a design where the primary node were to regularly push an OldestXMin into the WAL, the WAL apply process on the standby nodes pushes it into shared memory and the backends keep an OldestMasterXMin in shared memory. The standby nodes then regularly pushes back the oldest OldestMasterXMin from all backends to the master. Vacuum and HOT prune could then base their calculations on an OldestXMin that is not the OldestXMin of the master itself, but of the master and the standby nodes. That way removal of records that are still visible on one of the standby nodes is prevented on the master instead of worked around on the standby nodes. The obvious downside would be bloat on the master (which could get out of hand if a slave is a few days behind due to a long report), but I think in terms of visibility and consistency this would work. Or am I completely misunderstanding the problem? Jochem
Csaba Nagy wrote: > On Thu, 2008-09-11 at 15:42 +0300, Heikki Linnakangas wrote: >> One problem with this, BTW, is that if there's a continuous stream of >> medium-length transaction in the slave, each new snapshot taken will >> prevent progress in the WAL replay, so the WAL replay will advance in >> "baby steps", and can fall behind indefinitely. > > Why would it fall behind indefinitely ? It only should fall behind to > the "blocking horizon", which should be the start of the longest > currently running transaction... which should be continually advancing > and not too far in the past if there are only medium length transactions > involved. Well, yes, but you can fall behind indefinitely that way. Imagine that each transaction on the slave lasts, say 10 minutes, with a new transaction starting every 5 minutes. On the master, there's a table that's being vacuumed (or HOT-updated) frequently, say after each transaction for simplicity. What can happen is that every transaction that finishes on the slave will only let the WAL replay advance by one XID before blocking on the snapshot of the next slave transaction. The WAL replay will advance at a rate of 0.2 TPM, while the master is generating 1.0 TPM. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
> I'd imagine that even if applying the WAL on the slave is blocked, it's > still streamed from the master to the slave, and in case of failover the > slave will fast-forward before starting up as the new master. Of course, if > it has fallen 3 days behind because of a giant reporting query, it can take > a while to replay all the WAL that has accumulated. Yes, and meanwhile, any other queries that are running on that box are seeing three-day old data as well. In an ideal world, it would be nice if the slave could keep tuples around even after they are dead and vacuumed on the master. Pushing Xmin from the slave to the master creates the possibility of bloating the master due to activity on the slave, which is not going to improve reliability. OTOH, not pushing Xmin leads to several pathological query behaviors on the slave: cancelled queries, inconsistent reads, and falling way behind on WAL application. Either way, it seems to me a massive and uncomfortable violation of the POLA. If it were possible for tuples that had been vacuumed on the master to stick around on the slave for as long as the slave still needed them, then you'd have the best of both worlds, but I have a feeling someone's going to say that that's just about impossible to implement.Against that, all I can say is that neither of thebehaviors described thus far sounds very appealing as a feature, though I'm certain there are some people who, with sufficient jiggering, could make effective use of them. ...Robert
On Thu, 2008-09-11 at 16:19 +0300, Heikki Linnakangas wrote: > Well, yes, but you can fall behind indefinitely that way. Imagine that > each transaction on the slave lasts, say 10 minutes, with a new > transaction starting every 5 minutes. On the master, there's a table > that's being vacuumed (or HOT-updated) frequently, say after each > transaction for simplicity. What can happen is that every transaction > that finishes on the slave will only let the WAL replay advance by one > XID before blocking on the snapshot of the next slave transaction. The > WAL replay will advance at a rate of 0.2 TPM, while the master is > generating 1.0 TPM. Aha, now I see where I was mistaken... I thought in terms of time and not transaction IDs. So the time distance between the slave transactions does not matter at all, only the distance in recovered XIDs matter for the "blocking horizon"... and if the WAL recovery is blocked, the "blocking horizon" is stalled as well, so the next transaction on the slave will in fact require the same "blocking horizon" as all currently running ones. Now I got it... and that means in fact that if you have continuously overlapping small transactions, the "blocking horizon" could be even blocked forever, as there'll always be a query running, and the new queries will always have the snapshot of the currently running ones because WAL recovery is stalled... or at least that's what I understand from the whole thing... Cheers, Csaba.
Le jeudi 11 septembre 2008, Heikki Linnakangas a écrit : > Well, yes, but you can fall behind indefinitely that way. Imagine that > each transaction on the slave lasts, say 10 minutes, with a new > transaction starting every 5 minutes. On the master, there's a table > that's being vacuumed (or HOT-updated) frequently, say after each > transaction for simplicity. What can happen is that every transaction > that finishes on the slave will only let the WAL replay advance by one > XID before blocking on the snapshot of the next slave transaction. The > WAL replay will advance at a rate of 0.2 TPM, while the master is > generating 1.0 TPM. What would forbid the slave to choose to replay all currently lagging WALs each time it's given the choice to advance a little? -- dim
On Thu, 2008-09-11 at 15:33 +0200, Dimitri Fontaine wrote: > What would forbid the slave to choose to replay all currently lagging WALs > each time it's given the choice to advance a little? Well now that I think I understand what Heikki meant, I also think the problem is that there's no choice at all to advance, because the new queries will simply have the same snapshot as currently running ones as long as WAL reply is blocked... further blocking the WAL reply. When saying this I suppose that the snapshot is in fact based on the last recovered XID, and not on any slave-local XID. In that case once WAL recovery is blocked, the snapshot is stalled too, further blocking WAL recovery, and so on... Cheers, Csaba.
Le jeudi 11 septembre 2008, Csaba Nagy a écrit : > Well now that I think I understand what Heikki meant, I also think the > problem is that there's no choice at all to advance, because the new > queries will simply have the same snapshot as currently running ones as > long as WAL reply is blocked... further blocking the WAL reply. When > saying this I suppose that the snapshot is in fact based on the last > recovered XID, and not on any slave-local XID. In that case once WAL > recovery is blocked, the snapshot is stalled too, further blocking WAL > recovery, and so on... Well, it may be possible to instruct the WAL replay daemon to stop being polite sometimes: when a given max_lag_delay is reached, it could take locks and as soon as it obtains them, replay all remaining WAL. The max_lag_delay would be a GUC allowing to set the threshold between continuing the replay and running queries. There could some other nice ideas without inventing yet another GUC, but this one was eaiser to think about for me ;) -- dim
Csaba Nagy wrote: > and that means in fact that if you have > continuously overlapping small transactions, the "blocking horizon" > could be even blocked forever, as there'll always be a query running, > and the new queries will always have the snapshot of the currently > running ones because WAL recovery is stalled... Hmm, no I don't think the WAL recovery can become completely stalled. To completely stop progressing, we'd need to take a new snapshot that includes transaction X, and at the same time be blocked on a vacuum record that vacuums a tuple that's visible to transaction X. I don't think that can happen, because for such a scenario to arise, in the corresponding point in time in the master, there would've been a scenario where the vacuum would've removed a tuple that would have been visible to a newly starting transaction. Which can'thappen. I think.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Thanks for the detailed thinking. At least one very good new idea here, some debate on other points. On Thu, 2008-09-11 at 09:24 +0300, Heikki Linnakangas wrote: > And still we can't escape the scenario that the slave receives a WAL > record that vacuums away a tuple that's still visible according to a > snapshot used in the slave. Even with the proposed scheme, this can happen: > > 1. Slave receives a snapshot from master > 2. A long-running transaction begins on the slave, using that snapshot > 3. Network connection is lost > 4. Master hits a timeout, and decides to discard the snapshot it sent to > the slave > 5. A tuple visible to the snapshot is vacuumed > 6. Network connection is re-established > 7. Slave receives the vacuum WAL record, even though the long-running > transaction still needs the tuple. Interesting point. (4) is a problem, though not for the reason you suggest. If we were to stop and start master, that would be sufficient to discard the snapshot that the standby is using and so cause problems. So the standby *must* tell the master the recentxmin it is using, as you suggest later, so good thinking. So part of the handshake between primary and standby must be "what is your recentxmin?". The primary will then use the lower/earliest of the two. > I like the idea of acquiring snapshots locally in the slave much more. Me too. We just need to know how, if at all. > As you mentioned, the options there are to defer applying WAL, or cancel > queries. I think both options need the same ability to detect when > you're about to remove a tuple that's still visible to some snapshot, > just the action is different. We should probably provide a GUC to > control which you want. I don't see any practical way of telling whether a tuple removal will affect a snapshot or not. Each removed row would need to be checked against each standby snapshot. Even if those were available, it would be too costly. And even if we can do that, ISTM that neither option is acceptable: if we cancel queries then touching a frequently updated table is nearly impossible, or if we delay applying WAL then the standby could fall behind, impairing its ability for use in HA. (If there was a way, yes, we should have a parameter for it). It was also suggested we might take the removed rows and put them in a side table, but that makes me think of the earlier ideas for HOT and so I've steered clear of that. You might detect blocks that have had tuples removed from them *after* a query started by either * keeping a hash table of changed blocks - it would be a very big data structure and hard to keep clean * adding an additional "last cleaned LSN" onto every data block * keeping an extra LSN on the bufhdr for each of the shared_buffers, plus keeping a hash table of blocks that have been cleaned and then paged out Once detected, your only option is to cancel the query. ISTM if we want to try to avoid making recentxmin same on both primary and standby then the only viable options are the 3 on the original post. > However, if we still to provide the behavior that "as long as the > network connection works, the master will not remove tuples still needed > in the slave" as an option, a lot simpler implementation is to > periodically send the slave's oldest xmin to master. Master can take > that into account when calculating its own oldest xmin. That requires a > lot less communication than the proposed scheme to send snapshots back > and forth. A softer version of that is also possible, where the master > obeys the slave's oldest xmin, but only up to a point. I like this very much. Much simpler implementation and no need for a delay in granting snapshots. I'll go for this as the default implementation. Thanks for the idea. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > So part of the handshake between > primary and standby must be "what is your recentxmin?". The primary will > then use the lower/earliest of the two. Even then, the master might already have vacuumed away tuples that are visible to an already running transaction in the slave, before the slave connects. Presumably the master doesn't wait for the slave to connect before starting to accept new connections. >> As you mentioned, the options there are to defer applying WAL, or cancel >> queries. I think both options need the same ability to detect when >> you're about to remove a tuple that's still visible to some snapshot, >> just the action is different. We should probably provide a GUC to >> control which you want. > > I don't see any practical way of telling whether a tuple removal will > affect a snapshot or not. Each removed row would need to be checked > against each standby snapshot. Even if those were available, it would be > too costly. How about using the same method as we use in HeapTupleSatisfiesVacuum? Before replaying a vacuum record, look at the xmax of the tuple (assuming it committed). If it's < slave's OldestXmin, it can be removed. Otherwise not. Like HeapTupleSatisfiesVacuum, it's conservative, but doesn't require any extra bookkeeping. And vice versa: if we implement the more precise book-keeping, with all snapshots in shared memory or something, we might as well use it in HeapTupleSatisfiesVacuum. That has been discussed before, but it's a separate project. > It was also suggested we might take the removed rows and put them in a > side table, but that makes me think of the earlier ideas for HOT and so > I've steered clear of that. Yeah, that's non-trivial. Basically a whole new, different implementation of MVCC, but without changing any on-disk formats. BTW, we haven't talked about how to acquire a snapshot in the slave. You'll somehow need to know which transactions have not yet committed, but will in the future. In the master, we keep track of in-progress transaction in the ProcArray, so I suppose we'll need to do the same in the slave. Very similar to prepared transactions, actually. I believe the Abort records, which are not actually needed for normal operation, become critical here. The slave will need to put an entry to ProcArray for any new XLogRecord.xl_xid it sees in the WAL, and remove the entry at a Commit and Abort record. And clear them all at a shutdown record. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > BTW, we haven't talked about how to acquire a snapshot in the slave. You'll > somehow need to know which transactions have not yet committed, but will in the > future. I'm not sure why you need to know which ones will commit in the future. ISTM you need the same information you normally have which is just which ones have committed as of the point of WAL replay you're at. However it does occur to me that if you know in advance that a transaction will abort in the future you could skip applying its WAL records when you see them. I'm not sure it's worth adding such an optimization though and it might get weird around vacuum. > In the master, we keep track of in-progress transaction in the ProcArray, so > I suppose we'll need to do the same in the slave. Very similar to prepared > transactions, actually. I believe the Abort records, which are not actually > needed for normal operation, become critical here. The slave will need to > put an entry to ProcArray for any new XLogRecord.xl_xid it sees in the WAL, > and remove the entry at a Commit and Abort record. And clear them all at a > shutdown record. That's how I envisioned it when the topic came up, but another solution was also bandied about -- I'm not sure who originally proposed it. The alternative method is to have the master periodically insert a data structure describing a snapshot in the WAL. The slave then keeps a copy of the last snapshot it saw in the WAL and any new query which starts up uses that. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
Gregory Stark wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > >> BTW, we haven't talked about how to acquire a snapshot in the slave. You'll >> somehow need to know which transactions have not yet committed, but will in the >> future. > > I'm not sure why you need to know which ones will commit in the future. Hmm, I phrased that badly. We need to know which transactions *might* commit in the future, IOW, are still in progress. Because we want to mark those as in-progress in the snapshots that are taken in the slave. Otherwise, when they do commit, they will suddenly become visible in the snapshots that didn't know that they were in progress. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
heikki.linnakangas@enterprisedb.com (Heikki Linnakangas) writes: > Simon Riggs wrote: >> Taking snapshots from primary has a few disadvantages >> >> ... >> * snapshots on primary prevent row removal (but this was also an >> advantage of this technique!) > > That makes it an awful solution for high availability. A backend hung > in transaction-in-progress state in the slave will prevent row removal > on the master. Isolating the master from queries done performed in the > slave is exactly the reason why people use hot standby. And running > long reporting queries in the standby is again a very typical use case. I agree that this is a demerit to this approach. Whether or not, on balance, it makes it an 'awful solution for high availability' is much more in the eye of the beholder, and NOT obvious on the face of it. -- let name="cbbrowne" and tld="linuxdatabases.info" in name ^ "@" ^ tld;; http://linuxfinances.info/info/sgml.html Question: How many surrealists does it take to change a light bulb? Answer: Two, one to hold the giraffe, and the other to fill the bathtub with brightly colored machine tools.
On Thu, 2008-09-11 at 17:58 +0300, Heikki Linnakangas wrote: > Simon Riggs wrote: > > So part of the handshake between > > primary and standby must be "what is your recentxmin?". The primary will > > then use the lower/earliest of the two. > > Even then, the master might already have vacuumed away tuples that are > visible to an already running transaction in the slave, before the slave > connects. Presumably the master doesn't wait for the slave to connect > before starting to accept new connections. Yep, OK. > >> As you mentioned, the options there are to defer applying WAL, or cancel > >> queries. I think both options need the same ability to detect when > >> you're about to remove a tuple that's still visible to some snapshot, > >> just the action is different. We should probably provide a GUC to > >> control which you want. > > > > I don't see any practical way of telling whether a tuple removal will > > affect a snapshot or not. Each removed row would need to be checked > > against each standby snapshot. Even if those were available, it would be > > too costly. > > How about using the same method as we use in HeapTupleSatisfiesVacuum? > Before replaying a vacuum record, look at the xmax of the tuple > (assuming it committed). If it's < slave's OldestXmin, it can be > removed. Otherwise not. Like HeapTupleSatisfiesVacuum, it's > conservative, but doesn't require any extra bookkeeping. > > And vice versa: if we implement the more precise book-keeping, with all > snapshots in shared memory or something, we might as well use it in > HeapTupleSatisfiesVacuum. That has been discussed before, but it's a > separate project. Tuple removals earlier than the slave's OldestXmin are easy, thats true. I'm not sure what you had in mind for "Otherwise not"? Maybe you mean "stop applying WAL until slave's OldestXmin is > tuple removal xid". Not sure, reading other subthreads of this post. I think its possible to defer removal actions on specific blocks only, but that is an optimisation that's best left for a while. BTW, tuple removals would need a cleanup lock on a block, just as they do on master server. So WAL apply can be delayed momentarily by pinholders anyway, whatever we do. > > It was also suggested we might take the removed rows and put them in a > > side table, but that makes me think of the earlier ideas for HOT and so > > I've steered clear of that. > > Yeah, that's non-trivial. Basically a whole new, different > implementation of MVCC, but without changing any on-disk formats. > > BTW, we haven't talked about how to acquire a snapshot in the slave. > You'll somehow need to know which transactions have not yet committed, > but will in the future. In the master, we keep track of in-progress > transaction in the ProcArray, so I suppose we'll need to do the same in > the slave. Very similar to prepared transactions, actually. I believe > the Abort records, which are not actually needed for normal operation, > become critical here. The slave will need to put an entry to ProcArray > for any new XLogRecord.xl_xid it sees in the WAL, and remove the entry > at a Commit and Abort record. And clear them all at a shutdown record. I wouldn't do it like that. I was going to maintain a "current snapshot" in shared memory, away from the PROCARRAY. Each time we see a TransactionId we check whether its already been seen, if not, insert it. When a transaction commits or aborts we remove the stated xid. If we see a shutdown checkpoint we clear the array completely. When query backends want a snapshot they just read the array. It doesn't matter whether queries commit or abort, since those changes can't be seen anyway by queries until commit. Reason for doing it this way is PROCARRAY may be full of query backends, so having dummy backends in there as well sounds confusing. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Thu, 2008-09-11 at 11:38 +0300, Hannu Krosing wrote: > On Thu, 2008-09-11 at 09:24 +0300, Heikki Linnakangas wrote: > > > I like the idea of acquiring snapshots locally in the slave much more. > > As you mentioned, the options there are to defer applying WAL, or cancel > > queries. > > More exotic ways to defer applying WAL include using some smart > filesystems to get per-backend data snapshots, using either > copy-of-write overlay filesystems and filesystem or disk level > snapshots. That's certainly possible. That would mean we maintain a single consistent viewpoint of the whole database for a period of time. The frozen viewpoint could move forwards by operator command, or we might keep multiple frozen views. We can have a LookasideHash table in memory that keeps track of which blocks have had rows removed from them since the "frozen view" was taken. If you request a block, we check to see whether there is a lookaside copy of it prior to the tuple removals. We then redirect the block request to a viewpoint relation's block. Each viewpoint gets a separate relfilenode. We do the switcheroo while holding cleanup lock on block. So effectively we would be adding "frozen snapshot" technology to Postgres. Although we would want to do, copy-on-clean rather than copy-on-write. Which could mean significantly better performance. That might be encumbered by patent in some way. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Thu, 2008-09-11 at 15:42 +0300, Heikki Linnakangas wrote: > Gregory Stark wrote: > > b) vacuum on the server which cleans up a tuple the slave has in scope has to > > block WAL reply on the slave (which I suppose defeats the purpose of having > > a live standby for users concerned more with fail-over latency). > > One problem with this, BTW, is that if there's a continuous stream of > medium-length transaction in the slave, each new snapshot taken will > prevent progress in the WAL replay, so the WAL replay will advance in > "baby steps", and can fall behind indefinitely. As soon as there's a > moment that there's no active snapshot, it can catch up, but if the > slave is seriously busy, that might never happen. It should be possible to do mixed mode. Stall WAL apply for up to X seconds, then cancel queries. Some people may want X=0 or low, others might find X = very high acceptable (Merlin et al). -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, 2008-09-12 at 09:38 +0100, Simon Riggs wrote: > If you request a block, we check to see whether there is a lookaside > copy of it prior to the tuple removals. We then redirect the block > request to a viewpoint relation's block. Each viewpoint gets a separate > relfilenode. We do the switcheroo while holding cleanup lock on block. Wouldn't it make sense to also have a hint bit on the pages which are copied away ? Then instead of looking up a hash table, you first would look up that bit, and if not set you won't look up the hash table at all. Then when you clean up the "lookaside copies" you clear those bits too... That would probably perform somewhat better for reading than always looking up a potentially big hash table, and the cost of setting the hint is probably a lot less than copying away the page in the first place. Resetting the hint bit might be a bit more expensive. Cheers, Csaba.
On Fri, 2008-09-12 at 11:21 +0200, Csaba Nagy wrote: > On Fri, 2008-09-12 at 09:38 +0100, Simon Riggs wrote: > > If you request a block, we check to see whether there is a lookaside > > copy of it prior to the tuple removals. We then redirect the block > > request to a viewpoint relation's block. Each viewpoint gets a separate > > relfilenode. We do the switcheroo while holding cleanup lock on block. > > Wouldn't it make sense to also have a hint bit on the pages which are > copied away ? There is no need for hint bit (and one bit would not be enough anyway, as we may need to keep multiple versions) A LSN of last row-removal (VACUUM or HOT-pruning) would serve as a hint to start digging around in hash tables. It seems though , that you may have to look into several disk snapshots to find the page you need. > Then instead of looking up a hash table, you first would > look up that bit, and if not set you won't look up the hash table at > all. Then when you clean up the "lookaside copies" you clear those bits > too... > > That would probably perform somewhat better for reading than always > looking up a potentially big hash table, and the cost of setting the > hint is probably a lot less than copying away the page in the first > place. Resetting the hint bit might be a bit more expensive. > > Cheers, > Csaba. > >
On Fri, 2008-09-12 at 09:45 +0100, Simon Riggs wrote: > On Thu, 2008-09-11 at 15:42 +0300, Heikki Linnakangas wrote: > > Gregory Stark wrote: > > > b) vacuum on the server which cleans up a tuple the slave has in scope has to > > > block WAL reply on the slave (which I suppose defeats the purpose of having > > > a live standby for users concerned more with fail-over latency). > > > > One problem with this, BTW, is that if there's a continuous stream of > > medium-length transaction in the slave, each new snapshot taken will > > prevent progress in the WAL replay, so the WAL replay will advance in > > "baby steps", and can fall behind indefinitely. As soon as there's a > > moment that there's no active snapshot, it can catch up, but if the > > slave is seriously busy, that might never happen. > > It should be possible to do mixed mode. > > Stall WAL apply for up to X seconds, then cancel queries. Some people > may want X=0 or low, others might find X = very high acceptable (Merlin > et al). Or even milder version. * Stall WAL apply for up to X seconds, * then stall new queries, let old ones run to completion (with optional fallback to canceling after Y sec), * apply WAL. * Repeat. ------------- Hannu
On Thu, 2008-09-11 at 15:17 +0200, Jochem van Dieten wrote: > On Thu, Sep 11, 2008 at 2:07 AM, Simon Riggs wrote: > > Transaction snapshots is probably the most difficult problem for Hot > > Standby to resolve. > > > * remotely from primary node > > * locally on the standby node > > > > If we derive a snapshot locally, then we will end up with a situation > > where the xmin of the local snapshot precedes the xmin of the primary > > node. When this occurs it will then be possible for WAL records to > > arrive on the standby that request removal of rows that a transaction > > might wish to see. Preventing that situation can be done by either > > deferring WAL apply or by cancelling queries. > > Which operations can request row removal? Isn't that just specific > operations that have their own 'is this save to remove' calculations > anyway (i.e. vacuum and HOT prune)? > > What I am thinking about is a design where the primary node were to > regularly push an OldestXMin into the WAL, the WAL apply process on > the standby nodes pushes it into shared memory and the backends keep > an OldestMasterXMin in shared memory. The standby nodes then regularly > pushes back the oldest OldestMasterXMin from all backends to the > master. Vacuum and HOT prune could then base their calculations on an > OldestXMin that is not the OldestXMin of the master itself, but of the > master and the standby nodes. That way removal of records that are > still visible on one of the standby nodes is prevented on the master > instead of worked around on the standby nodes. > > The obvious downside would be bloat on the master (which could get out > of hand if a slave is a few days behind due to a long report), but I > think in terms of visibility and consistency this would work. Or am I > completely misunderstanding the problem? Yes, just sending the xmin from standby to primary is the best way of having primary and standby work together. I will include this, thanks. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Thu, 2008-09-11 at 17:04 +0300, Heikki Linnakangas wrote: > Csaba Nagy wrote: > > and that means in fact that if you have > > continuously overlapping small transactions, the "blocking horizon" > > could be even blocked forever, as there'll always be a query running, > > and the new queries will always have the snapshot of the currently > > running ones because WAL recovery is stalled... > > Hmm, no I don't think the WAL recovery can become completely stalled. To > completely stop progressing, we'd need to take a new snapshot that > includes transaction X, and at the same time be blocked on a vacuum > record that vacuums a tuple that's visible to transaction X. I don't > think that can happen, because for such a scenario to arise, in the > corresponding point in time in the master, there would've been a > scenario where the vacuum would've removed a tuple that would have been > visible to a newly starting transaction. Which can't happen. I think.. ISTM Csaba is correct. If WAL blocks the xids don't change and so the snapshots never change, so wal is blocked. The only way out of that is to store up removals for particular blocks, but that's complicated. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Thu, 2008-09-11 at 01:07 +0100, Simon Riggs wrote: > Transaction snapshots is probably the most difficult problem for Hot > Standby to resolve. In summary of thread so far: When queries on standby run for significantly longer than longest queries on primary, some problems can occur. Various people have argued for these responses to the problems: 1. Master uses Standby's OldestXmin Effects: * Long running queries on standby... Can delay row removal on primary Do not delay apply of WAL records on standby * Queries on standby give consistent answers in all cases. 2. Master ignores Standby's OldestXmin Effects: * Long running queries on standby... Have no effect on primary Can delay apply of WAL records on standby * Queries on standby give consistent answers in all cases. 3. Ignore problem Effects: * Long running queries on standby... Have no effect on primary Do not delay apply of WAL records on standby * Queries on standby give inconsistent answers in some cases, though doesn't generate any messages to show inconsistency occurred. Acceptable for read-only and insert only tables only. Hot Standby should provide all 3 responses as options. (1) would be implemented by sending Standby OldestXmin to primary. Snapshots would not be sent from primary, they will be derived locally from transactions currently being applied. (2) would be implemented by setting a timer. When Startup process has waited for more than "redo_max_delay"/"max_lag_delay" (SIGHUP) we cancel queries. If timeout is 0 we aggressively cancel queries without a timeout. (3) would be implemented using read_consistency = on (default) | off, a USERSET parameter. When read_consistency = off we ignore the backend's xmin when deciding whether to wait before applying WAL or not. Summary OK for everyone? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Thu, 2008-09-11 at 12:18 +0100, Gregory Stark wrote: > Is there any middle ground or brilliant ways to get the best of both worlds? Possibly. Nobody has commented yet on the other ideas on the post itself. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, 2008-09-12 at 11:19 +0100, Simon Riggs wrote: > On Thu, 2008-09-11 at 01:07 +0100, Simon Riggs wrote: > > Transaction snapshots is probably the most difficult problem for Hot > > Standby to resolve. > > In summary of thread so far: > > When queries on standby run for significantly longer than longest > queries on primary, some problems can occur. Various people have argued > for these responses to the problems: > > 1. Master uses Standby's OldestXmin ... > 2. Master ignores Standby's OldestXmin ... > 3. Ignore problem ... > Summary OK for everyone? Maybe we should at least mention option 4. 4. Slave keeps copies of removed pages or rows when WAL apply removes old versions . Possible ways to do this * inside Slave - have some backup store tied to OldestXmin intervals * variant 1 - have one global store, accessed through shared mem * variant 2 - present removed pages to interested backendsand let them (decide to) keep them * outside Slave - having file system keep old snapshots as long as needed, still must tie to OldestXmin intervals, but most of work done by storage layer (SAN or overlay file system). ------------- Hannu
On Fri, 2008-09-12 at 13:53 +0300, Hannu Krosing wrote: > 4. Slave keeps copies of removed pages or rows when WAL apply removes > old versions . > > Possible ways to do this > > * inside Slave - have some backup store tied to OldestXmin intervals > > * variant 1 - have one global store, accessed through shared mem > * variant 2 - present removed pages to interested backends and > let them (decide to) keep them > > * outside Slave - having file system keep old snapshots as long as > needed, still must tie to OldestXmin intervals, but most of work done by > storage layer (SAN or overlay file system). Possible options for "outside Slave" filesystem snapshooting - ZFS ( http://en.wikipedia.org/wiki/ZFS ) is very likely usable Linux LVM + XFS may be usable - http://arstechnica.com/articles/columns/linux/linux-20041013.ars Possibly also http://en.wikipedia.org/wiki/Btrfs . ---------------- Hannu
Simon Riggs <simon@2ndQuadrant.com> writes: > 3. Ignore problem > Effects: > * Long running queries on standby... > Have no effect on primary > Do not delay apply of WAL records on standby > * Queries on standby give inconsistent answers in some cases, though > doesn't generate any messages to show inconsistency occurred. Acceptable > for read-only and insert only tables only. This seems like a non-starter. Your comment about read-only and insert-only tuples only seems to make sense if you assume there are other tables being updated simultaneously. Otherwise of course there would be no WAL records for tuple removals. In that case the problem is dealing with different usage patterns on different tables. There might be a way to solve just that use case such as deferring WAL records for those tables. That doesn't guarantee inter-table data consistency if there were other queries which read from those tables and updated other tables based on that data though. Perhaps there's a solution for that too though. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
Gregory Stark wrote: > In that case the problem is dealing with different usage patterns on different > tables. There might be a way to solve just that use case such as deferring WAL > records for those tables. That doesn't guarantee inter-table data consistency > if there were other queries which read from those tables and updated other > tables based on that data though. Perhaps there's a solution for that too > though. There was a suggestion (Simon - from you?) of a transaction voluntarily restricting itself to a set of tables. That would obviously reduce the impact of all the options where the accessed tables weren't being updated (where update = vacuum + HOT if I've got this straight). -- Richard Huxton Archonet Ltd
On Fri, 2008-09-12 at 12:31 +0300, Hannu Krosing wrote: > On Fri, 2008-09-12 at 09:45 +0100, Simon Riggs wrote: > > On Thu, 2008-09-11 at 15:42 +0300, Heikki Linnakangas wrote: > > > Gregory Stark wrote: > > > > b) vacuum on the server which cleans up a tuple the slave has in scope has to > > > > block WAL reply on the slave (which I suppose defeats the purpose of having > > > > a live standby for users concerned more with fail-over latency). > > > > > > One problem with this, BTW, is that if there's a continuous stream of > > > medium-length transaction in the slave, each new snapshot taken will > > > prevent progress in the WAL replay, so the WAL replay will advance in > > > "baby steps", and can fall behind indefinitely. As soon as there's a > > > moment that there's no active snapshot, it can catch up, but if the > > > slave is seriously busy, that might never happen. > > > > It should be possible to do mixed mode. > > > > Stall WAL apply for up to X seconds, then cancel queries. Some people > > may want X=0 or low, others might find X = very high acceptable (Merlin > > et al). > > Or even milder version. > > * Stall WAL apply for up to X seconds, > * then stall new queries, let old ones run to completion (with optional > fallback to canceling after Y sec), > * apply WAL. > * Repeat. Now that I have thought a little more about delegating keeping old versions to filesystem level (ZFS , XFS+LVM) snapshots I'd like to propose the following: 0. run queries and apply WAL freely until WAL application would remove old rows. 1. stall applying WAL for up to N seconds 2. stall starting new queries for up to M seconds 3. if some backends are still running long queries, then 3.1. make filesystem level snapshot (FS snapshot), 3.2. mount the FS snapshot somewhere (maybe as data.at.OldestXmin in parallel to $PGDATA) and 3.3 hand this mounted FS snapshot over to those backends 4. apply WAL 5. GoTo 0. Of course we need to do the filesystem level snapshots in 3. only if the long-running queries don't already have one given to them. Or maybe also if they are running in READ COMMITTED mode and and have aquired a new PG snapshot since they got their FS snapshot need a new one. Also, snapshots need to be reference counted, so we can unmount and destroy them once all their users have finished. I think that enabling long-running queries this way is both low-hanging fruit (or at least medium-height-hanging ;) ) and also consistent to PostgreSQL philosophy of not replication effort. As an example we trust OS's file system cache and don't try to write our own. ---------------- Hannu
On Fri, 2008-09-12 at 12:31 +0100, Richard Huxton wrote: > There was a suggestion (Simon - from you?) of a transaction voluntarily > restricting itself to a set of tables. While thinking about how easy it would be for the DBA to specify the set of tables a single query is accessing, first I thought that it should be straight enough to look at the query itself for that. Then I thought what about views, rules, triggers, user functions etc. ? All those have the potential to access more than you see in the query itself. And then the actually interesting question: what will the slave do with views, rules, triggers ? I guess triggers are out of the question to be executed, what about rules ? Probably must be also ignored... user functions will probably get errors if they try to update something... Views should probably function correctly. So in any case the functionality available for querying slaves would be less than for the primary. This is probably good enough for most purposes... Cheers, Csaba.
> I think that enabling long-running queries this way is both > low-hanging > fruit (or at least medium-height-hanging ;) ) and also consistent to > PostgreSQL philosophy of not replication effort. As an example we trust > OS's file system cache and don't try to write our own. I have again questions (unfortunately I only have questions usually): * how will the buffers keep 2 different versions of the same page ? * how will you handle the creation of snapshots ? I guess there's no portable and universal API for that (just guessing),or there is some POSIX thing which is supported or not by the specific FS ? So if the FS is not supporting it,you skip the snapshot step ? And if there's no universal API, will it be handled by plugins providing a specified APIfor snapshotting the FS ? I hope my continuous questioning is not too annoying... Cheers, Csaba.
On Fri, 2008-09-12 at 12:25 +0100, Gregory Stark wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > > 3. Ignore problem > > Effects: > > * Long running queries on standby... > > Have no effect on primary > > Do not delay apply of WAL records on standby > > * Queries on standby give inconsistent answers in some cases, though > > doesn't generate any messages to show inconsistency occurred. Acceptable > > for read-only and insert only tables only. > > This seems like a non-starter. It works, and is proposed as a non-default option since a number of people have independently said to me that this would be acceptable/preferred. > Your comment about read-only and insert-only tuples only seems to make sense > if you assume there are other tables being updated simultaneously. Otherwise > of course there would be no WAL records for tuple removals. Yeh, you got it. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, 2008-09-12 at 13:54 +0200, Csaba Nagy wrote: > > I think that enabling long-running queries this way is both > > low-hanging > > fruit (or at least medium-height-hanging ;) ) and also consistent to > > PostgreSQL philosophy of not replication effort. As an example we trust > > OS's file system cache and don't try to write our own. > > I have again questions (unfortunately I only have questions usually): > > * how will the buffers keep 2 different versions of the same page ? As the FS snapshot is mounted as a different directory, it will have it's own buffer pages. To conserve RAM, one could go to FS snapshot files only in case main pages have LSN too big to be trusted. > * how will you handle the creation of snapshots ? probably an external command, possibly shell script. similar to current "archive_command" for wal copying maybe 'create_fs_snapshot_command' and 'destroy_fs_snapshot_command' > I guess there's no portable and universal API for that (just guessing), > or there is some POSIX thing which is supported or not by the specific FS ? > So if the FS is not supporting it, you skip the snapshot step ? Yes, if not FS snapshots are not supported, we fall back to either inconsistent read or killing long-running queries. > And if there's no universal API, will it be handled by plugins providing > a specified API for snapshotting the FS ? Yes, the simplest one being external command. As FS snapshots are supposed to happen not-too often, at least not every second, just having external commands may be enough. > I hope my continuous questioning is not too annoying... On the contrary, much appreciated. :) ---------------- Hannu
On Fri, 2008-09-12 at 15:08 +0300, Hannu Krosing wrote: > > * how will the buffers keep 2 different versions of the same page ? > > As the FS snapshot is mounted as a different directory, it will have > it's own buffer pages. Lack of knowledge about this shows my ignorance about the implementation of the page buffers... > > * how will you handle the creation of snapshots ? > > probably an external command, possibly shell script. > similar to current "archive_command" for wal copying > > maybe 'create_fs_snapshot_command' and 'destroy_fs_snapshot_command' [snip] > Yes, the simplest one being external command. As FS snapshots are > supposed to happen not-too often, at least not every second, just having > external commands may be enough. You could restrict the creation of snapshots to some minimum amount of time between them, and maybe also restrict the maximum number of concurrent snapshots possible. Then if the time limit (as calculated from the last open snapshot) is currently not met, any new query could reuse that last snapshot. The time intervals do not need to be evenly distributed BTW, it could be a function of the already opened snapshots, like increase the minimum interval exponentially with the number of already opened snapshots. That would help to catch more long running queries to just a few snapshots. > > I hope my continuous questioning is not too annoying... > > On the contrary, much appreciated. :) Ok, then I'll continue :-) I would like to see this feature succeed, but there's slim chance I'll ever code well in C... Cheers, Csaba.
On Fri, 2008-09-12 at 15:08 +0300, Hannu Krosing wrote: > On Fri, 2008-09-12 at 13:54 +0200, Csaba Nagy wrote: > > > I think that enabling long-running queries this way is both > > > low-hanging > > > fruit (or at least medium-height-hanging ;) ) and also consistent to > > > PostgreSQL philosophy of not replication effort. As an example we trust > > > OS's file system cache and don't try to write our own. > > > > I have again questions (unfortunately I only have questions usually): > > > > * how will the buffers keep 2 different versions of the same page ? > > As the FS snapshot is mounted as a different directory, it will have > it's own buffer pages. RelFileNode has a spcNode which can be redirected to a temporary filesystem snapshot. So its relatively easy to imagine redirecting access to a table from its normal tablespace to the snapshot one. > To conserve RAM, one could go to FS snapshot files only in case main > pages have LSN too big to be trusted. That would mean you'd need to do two I/Os, one to get the newly changed page to get its LSN and another to get the old COW copy. We might waste buffer space with that technique also. Since we'd be trying to avoid cacheing bigger tables anyway (since 8.3) it seems easier to just go straight to the COW copy. So I think its fairly straightforward to support temporary snapshots in Postgres, with creation/destruction handled in the way you say. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Heikki Linnakangas wrote: > BTW, we haven't talked about how to acquire a snapshot in the slave. > You'll somehow need to know which transactions have not yet > committed, but will in the future. In the master, we keep track of > in-progress transaction in the ProcArray, so I suppose we'll need to > do the same in the slave. Very similar to prepared transactions, > actually. I believe the Abort records, which are not actually needed > for normal operation, become critical here. The slave will need to > put an entry to ProcArray for any new XLogRecord.xl_xid it sees in > the WAL, and remove the entry at a Commit and Abort record. And clear > them all at a shutdown record. For reference, here is how I solved the snapshot problem in my Summer-of-Code project last year, which dealt exactly with executing read-only queries on PITR slaves (But sadly never came out of alpha stage due to both my and Simon's lack of time) The main idea was to invert the meaning of the xid array in the snapshot struct - instead of storing all the xid's between xmin and xmax that are to be considering "in-progress", the array contained all the xid's > xmin that are to be considered "completed". The current read-only snapshot (which "current" meaning the corresponding state on the master at the time the last replayed wal record was generated) was maintained in shared memory. It' xmin field was continually updated with the (newly added) XLogRecord.xl_xmin field, which contained the xid of the oldest running query on the master, with a pruning step after each ReadOnlySnapshot.xmin update to remove all entries < xmin from the xid array. If a commit was seen for an xid, that xid was added to the ReadOnlySnapshot.xid array. The advantage of this concept is that it handles snapshotting on the slave without too much additional work for the master (The only change is the addition of the xl_xmin field to XLogRecord). It especially removes that need to track ShmemVariableCache->nextXid. The downside is that the size of the read-only snapshot is theoretically unbounded, which poses a bit of a problem if it's supposed to live inside shared memory... regards, Florian Pflug
On Sat, 2008-09-13 at 10:48 +0100, Florian G. Pflug wrote: > The main idea was to invert the meaning of the xid array in the snapshot > struct - instead of storing all the xid's between xmin and xmax that are > to be considering "in-progress", the array contained all the xid's > > xmin that are to be considered "completed". > The downside is that the size of the read-only snapshot is theoretically > unbounded, which poses a bit of a problem if it's supposed to live > inside shared memory... Why do it inverted? That clearly has problems. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Sat, 2008-09-13 at 10:48 +0100, Florian G. Pflug wrote: > The current read-only snapshot (which "current" meaning the > corresponding state on the master at the time the last replayed wal > record was generated) was maintained in shared memory. It' xmin field > was continually updated with the (newly added) XLogRecord.xl_xmin > field, which contained the xid of the oldest running query on the > master, with a pruning step after each ReadOnlySnapshot.xmin update to > remove all entries < xmin from the xid array. If a commit was seen for > an xid, that xid was added to the ReadOnlySnapshot.xid array. > > The advantage of this concept is that it handles snapshotting on the > slave without too much additional work for the master (The only change > is the addition of the xl_xmin field to XLogRecord). It especially > removes that need to track ShmemVariableCache->nextXid. Snapshots only need to know which transactions are currently "running" during WAL apply. The standby can't remove any tuples itself, so it doesn't need to know what the master's OldestXmin is. So passing xl_xmin from master to standby seems not necessary to me. The standby's OldestXmin needs to be passed through to the master, not the other way around so that WAL records for tuple removal are not generated. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > So passing xl_xmin from master to standby seems not necessary to me. The > standby's OldestXmin needs to be passed through to the master, not the > other way around so that WAL records for tuple removal are not > generated. I think most people were pretty leery of doing it that way because it means activity on the standby database can cause the master to bloat. The consensus seemed to be headed towards having WAL replay on the standby stall if it meets a tuple removal record for a tuple that's visible to a query running on it. Probably with a mechanism to configure a maximum amount of time it can be stalled before shooting those queries. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
On Mon, 2008-09-15 at 13:13 +0100, Gregory Stark wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > > So passing xl_xmin from master to standby seems not necessary to me. The > > standby's OldestXmin needs to be passed through to the master, not the > > other way around so that WAL records for tuple removal are not > > generated. > > I think most people were pretty leery of doing it that way because it means > activity on the standby database can cause the master to bloat. The consensus > seemed to be headed towards having WAL replay on the standby stall if it meets > a tuple removal record for a tuple that's visible to a query running on it. > Probably with a mechanism to configure a maximum amount of time it can be > stalled before shooting those queries. Well, my impression from all inputs is there is no single preferred route. Any one of the approaches seems to have a possible objection, depending upon the needs of the user. So I'm going to give options. In any case it's not just a two horse race. There are other options, favoured by some people, that you've not personally commented on in any of your summaries (thats up to you, of course). And "Standby causing master to bloat" is not such a big problem. It's no different to running queries directly on the master. But please don't take it that I don't see the problem or think it the best solution in all cases. Certainly, halting WAL replay for any length of time might mean it can never catch up again, so that won't be acceptable for many cases. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Mon, Sep 15, 2008 at 8:40 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > On Mon, 2008-09-15 at 13:13 +0100, Gregory Stark wrote: >> Simon Riggs <simon@2ndQuadrant.com> writes: >> >> > So passing xl_xmin from master to standby seems not necessary to me. The >> > standby's OldestXmin needs to be passed through to the master, not the >> > other way around so that WAL records for tuple removal are not >> > generated. >> >> I think most people were pretty leery of doing it that way because it means >> activity on the standby database can cause the master to bloat. The consensus >> seemed to be headed towards having WAL replay on the standby stall if it meets >> a tuple removal record for a tuple that's visible to a query running on it. >> Probably with a mechanism to configure a maximum amount of time it can be >> stalled before shooting those queries. > > Well, my impression from all inputs is there is no single preferred > route. Any one of the approaches seems to have a possible objection, > depending upon the needs of the user. So I'm going to give options. > > In any case it's not just a two horse race. There are other options, > favoured by some people, that you've not personally commented on in any > of your summaries (thats up to you, of course). > > And "Standby causing master to bloat" is not such a big problem. It's no > different to running queries directly on the master. But please don't > take it that I don't see the problem or think it the best solution in > all cases. It's not a problem, but a relative disadvantage. What makes warm standby really attractive relative to other data transfer solutions is that there are no side effects on the master outside of setting the archive command...communication is one way. Any 'master bloat' style approach seems to be increasingly fragile if you want to increase the number of standby servers, if it's even possible to do that. I'm not saying it should be the only way to do it (and it may not even be possible)...but it would be very attractive to be able to run a hot standby much the same as a warm standby is running today...I could, for example easily script a second standby but keep it a week behind for example. merlin
Gregory Stark wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > >> So passing xl_xmin from master to standby seems not necessary to me. The >> standby's OldestXmin needs to be passed through to the master, not the >> other way around so that WAL records for tuple removal are not >> generated. > > I think most people were pretty leery of doing it that way because it means > activity on the standby database can cause the master to bloat. The consensus > seemed to be headed towards having WAL replay on the standby stall if it meets > a tuple removal record for a tuple that's visible to a query running on it. > Probably with a mechanism to configure a maximum amount of time it can be > stalled before shooting those queries. Yes, but we can quite easily provide an option on top of that to advertise the slave xmin back to the master, for those who prefer some bloat on master over stalling replay or killing queries in the slave. In fact, I think a lot of people will choose some compromise, where the slave xmin is advertised back to the master, but the master will obey it only up to some limit, after which the slave will need to stall or kill queries again. It's not something that needs to be done in the first phase, but should be straightforward to add after the basics are working. In any case, we'll need the capability in the slave to notice when it's about to remove a tuple that's still visible to a snapshot in the slave. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 2008-09-15 at 16:26 +0300, Heikki Linnakangas wrote: > In any case, we'll need the capability in the slave to notice when > it's about to remove a tuple that's still visible to a snapshot in the > slave. Agreed. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Mon, 2008-09-15 at 09:07 -0400, Merlin Moncure wrote: > > Well, my impression from all inputs is there is no single preferred > > route. Any one of the approaches seems to have a possible objection, > > depending upon the needs of the user. So I'm going to give options. > > > > In any case it's not just a two horse race. There are other options, > > favoured by some people, that you've not personally commented on in > any > > of your summaries (thats up to you, of course). > > > > And "Standby causing master to bloat" is not such a big problem. It's no > > different to running queries directly on the master. But please don't > > take it that I don't see the problem or think it the best solution in > > all cases. > > It's not a problem, but a relative disadvantage. In some circumstances, yes, in others, not. But its not the only consideration and different people attach different weightings in their decision making. > What makes warm > standby really attractive relative to other data transfer solutions is > that there are no side effects on the master outside of setting the > archive command...communication is one way. Any 'master bloat' style > approach seems to be increasingly fragile if you want to increase the > number of standby servers, if it's even possible to do that. > > I'm not saying it should be the only way to do it (and it may not even > be possible)...but it would be very attractive to be able to run a hot > standby much the same as a warm standby is running today...I could, > for example easily script a second standby but keep it a week behind > for example. I hope to provide options for all users, not just a single approach. If you choose never to use the option to link standby and master, I respect that and understand. I've listened to your requirements and believe I'm including things to allow it to work for you the way you've said. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Mon, 2008-09-15 at 16:26 +0300, Heikki Linnakangas wrote: > In any case, > we'll need the capability in the slave to notice when it's about to > remove a tuple that's still visible to a snapshot in the slave. Looks like what I'll do is this: Alter functions in pruneheap.c so that we bubble up the latest xid that is being removed as part of block cleaning. We then add that xid into the WAL record for cleaning. If latest xid of clean is ahead of oldestxmin of running queries on standby then Startup process needs to take action, of some kind. Re-examining the tuples in WAL apply seems bad plan, since we'd have to touch stuff in the block twice and juggle the locks. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Friday 12 September 2008 07:44:36 Csaba Nagy wrote: > And then > the actually interesting question: what will the slave do with views, > rules, triggers ? I guess triggers are out of the question to be > executed, what about rules ? Probably must be also ignored... user > functions will probably get errors if they try to update something... > Views should probably function correctly. > If we dont have rules, we dont get views, so those need to be. Really we should allow anything that would work in the context of a read only transaction. (ie. functions that dont change anything should be fine to call) -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL
Simon Riggs wrote: > On Sat, 2008-09-13 at 10:48 +0100, Florian G. Pflug wrote: > >> The main idea was to invert the meaning of the xid array in the snapshot >> struct - instead of storing all the xid's between xmin and xmax that are >> to be considering "in-progress", the array contained all the xid's > >> xmin that are to be considered "completed". > >> The downside is that the size of the read-only snapshot is theoretically >> unbounded, which poses a bit of a problem if it's supposed to live >> inside shared memory... > > Why do it inverted? That clearly has problems. Because it solves the problem of "sponteaously" apprearing XIDs in the WAL. At least prior to 8.3 with virtual xids, a transaction might have allocated it's xid long before actually writing anything to disk, and therefore long before this XID ever shows up in the WAL. And with a non-inverted snapshot such an XID would be considered to be "completed" by transactions on the slave... So, one either needs to periodically log a snapshot on the master or log XID allocations which both seem to cause considerable additional load on the master. With an inverted snapshot, it's sufficient to log the current RecentXmin - a values that is readily available on the master, and therefore the cost amounts to just one additional 4-byte field per xlog entry. regards, Florian Pflug
Simon Riggs wrote: > On Sat, 2008-09-13 at 10:48 +0100, Florian G. Pflug wrote: > >> The current read-only snapshot (which "current" meaning the >> corresponding state on the master at the time the last replayed wal >> record was generated) was maintained in shared memory. It' xmin field >> was continually updated with the (newly added) XLogRecord.xl_xmin >> field, which contained the xid of the oldest running query on the >> master, with a pruning step after each ReadOnlySnapshot.xmin update to >> remove all entries < xmin from the xid array. If a commit was seen for >> an xid, that xid was added to the ReadOnlySnapshot.xid array. >> >> The advantage of this concept is that it handles snapshotting on the >> slave without too much additional work for the master (The only change >> is the addition of the xl_xmin field to XLogRecord). It especially >> removes that need to track ShmemVariableCache->nextXid. > > Snapshots only need to know which transactions are currently "running" > during WAL apply. The standby can't remove any tuples itself, so it > doesn't need to know what the master's OldestXmin is. I used the logged xmin value to track the shared snapshot's xmin, which in turn allowed me to prune the xid array, eliminating all xids < that xmin. regards, Florian Pflug
On Mon, 2008-09-15 at 19:20 +0100, Florian G. Pflug wrote: > Simon Riggs wrote: > > On Sat, 2008-09-13 at 10:48 +0100, Florian G. Pflug wrote: > > > >> The main idea was to invert the meaning of the xid array in the snapshot > >> struct - instead of storing all the xid's between xmin and xmax that are > >> to be considering "in-progress", the array contained all the xid's > > >> xmin that are to be considered "completed". > > > >> The downside is that the size of the read-only snapshot is theoretically > >> unbounded, which poses a bit of a problem if it's supposed to live > >> inside shared memory... > > > > Why do it inverted? That clearly has problems. > > Because it solves the problem of "sponteaously" apprearing XIDs in the > WAL. At least prior to 8.3 with virtual xids, a transaction might have > allocated it's xid long before actually writing anything to disk, and > therefore long before this XID ever shows up in the WAL. And with a > non-inverted snapshot such an XID would be considered to be "completed" > by transactions on the slave... So, one either needs to periodically log > a snapshot on the master or log XID allocations which both seem to cause > considerable additional load on the master. With an inverted snapshot, > it's sufficient to log the current RecentXmin - a values that is readily > available on the master, and therefore the cost amounts to just one > additional 4-byte field per xlog entry. I think I understand what you're saying now, though I think it mostly/only applies before your brilliant idea in 8.3. If we have a transaction history that looks like this: ReadA, WriteB, WriteA (both commit in either order) then pre-8.3 we would have xidA < xidB, whereas at 8.3 and above we see that xidA is actually higher than xidB. Now, TransactionIds are assigned in the order of their first page write and *not* in the order of transaction start as was previously the case, which isn't very obvious. So when we replay WAL, we know that WAL is only written with locks held, so that WriteA and WriteB must be independent of each other. So that means the locks held by Read1 can be ignored and we can assume that the above history is the same as WriteB, WriteA So if we took a snapshot in the middle of WriteB we would be safe to say that only transaction B was in progress and that transaction A was not yet started. So the snapshot we derive on the standby is different to the one we would have derived on the client, yet the serializable order is the same. In general, this means that all reads on a transaction prior to the first write can be reordered later so that they can be assumed to occur exactly prior to the first write of a transaction. (Please shoot me down, if incorrect). So when we see the first WAL record of a transaction we know that there are no in progress transactions with a *lower* xid that we have not yet seen. So we cannot be confused about whether a transaction is in-progress, or not. Almost. Now having written all of that I see there is an obvious race condition between assignment of an xid and actions that result in the acquisition of WALInsertLock. So even though the above seems mostly correct, there is still a gap to plug, but a much smaller one. So when writing the first WAL record for xid=48, it is possible that xid=47 has just been assigned and is also just about to write a WAL record. Thus the absence of a WAL record for xid=47 is not evidence that xid=47 is complete because it was read-only. We might handle this with the inverted technique you describe, but there should be an easier way to track dense packing of xid sequence. We expect xids to be written to WAL in the order assigned, so we might check whether a newly assigned xid is 1 higher than the last highest value to have inserted into WAL. If it is not, then we can write a short WAL record to inform readers of WAL that the missing xids in sequence are in progress also. So readers of WAL will "see" xids in the correct sequence and are thus able to construct valid snapshots direct from WAL. I think I should measure how often that occurs to see what problem or overhead this might cause, if any. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > 2. Master ignores Standby's OldestXmin > Effects: > * Long running queries on standby... > Have no effect on primary > Can delay apply of WAL records on standby > * Queries on standby give consistent answers in all cases. Just for clarification, if you set a max_slave_delay it means it is the maximum amount of time WAL replay can be delayed on the slave, _and_ it is the maximum amount of time a query/snapshot can be guaranteed to run without the possibility of being canceled. My point is that these two concepts are linked to the same setting. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Simon Riggs wrote: > > 2. Master ignores Standby's OldestXmin > > Effects: > > * Long running queries on standby... > > Have no effect on primary > > Can delay apply of WAL records on standby > > * Queries on standby give consistent answers in all cases. > > Just for clarification, if you set a max_slave_delay it means it is the > maximum amount of time WAL replay can be delayed on the slave, _and_ it > is the maximum amount of time a query/snapshot can be guaranteed to run > without the possibility of being canceled. My point is that these two > concepts are linked to the same setting. FYI, max_slave_delay does not relate to the amount of time of data loss in the case of master failure, assuming the WAL files are stored on the slave in some way during the delay. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Wed, 2008-09-24 at 21:19 -0400, Bruce Momjian wrote: > Simon Riggs wrote: > > 2. Master ignores Standby's OldestXmin > > Effects: > > * Long running queries on standby... > > Have no effect on primary > > Can delay apply of WAL records on standby > > * Queries on standby give consistent answers in all cases. > > Just for clarification, if you set a max_slave_delay it means it is the > maximum amount of time WAL replay can be delayed on the slave, _and_ it > is the maximum amount of time a query/snapshot can be guaranteed to run > without the possibility of being canceled. My point is that these two > concepts are linked to the same setting. I didn't even see them as separate, but now you mention it they could be. Startup process may need to wait up to max_slave_delay before applying WAL, if it sees that it must cancel a query to continue. max_slave_delay is set in postgresql.conf and honoured by the startup process. What is not yet defined is whether max_slave_delay is a setting per-blockage event, or a total time that could occur because of a single long blockage or many smaller ones. The latter makes more sense, but would only be sensible to calculate it when we have WAL streaming using calculations similar to the bgwriter delay calculations. That definition makes sense for the sysadmin because it is the amount of extra time a standby could take to startup after a failover. If we take max_slave_delay to mean the latter then there is no guarantee for user queries, since we may already have used up all our time waiting on one query and there may be no wait time left for the current user query. It would depend heavily on the rows accessed, so queries might easily run to completion even though they exceed the max_slave_delay. That makes sense because if you are asking a question like "what is the current total of widgets available" you understand that answer changes quickly over time, whereas the "how many widgets were reordered on day X" doesn't change over time at all. If you define this as a "per wait event" setting it provides a guarantee to the user queries, but also puts WAL apply into free-fall since you can't control number of blockage points queries might cause. >From what Merlin was saying, it would be sensible to have multiple slaves: one with a max_slave_delay of 30 seconds to be your HA fast-startup standby and another where we set max_slave_delay to 5 hours to guarantee execution time for large reporting queries. Anyway, its clear that max_slave_delay needs to be settable while running. >From my perspective, all I can say is "all things are possible" and I appreciate the need for options to satisfy different use cases. I'll build the basic mechanics and then we can add fine tuning over time. So I'll get the basic mechanics in place, suggest we observe how that works and then decide on the control mechanisms because they sound relatively simple to add. Another important note is Hannu's suggestion of using snapshot filesystems with Postgres. We can include that feature very quickly and it will complement Hot Standby very well. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
> Simon Riggs wrote: > > 2. Master ignores Standby's OldestXmin > > Effects: > > * Long running queries on standby... > > Have no effect on primary > > Can delay apply of WAL records on standby > > * Queries on standby give consistent answers in all cases. > > Just for clarification, if you set a max_slave_delay it means it is the > maximum amount of time WAL replay can be delayed on the slave, _and_ it > is the maximum amount of time a query/snapshot can be guaranteed to run > without the possibility of being canceled. My point is that these two > concepts are linked to the same setting. I wonder whether the cancel can be delayed until a tuple/page is actually accessed that shows a too new xid. The procedure would be like this: Instead of cancel, the backend gets a message with a lsn_horizon. From there on, whenever the backend reads a page/tuple with a LSN > lsn_horizon it cancels. I think that should allow some more queries to complete. Especially such that target static tables, or static parts of large tables using appropriate index btree ranges that are also static. Andreas
On Thu, 2008-09-25 at 11:14 +0200, Zeugswetter Andreas OSB sIT wrote: > > Simon Riggs wrote: > > > 2. Master ignores Standby's OldestXmin > > > Effects: > > > * Long running queries on standby... > > > Have no effect on primary > > > Can delay apply of WAL records on standby > > > * Queries on standby give consistent answers in all cases. > > > > Just for clarification, if you set a max_slave_delay it means it is the > > maximum amount of time WAL replay can be delayed on the slave, _and_ it > > is the maximum amount of time a query/snapshot can be guaranteed to run > > without the possibility of being canceled. My point is that these two > > concepts are linked to the same setting. > > I wonder whether the cancel can be delayed until a tuple/page is actually accessed > that shows a too new xid. Yes, its feasible and is now part of the design. This is all about what happens *if* we need to remove rows that a query can still see. We might also make HOT and VACUUM slightly less aggressive at removing rows on the master. We can't use later than OldestXmin for row removal, but we might choose to use something earlier. That would be an alternative to using the standby's OldestXmin. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Wed, 2008-09-24 at 21:22 -0400, Bruce Momjian wrote: > Bruce Momjian wrote: > > Simon Riggs wrote: > > > 2. Master ignores Standby's OldestXmin > > > Effects: > > > * Long running queries on standby... > > > Have no effect on primary > > > Can delay apply of WAL records on standby > > > * Queries on standby give consistent answers in all cases. > > > > Just for clarification, if you set a max_slave_delay it means it is the > > maximum amount of time WAL replay can be delayed on the slave, _and_ it > > is the maximum amount of time a query/snapshot can be guaranteed to run > > without the possibility of being canceled. My point is that these two > > concepts are linked to the same setting. > > FYI, max_slave_delay does not relate to the amount of time of data loss > in the case of master failure, assuming the WAL files are stored on the > slave in some way during the delay. Another way to handle row removal would be to check whether any users have access to particular tables or not. If we could issue a permission to prevent access to security definer functions, then we'd be able to work out with certainty whether a row removal would ever be visible to certain users. That would allow us to isolate fast changing tables from slow changing ones. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
> > I wonder whether the cancel can be delayed until a tuple/page is actually accessed > > that shows a too new xid. > > Yes, its feasible and is now part of the design. > > This is all about what happens *if* we need to remove rows that a query > can still see. I was describing a procedure for exactly that case. If a slave backend has a snapshot that we cannot guarantee any more (because max_slave_delay has been exceeded): > > Instead of cancel, the backend gets a message with a lsn_horizon. > > From there on, whenever the backend reads a page/tuple with a LSN > lsn_horizon it cancels. but not before (at the time max_slave_delay has been reached), as described earlier. Andreas
On Thu, 2008-09-25 at 12:34 +0200, Zeugswetter Andreas OSB sIT wrote: > > > I wonder whether the cancel can be delayed until a tuple/page is actually accessed > > > that shows a too new xid. > > > > Yes, its feasible and is now part of the design. > > > > This is all about what happens *if* we need to remove rows that a query > > can still see. > > I was describing a procedure for exactly that case. Ok, I see, sorry. > If a slave backend has a snapshot that we cannot guarantee any more > (because max_slave_delay has been exceeded): > > > > Instead of cancel, the backend gets a message with a lsn_horizon. > > > From there on, whenever the backend reads a page/tuple with a LSN > lsn_horizon it cancels. > > but not before (at the time max_slave_delay has been reached), as described earlier. Like that. OK, so in full: Each WAL record that cleans tuples has a latestxid on it. If latestxid is later than a running query on standby then we wait. When we stop waiting we tell all at-risk queries the LSN of the first WAL record that has potentially removed tuples they can see. If they see a block with a higher LSN they cancel *themselves*. This works OK, since SeqScans never read blocks at end of file that didn't exist when they started, so long queries need not be cancelled when they access growing tables. That combines all the suggested approaches into one. It still leaves the possibility of passing the standby's OldestXmin to the primary, but does not rely upon it. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Thu, 2008-09-11 at 17:58 +0300, Heikki Linnakangas wrote: > BTW, we haven't talked about how to acquire a snapshot in the slave. > You'll somehow need to know which transactions have not yet committed, > but will in the future. In the master, we keep track of in-progress > transaction in the ProcArray, so I suppose we'll need to do the same > in the slave. Very similar to prepared transactions, actually. I > believe the Abort records, which are not actually needed for normal > operation, become critical here. The slave will need to put an entry > to ProcArray for any new XLogRecord.xl_xid it sees in the WAL, and > remove the entry at a Commit and Abort record. And clear them all at a > shutdown record. Although I said differently earlier, it seems cleaner to make recovery snapshots work by emulating ProcArray entries as you suggest. Fatal errors were my problem there, but I think that's solvable now. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support