Thread: Hot standby and b-tree killed items
Whenever a B-tree index scan fetches a heap tuple that turns out to be dead, the B-tree item is marked as killed by calling _bt_killitems. When the page gets full, all the killed items are removed by calling _bt_vacuum_one_page. That's a problem for hot standby. If any of the killed b-tree items point to a tuple that is still visible to a running read-only transaction, we have the same situation as with vacuum, and have to either wait for the read-only transaction to finish before applying the WAL record or kill the transaction. It looks like there's some cosmetic changes related to that in the patch, the signature of _bt_delitems is modified, but there's no actual changes that would handle that situation. I didn't see it on the TODO on the hot standby wiki either. Am I missing something, or the patch? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2008-12-19 at 10:49 +0200, Heikki Linnakangas wrote: > Whenever a B-tree index scan fetches a heap tuple that turns out to be > dead, the B-tree item is marked as killed by calling _bt_killitems. When > the page gets full, all the killed items are removed by calling > _bt_vacuum_one_page. > > That's a problem for hot standby. If any of the killed b-tree items > point to a tuple that is still visible to a running read-only > transaction, we have the same situation as with vacuum, and have to > either wait for the read-only transaction to finish before applying the > WAL record or kill the transaction. > > It looks like there's some cosmetic changes related to that in the > patch, the signature of _bt_delitems is modified, but there's no actual > changes that would handle that situation. I didn't see it on the TODO on > the hot standby wiki either. Am I missing something, or the patch? ResolveRedoVisibilityConflicts() describes the current patch's position on this point, which on review is wrong, I agree. It looks like I assumed that _bt_delitems is only called during VACUUM, which I knew it wasn't. I know I was going to split XLOG_BTREE_VACUUM into two record types at one point, one for delete, one for vacuum. In the end I didn't. Anyhow, its wrong. We have infrastructure in place to make this work correctly, just need to add latestRemovedXid field to xl_btree_vacuum. So that part is easily solved. Thanks for spotting it. More like that please! -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > We have infrastructure in place to make this work correctly, just need > to add latestRemovedXid field to xl_btree_vacuum. So that part is easily > solved. That's tricky because there's no xmin/xmax on index tuples. You could conservatively use OldestXmin as latestRemovedXid, but that could stall the WAL redo a lot more than necessary. Or you could store latestRemovedXid in the page header, but that would need to be WAL-logged to ensure that it's valid after crash. Or you could look at the heap to fetch the xmin/xmax, but that would be expensive. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2008-12-19 at 12:24 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > We have infrastructure in place to make this work correctly, just need > > to add latestRemovedXid field to xl_btree_vacuum. So that part is easily > > solved. > > That's tricky because there's no xmin/xmax on index tuples. Doh. > You could > conservatively use OldestXmin as latestRemovedXid, but that could stall > the WAL redo a lot more than necessary. Or you could store > latestRemovedXid in the page header, but that would need to be > WAL-logged to ensure that it's valid after crash. Or you could look at > the heap to fetch the xmin/xmax, but that would be expensive. Agreed. Probably need to use OldestXmin then. If I was going to add anything to the btree page header, it would be latestRemovedLSN, only set during recovery. That way we don't have to explicitly kill queries, we can do the a wait on OldestXmin then let them ERROR out when they find a page that has been modified. I have a suspicion that we may need some modification of that solution for all data blocks, so we don't kill too many queries. Hmmm. I wonder if we can track latestRemovedLSN for all of shared_buffers. That was initially rejected, but if we set the latestRemovedLSN to be the block's LSN when we read it in, that would be fairly useful. Either way we use 8 bytes RAM per buffer. BTW, I noticed the other day that Oracle 11g only allows you to have a read only slave *or* allows you to continue replaying. You need to manually switch back and forth between those modes. They can't do *both*, as Postgres will be able to do. That's because their undo information is stored off-block in the Undo Tablespace, so is not available for standby queries. Nice one, Postgres. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
I'm confused shouldn't read-only transactions on the slave just be hacked to not set any hint bits including lp_delete? -- Greg On 19 Dec 2008, at 03:49, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com > wrote: > Whenever a B-tree index scan fetches a heap tuple that turns out to > be dead, the B-tree item is marked as killed by calling > _bt_killitems. When the page gets full, all the killed items are > removed by calling _bt_vacuum_one_page. > > That's a problem for hot standby. If any of the killed b-tree items > point to a tuple that is still visible to a running read-only > transaction, we have the same situation as with vacuum, and have to > either wait for the read-only transaction to finish before applying > the WAL record or kill the transaction. > > It looks like there's some cosmetic changes related to that in the > patch, the signature of _bt_delitems is modified, but there's no > actual changes that would handle that situation. I didn't see it on > the TODO on the hot standby wiki either. Am I missing something, or > the patch? > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, 2008-12-19 at 09:22 -0500, Greg Stark wrote: > I'm confused shouldn't read-only transactions on the slave just be > hacked to not set any hint bits including lp_delete? They could be, though I see no value in doing so. But that is not Heikki's point. He is discussing what happens on the primary and the effects that must then occur on the standby. He has rightly pointed out a (pluggable) hole in my logic. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
>>> Simon Riggs <simon@2ndQuadrant.com> wrote: > If I was going to add anything to the btree page header, it would be > latestRemovedLSN, only set during recovery. That way we don't have to > explicitly kill queries, we can do the a wait on OldestXmin then let > them ERROR out when they find a page that has been modified. > > I have a suspicion that we may need some modification of that solution > for all data blocks, so we don't kill too many queries. If the failure is caused by the timing of various database transactions, and the query is likely to run successfully after a delay and a retry, please use SQLSTATE of '40001'. Some software (ours, for one) will recognize this and retry the query automatically, so that the user impact is essentially the same as blocking. -Kevin
On Fri, 2008-12-19 at 11:54 -0600, Kevin Grittner wrote: > >>> Simon Riggs <simon@2ndQuadrant.com> wrote: > > > If I was going to add anything to the btree page header, it would be > > latestRemovedLSN, only set during recovery. That way we don't have > to > > explicitly kill queries, we can do the a wait on OldestXmin then let > > them ERROR out when they find a page that has been modified. > > > > I have a suspicion that we may need some modification of that > solution > > for all data blocks, so we don't kill too many queries. > > If the failure is caused by the timing of various database > transactions, and the query is likely to run successfully after a > delay and a retry, please use SQLSTATE of '40001'. Some software > (ours, for one) will recognize this and retry the query automatically, > so that the user impact is essentially the same as blocking. I understand the need, but we won't be using SQLSTATE = 40001. That corresponds to ERRCODE_T_R_SERIALIZATION_FAILURE, which that error would not be. The error message ought to be "snapshot too old", which could raise a chuckle, so I called it something else. The point you raise is a good one and I think we should publish a list of retryable error messages. I contemplated once proposing a special log level for a retryable error, but not quite a good idea. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, 2008-12-19 at 10:52 +0000, Simon Riggs wrote: > > You could > > conservatively use OldestXmin as latestRemovedXid, but that could stall > > the WAL redo a lot more than necessary. Or you could store > > latestRemovedXid in the page header, but that would need to be > > WAL-logged to ensure that it's valid after crash. Or you could look at > > the heap to fetch the xmin/xmax, but that would be expensive. > > Agreed. Probably need to use OldestXmin then. Just finished coding this up, plus TODO item to pin every index page. Will post after some further testing. Used RecentOldestXmin. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
>>> Simon Riggs <simon@2ndQuadrant.com> wrote: > I understand the need, but we won't be using SQLSTATE = 40001. > > That corresponds to ERRCODE_T_R_SERIALIZATION_FAILURE, which that error > would not be. Isn't it a problem with serialization of database transactions? You hit it in a different way, but if it is a temporary failure due to the timing of the transactions, I strongly feel that that is the correct SQLSTATE to use. Perhaps more information to provide any useful context could be in the info or hint areas? -Kevin
Simon Riggs <simon@2ndQuadrant.com> writes: > The error message ought to be "snapshot too old", which could raise a > chuckle, so I called it something else. > > The point you raise is a good one and I think we should publish a list > of retryable error messages. I contemplated once proposing a special log > level for a retryable error, but not quite a good idea. I'm a bit concerned about the idea of killing off queries to allow WAL to proceed. While I have nothing against that being an option I think we should be aiming to make it not necessary for correctness and not the default. By default I think WAL replay should stick to stalling WAL replay and only resort to killing queries if the user specifically requests it. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning
On Fri, 2008-12-19 at 18:59 +0000, Gregory Stark wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > > The error message ought to be "snapshot too old", which could raise a > > chuckle, so I called it something else. > > > > The point you raise is a good one and I think we should publish a list > > of retryable error messages. I contemplated once proposing a special log > > level for a retryable error, but not quite a good idea. > > I'm a bit concerned about the idea of killing off queries to allow WAL to > proceed. While I have nothing against that being an option I think we should > be aiming to make it not necessary for correctness and not the default. By > default I think WAL replay should stick to stalling WAL replay and only resort > to killing queries if the user specifically requests it. Increasing the waiting time increases the failover time and thus decreases the value of the standby as an HA system. Others value high availability higher than you and so we had agreed to provide an option to allow the max waiting time to be set. max_standby_delay is set in recovery.conf, value 0 (forever) - 2,000,000 secs, settable in milliseconds. So think of it like a deadlock detector for recovery apply. Also, there is a set of functions to control the way recovery proceeds, much as you might control an MP3 player (start, stop, pause). There ares also functions to pause at specific xids, pause at specific time, pause at the next cleanup record. That allows you to set the max_standby_delay lower and then freeze the server for longer to run a long query if required. It also allows you to do PITR by trial and error rather than one shot specify-in-advance settings. There is a function to manually end recovery at a useful place if desired. I hope your needs and wishes are catered for by that? (I have a Plan B in case we need it during wider user testing, as explained up thread.) -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
>>> Simon Riggs <simon@2ndQuadrant.com> wrote: > max_standby_delay is set in recovery.conf, value 0 (forever) - 2,000,000 > secs, settable in milliseconds. So think of it like a deadlock detector > for recovery apply. Aha! A deadlock is a type of serialization failure. (In fact, on databases with lock-based concurrency control rather than MVCC, it can be the ONLY type of serialization failure.) -Kevin
On Fri, 2008-12-19 at 13:47 -0600, Kevin Grittner wrote: > >>> Simon Riggs <simon@2ndQuadrant.com> wrote: > > > max_standby_delay is set in recovery.conf, value 0 (forever) - > 2,000,000 > > secs, settable in milliseconds. So think of it like a deadlock > detector > > for recovery apply. > > Aha! A deadlock is a type of serialization failure. (In fact, on > databases with lock-based concurrency control rather than MVCC, it can > be the ONLY type of serialization failure.) The SQL Standard specifically names this error as thrown when "it detects the inability to guarantee the serializability of two or more concurrent SQL-transactions". Now that really should only apply when running with SERIALIZABLE transactions, but I grant you the standard doesn't explicitly say that. You give me the strange sense that you want this because of some quirk in your software, rather than an overwhelming desire to see these two situations described the same. I guess making it that SQLSTATE would make it simpler to understand why the error occurs and also how to handle it (i.e. resubmit). So there probably is a wide argument for making developers jobs a little easier by doing it. i.e. usability will be improved if we do that. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes: > Increasing the waiting time increases the failover time and thus > decreases the value of the standby as an HA system. Others value high > availability higher than you and so we had agreed to provide an option > to allow the max waiting time to be set. Sure, it's a nice option to have. But I think the default should be to pause WAL replay. The question I had was whether your solution for btree pointers marked dead and later dropped from the index works when the user hasn't configured a timeout and doesn't want standby queries killed. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >>>> Simon Riggs <simon@2ndQuadrant.com> wrote: > >> max_standby_delay is set in recovery.conf, value 0 (forever) - > 2,000,000 >> secs, settable in milliseconds. So think of it like a deadlock > detector >> for recovery apply. > > Aha! A deadlock is a type of serialization failure. (In fact, on > databases with lock-based concurrency control rather than MVCC, it can > be the ONLY type of serialization failure.) I think the fundamental difference is that a deadlock or serialization failure can be predicted as a potential problem when writing the code. This is something that can happen for any query any time, even plain old read-only select queries. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
>>> Simon Riggs <simon@2ndQuadrant.com> wrote: > The SQL Standard specifically names this error as thrown when "it > detects the inability to guarantee the serializability of two or more > concurrent SQL-transactions". Now that really should only apply when > running with SERIALIZABLE transactions, I disagree. Database integrity could not be guaranteed without detection of conflicting modification in READ COMMITTED on up, and this is the normal means of indicating these problems. > but I grant you the standard doesn't explicitly say that. I think that's intentional. > You give me the strange sense that you want this because of some quirk > in your software, rather than an overwhelming desire to see these two > situations described the same. Well, we are very unlikely to ever use this feature, so it's not really something I care about for us; it just struck me that there may be others that care about categorizing errors accurately according the the SQL standard, and that what you were describing sounded like a new type of serialization failure in the PostgreSQL environment, and should be classified that way. The primary quirkiness of our software is that it needs to be able to run with a number of different database products, and we do want to take advantage of whatever information is available in a portable format. This is not the only standard SQLSTATE we look for and handle appropriately for the documented meaning, but it is an important one, as it has simplified application programming and reduced the confusing error messages which reach our end users. > I guess making it that SQLSTATE would make it simpler to understand why > the error occurs and also how to handle it (i.e. resubmit). Precisely. -Kevin
>>> Gregory Stark <stark@enterprisedb.com> wrote: > I think the fundamental difference is that a deadlock or serialization > failure > can be predicted as a potential problem when writing the code. This is > something that can happen for any query any time, even plain old read-only > select queries. I've heard that on Oracle it is (or at least was) possible to get a serialization failure on a single SELECT statement which was the only user-requested activity on the system, because it could conflict with automatic maintenance operations. In Sybase and Microsoft databases it is definitely possible for a plain old read-only SELECT statement to be a deadlock victim (reported as a serialization failure) if some of the data it is referencing is being updated concurrently. In these (and many other) products, a lock must be acquired before a row can be read. Imagine, the SELECT locks a row against updates, another transaction locks some other row against any access, then the UPDATE tries to change the row locked by the SELECT while the SELECT tries to read the row locked by the UPDATE. PostgreSQL is much less prone to serialization failures, but it is certainly understandable if hot standby replication introduces new cases of it. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > PostgreSQL is much less prone to serialization failures, but it is > certainly understandable if hot standby replication introduces new > cases of it. In this case it will be possible to get this error even if you're just running a single SELECT query -- and that's the *only* query in the database at all. A vacuum being replayed -- even in a different database -- could trigger the error. Or with the btree split issue, a data load -- again even in a different database -- would be quite likely cause your SELECT to be killed. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB'sPostgreSQL training!
>>> Gregory Stark <stark@enterprisedb.com> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > >> PostgreSQL is much less prone to serialization failures, but it is >> certainly understandable if hot standby replication introduces new >> cases of it. > > In this case it will be possible to get this error even if you're just > running > a single SELECT query -- and that's the *only* query in the database at all. > > A vacuum being replayed -- even in a different database -- could trigger the > error. Or with the btree split issue, a data load -- again even in a different > database -- would be quite likely cause your SELECT to be killed. OK. Does that make serialization failure a bad description of the problem? If these steps are serialized (run one after the other), is there a problem? It just seems that the hot standby near-synchronous replication creates a situation where tasks on multiple, linked databases might need to be serialized. It does seem like it will be important to provide as much information to the user about what's causing the problem, and hints about what to do. PostgreSQL has nice features for that, though. Since I have no vested interest here, I'm not inclined to belabor the point. I was really just trying to make sure the feature was as useful as possible to others, some of whom might be looking for standard SQLSTATE values to help the software take the right course. If others feel the 40001 code would confuse more than enlighten, I'll respect that. -Kevin
On Friday 19 December 2008 05:52:42 Simon Riggs wrote: > BTW, I noticed the other day that Oracle 11g only allows you to have a > read only slave *or* allows you to continue replaying. You need to > manually switch back and forth between those modes. They can't do > *both*, as Postgres will be able to do. That's because their undo > information is stored off-block in the Undo Tablespace, so is not > available for standby queries. Nice one, Postgres. > I think this is true for physical replay, but Oracle also offers the option to do logical replay (where transaction logs are converted into sql and run against the standby; i believe this is similar to what continuant is trying to do with thier latest offering). In that scenario you can do read and replay at the same time, though I think there are some conflicts possible; fewer than what postgres will have, since I think most of thier DDL can be done online. (This might require some extra modules / high end version of Oracle, please consult your local Oracle wizard for more details) -- Robert Treat Conjecture: http://www.xzilla.net Consulting: http://www.omniti.com
On Fri, 2008-12-19 at 20:54 +0000, Gregory Stark wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > > > PostgreSQL is much less prone to serialization failures, but it is > > certainly understandable if hot standby replication introduces new > > cases of it. > > In this case it will be possible to get this error even if you're just running > a single SELECT query -- and that's the *only* query in the database at all. > > A vacuum being replayed -- even in a different database -- could trigger the > error. Or with the btree split issue, a data load -- again even in a different > database -- would be quite likely cause your SELECT to be killed. Quite likely? "You're all doomed I say!", his eyes rolling wildly. :-) The standby is an extension of the primary and is quite literally running the same transactions. This "only query" idea isn't the right way to think about it. It's fairly easily possible to predict it will happen and it will happen only in same database as transactions on the primary. And as we just said, you can control whether and/or after how long this will happen in some detail. Industry context: In the worst case this is as bad as Oracle 11g. In many/most cases it is much better. Perhaps we should listen to the people that have said they don't want queries cancelled, even if the alternative is inconsistent answers. That is easily possible yet is not currently an option. Plus we have the option I referred to up thread, which is to defer query cancel until the query reads a modified data block. I'm OK with implementing either of those, as non-default options. Do we need those options or are we ok? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, 2008-12-19 at 19:29 -0500, Robert Treat wrote: > On Friday 19 December 2008 05:52:42 Simon Riggs wrote: > > BTW, I noticed the other day that Oracle 11g only allows you to have a > > read only slave *or* allows you to continue replaying. You need to > > manually switch back and forth between those modes. They can't do > > *both*, as Postgres will be able to do. That's because their undo > > information is stored off-block in the Undo Tablespace, so is not > > available for standby queries. Nice one, Postgres. > > > > I think this is true for physical replay, but Oracle also offers the option to > do logical replay (where transaction logs are converted into sql and run > against the standby; i believe this is similar to what continuant is trying > to do with thier latest offering). In that scenario you can do read and > replay at the same time, though I think there are some conflicts possible; > fewer than what postgres will have, since I think most of thier DDL can be > done online. That is also an option I have argued that we need, BTW. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Gregory Stark wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > >> Increasing the waiting time increases the failover time and thus >> decreases the value of the standby as an HA system. Others value high >> availability higher than you and so we had agreed to provide an option >> to allow the max waiting time to be set. > > Sure, it's a nice option to have. But I think the default should be to pause > WAL replay. I think I agree that pausing should be the default. If for no other reason, because I can't think of a good default for max_standby_delay. It would be nice to have a setting to specify the max. amount of unapplied WAL before killing queries. When the primary isn't doing much, you might want wait longer before killing queries, and if you're falling behind a lot, you might want to kill queries more aggressively to catch up. I guess that doesn't quite fit the current architecture; you'd need to peek ahead to see how much unapplied WAL there is. > The question I had was whether your solution for btree pointers marked dead > and later dropped from the index works when the user hasn't configured a > timeout and doesn't want standby queries killed. Yes, it's not any different from vacuum WAL records. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Gregory Stark wrote: >> The question I had was whether your solution for btree pointers marked >> dead >> and later dropped from the index works when the user hasn't configured a >> timeout and doesn't want standby queries killed. > > Yes, it's not any different from vacuum WAL records. No wait, there is a nasty corner-case. When an index tuple is marked as killed, no WAL record is written. Since there's now WAL record, it won't be killed in the slave yet. But if we take a full-page image of that page later for some other operation, the LP_DEAD flag is included in the full-page image. If the flag sneaks into the slave without an explicit WAL record like that, there's no latestRemovedXid for the slave to wait on. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sat, 2008-12-20 at 09:21 +0200, Heikki Linnakangas wrote: > Gregory Stark wrote: > > Simon Riggs <simon@2ndQuadrant.com> writes: > > > >> Increasing the waiting time increases the failover time and thus > >> decreases the value of the standby as an HA system. Others value high > >> availability higher than you and so we had agreed to provide an option > >> to allow the max waiting time to be set. > > > > Sure, it's a nice option to have. But I think the default should be to pause > > WAL replay. > > I think I agree that pausing should be the default. If for no other > reason, because I can't think of a good default for max_standby_delay. I would rather err on the side of caution. If we do as you suggest, somebody will lose their database and start shouting "stupid default". So I would suggest we set it to say 5 seconds to start with and let people that read the manual set it higher, or at least read the manual after they receive their first query cancellation. > It would be nice to have a setting to specify the max. amount of > unapplied WAL before killing queries. Agreed. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > On Sat, 2008-12-20 at 09:21 +0200, Heikki Linnakangas wrote: >> Gregory Stark wrote: >>> Simon Riggs <simon@2ndQuadrant.com> writes: >>> >>>> Increasing the waiting time increases the failover time and thus >>>> decreases the value of the standby as an HA system. Others value high >>>> availability higher than you and so we had agreed to provide an option >>>> to allow the max waiting time to be set. >>> Sure, it's a nice option to have. But I think the default should be to pause >>> WAL replay. >> I think I agree that pausing should be the default. If for no other >> reason, because I can't think of a good default for max_standby_delay. > > I would rather err on the side of caution. If we do as you suggest, > somebody will lose their database and start shouting "stupid default". Even if we stop applying the WAL, it should still be archived safely, right? So no data should be lost, although the standby can fall very much behind, and it can take a while to catch up. > So I would suggest we set it to say 5 seconds to start with and let > people that read the manual set it higher, or at least read the manual > after they receive their first query cancellation. I don't feel strongly either way... -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Gregory Stark wrote: > A vacuum being replayed -- even in a different database -- could trigger the > error. Or with the btree split issue, a data load -- again even in a different > database -- would be quite likely cause your SELECT to be killed. Hmm, I wonder if we should/could track the "latestRemovedXid" separately for each database. There's no reason why we need to kill a read-only query in database X when a table in database Y is vacuumed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Gregory Stark wrote: >> A vacuum being replayed -- even in a different database -- could trigger the >> error. Or with the btree split issue, a data load -- again even in a different >> database -- would be quite likely cause your SELECT to be killed. > > Hmm, I wonder if we should/could track the "latestRemovedXid" separately > for each database. There's no reason why we need to kill a read-only > query in database X when a table in database Y is vacuumed. What about shared catalogs? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote: > Heikki Linnakangas wrote: >> Gregory Stark wrote: >>> A vacuum being replayed -- even in a different database -- could trigger the >>> error. Or with the btree split issue, a data load -- again even in a different >>> database -- would be quite likely cause your SELECT to be killed. >> Hmm, I wonder if we should/could track the "latestRemovedXid" separately >> for each database. There's no reason why we need to kill a read-only >> query in database X when a table in database Y is vacuumed. > > What about shared catalogs? True, vacuums on shared catalogs would affect read-only queries on all databases. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Alvaro Herrera wrote: >> Heikki Linnakangas wrote: >>> Gregory Stark wrote: >>>> A vacuum being replayed -- even in a different database -- could trigger the >>>> error. Or with the btree split issue, a data load -- again even in a different >>>> database -- would be quite likely cause your SELECT to be killed. >>> Hmm, I wonder if we should/could track the "latestRemovedXid" >>> separately for each database. There's no reason why we need to kill >>> a read-only query in database X when a table in database Y is >>> vacuumed. >> >> What about shared catalogs? > > True, vacuums on shared catalogs would affect read-only queries on all > databases. Maybe it's possible to track latestRemovedXid for each database, and additionally another counter that tracks vacuums on shared catalogs. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Sat, 2008-12-20 at 22:07 +0200, Heikki Linnakangas wrote: > Gregory Stark wrote: > > A vacuum being replayed -- even in a different database -- could trigger the > > error. Or with the btree split issue, a data load -- again even in a different > > database -- would be quite likely cause your SELECT to be killed. > > Hmm, I wonder if we should/could track the "latestRemovedXid" separately > for each database. There's no reason why we need to kill a read-only > query in database X when a table in database Y is vacuumed. Already implemented in code. see ResolveRedoVisibilityConflicts() -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Sat, 2008-12-20 at 20:09 -0300, Alvaro Herrera wrote: > Heikki Linnakangas wrote: > > Gregory Stark wrote: > >> A vacuum being replayed -- even in a different database -- could trigger the > >> error. Or with the btree split issue, a data load -- again even in a different > >> database -- would be quite likely cause your SELECT to be killed. > > > > Hmm, I wonder if we should/could track the "latestRemovedXid" separately > > for each database. There's no reason why we need to kill a read-only > > query in database X when a table in database Y is vacuumed. > > What about shared catalogs? Hot Standby already covers this special case. Patch uses GetCurrentVirtualXIDs(), which treats a dbOid of 0 to match against all databases. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, 2008-12-19 at 14:23 -0600, Kevin Grittner wrote: > > I guess making it that SQLSTATE would make it simpler to understand > why > > the error occurs and also how to handle it (i.e. resubmit). > > Precisely. Just confirming I will implement the SQLSTATE as requested. I recognize my own and Greg's arguments that the match is not perfect, but the Standard isn't clear on this and Kevin's interpretation is the more useful behaviour for developers. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Saturday 20 December 2008 04:10:21 Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Sat, 2008-12-20 at 09:21 +0200, Heikki Linnakangas wrote: > >> Gregory Stark wrote: > >>> Simon Riggs <simon@2ndQuadrant.com> writes: > >>>> Increasing the waiting time increases the failover time and thus > >>>> decreases the value of the standby as an HA system. Others value high > >>>> availability higher than you and so we had agreed to provide an option > >>>> to allow the max waiting time to be set. > >>> > >>> Sure, it's a nice option to have. But I think the default should be to > >>> pause WAL replay. > >> > >> I think I agree that pausing should be the default. If for no other > >> reason, because I can't think of a good default for max_standby_delay. > > > > I would rather err on the side of caution. If we do as you suggest, > > somebody will lose their database and start shouting "stupid default". > > Even if we stop applying the WAL, it should still be archived safely, > right? So no data should be lost, although the standby can fall very > much behind, and it can take a while to catch up. > I was thinking the condition Simon was concerned about was that on a very busy slave with wal delay, you could theoretically fill up the disks and destroy the slave. With query cancel, you might be annoyed to see the queries canceled, but theres no way that you would destroy the slave. (That might not have been what he meant though) -- Robert Treat Conjecture: http://www.xzilla.net Consulting: http://www.omniti.com
On Friday 19 December 2008 19:36:42 Simon Riggs wrote: > Perhaps we should listen to the people that have said they don't want > queries cancelled, even if the alternative is inconsistent answers. That > is easily possible yet is not currently an option. Plus we have the > option I referred to up thread, which is to defer query cancel until the > query reads a modified data block. I'm OK with implementing either of > those, as non-default options. Do we need those options or are we ok? > Haven't seen any feed back on this, but I think the two options of cancel query for replay, and pause replay for queries, are probably enough for a first go around (especially if you can get the query canceling to work only when changes are made to the specific database in question) -- Robert Treat Conjecture: http://www.xzilla.net Consulting: http://www.omniti.com
On Tue, 2008-12-23 at 23:59 -0500, Robert Treat wrote: > On Friday 19 December 2008 19:36:42 Simon Riggs wrote: > > Perhaps we should listen to the people that have said they don't want > > queries cancelled, even if the alternative is inconsistent answers. That > > is easily possible yet is not currently an option. Plus we have the > > option I referred to up thread, which is to defer query cancel until the > > query reads a modified data block. I'm OK with implementing either of > > those, as non-default options. Do we need those options or are we ok? > > > > Haven't seen any feed back on this, but I think the two options of cancel > query for replay, and pause replay for queries, are probably enough for a > first go around (especially if you can get the query canceling to work only > when changes are made to the specific database in question) Thanks for picking up on this. This question is the #1 usability issue for Hot Standby, since at least May 2008. There are many potential additions and we need to track this carefully over the next few months to see if we have it just right. I'll take viewpoints at any time on that; this door is never closed, though tempus fugit. Greg and Heikki have highlighted in this thread some aspects of btree garbage collection that will increase the chance of queries being cancelled in various circumstances. If this is important enough to trigger additional actions then we need to highlight that now so we have time to take those corrective actions. I've listened to many different viewpoints on and off list. Everybody takes a slightly different angle on it and I'm in favour of giving everybody what they want with the right set of options. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Wed, Dec 24, 2008 at 4:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > Greg and Heikki have highlighted in this thread some aspects of btree > garbage collection that will increase the chance of queries being > cancelled in various circumstances Even HOT-prune may lead to frequent query cancellations and unlike VACUUM there is no way user can control the frequency of prune operations. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
On Wed, 2008-12-24 at 16:48 +0530, Pavan Deolasee wrote: > On Wed, Dec 24, 2008 at 4:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > > > > Greg and Heikki have highlighted in this thread some aspects of btree > > garbage collection that will increase the chance of queries being > > cancelled in various circumstances > > Even HOT-prune may lead to frequent query cancellations and unlike > VACUUM there is no way user can control the frequency of prune > operations. The patch does go to some trouble to handle that case, as I'm sure you've seen. Are you saying that part of the patch is ineffective and should be removed, or? Should/could there be a way to control frequency of prune operations? We could maintain cleanupxmin as a constant minimum distance from xmax, for example. Are we saying we should take further measures, as I asked upthread? If it is a consensus that I take some action, then I will. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Wed, Dec 24, 2008 at 5:26 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > The patch does go to some trouble to handle that case, as I'm sure > you've seen. Are you saying that part of the patch is ineffective and > should be removed, or? > Umm.. are you talking about the "wait" mechanism ? That's the only thing I remember. Otherwise, prune record is pretty much same as any vacuum cleanup record. > Should/could there be a way to control frequency of prune operations? We > could maintain cleanupxmin as a constant minimum distance from xmax, for > example. > Well, there can be. But tuning any such thing might be difficult and would have implications on the primary. I am not saying we can do that, but we will need additional tests to see its impact. > Are we saying we should take further measures, as I asked upthread? If > it is a consensus that I take some action, then I will. > Again, I haven't seen how frequently queries may get canceled. Or if the delay is set to a large value, how far behind standby may get during replication, so I can't really comment. Have you done any tests on a reasonable hardware and checked if moderately long read queries can be run on standby without standby lagging behind a lot. I would prefer to have a solution which can be smarter than canceling all queries as soon as a cleanup record comes and timeout occurs. For example, if the queries are being run on a completely different set of tables where as the updates/deletes are happening on another set of tables, there is no reason why those queries should be canceled. I think it would be very common to have large history tables which may receive long read-only queries, but no updates/deletes. Whereas other frequently updated tables which receive very few queries on the standby. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
On Wed, 2008-12-24 at 17:56 +0530, Pavan Deolasee wrote: > On Wed, Dec 24, 2008 at 5:26 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > > > > > The patch does go to some trouble to handle that case, as I'm sure > > you've seen. Are you saying that part of the patch is ineffective and > > should be removed, or? > > > > Umm.. are you talking about the "wait" mechanism ? That's the only > thing I remember. Otherwise, prune record is pretty much same as any > vacuum cleanup record. With respect, I was hoping you might look in the patch and see if you agree with the way it is handled. No need to remember. The whole latestRemovedXid concept is designed to do help. > > Should/could there be a way to control frequency of prune operations? We > > could maintain cleanupxmin as a constant minimum distance from xmax, for > > example. > > > > Well, there can be. But tuning any such thing might be difficult and > would have implications on the primary. I am not saying we can do > that, but we will need additional tests to see its impact. > > > Are we saying we should take further measures, as I asked upthread? If > > it is a consensus that I take some action, then I will. > > > > Again, I haven't seen how frequently queries may get canceled. Or if > the delay is set to a large value, how far behind standby may get > during replication, so I can't really comment. Have you done any tests > on a reasonable hardware and checked if moderately long read queries > can be run on standby without standby lagging behind a lot. Queries get cancelled if data they need to see if removed and the max_standby_delay expires. So lag will be max_standby_delay, by definition. Not sure what further tests would show. Queries that run for longer than max_standby delay plus mean time between cleanup records will currently end up being cancelled. > I would prefer to have a solution which can be smarter than canceling > all queries as soon as a cleanup record comes and timeout occurs. Currently, it was the consensus view that queries should be cancelled, though there are other options still on the table. It's discussed in Design Notes on the Wiki. "Simply ignoring WAL removal has been discussed and rejected (so far). http://archives.postgresql.org/pgsql-hackers/2008-05/msg00753.php Explicitly defining the tables a transaction wishes to see has also been discussed and rejected (so far). http://archives.postgresql.org/pgsql-hackers/2008-08/msg00268.php" > For > example, if the queries are being run on a completely different set of > tables where as the updates/deletes are happening on another set of > tables, there is no reason why those queries should be canceled. I > think it would be very common to have large history tables which may > receive long read-only queries, but no updates/deletes. Whereas other > frequently updated tables which receive very few queries on the > standby. There is currently no way to tell which tables a query will touch during the course of its execution. Nor is there likely to be because of user-defined volatile functions. I attempted to find ways to explicitly limit the set of tables over which a query might venture, but that cam to nothing also. We've also discussed storing lastCleanedLSN for each buffer, so queries can cancel themselves if they need to read a buffer that has had data removed from it that they would have needed to see. I'll write that up also. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Wednesday 24 December 2008 08:48:04 Simon Riggs wrote: > On Wed, 2008-12-24 at 17:56 +0530, Pavan Deolasee wrote: > > Again, I haven't seen how frequently queries may get canceled. Or if > > the delay is set to a large value, how far behind standby may get > > during replication, so I can't really comment. Have you done any tests > > on a reasonable hardware and checked if moderately long read queries > > can be run on standby without standby lagging behind a lot. > > Queries get cancelled if data they need to see if removed and the > max_standby_delay expires. So lag will be max_standby_delay, by > definition. > > Not sure what further tests would show. Queries that run for longer than > max_standby delay plus mean time between cleanup records will currently > end up being cancelled. > I think the uncertainty comes from peoples experience with typical replication use cases vs a lack of experience with this current implementation. One such example is that it is pretty common to use read-only slaves to do horizontal scaling of read queries across a bunch of slaves. This is not the scenario of running reporting queries on a second machine to lower load; you would be running a large number of read-only, relativly short, oltp-ish queries (think pg_benchs select only test i suppose), but you also have a fairly regular stream of inserts/updates going on with these same tables, its just you have 95/5 split of read/write (or similar). This is standard practice in things like mysql or using slony or what have you. I suspect it's one of the first things people are going to want to do with hot standby. But it's unclear how well this will work because we don't have any experience with it yet, coupled with the two downsides being mentioned as canceled queries and replay lag, which happen to be probably the two worst downsides you would have in the above scenario. :-) Hmm.... I'm not sure why I didn't think of running this test before, but read/write pg_bench on a master with pg_bench select test on slave isn't that bad of a scenario to match the above; it might be a little too much activity on the master, but has anyone else run such a test? -- Robert Treat Conjecture: http://www.xzilla.net Consulting: http://www.omniti.com
On Wed, 2008-12-24 at 09:59 -0500, Robert Treat wrote: > I think the uncertainty comes from peoples experience with typical replication > use cases vs a lack of experience with this current implementation. Quite possibly. Publishing user feedback on this will be very important in making this a usable feature. I'd be very happy if you were to direct the search for optimal usability. > One such example is that it is pretty common to use read-only slaves to do > horizontal scaling of read queries across a bunch of slaves. This is not the > scenario of running reporting queries on a second machine to lower load; you > would be running a large number of read-only, relativly short, oltp-ish > queries (think pg_benchs select only test i suppose), but you also have a > fairly regular stream of inserts/updates going on with these same tables, its > just you have 95/5 split of read/write (or similar). One thing to consider also is latency of information. Sending queries to master or slave may return different answers if querying very recent data. > This is standard practice in things like mysql or using slony or what have > you. I suspect it's one of the first things people are going to want to do > with hot standby. But it's unclear how well this will work because we don't > have any experience with it yet, coupled with the two downsides being > mentioned as canceled queries and replay lag, which happen to be probably the > two worst downsides you would have in the above scenario. :-) > > Hmm.... I'm not sure why I didn't think of running this test before, but > read/write pg_bench on a master with pg_bench select test on slave isn't that > bad of a scenario to match the above; it might be a little too much activity > on the master, but has anyone else run such a test? > -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Wed, Dec 24, 2008 at 7:18 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > > With respect, I was hoping you might look in the patch and see if you > agree with the way it is handled. No need to remember. The whole > latestRemovedXid concept is designed to do help. > Well, that's common for all cleanup record including vacuum. But reading your comment, it seemed as there is something special to handle HOT prune case which I did not see. Anyways, the trouble with HOT prune is that uples may be cleaned up very frequently and that can lead to query cancellation at the standby. That's what I wanted to emphasize. > > Queries get cancelled if data they need to see if removed and the > max_standby_delay expires. So lag will be max_standby_delay, by > definition. That's per cleanup record, isn't it ? > We've also discussed storing lastCleanedLSN for each buffer, so queries > can cancel themselves if they need to read a buffer that has had data > removed from it that they would have needed to see. I'll write that up > also. > What if we do that at table level ? So if a query touches a table which had cleanup activity since the query was started, it cancels itself automatically, Happy X'mas to all of you! Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
> Perhaps we should listen to the people that have said they don't want > queries cancelled, even if the alternative is inconsistent answers. I think an alternative to that would be "if the wal backlog is too big, let current queries finish and let incoming queries wait till the backlog gets smaller". fell free to ignore me, as a non-hacker I`m not even supposed to be reading this list :-] Greetings Marcin
marcin mank wrote: >> Perhaps we should listen to the people that have said they don't want >> queries cancelled, even if the alternative is inconsistent answers. I don't like that much. PostgreSQL has traditionally avoided that very hard. It's hard to tell what kind of inconsistencies you'd get, as it'd depend on what plan is created, when a vacuum happens to run on master etc. > I think an alternative to that would be "if the wal backlog is too > big, let current queries finish and let incoming queries wait till the > backlog gets smaller". Yeah, that makes sense too. Many approaches have been proposed, and they all have different tradeoffs and therefore fit different use cases. I'm not sure which ones are/will be included in the patch. We don't need all in 8.4, one or two simplest ones will do just fine, and we can extend later. Let me summarize. Whenever a WAL record conflicts with a query-in-progress, we can: 1. kill the query, or 2. wait for the query to finish 3. let the query proceed, producing invalid results. There's some combinations of those as well. You're proposal is a variation of 2, to avoid the problem of WAL application falling behind indefinitely. There's also the max_standby_delay option in the patch, to wait a while, and then kill the query. There's some additional optimizations that can be made to make those options less painful. Instead of killing all queries that might be affected by a vacuum record, only kill them when they actually hit a block that was vacuumed (Simon's idea of latestRemovedLSN field in page header). Another line of attack is to avoid getting into the situation in the first place, by affecting behavior on the master. If the standby has an online connection to the master (per the synch rep patch), it can tell master what the slave's OldestXmin is, and master can take that into account and not remove tuples still needed by the slave. That's not good from high availability point of view, you don't want a hung query in the slave to cause a long-running-transaction situation in the master, but for other use cases it would be fine. Or we can just add a constant # of transactions to OldestXmin in master, to get some breathing room in the server. The bottom line is that we have enough options to make everyone happy. Some understanding of the issue is required to tune it properly, however, so documentation is important. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, 2008-12-19 at 09:22 -0500, Greg Stark wrote: > I'm confused shouldn't read-only transactions on the slave just be > hacked to not set any hint bits including lp_delete? It seems there are multiple issues involved and I saw only the first of these initially. I want to explicitly separate these issues so we can discuss them more easily. 1. When we replay an XLOG_BTREE_DELETE record, we may have to wait-then-cancel-etc other sessions. Possibly a pain, but these records are not very common now that we have HOT, except on certain kinds of queue table. 2. Should we ignore the LP_DEAD flag on btree rows when we are using the index during recovery? As Heikki points out, this hint bit is not WAL logged, but can appear in the standby as a result of full page writes. The LP_DEAD flags will have been set using a different xmin to the one on the standby and would cause index rows to be ignored that should have been included in a correct MVCC answer. So we need to either (a) always ignore LP_DEAD flags we see when reading index during recovery. (b) include an additional step to clean the full page writes to remove LP_DEAD hints from the incoming pages. (b) is feasible, but would need to be repeated each time a new full page arrived, so a page may need to be re-cleaned many times. Sounds like a bad plan, so we should choose (a). 3. Should we set LP_DELETE flag on btree rows when we are using the index during recovery? Not much point if we are ignoring them. There is no space for an additional flag, to distinguish between primary and standby hint bits. Issues (2) and (3) would go away entirely if both standby and primary always had the same xmin value as a system-wide setting. i.e. the standby and primary are locked together at their xmins. Perhaps that was Heikki's intention in recent suggestions? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > Issues (2) and (3) would go away entirely if both standby and primary > always had the same xmin value as a system-wide setting. i.e. the > standby and primary are locked together at their xmins. Perhaps that was > Heikki's intention in recent suggestions? No, I only suggested that as an optional optimization. We can't rely on it, because the queries on standby should still work correctly if the connection to primary is lost for some reason, or if the primary decides not to honor standby's xmin, perhaps to avoid the usual issues with long-running-transactions. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Simon Riggs wrote: > (a) always ignore LP_DEAD flags we see when reading index during > recovery. This sounds simplest, and it's nice to not clear the flags for the benefit of transactions running after the recovery is done. You have to be careful to ignore the flags in read-only transactions that started in hot standby mode, even if recovery has since ended and we're in normal operation now. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2008-12-30 at 18:31 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > (a) always ignore LP_DEAD flags we see when reading index during > > recovery. > > This sounds simplest, and it's nice to not clear the flags for the > benefit of transactions running after the recovery is done. Agreed. (Also: Transaction hint bits are always set correctly, because we would only ever see a full page write with hints set after the commit/abort record was processed. So I continue to honour transaction hint bit reading and setting during recovery). > You have to be careful to ignore the flags in read-only transactions > that started in hot standby mode, even if recovery has since ended and > we're in normal operation now. Got that. I'm setting ignore_killed_tuples = false at the start of any index scan during recovery. And kill_prior_tuples is never set true when in recovery. Both measures are AM-agnostic. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Tue, 2008-12-30 at 18:31 +0200, Heikki Linnakangas wrote: > You have to be careful to ignore the flags in read-only transactions > that started in hot standby mode, even if recovery has since ended and > we're in normal operation now. My initial implementation in v6 worked, but had a corner case if a transaction spanned the change from recovery into normal processing. I'm now done on a more complete fix, will be in v8. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support