Thread: Allow "snapshot too old" error, to prevent bloat
This patch is related to the "Reduce pinning in btree indexes" patch submitted here: http://www.postgresql.org/message-id/721615179.3351449.1423959585771.JavaMail.yahoo@mail.yahoo.com That describes how they evolved and how they relate; I won't duplicate that here. Unlike the other patch, this one is more at the "proof of concept" phase, because it requires support in the heap and each index AM to work correctly; so far I have only had time to cover the heap and btree indexes. In spite of that, I have thrown the worst test cases I could think of at it (and only succeeded in uncovering a bug which was already out there in production), and it has shown its value in a two-day test test simulating a 300 user load with complex real-world applications (although the only indexes it used were btree indexes). Without the patches the database growth was 39GB per day; with the patches it was 28.5GB per day. (The test does involve more inserts than deletes, so some growth is expected.) At the end of the tests, pgstattuple reported eight times as many dead tuples in the database without the patches. More importantly, without the patches the CPU load started at 60% and showed linear growth to 92% over the course of the first day; with the patches it stayed at a stable 60% throughout the test. What this patch does is add a GUC call old_snapshot_threshold. It defaults to -1, which leaves behavior matching unpatched code. Above that it allows tuples to be vacuumed away after the number of transaction IDs specified by the GUC have been consumed. It also saves the current insertion LSN into every snapshot when it is created. When reading from the heap or any index, if the snapshot is vulnerable to showing incorrect data because the threshold has been crossed since it was generated, reading any page with an LSN past the snapshot LSN causes a "snapshot too old" error to be thrown. Since this is LSN-based, the new logic is not used for any relation which is not WAL-logged. Note that if you don't read data from a page which has been modified after your snapshot was taken, the threshold doesn't matter. All `make installcheck` tests succeed with any setting. With a setting of 0 (the most extreme), `make installcheck-world` sees four isolation tests fail. Those all pass if you raise the setting to 2. The postgres_fdw test needs a setting of 4 to succeed. I would expect most shops would want to tune this to something in the six-digit to eight-digit range. In the tests mentioned above it was set to 150000 (which corresponded to just under 4 minutes of txid consumption) and there were no "snapshot too old" errors, even though some cursors were left open for the entire two-day run. The patch still lacks (as mentioned above) support for index AMs other than btree, and lacks documentation for the new GUC. I'm sure that there are some comments and README files that need adjustment, too. As I said, this is still POC. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Kevin Grittner <kgrittn@ymail.com> writes: > What this patch does is add a GUC call old_snapshot_threshold. It > defaults to -1, which leaves behavior matching unpatched code. > Above that it allows tuples to be vacuumed away after the number of > transaction IDs specified by the GUC have been consumed. TBH, I'm not sure why we'd wish to emulate Oracle's single worst operational feature. > Unlike the other patch, this one is more at the "proof of concept" > phase, because it requires support in the heap and each index AM to > work correctly; so far I have only had time to cover the heap and > btree indexes. But, having said that, why would the index AMs care? Seems like what you are describing should be strictly a matter for VACUUM's removal rules. If we're going to have something as ugly as this, I would much rather it had a very small code footprint. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Kevin Grittner <kgrittn@ymail.com> writes: >> What this patch does is add a GUC call old_snapshot_threshold. It >> defaults to -1, which leaves behavior matching unpatched code. >> Above that it allows tuples to be vacuumed away after the number of >> transaction IDs specified by the GUC have been consumed. > > TBH, I'm not sure why we'd wish to emulate Oracle's single worst > operational feature. I've run into cases where people have suffered horribly bloated databases because of one ill-behaved connection. Some companies don't want to be vulnerable to that and the disruption that recovery from that bloat causes. Note that the default is to maintain legacy PostgreSQL behavior; the postgresql.conf file must be modified to see this error. >> Unlike the other patch, this one is more at the "proof of concept" >> phase, because it requires support in the heap and each index AM to >> work correctly; so far I have only had time to cover the heap and >> btree indexes. > > But, having said that, why would the index AMs care? Seems like what > you are describing should be strictly a matter for VACUUM's removal > rules. If we're going to have something as ugly as this, I would much > rather it had a very small code footprint. If a tuple is vacuumed away, you would not notice that if you were running an index scan because you would not visit the page it used to be on to see that the LSN had changed. We tried to sell the idea that *any* scan using a snapshot past the threshold should error out (which is the only way I can see to avoid making the index AMs aware of this), but they insisted that there was no reason to have scans fail when reading an unmodified table using an old snapshot, or even an unmodified page of a modified table. Due to the cost to this company of modifying their software to deal with differrent behavior, they will not move without the behavior I'm shooting for with this patch. Due to the value of this customer to us, we will put this (or something to accomplish this behavior) into our fork. We're happy to share it with the community. If the community does not feel they need this behavior, or wants to generate the error more aggressively to avoid touching the index AMs, I understand. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Kevin Grittner <kgrittn@ymail.com> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Kevin Grittner <kgrittn@ymail.com> writes: >>> What this patch does is add a GUC call old_snapshot_threshold. It >>> defaults to -1, which leaves behavior matching unpatched code. >>> Above that it allows tuples to be vacuumed away after the number of >>> transaction IDs specified by the GUC have been consumed. >> TBH, I'm not sure why we'd wish to emulate Oracle's single worst >> operational feature. > I've run into cases where people have suffered horribly bloated > databases because of one ill-behaved connection. Some companies > don't want to be vulnerable to that and the disruption that > recovery from that bloat causes. No doubt, preventing bloat is a good thing, but that doesn't mean this is the best API we could create for the issue. The proposition this patch offers to DBAs is: "You can turn this knob to reduce bloat by some hard-to-quantify factor. The cost is that some long-running transactions might fail. You won't know which ones are at risk, the failures won't be the same from time to time, and you won't be able to do anything to spare high-value transactions from that fate except by turning that knob back again globally (which requires a database restart)." Maybe refugees from Oracle will think that sounds good, but nobody else will. I wonder if we couldn't achieve largely the same positive effects through adding a simple transaction-level timeout option. That would be far easier for users to understand and manage, it would be trivial to allow specific high-value transactions to run with a laxer setting, it does not create any implementation-detail-dependent behavior that we'd be having to explain to users forevermore, and (not incidentally) it would be a lot simpler and more trustworthy to implement. There's no well-defined correlation between your setting and the net effect on database bloat, but that's true with the "snapshot too old" approach as well. regards, tom lane
On 2/15/15 10:36 AM, Tom Lane wrote: > Kevin Grittner <kgrittn@ymail.com> writes: >> Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Kevin Grittner <kgrittn@ymail.com> writes: >>>> What this patch does is add a GUC call old_snapshot_threshold. It >>>> defaults to -1, which leaves behavior matching unpatched code. >>>> Above that it allows tuples to be vacuumed away after the number of >>>> transaction IDs specified by the GUC have been consumed. > >>> TBH, I'm not sure why we'd wish to emulate Oracle's single worst >>> operational feature. > >> I've run into cases where people have suffered horribly bloated >> databases because of one ill-behaved connection. Some companies >> don't want to be vulnerable to that and the disruption that >> recovery from that bloat causes. > > No doubt, preventing bloat is a good thing, but that doesn't mean this > is the best API we could create for the issue. The proposition this > patch offers to DBAs is: "You can turn this knob to reduce bloat by some > hard-to-quantify factor. The cost is that some long-running transactions > might fail. You won't know which ones are at risk, the failures won't be > the same from time to time, and you won't be able to do anything to spare > high-value transactions from that fate except by turning that knob back > again globally (which requires a database restart)." Maybe refugees from > Oracle will think that sounds good, but nobody else will. > > I wonder if we couldn't achieve largely the same positive effects through > adding a simple transaction-level timeout option. That would be far > easier for users to understand and manage, it would be trivial to allow > specific high-value transactions to run with a laxer setting, it does not > create any implementation-detail-dependent behavior that we'd be having to > explain to users forevermore, and (not incidentally) it would be a lot > simpler and more trustworthy to implement. There's no well-defined > correlation between your setting and the net effect on database bloat, > but that's true with the "snapshot too old" approach as well. A common use-case is long-running reports hitting relatively stable data in a database that also has tables with a high churn rate (ie: queue tables). In those scenarios your only options right now are to suffer huge amounts of bloat in the high-churn or not do your reporting. A simple transaction timeout only "solves" this by denying you reporting queries. An idea that I've had on this would be some way to "lock down" the tables that a long-running transaction could access. That would allow vacuum to ignore any snapshots that transaction had for tables it wasn't accessing. That's something that would be deterministic. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
Jim Nasby <Jim.Nasby@bluetreble.com> writes: > On 2/15/15 10:36 AM, Tom Lane wrote: >> I wonder if we couldn't achieve largely the same positive effects through >> adding a simple transaction-level timeout option. > A common use-case is long-running reports hitting relatively stable data > in a database that also has tables with a high churn rate (ie: queue > tables). In those scenarios your only options right now are to suffer > huge amounts of bloat in the high-churn or not do your reporting. A > simple transaction timeout only "solves" this by denying you reporting > queries. Agreed, but Kevin's proposal has exactly the same problem only worse, because (a) the reporting transactions might or might not fail (per Murphy, they'll fail consistently only when you're on deadline), and (b) if they do fail, there's nothing you can do short of increasing the slack db-wide. > An idea that I've had on this would be some way to "lock down" the > tables that a long-running transaction could access. That would allow > vacuum to ignore any snapshots that transaction had for tables it wasn't > accessing. That's something that would be deterministic. There might be something in that, but again it's not much like this patch. The key point I think we're both making is that nondeterministic failures are bad, especially when you're talking about long-running, expensive-to- retry transactions. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jim Nasby <Jim.Nasby@bluetreble.com> writes: >> On 2/15/15 10:36 AM, Tom Lane wrote: >>> I wonder if we couldn't achieve largely the same positive effects through >>> adding a simple transaction-level timeout option. We suggested this to our customer and got out of the meeting with it looking like it *might* fly. In the next meeting, however, they said they had run it by others and reviewed the code and it was completely unacceptable -- they would not consider pg with this as the solution. >> A common use-case is long-running reports hitting relatively stable data >> in a database that also has tables with a high churn rate (ie: queue >> tables). In those scenarios your only options right now are to suffer >> huge amounts of bloat in the high-churn or not do your reporting. A >> simple transaction timeout only "solves" this by denying you reporting >> queries. > > Agreed, but Kevin's proposal has exactly the same problem only worse, > because (a) the reporting transactions might or might not fail (per > Murphy, they'll fail consistently only when you're on deadline), and > (b) if they do fail, there's nothing you can do short of increasing the > slack db-wide. These they were comfortable with, and did *not* consider to be unpredictable or something they could not do something about. I really don't feel I can say more than that, though, without disclosing more than I should. >> An idea that I've had on this would be some way to "lock down" the >> tables that a long-running transaction could access. That would allow >> vacuum to ignore any snapshots that transaction had for tables it wasn't >> accessing. That's something that would be deterministic. While this option was not specifically suggested, based on their their reasons that numerous other options we suggested were unacceptable, I feel sure that this would not be acceptable. I think Tom hit the nail on the head when he said "Maybe refugees from Oracle will think that sounds good..." It is precisely those with very large code bases which have been modified over long periods of time to work with Oracle that would find this solution attractive, or perhaps *necessary* to consider a move to Postgres. That's a potential market we don't care to write off. > There might be something in that, but again it's not much like this patch. > The key point I think we're both making is that nondeterministic failures > are bad, especially when you're talking about long-running, expensive-to- > retry transactions. What the customer most doesn't want to be "nondeterministic" is whether the error is generated only when the snapshot has been used to read a page which has been modified since the snapshot was taken. If tables or materialized views are set up before a query and then not subsequently modified during the execution of the query, that query must not be canceled even if it runs for days, but it must not cause unconstrained bloat during the run. So far I don't see any way to avoid including the LSN with the snapshot or modifying the index AMs. Let's be clear on the footprint for that; for the btree implementation it is: src/backend/access/nbtree/nbtinsert.c | 7 ++++---src/backend/access/nbtree/nbtpage.c | 2 +-src/backend/access/nbtree/nbtsearch.c| 43 ++++++++++++++++++++++++++++++++++---------src/include/access/nbtree.h | 7 ++++---4 files changed, 43 insertions(+), 16 deletions(-) What this discussion has made me reconsider is the metric for considering a transaction "too old". The number of transaction IDs consumed seems inferior as the unit of measure for that to LSN or time. It looks to me to be pretty trivial (on the order of maybe 30 lines of code) to specify this GUC in minutes rather than transaction IDs. At first glance this seems like it would be vulnerable to the usual complaints about mismanaged clocks, but those are easily answered by using a cached time in shared memory that we populate in such a way that it can never move backward. Done right, this could even allow the GUC to be changeable on reload rather than only at restart. A badly mismanaged system clock could not cause a query to generate incorrect results; the worst that could happen is that this feature would fail to control bloat as much as expected or reads of modified data could generate the "snapshot too old" error around the time of the clock adjustment. As before, this would default to a magic value to mean that you want the historical PostgreSQL behavior. If that makes the idea more palatable to anyone, I can submit a patch to that effect within the next 24 hours. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-02-16 05:19:11 +0000, Kevin Grittner wrote: > Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Jim Nasby <Jim.Nasby@bluetreble.com> writes: > >> On 2/15/15 10:36 AM, Tom Lane wrote: > >>> I wonder if we couldn't achieve largely the same positive effects through > >>> adding a simple transaction-level timeout option. > > We suggested this to our customer and got out of the meeting with > it looking like it *might* fly. In the next meeting, however, they > said they had run it by others and reviewed the code and it was > completely unacceptable -- they would not consider pg with this as > the solution. > > >> A common use-case is long-running reports hitting relatively stable data > >> in a database that also has tables with a high churn rate (ie: queue > >> tables). In those scenarios your only options right now are to suffer > >> huge amounts of bloat in the high-churn or not do your reporting. A > >> simple transaction timeout only "solves" this by denying you reporting > >> queries. > > > > Agreed, but Kevin's proposal has exactly the same problem only worse, > > because (a) the reporting transactions might or might not fail (per > > Murphy, they'll fail consistently only when you're on deadline), and > > (b) if they do fail, there's nothing you can do short of increasing the > > slack db-wide. > > These they were comfortable with, and did *not* consider to be > unpredictable or something they could not do something about. > I really don't feel I can say more than that, though, without > disclosing more than I should. I agree that we need to do something about the dangers of long snapshots, I'm not sure this is it. I'm not sure of the contrary either. What I'm definitely not a fan of though, is this implementation. Having to add checks to a large number of places is a recipe for silently wrong query results. One thing I was wondering about recently was introducing an optimization for repeatedly updated rows into the vacuum code: A row that has xmin = xmax where these have committed can be removed, even if the xid is above the xmin horizon - no other transaction is ever going to see it. While there's some hairy-ness implementing that, it doesn't seem too hard. And there's a fair number of cases where that'd allow less bloat to accumulate. Obviously it'd be better if we had logic to do that for other patterns as well (where the updates aren't in the same xact), but it seems like a good start. > > There might be something in that, but again it's not much like this patch. > > The key point I think we're both making is that nondeterministic failures > > are bad, especially when you're talking about long-running, expensive-to- > > retry transactions. > > What the customer most doesn't want to be "nondeterministic" is > whether the error is generated only when the snapshot has been used > to read a page which has been modified since the snapshot was > taken. If tables or materialized views are set up before a query > and then not subsequently modified during the execution of the > query, that query must not be canceled even if it runs for days, > but it must not cause unconstrained bloat during the run. So far I > don't see any way to avoid including the LSN with the snapshot or > modifying the index AMs. Let's be clear on the footprint for that; > for the btree implementation it is: IIUC you never would want cancellations when accessing the the tables these longrunning backends actually access. The errors about too old snapshots are just a stopgap because we can't compute the xmin per table in a meaningful way atm. Right? Is the bloat caused by rows these transactions actually can see or are the not-removed rows newer than the transaction's xmax? Since you actually don't want cancellations for the long running reporting queries it very much might be sensible to switch to a more complicated version of HeapTupleSatisfiesVacuum() if there's longrunning queries. One that can recognize if rows are actually visible to any backend, or if there are just snapshots that see older rows. I've previously wondered how hard this would be, but I don't think it's *that* hard. And it'd be a much more general solution. Greetings, Andres Freund
On 2/16/15 2:41 AM, Andres Freund wrote: > Since you actually don't want cancellations for the long running > reporting queries it very much might be sensible to switch to a more > complicated version of HeapTupleSatisfiesVacuum() if there's longrunning > queries. One that can recognize if rows are actually visible to any > backend, or if there are just snapshots that see older rows. I've > previously wondered how hard this would be, but I don't think it's > *that* hard. And it'd be a much more general solution. Josh Berkus mentioned on IRC that they're working on improving vacuum, and I think it was something along these lines. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Sun, Feb 15, 2015 at 11:36:50AM -0500, Tom Lane wrote: > I wonder if we couldn't achieve largely the same positive effects through > adding a simple transaction-level timeout option. That would be far > easier for users to understand and manage, it would be trivial to allow > specific high-value transactions to run with a laxer setting, it does not > create any implementation-detail-dependent behavior that we'd be having to > explain to users forevermore, and (not incidentally) it would be a lot > simpler and more trustworthy to implement. There's no well-defined > correlation between your setting and the net effect on database bloat, > but that's true with the "snapshot too old" approach as well. While we have prided ourselves on not generating query cancel messages due to snapshot-too-old, there is the downside of bloat. I understand the need to allow users to make the trade-off between bloat and query cancellation. It seems we already have a mechanism in place that allows tuning of query cancel on standbys vs. preventing standby queries from seeing old data, specifically max_standby_streaming_delay/max_standby_archive_delay. We obsessed about how users were going to react to these odd variables, but there has been little negative feedback. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 2015-02-16 16:35:46 -0500, Bruce Momjian wrote: > It seems we already have a mechanism in place that allows tuning of > query cancel on standbys vs. preventing standby queries from seeing old > data, specifically > max_standby_streaming_delay/max_standby_archive_delay. We obsessed > about how users were going to react to these odd variables, but there > has been little negative feedback. FWIW, I think that's a somewhat skewed perception. I think it was right to introduce those, because we didn't really have any alternatives. But max_standby_streaming_delay, max_standby_archive_delay and hot_standby_feedback are among the most frequent triggers for questions and complaints that I/we see. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
<p dir="ltr"><br /> On Feb 17, 2015 12:26 AM, "Andres Freund" <<a href="mailto:andres@2ndquadrant.com">andres@2ndquadrant.com</a>>wrote:<br /> ><br /> > On 2015-02-16 16:35:46 -0500,Bruce Momjian wrote:<br /> > > It seems we already have a mechanism in place that allows tuning of<br /> >> query cancel on standbys vs. preventing standby queries from seeing old<br /> > > data, specifically<br />> > max_standby_streaming_delay/max_standby_archive_delay. We obsessed<br /> > > about how users were goingto react to these odd variables, but there<br /> > > has been little negative feedback.<br /> ><br /> >FWIW, I think that's a somewhat skewed perception. I think it was right to<br /> > introduce those, because we didn'treally have any alternatives.<br /> ><br /> > But max_standby_streaming_delay, max_standby_archive_delay and<br/> > hot_standby_feedback are among the most frequent triggers for questions<br /> > and complaints that I/wesee.<br /> ><p dir="ltr">Agreed.<p dir="ltr">And a really bad one used to be vacuum_defer_cleanup_age, because ofconfusing units amongst other things. Which in terms seems fairly close to Kevins suggestions, unfortunately. <br /><pdir="ltr">/Magnus <br />
Magnus Hagander <magnus@hagander.net> wrote: > On Feb 17, 2015 12:26 AM, "Andres Freund" <andres@2ndquadrant.com> wrote: >> On 2015-02-16 16:35:46 -0500, Bruce Momjian wrote: >>> It seems we already have a mechanism in place that allows >>> tuning of query cancel on standbys vs. preventing standby >>> queries from seeing old data, specifically >>> max_standby_streaming_delay/max_standby_archive_delay. We >>> obsessed about how users were going to react to these odd >>> variables, but there has been little negative feedback. >> >> FWIW, I think that's a somewhat skewed perception. I think it >> was right to introduce those, because we didn't really have any >> alternatives. As far as I can see, the "alternatives" suggested so far are all about causing heap bloat to progress more slowly, but still without limit. I suggest, based on a lot of user feedback (from the customer I've talked about at some length on this thread, as well as numerous others), that unlimited bloat based on the activity of one connection is a serious deficiency in the product; and that there is no real alternative to something like a "snapshot too old" error if we want to fix that deficiency. Enhancements to associate a snapshot with a database and using a separate vacuum xmin per database, not limiting vacuum of a particular object by snapshots that cannot see that snapshot, etc., are all Very Good Things and I hope those changes are made, but none of that fixes a very fundamental flaw. >> But max_standby_streaming_delay, max_standby_archive_delay and >> hot_standby_feedback are among the most frequent triggers for >> questions and complaints that I/we see. >> > Agreed. > And a really bad one used to be vacuum_defer_cleanup_age, because > of confusing units amongst other things. Which in terms seems > fairly close to Kevins suggestions, unfortunately. Particularly my initial suggestion, which was to base snapshot "age" it on the number of transaction IDs assigned. Does this look any better to you if it is something that can be set to '20min' or '1h'? Just to restate, that would not automatically cancel the snapshots past that age; it would allow vacuum of any tuples which became "dead" that long ago, and would cause a "snapshot too old" message for any read of a page modified more than that long ago using a snapshot which was older than that. Unless this idea gets some +1 votes Real Soon Now, I will mark the patch as Rejected and move on. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Kevin, * Kevin Grittner (kgrittn@ymail.com) wrote: > Magnus Hagander <magnus@hagander.net> wrote: > > On Feb 17, 2015 12:26 AM, "Andres Freund" <andres@2ndquadrant.com> wrote: > >> On 2015-02-16 16:35:46 -0500, Bruce Momjian wrote: > > >>> It seems we already have a mechanism in place that allows > >>> tuning of query cancel on standbys vs. preventing standby > >>> queries from seeing old data, specifically > >>> max_standby_streaming_delay/max_standby_archive_delay. We > >>> obsessed about how users were going to react to these odd > >>> variables, but there has been little negative feedback. > >> > >> FWIW, I think that's a somewhat skewed perception. I think it > >> was right to introduce those, because we didn't really have any > >> alternatives. > > As far as I can see, the "alternatives" suggested so far are all > about causing heap bloat to progress more slowly, but still without > limit. I suggest, based on a lot of user feedback (from the > customer I've talked about at some length on this thread, as well > as numerous others), that unlimited bloat based on the activity of > one connection is a serious deficiency in the product; and that > there is no real alternative to something like a "snapshot too old" > error if we want to fix that deficiency. Enhancements to associate > a snapshot with a database and using a separate vacuum xmin per > database, not limiting vacuum of a particular object by snapshots > that cannot see that snapshot, etc., are all Very Good Things and I > hope those changes are made, but none of that fixes a very > fundamental flaw. I also agree with the general idea that it makes sense to provide a way to control bloat, but I think you've missed what Andres was getting at with his suggestion (if I understand correctly, apologies if I don't). The problem is that we're only looking at the overall xmin / xmax horizon when it comes to deciding if a given tuple is dead. That's not quite right- the tuple might actually be dead to all *current* transactions by being newer than the oldest transaction but dead for all later transactions. Basically, there exist gaps between our cluster wide xmin / xmax where we might find actually dead rows. Marking those rows dead and reusable *would* stop the bloat, not just slow it down. In the end, with a single long-running transaction, the worst bloat you would have is double the size of the system at the time the long-running transaction started. Another way of thinking about it is with this timeline: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx /\ /\ /\ /\ Long running row created row deleted vacuum transaction starts Now, if all transactions currently running started either before the row was created or after the row was deleted, then the row is dead and vacuum can reclaim it. Finding those gaps in the transaction space for all of the currently running processes might not be cheap, but for this kind of a use-case, it might be worth the effort. On a high-churn system where the actual set of rows being copied over constantly isn't very high then you could end up with some amount of duplication of rows- those to satisfy the ancient transaction and the current working set, but there wouldn't be any further bloat and it'd almost cerainly be a much better situation than what is being seen now. > Particularly my initial suggestion, which was to base snapshot > "age" it on the number of transaction IDs assigned. Does this look > any better to you if it is something that can be set to '20min' or > '1h'? Just to restate, that would not automatically cancel the > snapshots past that age; it would allow vacuum of any tuples which > became "dead" that long ago, and would cause a "snapshot too old" > message for any read of a page modified more than that long ago > using a snapshot which was older than that. > > Unless this idea gets some +1 votes Real Soon Now, I will mark the > patch as Rejected and move on. I'm not against having a knob like this, which is defaulted to off, but I do think we'd be a lot better off with a system that could realize when rows are not visible to any currently running transaction and clean them up. If this knob's default is off then I don't think we'd be likely to get the complaints which are being discussed (or, if we did, we could point the individual at the admin who set the knob...). Thanks! Stephen
Stephen Frost <sfrost@snowman.net> wrote: > I also agree with the general idea that it makes sense to provide a way > to control bloat, but I think you've missed what Andres was getting at > with his suggestion (if I understand correctly, apologies if I don't). > > The problem is that we're only looking at the overall xmin / xmax > horizon when it comes to deciding if a given tuple is dead. That's > not quite right- the tuple might actually be dead to all *current* > transactions by being newer than the oldest transaction but dead for all > later transactions. Basically, there exist gaps between our cluster > wide xmin / xmax where we might find actually dead rows. Marking those > rows dead and reusable *would* stop the bloat, not just slow it down. > > In the end, with a single long-running transaction, the worst bloat you > would have is double the size of the system at the time the long-running > transaction started. I agree that limiting bloat to one dead tuple for every live one for each old snapshot is a limit that has value, and it was unfair of me to characterize that as not being a limit. Sorry for that. This possible solution was discussed with the user whose feedback caused me to write this patch, but there were several reasons they dismissed that as a viable total solution for them, two of which I can share: (1) They have a pool of connections each of which can have several long-running cursors, so the limit from that isn't just doubling the size of their database, it is limiting it to some two-or-three digit multiple of the necessary size. (2) They are already prepared to deal with "snapshot too old" errors on queries that run more than about 20 minutes and which access tables which are modified. They would rather do that than suffer the bloat beyond that point. IMO all of these changes people are working are very valuable, and complement each other. This particular approach is likely to be especially appealing to those moving from Oracle because it is a familiar mechanism, and one which some of them have written their software to be aware of and deal with gracefully. > I'm not against having a knob like this, which is defaulted to off, Thanks! I'm not sure that amounts to a +1, but at least it doesn't sound like a -1. :-) > but I do think we'd be a lot better off with a system that could > realize when rows are not visible to any currently running transaction > and clean them up. +1 But they are not mutually exclusive; I see them as complementary. > If this knob's default is off then I don't think > we'd be likely to get the complaints which are being discussed (or, if > we did, we could point the individual at the admin who set the knob...). That's how I see it, too. I would not suggest making the default anything other than "off", but there are situations where it would be a nice tool to have in the toolbox. Thanks for the feedback! -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Feb 15, 2015 at 8:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
There might be something in that, but again it's not much like this patch.
The key point I think we're both making is that nondeterministic failures
are bad, especially when you're talking about long-running, expensive-to-
retry transactions.
But I think Kevin would agree with you there. That's why he's interested in having the errors not occur if you don't read from the volatile tables. Ie, your reporting query reading from read-only tables would be guaranteed to succeed even if some other table had had some rows vacuumed away.
I'm not sure that's really going to work because of things like hint bit updates or hot pruning. That is, say your table is read-only now but last week some of the records were updated. You might reasonably expect those rows to be safe and indeed the rows themselves will still be in your report. But the earlier versions of the rows could still be sitting on some pages invisible to every query but not vacuumable quite yet and then suddenly they get vacuumed away and the LSN on the page gets bumped.
Fwiw I would strongly suggest that instead of having a number of transactions to have a time difference. I haven't been following the "commit timestamp" thread but I'm hoping that patch has the infrastructure needed to look up an lsn and find out the timestamp and vice versa. So when you take a snapshot you could look up the effective cut-off LSN based on the time difference specified in the GUC. As a DBA I would find it infinitely more comforting specifying "old_snapshot_threshold=4h" than specifying some arbitrary large number of transactions and hoping I calculated the transaction rate reasonably accurately.
greg
* Kevin Grittner (kgrittn@ymail.com) wrote: > Stephen Frost <sfrost@snowman.net> wrote: > > I also agree with the general idea that it makes sense to provide a way > > to control bloat, but I think you've missed what Andres was getting at > > with his suggestion (if I understand correctly, apologies if I don't). > > > > The problem is that we're only looking at the overall xmin / xmax > > horizon when it comes to deciding if a given tuple is dead. That's > > not quite right- the tuple might actually be dead to all *current* > > transactions by being newer than the oldest transaction but dead for all > > later transactions. Basically, there exist gaps between our cluster > > wide xmin / xmax where we might find actually dead rows. Marking those > > rows dead and reusable *would* stop the bloat, not just slow it down. > > > > In the end, with a single long-running transaction, the worst bloat you > > would have is double the size of the system at the time the long-running > > transaction started. > > I agree that limiting bloat to one dead tuple for every live one > for each old snapshot is a limit that has value, and it was unfair > of me to characterize that as not being a limit. Sorry for that. > > This possible solution was discussed with the user whose feedback > caused me to write this patch, but there were several reasons they > dismissed that as a viable total solution for them, two of which I > can share: > > (1) They have a pool of connections each of which can have several > long-running cursors, so the limit from that isn't just doubling > the size of their database, it is limiting it to some two-or-three > digit multiple of the necessary size. This strikes me as a bit off-the-cuff; was an analysis done which deteremined that would be the result? If there is overlap between the long-running cursors then there would be less bloat, and most systems which I'm familiar with don't turn the entire database over in 20 minutes, 20 hours, or even 20 days except in pretty specific cases. Perhaps this is one of those, and if so then I'm all wet, but the feeling I get is that this is a way to dismiss this solution because it's not what's wanted, which is "what Oracle did." > (2) They are already prepared to deal with "snapshot too old" > errors on queries that run more than about 20 minutes and which > access tables which are modified. They would rather do that than > suffer the bloat beyond that point. That, really, is the crux here- they've already got support for dealing with it the way Oracle did and they'd like PG to do that too. Unfortunately, that, by itself, isn't a good reason for a particular capability (we certainly aren't going to be trying to duplicate PL/SQL in PG any time soon). That said, there are capabilities in other RDBMS's which are valuable and which we *do* want, so the fact that Oracle does this also isn't a reason to not include it. > IMO all of these changes people are working are very valuable, and > complement each other. This particular approach is likely to be > especially appealing to those moving from Oracle because it is a > familiar mechanism, and one which some of them have written their > software to be aware of and deal with gracefully. For my 2c, I'd much rather provide them with a system where they don't have to deal with broken snapshots than give them a way to have them the way Oracle provided them. :) That said, even the approach Andres outlined will cause bloat and it may be beyond what's acceptable in some environments, and it's certainly more complicated and unlikely to get done in the short term. > > I'm not against having a knob like this, which is defaulted to off, > > Thanks! I'm not sure that amounts to a +1, but at least it doesn't > sound like a -1. :-) So, at the time I wrote that, I wasn't sure if it was a +1 or not myself. I've been thinking about it since then, however, and I'm leaning more towards having the capability than not, so perhaps that's a +1, but it doesn't excuse the need to come up with an implementation that everyone can be happy with and what you've come up with so far doesn't have a lot of appeal, based on the feedback (I've only glanced through it myself, but I agree with Andres and Tom that it's a larger footprint than we'd want for this and the number of places having to be touched is concerning as it could lead to future bugs). A lot of that would go away if there was a way to avoid having to mess with the index AMs, I'd think, but I wonder if we'd actually need more done there- it's not immediately obvious to me how an index-only scan is safe with this. Whenever an actual page is visited, we can check the LSN, and the relation can't be truncated by vacuum since the transaction will still have a lock on the table which prevents it, but does the visibility-map update check make sure to never mark pages all-visible when one of these old transactions is running around? On what basis? > > but I do think we'd be a lot better off with a system that could > > realize when rows are not visible to any currently running transaction > > and clean them up. > > +1 > > But they are not mutually exclusive; I see them as complementary. I can see how they would be, provided we can be confident that we're going to actually throw an error when the snapshot is out of date and not end up returning incorrect results. We need to be darn sure of that, both now and in a few years from now when many of us may have forgotten about this knob.. ;) > > If this knob's default is off then I don't think > > we'd be likely to get the complaints which are being discussed (or, if > > we did, we could point the individual at the admin who set the knob...). > > That's how I see it, too. I would not suggest making the default > anything other than "off", but there are situations where it would > be a nice tool to have in the toolbox. Agreed. Thanks! Stephen
On Wed, Feb 18, 2015 at 10:57 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
-- Magnus Hagander <magnus@hagander.net> wrote:
> On Feb 17, 2015 12:26 AM, "Andres Freund" <andres@2ndquadrant.com> wrote:
>> On 2015-02-16 16:35:46 -0500, Bruce Momjian wrote:
>> But max_standby_streaming_delay, max_standby_archive_delay and
>> hot_standby_feedback are among the most frequent triggers for
>> questions and complaints that I/we see.
>>
> Agreed.
> And a really bad one used to be vacuum_defer_cleanup_age, because
> of confusing units amongst other things. Which in terms seems
> fairly close to Kevins suggestions, unfortunately.
Particularly my initial suggestion, which was to base snapshot
"age" it on the number of transaction IDs assigned. Does this look
any better to you if it is something that can be set to '20min' or
'1h'? Just to restate, that would not automatically cancel the
snapshots past that age; it would allow vacuum of any tuples which
became "dead" that long ago, and would cause a "snapshot too old"
message for any read of a page modified more than that long ago
using a snapshot which was older than that.
Yes, it would definitely look much better. My reference per above was exactly that - having a setting in the unit "number of xids" confused a lot of users and made it really hard to tune. Having something in time units is a lot easier to understand and tune for most people.
Greg Stark <stark@mit.edu> wrote: > On Sun, Feb 15, 2015 at 8:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> There might be something in that, but again it's not much like >> this patch. The key point I think we're both making is that >> nondeterministic failures are bad, especially when you're >> talking about long-running, expensive-to-retry transactions. >> > But I think Kevin would agree with you there. That's why he's > interested in having the errors not occur if you don't read from > the volatile tables. Ie, your reporting query reading from > read-only tables would be guaranteed to succeed even if some > other table had had some rows vacuumed away. > > I'm not sure that's really going to work because of things like > hint bit updates or hot pruning. That is, say your table is > read-only now but last week some of the records were updated. You > might reasonably expect those rows to be safe and indeed the rows > themselves will still be in your report. But the earlier versions > of the rows could still be sitting on some pages invisible to > every query but not vacuumable quite yet and then suddenly they > get vacuumed away and the LSN on the page gets bumped. That's a very interesting point. In the case that the reporting tables are like that (versus being created right before the run, for the report), it would argue for either very aggressive autovacuum settings or an explicit vacuum on those tables before starting the report. > Fwiw I would strongly suggest that instead of having a number of > transactions to have a time difference. On the 15th I said this: | What this discussion has made me reconsider is the metric for | considering a transaction "too old". The number of transaction IDs | consumed seems inferior as the unit of measure for that to LSN or | time. | | It looks to me to be pretty trivial (on the order of maybe 30 lines | of code) to specify this GUC in minutes rather than transaction | IDs. At first glance this seems like it would be vulnerable to the | usual complaints about mismanaged clocks, but those are easily | answered by using a cached time in shared memory that we populate | in such a way that it can never move backward. Done right, this | could even allow the GUC to be changeable on reload rather than | only at restart. A badly mismanaged system clock could not cause a | query to generate incorrect results; the worst that could happen is | that this feature would fail to control bloat as much as expected | or reads of modified data could generate the "snapshot too old" | error around the time of the clock adjustment. | | As before, this would default to a magic value to mean that you | want the historical PostgreSQL behavior. | | If that makes the idea more palatable to anyone, I can submit a | patch to that effect within the next 24 hours. Until yesterday I didn't get any feedback suggesting that such a change would make the patch more palatable. Now that I have had, I'll try to post a patch to that effect today. Thanks! -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Stephen Frost <sfrost@snowman.net> wrote: > Kevin Grittner (kgrittn@ymail.com) wrote: >> Stephen Frost <sfrost@snowman.net> wrote: >>> In the end, with a single long-running transaction, the worst bloat you >>> would have is double the size of the system at the time the long-running >>> transaction started. >> >> I agree that limiting bloat to one dead tuple for every live one >> for each old snapshot is a limit that has value, and it was unfair >> of me to characterize that as not being a limit. Sorry for that. >> >> This possible solution was discussed with the user whose feedback >> caused me to write this patch, but there were several reasons they >> dismissed that as a viable total solution for them, two of which I >> can share: >> >> (1) They have a pool of connections each of which can have several >> long-running cursors, so the limit from that isn't just doubling >> the size of their database, it is limiting it to some two-or-three >> digit multiple of the necessary size. > > This strikes me as a bit off-the-cuff; was an analysis done which > deteremined that would be the result? If it sounds like off-the-cuff it is because I am summarizing the mountain of data which has been accumulated over weeks of discussions and many rounds of multi-day tests using their actual software driven by a software test driver that simulates users, with "think time" and all. To be usable, we need to run on a particular set of hardware with a simulated load of 3000 users. In a two-day test without the patches I'm proposing, their actual, working production software running against PostgreSQL bloated the database by 37% and was on a linear rate of increase. Unpatched, CPU time consumed at a linear rate due to the bloat until in the second day it saturated the CPUs and the driver was unable to get acceptable user performance. Because of that we haven't moved on to test what the bloat rate would be like with 3000 users, but I assume it would be higher than with 300 users. With the two patches I submitted, bloat stabilized at less than 5% except for some btree indexes which followed pattern of inserting at the end and deleting most (but not all) of the entries over time. I've been working on a separate patch for that, but it's not ready to present, so I probably won't post that until the first CF of the next release -- unless someone's interested enough to want to discuss it before then. >> (2) They are already prepared to deal with "snapshot too old" >> errors on queries that run more than about 20 minutes and which >> access tables which are modified. They would rather do that than >> suffer the bloat beyond that point. > > That, really, is the crux here- they've already got support for dealing > with it the way Oracle did and they'd like PG to do that too. > Unfortunately, that, by itself, isn't a good reason for a particular > capability (we certainly aren't going to be trying to duplicate PL/SQL > in PG any time soon). I understand that, and if the community doesn't want this style of limitation on bloat we can make it part of PPAS (which *does* duplicate PL/SQL, among other things). We are offering it to the community first. > That said, there are capabilities in other RDBMS's which are > valuable and which we *do* want, so the fact that Oracle does this > also isn't a reason to not include it. :-) >>> I'm not against having a knob like this, which is defaulted to off, >> >> Thanks! I'm not sure that amounts to a +1, but at least it doesn't >> sound like a -1. :-) > > So, at the time I wrote that, I wasn't sure if it was a +1 or not > myself. I've been thinking about it since then, however, and I'm > leaning more towards having the capability than not, so perhaps that's a > +1, but it doesn't excuse the need to come up with an implementation > that everyone can be happy with and what you've come up with so far > doesn't have a lot of appeal, based on the feedback (I've only glanced > through it myself, but I agree with Andres and Tom that it's a larger > footprint than we'd want for this and the number of places having to be > touched is concerning as it could lead to future bugs). This conversation has gotten a little confusing because some of my posts seem to have been held up in the email systems for some reason, and are arriving out-of-order (and at least one I don't see yet). But I have been responding to these points. For one thing I stated four days ago that I now feel that the metric for determining that a snapshot is "old" should be *time*, not transactions IDs consumed, and offered to modify that patch in that direction if people agreed. I now see one person agreeing and several people arguing that I should do just that (as though they had not seem me propose it), so I'll try to get a new version out today like that. In any event, I am sure it is possible and feel that it would be better. (Having posted that now at least 4 times, hopefully people will pick up on it.) For another thing, as mentioned earlier, this is the btree footprint: src/backend/access/nbtree/nbtinsert.c | 7 ++++---src/backend/access/nbtree/nbtpage.c | 2 +-src/backend/access/nbtree/nbtsearch.c| 43 ++++++++++++++++++++++++++++++++++---------src/include/access/nbtree.h | 7 ++++---4 files changed, 43 insertions(+), 16 deletions(-) That consists mainly of two things. First, there are 13 places (all in nbtsearch.c) immediately following a call to BufferGetPage() that a line like this was added: TestForOldSnapshot(snapshot, rel, page); Second, some functions that were not previously passed a snapshot had that added to the function signature and the snapshot was passed from the caller where needed. C files other than nbtsearch were only modified to pass a NULL in to calls to the functions with modified signatures -- in a total of four places. That is it. To be sure I had to initially spend some time figuring out which BufferGetPage() calls needed to be followed with this test, but it's not rocket science. > A lot of that would go away if there was a way to avoid having to mess > with the index AMs, I'd think, but I wonder if we'd actually need more > done there- it's not immediately obvious to me how an index-only scan is > safe with this. Whenever an actual page is visited, we can check the > LSN, and the relation can't be truncated by vacuum since the transaction > will still have a lock on the table which prevents it, but does the > visibility-map update check make sure to never mark pages all-visible > when one of these old transactions is running around? On what basis? I will review that; it may indeed need some attention. The index entry is removed before the line pointer can be freed, but the question is whether there is a window where pruning or vacuum has removed the tuple and changed the page visibility map bit to say that everything on the page is visible to all before the index entry is removed. There may be a need to, as you say, suppress setting the all-visible flag in some particular cases. >>> but I do think we'd be a lot better off with a system that could >>> realize when rows are not visible to any currently running transaction >>> and clean them up. >> >> +1 >> >> But they are not mutually exclusive; I see them as complementary. > > I can see how they would be, provided we can be confident that we're > going to actually throw an error when the snapshot is out of date and > not end up returning incorrect results. We need to be darn sure of > that, both now and in a few years from now when many of us may have > forgotten about this knob.. ;) I get that this is not zero-cost to maintain, and that it might not be of interest to the community largely for that reason. If you have any ideas about how to improve that, I'm all ears. But do consider the actual scope of what was needed for the btree coverage (above). (Note that the index-only scan issue doesn't look like anything AM-specific; if something is needed for that it would be in the pruning/vacuum area.) -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 02/19/2015 09:44 AM, Kevin Grittner wrote: > On the 15th I said this: > > | What this discussion has made me reconsider is the metric for > | considering a transaction "too old". The number of transaction IDs > | consumed seems inferior as the unit of measure for that to LSN or > | time. > | > | It looks to me to be pretty trivial (on the order of maybe 30 lines > | of code) to specify this GUC in minutes rather than transaction > | IDs. At first glance this seems like it would be vulnerable to the > | usual complaints about mismanaged clocks, but those are easily > | answered by using a cached time in shared memory that we populate > | in such a way that it can never move backward. Done right, this > | could even allow the GUC to be changeable on reload rather than > | only at restart. A badly mismanaged system clock could not cause a > | query to generate incorrect results; the worst that could happen is > | that this feature would fail to control bloat as much as expected > | or reads of modified data could generate the "snapshot too old" > | error around the time of the clock adjustment. > | > | As before, this would default to a magic value to mean that you > | want the historical PostgreSQL behavior. > | > | If that makes the idea more palatable to anyone, I can submit a > | patch to that effect within the next 24 hours. > > Until yesterday I didn't get any feedback suggesting that such a > change would make the patch more palatable. Now that I have had, > I'll try to post a patch to that effect today. > > Thanks! > I understand why this make people nervous. I wonder if it might be more palatable if there were a per-table setting that could enable it? If we could ensure that this was only applied to high churn queue tables, say, while tables touched by the report writer would not have it applied, that would calm a lot of my fears. I'm also interested in handling the case Stephen Frost described, where a tuple is effectively dead but we don't currently have the means of discovering the fact, because there is an older long running transaction which is never in fact going to be able to see the tuple. cheers andrew
On Wed, Feb 18, 2015 at 4:57 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
>> But max_standby_streaming_delay, max_standby_archive_delay and
>> hot_standby_feedback are among the most frequent triggers for
>> questions and complaints that I/we see.
>>
> Agreed.
> And a really bad one used to be vacuum_defer_cleanup_age, because
> of confusing units amongst other things. Which in terms seems
> fairly close to Kevins suggestions, unfortunately.
Particularly my initial suggestion, which was to base snapshot
"age" it on the number of transaction IDs assigned. Does this look
any better to you if it is something that can be set to '20min' or
'1h'? Just to restate, that would not automatically cancel the
snapshots past that age; it would allow vacuum of any tuples which
became "dead" that long ago, and would cause a "snapshot too old"
message for any read of a page modified more than that long ago
using a snapshot which was older than that.
I like this thought. One of the first things I do in a new Pg environment is setup a cronjob that watches pg_stat_activity and terminates most backends over N minutes in age (about 5x the length of normal work) with an exception for a handful of accounts doing backups and other maintenance operations. This prevents a stuck client from jamming up the database.
Would pg_dump be able to opt-out of such a restriction?
regards,
Rod
Rod
On Thu, Feb 19, 2015 at 6:01 PM, Kevin Grittner <kgrittn@ymail.com> wrote: >> I can see how they would be, provided we can be confident that we're >> going to actually throw an error when the snapshot is out of date and >> not end up returning incorrect results. We need to be darn sure of >> that, both now and in a few years from now when many of us may have >> forgotten about this knob.. ;) > > I get that this is not zero-cost to maintain, and that it might not > be of interest to the community largely for that reason. If you > have any ideas about how to improve that, I'm all ears. But do > consider the actual scope of what was needed for the btree coverage > (above). (Note that the index-only scan issue doesn't look like > anything AM-specific; if something is needed for that it would be > in the pruning/vacuum area.) If I understood the issue correctly, you have long running snapshots accessing one part of the data, while you have high churn on a disjoint part of data. We would need to enable vacuum on the high churn data while still being able to run those long queries. The "snapshot too old" solution limits the maximum age of global xmin at the cost of having to abort transactions that are older than this and happen to touch a page that was modified after they were started. What about having the long running snapshots declare their working set, and then only take them into account for global xmin for relations that are in the working set? Like a SET TRANSACTION WORKING SET command. This way the error is deterministic, vacuum on the high churn tables doesn't have to wait for the old transaction delay to expire and we avoid a hard to tune GUC (what do I need to set old_snapshot_threshold to, to reduce bloat while not having "normal" transactions abort). Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
Andrew Dunstan <andrew@dunslane.net> wrote: > On 02/19/2015 09:44 AM, Kevin Grittner wrote: > I understand why this make people nervous. I wonder if it might be more > palatable if there were a per-table setting that could enable it? If we > could ensure that this was only applied to high churn queue tables, say, > while tables touched by the report writer would not have it applied, > that would calm a lot of my fears. That's an interesting idea. I think I should switch the unit of measure for "too old" to a time-based value first, since there seems to be universal agreement that it would be better than number of transactions. Once I've cleared that hurdle I'll see what this would take. > I'm also interested in handling the case Stephen Frost described, where > a tuple is effectively dead but we don't currently have the means of > discovering the fact, because there is an older long running transaction > which is never in fact going to be able to see the tuple. Absolutely. That's one of several other issues that I've been looking at over the last few weeks. It sounds like people are already working on that one, which is great. My personal priority list included that, but after the two I submitted here and a patch to allow combining near-empty btree pages so that btree bloat is constrained without having to reindex periodically for the cases where index tuples are added in index order (at one or a few insertion points) and most-but-not-all are deleted. You can currently wind up with a worst-case of one index tuple per page with no action to reduce that bloat by vacuum processes. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Feb 19, 2015 at 3:44 PM, Kevin Grittner <kgrittn@ymail.com> wrote: >> I'm also interested in handling the case Stephen Frost described, where >> a tuple is effectively dead but we don't currently have the means of >> discovering the fact, because there is an older long running transaction >> which is never in fact going to be able to see the tuple. > > Absolutely. That's one of several other issues that I've been > looking at over the last few weeks. It sounds like people are > already working on that one, which is great. My personal priority > list included that, but after the two I submitted here and a patch > to allow combining near-empty btree pages so that btree bloat is > constrained without having to reindex periodically for the cases > where index tuples are added in index order (at one or a few > insertion points) and most-but-not-all are deleted. You can > currently wind up with a worst-case of one index tuple per page > with no action to reduce that bloat by vacuum processes. I'd be willing to test that patch. I have a big database that does that, and a script that fills the disk with said bloat. That's forced me to do periodic reindexing, which sucks.
Rod Taylor <rod.taylor@gmail.com> wrote: > Would pg_dump be able to opt-out of such a restriction? I don't see how, since vacuum would be removing recently dead tuples that are still visible; the alternative to getting a "snapshot too old" error when reading a page which could be affected is to return incorrect results, and nobody wants that. The best you could do if you wanted to run pg_dump (or similar) and it might take more time than your old_snapshot_threshold would be to increase the threshold, reload, dump, set it back to the "normal" setting, and reload again. Andrew's suggestion of setting this per table would not help here. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Ants Aasma <ants@cybertec.at> wrote: > If I understood the issue correctly, you have long running snapshots > accessing one part of the data, while you have high churn on a > disjoint part of data. We would need to enable vacuum on the high > churn data while still being able to run those long queries. The > "snapshot too old" solution limits the maximum age of global xmin at > the cost of having to abort transactions that are older than this and > happen to touch a page that was modified after they were started. Pretty much. It's not quite as "black and white" regarding high churn and stable tables, though -- at least for the customer whose feedback drove my current work on containing bloat. They are willing to tolerate an occasional "snapshot too old" on a cursor, and they can automatically pick up again by restarting the cursor with a new "lower limit" and a new snapshot. > What about having the long running snapshots declare their working > set, and then only take them into account for global xmin for > relations that are in the working set? Like a SET TRANSACTION WORKING > SET command. This way the error is deterministic, vacuum on the high > churn tables doesn't have to wait for the old transaction delay to > expire and we avoid a hard to tune GUC (what do I need to set > old_snapshot_threshold to, to reduce bloat while not having "normal" > transactions abort). Let me make sure I understand your suggestion. You are suggesting that within a transaction you can declare a list of tables which should get the "snapshot too old" error (or something like it) if a page is read which was modified after the snapshot was taken? Snapshots within that transaction would not constrain the effective global xmin for purposes of vacuuming those particular tables? If I'm understanding your proposal, that would help some of the cases the "snapshot too old" case addresses, but it would not handle the accidental "idle in transaction" case (e.g., start a repeatable read transaction, run one query, then sit idle indefinitely). I don't think it would work quite as well for some of the other cases I've seen, but perhaps it could be made to fit if we could figure out the accidental "idle in transaction" case. I also think it might be hard to develop a correct global xmin for vacuuming a particular table with that solution. How do you see that working? -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 02/19/2015 03:31 PM, Kevin Grittner wrote: >> What about having the long running snapshots declare their working >> set, and then only take them into account for global xmin for >> relations that are in the working set? Like a SET TRANSACTION WORKING >> SET command. This way the error is deterministic, vacuum on the high >> churn tables doesn't have to wait for the old transaction delay to >> expire and we avoid a hard to tune GUC (what do I need to set >> old_snapshot_threshold to, to reduce bloat while not having "normal" >> transactions abort). > Let me make sure I understand your suggestion. You are suggesting > that within a transaction you can declare a list of tables which > should get the "snapshot too old" error (or something like it) if a > page is read which was modified after the snapshot was taken? > Snapshots within that transaction would not constrain the effective > global xmin for purposes of vacuuming those particular tables? I thought it meant that the declared tables would only be vacuumed conservatively, so the transaction would expect not to see "snapshot too old" from them. cheers andrew
<p dir="ltr">On Feb 19, 2015 10:31 PM, "Kevin Grittner" <<a href="mailto:kgrittn@ymail.com">kgrittn@ymail.com</a>>wrote:<br /> > > What about having the long running snapshotsdeclare their working<br /> > > set, and then only take them into account for global xmin for<br /> > >relations that are in the working set? Like a SET TRANSACTION WORKING<br /> > > SET command. This way the erroris deterministic, vacuum on the high<br /> > > churn tables doesn't have to wait for the old transaction delayto<br /> > > expire and we avoid a hard to tune GUC (what do I need to set<br /> > > old_snapshot_thresholdto, to reduce bloat while not having "normal"<br /> > > transactions abort).<br /> ><br />> Let me make sure I understand your suggestion. You are suggesting<br /> > that within a transaction you can declarea list of tables which<br /> > should get the "snapshot too old" error (or something like it) if a<br /> > pageis read which was modified after the snapshot was taken?<br /> > Snapshots within that transaction would not constrainthe effective<br /> > global xmin for purposes of vacuuming those particular tables?<p dir="ltr">Sorry, I shouldhave been clearer. I'm proposing that a transaction can declare what tables it will access. After that the transactionwill constrain xmin for only those tables. Accessing any other table would give an error immediately.<p dir="ltr">>If I'm understanding your proposal, that would help some of the<br /> > cases the "snapshot too old" caseaddresses, but it would not<br /> > handle the accidental "idle in transaction" case (e.g., start a<br /> > repeatableread transaction, run one query, then sit idle<br /> > indefinitely). I don't think it would work quite aswell for some<br /> > of the other cases I've seen, but perhaps it could be made to fit<br /> > if we could figureout the accidental "idle in transaction" case.<p dir="ltr">Accidental idle in transaction seems better handled by justterminating transactions older than some timeout. This is already doable externally, but an integrated solution wouldbe nice if we could figure out how the permissions for setting such a timeout work.<p dir="ltr">> I also think itmight be hard to develop a correct global xmin for<br /> > vacuuming a particular table with that solution. How doyou see<br /> > that working?<p dir="ltr">I'm imagining the long running transaction would register it's xmin in a sharedmap keyed by relation for each referenced relation and remove the xmin from procarray. Vacuum would access this mapby relation, determine the minimum and use that if it's earlier than the global xmin. I'm being deliberately vague hereabout the datastructure in shared memory as I don't have a great idea what to use there. It's somewhat similar to thelock table in that in theory the size is unbounded, but in practice it's expected to be relatively tiny.<p dir="ltr">Regards,<br/> Ants Aasma<br />
Kevin, * Kevin Grittner (kgrittn@ymail.com) wrote: > Stephen Frost <sfrost@snowman.net> wrote: > > Kevin Grittner (kgrittn@ymail.com) wrote: > >> (1) They have a pool of connections each of which can have several > >> long-running cursors, so the limit from that isn't just doubling > >> the size of their database, it is limiting it to some two-or-three > >> digit multiple of the necessary size. > > > > This strikes me as a bit off-the-cuff; was an analysis done which > > deteremined that would be the result? > > If it sounds like off-the-cuff it is because I am summarizing the > mountain of data which has been accumulated over weeks of > discussions and many rounds of multi-day tests using their actual > software driven by a software test driver that simulates users, > with "think time" and all. My apologies for, unintentionally, implying that you hadn't done sufficient research or testing. That certainly was *not* my intent and it's absolutely clear to me that you've spent quite a bit of time analyzing this problem. What I was trying to get at is that the approach which I outlined was, perhaps, not closely analyzed. Now, analyzing a problem based on a solution which doesn't actually exist isn't exactly trivial, and while it might be possible to do in this case it'd probably be easier to simply write the code and the review the results than perform an independent analysis. I do feel that the solution might have been dismissed on the basis of "doubling the database for each long-running transaction is not acceptable" or even "having bloat because of a long-running transaction is not acceptable," but really analyzing what would actually happen is a much more difficult exercise. > With the two patches I submitted, bloat stabilized at less than 5% > except for some btree indexes which followed pattern of inserting > at the end and deleting most (but not all) of the entries over > time. I've been working on a separate patch for that, but it's not > ready to present, so I probably won't post that until the first CF > of the next release -- unless someone's interested enough to want > to discuss it before then. Understood. I'd encourage you to post your thoughts on it- I'm certainly interested in it, even if it isn't something we'll be able to really work on for 9.5. > > That, really, is the crux here- they've already got support for dealing > > with it the way Oracle did and they'd like PG to do that too. > > Unfortunately, that, by itself, isn't a good reason for a particular > > capability (we certainly aren't going to be trying to duplicate PL/SQL > > in PG any time soon). > > I understand that, and if the community doesn't want this style of > limitation on bloat we can make it part of PPAS (which *does* > duplicate PL/SQL, among other things). We are offering it to the > community first. That is certainly appreciated and I'd encourage you to continue to do so. I do feel that the community, in general, is interested in these kinds of knobs- it's just a question of implemenation and making sure that it's a carefully considered solution and not a knee-jerk reaction to match what another RDBMS does. > >>> I'm not against having a knob like this, which is defaulted to off, > >> > >> Thanks! I'm not sure that amounts to a +1, but at least it doesn't > >> sound like a -1. :-) > > > > So, at the time I wrote that, I wasn't sure if it was a +1 or not > > myself. I've been thinking about it since then, however, and I'm > > leaning more towards having the capability than not, so perhaps that's a > > +1, but it doesn't excuse the need to come up with an implementation > > that everyone can be happy with and what you've come up with so far > > doesn't have a lot of appeal, based on the feedback (I've only glanced > > through it myself, but I agree with Andres and Tom that it's a larger > > footprint than we'd want for this and the number of places having to be > > touched is concerning as it could lead to future bugs). > > This conversation has gotten a little confusing because some of my > posts seem to have been held up in the email systems for some > reason, and are arriving out-of-order (and at least one I don't see > yet). But I have been responding to these points. For one thing I > stated four days ago that I now feel that the metric for > determining that a snapshot is "old" should be *time*, not > transactions IDs consumed, and offered to modify that patch in that > direction if people agreed. I now see one person agreeing and > several people arguing that I should do just that (as though they > had not seem me propose it), so I'll try to get a new version out > today like that. In any event, I am sure it is possible and feel > that it would be better. (Having posted that now at least 4 times, > hopefully people will pick up on it.) I agree with others that having it be time-based would be better. I did see that but it's really an independent consideration from the footprint of the patch. > For another thing, as mentioned earlier, this is the btree footprint: > > src/backend/access/nbtree/nbtinsert.c | 7 ++++--- > src/backend/access/nbtree/nbtpage.c | 2 +- > src/backend/access/nbtree/nbtsearch.c | 43 ++++++++++++++++++++++++++++++++++--------- > src/include/access/nbtree.h | 7 ++++--- > 4 files changed, 43 insertions(+), 16 deletions(-) Understood- but that's not really the issue here, it's... > That consists mainly of two things. First, there are 13 places > (all in nbtsearch.c) immediately following a call to > BufferGetPage() that a line like this was added: > > TestForOldSnapshot(snapshot, rel, page); > This. That's quite a few different places which had to be updated and while they are obvious now, it might not be obvious in the future which BufferGetPage() calls need that follows TestForOldSnapshot() call. > Second, some functions that were not previously passed a snapshot > had that added to the function signature and the snapshot was > passed from the caller where needed. C files other than nbtsearch > were only modified to pass a NULL in to calls to the functions with > modified signatures -- in a total of four places. This part doesn't bother me nearly as much as having to add those tests into a lot of different places in the code, and having to get the index AMs involved. > > A lot of that would go away if there was a way to avoid having to mess > > with the index AMs, I'd think, but I wonder if we'd actually need more > > done there- it's not immediately obvious to me how an index-only scan is > > safe with this. Whenever an actual page is visited, we can check the > > LSN, and the relation can't be truncated by vacuum since the transaction > > will still have a lock on the table which prevents it, but does the > > visibility-map update check make sure to never mark pages all-visible > > when one of these old transactions is running around? On what basis? > > I will review that; it may indeed need some attention. The index > entry is removed before the line pointer can be freed, but the > question is whether there is a window where pruning or vacuum has > removed the tuple and changed the page visibility map bit to say > that everything on the page is visible to all before the index > entry is removed. There may be a need to, as you say, suppress > setting the all-visible flag in some particular cases. Right, there may be a period of time where the flag is set to all-visible and the index entry still exists. Isn't there a concern about the index entry disappearing also though? If there is a specific query against a particular value in the index and the index no longer has that value then the query would return zero results- but this could happen without visiting the heap, right? Therefore, how do you know that the value was, or wasn't, visible at the time that the original long-running snapshot was taken. > >>> but I do think we'd be a lot better off with a system that could > >>> realize when rows are not visible to any currently running transaction > >>> and clean them up. > >> > >> +1 > >> > >> But they are not mutually exclusive; I see them as complementary. > > > > I can see how they would be, provided we can be confident that we're > > going to actually throw an error when the snapshot is out of date and > > not end up returning incorrect results. We need to be darn sure of > > that, both now and in a few years from now when many of us may have > > forgotten about this knob.. ;) > > I get that this is not zero-cost to maintain, and that it might not > be of interest to the community largely for that reason. If you > have any ideas about how to improve that, I'm all ears. But do > consider the actual scope of what was needed for the btree coverage > (above). (Note that the index-only scan issue doesn't look like > anything AM-specific; if something is needed for that it would be > in the pruning/vacuum area.) Maintenance is certainly one concern among many but I'm not sure that we're quite ready to write this capability off due to the maintainability of this particular patch- it needs to be considered against the alternatives proposed and the risk of breaking this particular capability in the future. I agree that the index issue is not AM-specific but it's still a consideration to consider. Thanks! Stephen
On 2/19/15 1:54 PM, Kevin Grittner wrote: > Rod Taylor <rod.taylor@gmail.com> wrote: > >> Would pg_dump be able to opt-out of such a restriction? > > I don't see how, since vacuum would be removing recently dead > tuples that are still visible; the alternative to getting a > "snapshot too old" error when reading a page which could be > affected is to return incorrect results, and nobody wants that. > The best you could do if you wanted to run pg_dump (or similar) and > it might take more time than your old_snapshot_threshold would be > to increase the threshold, reload, dump, set it back to the > "normal" setting, and reload again. While I think pg_dump is a great solution for small to medium installations, there are a number of excellent file-based backup options available. Anyone who is seriously worried about bloat (or locking) should be looking to those solutions. -- - David Steele david@pgmasters.net
Stephen Frost <sfrost@snowman.net> wrote: > Kevin Grittner (kgrittn@ymail.com) wrote: >> With the two patches I submitted, bloat stabilized at less than 5% >> except for some btree indexes which followed pattern of inserting >> at the end and deleting most (but not all) of the entries over >> time. I've been working on a separate patch for that, but it's not >> ready to present, so I probably won't post that until the first CF >> of the next release -- unless someone's interested enough to want >> to discuss it before then. > > Understood. I'd encourage you to post your thoughts on it- I'm > certainly interested in it, even if it isn't something we'll be able to > really work on for 9.5. OK, if you're up for discussing it even while it's in early development, I'll start a separate thread to outline the idea. As a preliminary "teaser" in very abbreviated form, the idea is to basically use the "page split" logic, but splitting all tuples from multiple adjacent pages onto a new page to the right. The old tuples maintain their pointers (to support scans positioned in that range), but all pointers *to* those pages will be redirected to the new page. Let's not bifurcate this thread by discussing it further here; please wait until I start the new thread, which will be after I post the next version of this "snapshot too old" patch. > Isn't there a concern about the index entry disappearing also though? > If there is a specific query against a particular value in the index and > the index no longer has that value then the query would return zero > results- but this could happen without visiting the heap, right? Right; that's why I don't see how to avoid touching the index AMs. The index pages also contain the LSN, and if you are not visiting the heap it is because some page in the index was changed, and you will see the "too recent" LSN there. > Therefore, how do you know that the value was, or wasn't, visible at the > time that the original long-running snapshot was taken. Strictly speaking, you don't know anything that specific. You know that changes were made that *might* have removed tuples that would have been visible to the snapshot. Once a snapshot passes the threshold for being considered "old", then viewing any page changed after the snapshot was built has to trigger the error. That means some false positives, but as long as you can't get false negatives there is never a silently incorrect query result. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Feb 16, 2015 at 10:49 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
>
> What this discussion has made me reconsider is the metric for
> considering a transaction "too old". The number of transaction IDs
> consumed seems inferior as the unit of measure for that to LSN or
> time.
>
> It looks to me to be pretty trivial (on the order of maybe 30 lines
> of code) to specify this GUC in minutes rather than transaction
> IDs.
>
> What this discussion has made me reconsider is the metric for
> considering a transaction "too old". The number of transaction IDs
> consumed seems inferior as the unit of measure for that to LSN or
> time.
>
> It looks to me to be pretty trivial (on the order of maybe 30 lines
> of code) to specify this GUC in minutes rather than transaction
> IDs.
It seems to me that SQL Server also uses similar mechanism to
avoid the bloat in version store (place to store previous versions or
record).
Mechanism used by them is that during the shrink process, the
longest running transactions that have not yet generated row
versions are marked as victims. If a transaction is marked as a
victim, it can no longer read the row versions in the version store.
When it attempts to read row versions, an error message is
generated and the transaction is rolled back.
I am not sure how much weight it adds to the usefulness of this
patch, however I think if other leading databases provide a way to
control the bloat, it indicates that most of the customers having
write-intesive workload would like to see such an option.
Amit Kapila <amit.kapila16@gmail.com> wrote: > It seems to me that SQL Server also uses similar mechanism to > avoid the bloat in version store (place to store previous > versions or record). > I think if other leading databases provide a way to control the > bloat, it indicates that most of the customers having > write-intesive workload would like to see such an option. Yeah, I think many users are surprised by the damage that can be done to a database by an idle connection or long-running read-only query, especially when coming from other databases which have protections against that. The bad news is that changing to specifying the limit in time rather than transaction IDs is far more complicated than I thought. Basically, we need to associate times in the past with the value of latestCompletedXid (or derived values like snapshot xmin) at those times. I was initially thinking that by adding a timestamp to the snapshot we could derive this from active snapshots. For any one connection, it's not that hard for it to scan the pairingheap of snapshots and pick out the right values; but the problem is that it is process-local memory and this needs to work while the connection is sitting idle for long periods of time. I've got a couple ideas on how to approach it, but neither seems entirely satisfying: (1) Use a new timer ID to interrupt the process whenever its oldest snapshot which hasn't yet crossed the "old snapshot" threshold does so. The problem is that this seems as though it would need to do an awful lot more work (scanning the pairing heap for the best match) than anything else is currently doing in a timeout handler. One alternative would be to keep a second pairing heap to track snapshots which have not crossed the threshold, removing items as they get old -- that would make the work needed in the timeout handler closer to other handlers. (2) Use a course enough granularity on time and a short enough maximum for the GUC to just keep a circular buffer of the mappings in memory. We might be able to make this dense enough that one minute resolution for up to 60 days could fit in 338kB. Obviously that could be reduced with courser granularity or a shorter maximum. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 02/22/2015 11:48 AM, Kevin Grittner wrote: > > (2) Use a course enough granularity on time and a short enough > maximum for the GUC to just keep a circular buffer of the mappings > in memory. We might be able to make this dense enough that one > minute resolution for up to 60 days could fit in 338kB. Obviously > that could be reduced with courser granularity or a shorter > maximum. > This doesn't sound too bad to me. Presumably these would be tunable. I think one minute granularity would be fine for most purposes, but I can imagine people who would want it finer, like 10 seconds, say. cheers andrew
On Sun, Feb 22, 2015 at 10:18 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > It seems to me that SQL Server also uses similar mechanism to
> > avoid the bloat in version store (place to store previous
> > versions or record).
>
> > I think if other leading databases provide a way to control the
> > bloat, it indicates that most of the customers having
> > write-intesive workload would like to see such an option.
>
> Yeah, I think many users are surprised by the damage that can be
> done to a database by an idle connection or long-running read-only
> query, especially when coming from other databases which have
> protections against that.
>
> The bad news is that changing to specifying the limit in time
> rather than transaction IDs is far more complicated than I thought.
> Basically, we need to associate times in the past with the value of
> latestCompletedXid (or derived values like snapshot xmin) at those
> times. I was initially thinking that by adding a timestamp to the
> snapshot we could derive this from active snapshots. For any one
> connection, it's not that hard for it to scan the pairingheap of
> snapshots and pick out the right values; but the problem is that it
> is process-local memory and this needs to work while the connection
> is sitting idle for long periods of time.
>
> Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> > It seems to me that SQL Server also uses similar mechanism to
> > avoid the bloat in version store (place to store previous
> > versions or record).
>
> > I think if other leading databases provide a way to control the
> > bloat, it indicates that most of the customers having
> > write-intesive workload would like to see such an option.
>
> Yeah, I think many users are surprised by the damage that can be
> done to a database by an idle connection or long-running read-only
> query, especially when coming from other databases which have
> protections against that.
>
> The bad news is that changing to specifying the limit in time
> rather than transaction IDs is far more complicated than I thought.
> Basically, we need to associate times in the past with the value of
> latestCompletedXid (or derived values like snapshot xmin) at those
> times. I was initially thinking that by adding a timestamp to the
> snapshot we could derive this from active snapshots. For any one
> connection, it's not that hard for it to scan the pairingheap of
> snapshots and pick out the right values; but the problem is that it
> is process-local memory and this needs to work while the connection
> is sitting idle for long periods of time.
Could you please explain in slightly more detail why can't it work
if we use timestamp instead of snapshot->xmin in your patch in
function TestForOldSnapshot()?
> I've got a couple ideas
> on how to approach it, but neither seems entirely satisfying:
>
> (1) Use a new timer ID to interrupt the process whenever its
> oldest snapshot which hasn't yet crossed the "old snapshot"
> threshold does so. The problem is that this seems as though it
> would need to do an awful lot more work (scanning the pairing heap
> for the best match) than anything else is currently doing in a
> timeout handler. One alternative would be to keep a second pairing
> heap to track snapshots which have not crossed the threshold,
> removing items as they get old -- that would make the work needed
> in the timeout handler closer to other handlers.
>
> (2) Use a course enough granularity on time and a short enough
> maximum for the GUC to just keep a circular buffer of the mappings
> in memory. We might be able to make this dense enough that one
> minute resolution for up to 60 days could fit in 338kB. Obviously
> that could be reduced with courser granularity or a shorter
> maximum.
>
> on how to approach it, but neither seems entirely satisfying:
>
> (1) Use a new timer ID to interrupt the process whenever its
> oldest snapshot which hasn't yet crossed the "old snapshot"
> threshold does so. The problem is that this seems as though it
> would need to do an awful lot more work (scanning the pairing heap
> for the best match) than anything else is currently doing in a
> timeout handler. One alternative would be to keep a second pairing
> heap to track snapshots which have not crossed the threshold,
> removing items as they get old -- that would make the work needed
> in the timeout handler closer to other handlers.
>
> (2) Use a course enough granularity on time and a short enough
> maximum for the GUC to just keep a circular buffer of the mappings
> in memory. We might be able to make this dense enough that one
> minute resolution for up to 60 days could fit in 338kB. Obviously
> that could be reduced with courser granularity or a shorter
> maximum.
>
How exactly will this allow to return "snapshot old" error, do we
need a check for page (page lsn) as in your current patch?
Amit Kapila <amit.kapila16@gmail.com> wrote: > Could you please explain in slightly more detail why can't it work> if we use timestamp instead of snapshot->xmin in yourpatch in > function TestForOldSnapshot()? It works fine for the additional visibility checking in scans, but it doesn't cover the vacuuming -- that needs to use a transaction ID for its cutoff. > How exactly will this allow to return "snapshot old" error, do we > need a check for page (page lsn) as in your current patch? The change to the part where it actually throws the error is very small, just checking time instead of xmin. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 17 February 2015 at 06:35, Magnus Hagander <magnus@hagander.net> wrote: > > On Feb 17, 2015 12:26 AM, "Andres Freund" <andres@2ndquadrant.com> wrote: >> >> On 2015-02-16 16:35:46 -0500, Bruce Momjian wrote: >> > It seems we already have a mechanism in place that allows tuning of >> > query cancel on standbys vs. preventing standby queries from seeing old >> > data, specifically >> > max_standby_streaming_delay/max_standby_archive_delay. We obsessed >> > about how users were going to react to these odd variables, but there >> > has been little negative feedback. >> >> FWIW, I think that's a somewhat skewed perception. I think it was right to >> introduce those, because we didn't really have any alternatives. >> >> But max_standby_streaming_delay, max_standby_archive_delay and >> hot_standby_feedback are among the most frequent triggers for questions >> and complaints that I/we see. >> > > Agreed. > > And a really bad one used to be vacuum_defer_cleanup_age, because of > confusing units amongst other things. Which in terms seems fairly close to > Kevins suggestions, unfortunately. I agree with Bruce that we already have a mechanism similar to this for Hot Standby, so it should therefore be OK to have it for normal running when users desire it. The key difference here is that in this patch we use the LSN to look for changed data blocks, allowing queries to proceed for longer than they would do on Hot Standby where they currently just get blown away regardless. I like the proposal in this patch but it is more extreme than the mechanism Hot Standby provides. (If we implement this then I would want to see it work for Hot Standby also so we can keep the mechanisms in step, I think that is too late in the day to add that for 9.5.) I also agree with Andres and Magnus in saying that in the current definition of the tunable parameter it would be hard to use. In response to Tom's perspective that it is the single most annoying feature of Oracle, I agree. It just happens we have a similar problem: bloat. Having a parameter to define "maximum acceptable bloat" would be a very useful thing in PostgreSQL. IIRC other DBMS had a lot of work to make "snapshot too old" occur in controllable circumstances. So I would call having a very crap parameter like "old_snapshot_limit" a good start, but clearly not the end point of an initiative too improve our control of bloat. +1 to include this patch in 9.5, as long as there is agreement to improve this in the future -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services
Would a parameter to auto terminate long running transactions be a better solution? Terminating the long running transactionwould allow for the normal vacuuming process to cleanup the deleted records thus avoiding database bloat withoutintroducing new semantics to Postgres's MVCC. I would also recommend that the default be disabled.
On 16/03/15 13:04, Rui DeSousa wrote: > Would a parameter to auto terminate long running transactions be a better solution? Terminating the long running transactionwould allow for the normal vacuuming process to cleanup the deleted records thus avoiding database bloat withoutintroducing new semantics to Postgres's MVCC. I would also recommend that the default be disabled. > > Hmm... Better per transaction, or even per statement. If I fire of a query I expect should be completed within 30 seconds, then I probably don't mind it aborting after 5 minutes. However, if I fire of a query that I expect to take between an hour and 3 hours, then I definitely don't want it aborted after 5 minutes! It could be that 99.99% of transactions will complete with a minute, but a small minority might reasonably be expected to take much longer. Cheers, Gavin
On Sun, Mar 15, 2015 at 8:04 PM, Rui DeSousa <rui@crazybean.net> wrote: > Would a parameter to auto terminate long running transactions be a better solution? Terminating the long running transactionwould allow for the normal vacuuming process to cleanup the deleted records thus avoiding database bloat withoutintroducing new semantics to Postgres's MVCC. I would also recommend that the default be disabled. An advantage of Kevin's approach is that you don't have to abort *all* long-running transactions, only those that touch data which has been modified since then. If your system is read-mostly, that could dramatically reduce the number of aborts. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
<div dir="ltr"><br /><div class="gmail_extra"><br /><div class="gmail_quote">On Mon, Mar 16, 2015 at 11:46 AM, Robert Haas<span dir="ltr"><<a href="mailto:robertmhaas@gmail.com" target="_blank">robertmhaas@gmail.com</a>></span> wrote:<br/><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">OnSun, Mar 15, 2015 at 8:04 PM, Rui DeSousa <<a href="mailto:rui@crazybean.net">rui@crazybean.net</a>> wrote:<br/> > Would a parameter to auto terminate long running transactions be a better solution? Terminating the longrunning transaction would allow for the normal vacuuming process to cleanup the deleted records thus avoiding databasebloat without introducing new semantics to Postgres's MVCC. I would also recommend that the default be disabled.<br/><br /></span>An advantage of Kevin's approach is that you don't have to abort *all*<br /> long-running transactions,only those that touch data which has been<br /> modified since then. If your system is read-mostly, that could<br/> dramatically reduce the number of aborts.<br /><span class="HOEnZb"><font color="#888888"></font></span><br /></blockquote></div><br/></div><div class="gmail_extra">Indeed. The suggestion of auto-terminating long running transactionsmisses the point, ISTM. Most of the use cases I can see for this involve vacuuming tables that the long runningtransaction will have no interest in at all (which is why I suggested being able to set it on a per table basis).I certainly don't want to be terminating my long running report transaction so that my queue table which is only evertouched by very short-lived transactions can be reasonably vacuumed. But that's what we have to do now.<br /><br /></div><divclass="gmail_extra">cheers<br /><br /></div><div class="gmail_extra">andrew<br /></div></div>
On Sun, Nov 8, 2020 at 5:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jim Nasby <Jim.Nasby@bluetreble.com> writes:
> On 2/15/15 10:36 AM, Tom Lane wrote:
>> I wonder if we couldn't achieve largely the same positive effects through
>> adding a simple transaction-level timeout option.
> A common use-case is long-running reports hitting relatively stable data
> in a database that also has tables with a high churn rate (ie: queue
> tables). In those scenarios your only options right now are to suffer
> huge amounts of bloat in the high-churn or not do your reporting. A
> simple transaction timeout only "solves" this by denying you reporting
> queries.
Agreed, but Kevin's proposal has exactly the same problem only worse,
because (a) the reporting transactions might or might not fail (per
Murphy, they'll fail consistently only when you're on deadline), and
(b) if they do fail, there's nothing you can do short of increasing the
slack db-wide.
> An idea that I've had on this would be some way to "lock down" the
> tables that a long-running transaction could access. That would allow
> vacuum to ignore any snapshots that transaction had for tables it wasn't
> accessing. That's something that would be deterministic.
There might be something in that, but again it's not much like this patch.
The key point I think we're both making is that nondeterministic failures
are bad,
Based on my experience, a long transaction which is caused by a long-running
query is not so many. A common case I met is user start a transaction
but forget it to close it, I am thinking if we can do something for this situation.
Since most users will use Read Committed isolation level, so after a user
completes a query, the next query will use a fresh new snapshot, so there is no
need to block the oldest xmin because of this. will it be correct to advance the oldest
xmin in this case? If yes, what would be the blocker in implementing this feature?
Best Regards
Andy Fan