Thread: Turning off HOT/Cleanup sometimes
VACUUM cleans up blocks, which is nice because it happens offline in a lazy manner. We also make SELECT clean up blocks as it goes. That is useful in OLTP workloads, but it means that large SQL queries and pg_dump effectively do much the same work as VACUUM, generating huge amounts of I/O and WAL on the master, the cost and annoyance of which is experienced directly by the user. That is avoided on standbys. Effects of that are that long running statements often run much longer than we want, increasing bloat as a result. It also produces wildly varying response times, depending upon extent of cleanup required. It is a simple task to make that behaviour optional on the master. I propose a USERSET parameter, prune_cost_limit (<---insert better name here) which will make the behaviour optional, default -1, in normal user processes. VACUUM will ignore this parameter and so its actions will never be deferred. In detail, this parameter would disable pruning for any scan larger than the cost limit. So large scans will disable the behaviour. The default, -1, means never disable pruning, which is the current behavour. We track the number of pages dirtied by the current statement. When this reaches prune_cost_limit, we will apply these behaviours to all shared_buffer block accesses... (1) avoid running heap_page_prune_opt() (2) avoid dirtying the buffer for hints. (This is safe because the hinted changes will either be lost or will be part of the full page image when we make a logged-change). (i.e. doesn't apply to temp tables) For example, if we set prune_cost_limit = 4 this behaviour allows small index lookups via bitmapheapscan to continue to cleanup, while larger index and seq scans will avoid cleanup. There would be a postgresql.conf parameter prune_cost_limit, as well as a table level parameter that would prevent pruning except via VACUUM. This will help in these ways * Reduce write I/O from SELECTs and pg_dump - improving backups and BI queries * Allow finer grained control over Hot Standby conflicts * Potentially allow diagnostic inspection of older data via SeqScan Prototype patch shows this is possible and simple enough for 9.4. Major objections? Or should I polish up and submit? -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jan 8, 2014 at 3:33 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > VACUUM cleans up blocks, which is nice because it happens offline in a > lazy manner. > > We also make SELECT clean up blocks as it goes. That is useful in OLTP > workloads, but it means that large SQL queries and pg_dump effectively > do much the same work as VACUUM, generating huge amounts of I/O and > WAL on the master, the cost and annoyance of which is experienced > directly by the user. That is avoided on standbys. On a pgbench workload, though, essentially all page cleanup happens as a result of HOT cleanups, like >99.9%. It might be OK to have that happen for write operations, but it would be a performance disaster if updates didn't try to HOT-prune. Our usual argument for doing HOT pruning even on SELECT cleanups is that not doing so pessimizes repeated scans, but there are clearly cases that end up worse off as a result of that decision. I'm not entirely wild about adding a parameter in this area because it seems that we're increasingly choosing to further expose what arguably ought to be internal implementation details. The recent wal_log_hints parameter is another recent example of this that I'm not thrilled with, but in that case, as in this one, I can see the value of it. Still, I think it'd be loads better to restrict what you're talking about here to the SELECT-only case; I have a strong feeling that this will be a disaster on write workloads. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Jan 8, 2014 at 3:33 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> We also make SELECT clean up blocks as it goes. That is useful in OLTP >> workloads, but it means that large SQL queries and pg_dump effectively >> do much the same work as VACUUM, generating huge amounts of I/O and >> WAL on the master, the cost and annoyance of which is experienced >> directly by the user. That is avoided on standbys. > On a pgbench workload, though, essentially all page cleanup happens as > a result of HOT cleanups, like >99.9%. It might be OK to have that > happen for write operations, but it would be a performance disaster if > updates didn't try to HOT-prune. Our usual argument for doing HOT > pruning even on SELECT cleanups is that not doing so pessimizes > repeated scans, but there are clearly cases that end up worse off as a > result of that decision. My recollection of the discussion when HOT was developed is that it works that way not because anyone thought it was beneficial, but simply because we didn't see an easy way to know when first fetching a page whether we're going to try to UPDATE some tuple on the page. (And we can't postpone the pruning, because the query will have tuple pointers into the page later.) Maybe we should work a little harder on passing that information down. It seems reasonable to me that SELECTs shouldn't be tasked with doing HOT pruning. > I'm not entirely wild about adding a parameter in this area because it > seems that we're increasingly choosing to further expose what arguably > ought to be internal implementation details. I'm -1 for a parameter as well, but I think that just stopping SELECTs from doing pruning at all might well be a win. It's at least worthy of some investigation. regards, tom lane
On Thu, Jan 9, 2014 at 12:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, Jan 8, 2014 at 3:33 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> We also make SELECT clean up blocks as it goes. That is useful in OLTP >>> workloads, but it means that large SQL queries and pg_dump effectively >>> do much the same work as VACUUM, generating huge amounts of I/O and >>> WAL on the master, the cost and annoyance of which is experienced >>> directly by the user. That is avoided on standbys. > >> On a pgbench workload, though, essentially all page cleanup happens as >> a result of HOT cleanups, like >99.9%. It might be OK to have that >> happen for write operations, but it would be a performance disaster if >> updates didn't try to HOT-prune. Our usual argument for doing HOT >> pruning even on SELECT cleanups is that not doing so pessimizes >> repeated scans, but there are clearly cases that end up worse off as a >> result of that decision. > > My recollection of the discussion when HOT was developed is that it works > that way not because anyone thought it was beneficial, but simply because > we didn't see an easy way to know when first fetching a page whether we're > going to try to UPDATE some tuple on the page. (And we can't postpone the > pruning, because the query will have tuple pointers into the page later.) > Maybe we should work a little harder on passing that information down. > It seems reasonable to me that SELECTs shouldn't be tasked with doing > HOT pruning. > >> I'm not entirely wild about adding a parameter in this area because it >> seems that we're increasingly choosing to further expose what arguably >> ought to be internal implementation details. > > I'm -1 for a parameter as well, but I think that just stopping SELECTs > from doing pruning at all might well be a win. It's at least worthy > of some investigation. Unfortunately, there's no categorical answer. You can come up with workloads where HOT pruning on selects is a win; just create a bunch of junk and then read the same pages lots of times in a row. And you can also come up with workloads where it's a loss; create a bunch of junk and then read them just once. I don't know how easy it's going to be to set that parameter in a useful way for some particular environment, and I think that's possibly an argument against having it. But the argument that we don't need a parameter because one behavior is best for everyone is not going to fly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas escribió: > Unfortunately, there's no categorical answer. You can come up with > workloads where HOT pruning on selects is a win; just create a bunch > of junk and then read the same pages lots of times in a row. And you > can also come up with workloads where it's a loss; create a bunch of > junk and then read them just once. I don't know how easy it's going > to be to set that parameter in a useful way for some particular > environment, and I think that's possibly an argument against having > it. But the argument that we don't need a parameter because one > behavior is best for everyone is not going to fly. In the above, there's the underlying assumption that it doesn't matter *what* we do with the page after doing or not doing pruning. But this is not necessarily the case: in the case of an UPDATE, having the space be freed beforehand is beneficial because there's the option of putting the new version of the tuple in the same page, potentially saving lots of I/O (bring up another destination page for the new tuple, write the new tuple there, end up dirtying two pages instead of one). But in a SELECT, the effect is only that you will have to skip less dead tuples, which is not as exciting. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 9 January 2014 17:21, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, Jan 8, 2014 at 3:33 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> We also make SELECT clean up blocks as it goes. That is useful in OLTP >>> workloads, but it means that large SQL queries and pg_dump effectively >>> do much the same work as VACUUM, generating huge amounts of I/O and >>> WAL on the master, the cost and annoyance of which is experienced >>> directly by the user. That is avoided on standbys. > >> On a pgbench workload, though, essentially all page cleanup happens as >> a result of HOT cleanups, like >99.9%. It might be OK to have that >> happen for write operations, but it would be a performance disaster if >> updates didn't try to HOT-prune. Our usual argument for doing HOT >> pruning even on SELECT cleanups is that not doing so pessimizes >> repeated scans, but there are clearly cases that end up worse off as a >> result of that decision. > > My recollection of the discussion when HOT was developed is that it works > that way not because anyone thought it was beneficial, but simply because > we didn't see an easy way to know when first fetching a page whether we're > going to try to UPDATE some tuple on the page. (And we can't postpone the > pruning, because the query will have tuple pointers into the page later.) > Maybe we should work a little harder on passing that information down. > It seems reasonable to me that SELECTs shouldn't be tasked with doing > HOT pruning. > >> I'm not entirely wild about adding a parameter in this area because it >> seems that we're increasingly choosing to further expose what arguably >> ought to be internal implementation details. > > I'm -1 for a parameter as well, but I think that just stopping SELECTs > from doing pruning at all might well be a win. It's at least worthy > of some investigation. Turning HOT off completely would be an absolute disaster for OLTP on high update use cases against medium-large tables. That scenario is well represented by pgbench and TPC-C. I am *not* suggesting we recommend that and would look for very large caveats in the docs. (That may not have been clear, I guess I just assumed people would know I was heavily involved in the HOT project and understood its benefits). As stated, I am interested in turning off HOT in isolated, user specified situations, perhaps just for isolated tables. I'm not crazy about exposing magic parameters either but then I'm not crazy about either automatic settings or deferring things because we don't know how to set it. In general, I prefer the idea of having a user settable parameter in one release then automating it in a later release if clear settings emerge from usage. I'll submit a patch with parameter, to allow experimentation, for possible removal at commit or beta. If I had to suggest a value for an internal parameter, I would say that each SELECT statement should clean no more than 4 blocks. That way current OLTP behaviour is mostly preserved while the big queries and pg_dump don't suck in unpredictable ways. I'll submit the patch and we can talk some more. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Robert Haas escribi�: >> But the argument that we don't need a parameter because one >> behavior is best for everyone is not going to fly. > In the above, there's the underlying assumption that it doesn't matter > *what* we do with the page after doing or not doing pruning. But this > is not necessarily the case: in the case of an UPDATE, having the space > be freed beforehand is beneficial because there's the option of putting > the new version of the tuple in the same page, potentially saving lots > of I/O (bring up another destination page for the new tuple, write the > new tuple there, end up dirtying two pages instead of one). But in a > SELECT, the effect is only that you will have to skip less dead tuples, > which is not as exciting. Yeah. Once they're hinted dead, it doesn't cost that much to skip over them. Not to mention that you might well never visit them at all, if this is an indexscan that knows which TIDs it needs to look at. It's possible that it can be shown that different use-cases have sufficiently different behaviors that we really do need a user-visible parameter. I don't want to start from that position though. If we did have a simple GUC parameter, it'd likely end up in the same boat as, say, enable_seqscan, which is way too blunt an instrument for real world use --- so I'm afraid this would soon bloat into a request for per-table settings, planner hints, or god knows what to try to confine the effects to the queries where it's appropriate. Let's not go there without proof that we have to. It's a much better thing if we can get the system's native behavior to be tuned well enough by depending on things it already knows. regards, tom lane
* Alvaro Herrera (alvherre@2ndquadrant.com) wrote: > But in a > SELECT, the effect is only that you will have to skip less dead tuples, > which is not as exciting. Agreed. There's also the option to have it be done based on some expectation of future work- that is, if we have to traverse X number of dead tuples during a select, then don't bother with HOT pruning, but if we get up to X+Y dead tuples, then do HOT pruning. That said, I'm not entirely convinced that traversing these dead tuples is all *that* painful during SELECT. If there's that many levels then hopefully it's not long til an UPDATE comes along and cleans them up. Thanks, Stephen
* Simon Riggs (simon@2ndQuadrant.com) wrote: > > I'm -1 for a parameter as well, but I think that just stopping SELECTs > > from doing pruning at all might well be a win. It's at least worthy > > of some investigation. > > Turning HOT off completely would be an absolute disaster for OLTP on > high update use cases against medium-large tables. That scenario is > well represented by pgbench and TPC-C. I am *not* suggesting we > recommend that and would look for very large caveats in the docs. This is true even if we're only talking about turning it off for the SELECT case...? That's what's under discussion here, after all. Certainly, we wouldn't turn it off completely... > (That may not have been clear, I guess I just assumed people would > know I was heavily involved in the HOT project and understood its > benefits). I'm certainly aware that you were heavily involved in HOT but I don't think anyone is argueing to turn it off for everything. > As stated, I am interested in turning off HOT in isolated, user > specified situations, perhaps just for isolated tables. I tend to agree w/ Tom on this point- having this be a per-table configurable doesn't sound very appealing to me and it wouldn't address the case you mentioned around pg_dump, but I'm sure that'd be the next step for this and a per-session GUC wouldn't be sufficient. > I'm not crazy about exposing magic parameters either but then I'm not > crazy about either automatic settings or deferring things because we > don't know how to set it. In general, I prefer the idea of having a > user settable parameter in one release then automating it in a later > release if clear settings emerge from usage. I'll submit a patch with > parameter, to allow experimentation, for possible removal at commit or > beta. Ugh, adding GUCs is bad *because* we end up never being able to remove them. > If I had to suggest a value for an internal parameter, I would say > that each SELECT statement should clean no more than 4 blocks. That > way current OLTP behaviour is mostly preserved while the big queries > and pg_dump don't suck in unpredictable ways. Right, this was one idea that I had also, as noted in the other subthread. I'm not convinced that it's a great idea and it'd probably be good to do a bit of testing to see just what the cost is; perhaps even just come up with a "worst-case" example to see the difference between a "clean" table and one with HOT chains as deep as they can go.. > I'll submit the patch and we can talk some more. Neat. Thanks! Stephen
Stephen Frost <sfrost@snowman.net> writes: > That said, I'm not entirely convinced that traversing these dead tuples > is all *that* painful during SELECT. If there's that many levels then > hopefully it's not long til an UPDATE comes along and cleans them up. There's always VACUUM ;-) If you take about ten steps back, what's happening here is that maintenance work that we'd originally delegated to VACUUM, precisely so that it wouldn't have to be done by foreground queries, is now being done by foreground queries. And oddly enough, people don't like that. There is a reasonable argument for forcing UPDATE queries to do it anyway, to improve the odds they can do same-page updates (whether HOT or otherwise). And probably an INSERT should do it on a page that it's selected as an insertion target. But I think the argument that the original do-maintenance-in-background-whenever-possible design was wrong is a lot harder to sustain for SELECT or even DELETE queries. As I said upthread, I think the current behavior was *not* chosen for performance reasons but just to limit the scope of what we had to change for HOT. regards, tom lane
On 1/9/14, 12:54 PM, Tom Lane wrote: > Stephen Frost <sfrost@snowman.net> writes: >> That said, I'm not entirely convinced that traversing these dead tuples >> is all *that* painful during SELECT. If there's that many levels then >> hopefully it's not long til an UPDATE comes along and cleans them up. > > There's always VACUUM ;-) > > If you take about ten steps back, what's happening here is that > maintenance work that we'd originally delegated to VACUUM, precisely so > that it wouldn't have to be done by foreground queries, is now being done > by foreground queries. And oddly enough, people don't like that. > > There is a reasonable argument for forcing UPDATE queries to do it anyway, > to improve the odds they can do same-page updates (whether HOT or > otherwise). And probably an INSERT should do it on a page that it's > selected as an insertion target. But I think the argument that the > original do-maintenance-in-background-whenever-possible design was wrong > is a lot harder to sustain for SELECT or even DELETE queries. As I said > upthread, I think the current behavior was *not* chosen for performance > reasons but just to limit the scope of what we had to change for HOT. Instead of looking at how to avoid this work in SELECTs maybe it'd be more useful to look at how we can get it done morequickly in the background. The VSM is already a step in the right direction, but it seems the big use case here is whensome bulk operation comes through and touches a sizeable number of blocks (but perhaps not enough to hit autovac thresholds). ISTM it wouldn't be too difficult for a backend to track how many blocks in a relation it's dirtied (keep in mind that countdoesn't have to be perfect). If we tracked that info, it could be put into a maintenance queue (LISTEN/NOTIFY?) alongwith our XID. That gives us a list of relations to vacuum and exactly when to vacuum them. Thanks to the VSM we wouldn'tneed to track individual pages (though it might be useful to track the minimum and maximum block IDs we hit, perrelation). -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Thu, Jan 9, 2014 at 1:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Stephen Frost <sfrost@snowman.net> writes: >> That said, I'm not entirely convinced that traversing these dead tuples >> is all *that* painful during SELECT. If there's that many levels then >> hopefully it's not long til an UPDATE comes along and cleans them up. > > There's always VACUUM ;-) > > If you take about ten steps back, what's happening here is that > maintenance work that we'd originally delegated to VACUUM, precisely so > that it wouldn't have to be done by foreground queries, is now being done > by foreground queries. And oddly enough, people don't like that. People *think* they don't like that, because that's the way it works right now. If it worked some other way, there's a good chance people would be complaining about that behavior, too. I submitted a patch a few years back to limit the setting of hint bits by foreground processes to approximately 5% of the buffers they touched in a large scan, so that no single scan would incur all the cost of setting the hint bits; instead, the cost would be amortized over the first 20 or so scans. However, nobody was very enthusiastic about that patch, because while it greatly softened the blow for the first scan, subsequent scans were slower, because now they had to carry part of the burden, too. And you know what? People didn't like *that* either. The problem with saying that we should let VACUUM do this work is the same as the problem with saying that if you're late for your Concorde flight, you should go running across the tarmac and try to catch it. The cost of dead tuples is related in a linear fashion to the rate at which pages are accessed. Not coincidentally, the number of opportunities for HOT pruning is *also* related in a linear fashion to the rate at which pages are accessed. This is why it works so well. The rate at which vacuuming happens does not ramp up in the same way; it's limited by autovacuum cost settings (which people tend not have set correctly, and don't adjust themselves on the fly) or by their hardware capabilities. If autovacuum can't keep up, foreground activity doesn't slow down to compensate; instead, the system just bloats out of control. While people may not like having this maintenance activity in the foreground, they like not having it at all even less. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-01-09 16:27:23 -0500, Robert Haas wrote: > People *think* they don't like that, because that's the way it works > right now. If it worked some other way, there's a good chance people > would be complaining about that behavior, too. I think on of the primary reason why it's causing huge slowdowns is that the ring buffer of scan strategies causes dirty buffer writes pretty much immediately, when a buffer is reused. Not that delaying the writeout would work all that effectively right now, with the current bgwriter... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Robert Haas <robertmhaas@gmail.com> writes: > The problem with saying that we should let VACUUM do this work is the > same as the problem with saying that if you're late for your Concorde > flight, you should go running across the tarmac and try to catch it. > The cost of dead tuples is related in a linear fashion to the rate at > which pages are accessed. Not coincidentally, the number of > opportunities for HOT pruning is *also* related in a linear fashion to > the rate at which pages are accessed. This is why it works so well. That seems like a large oversimplification. Some (most?) of the costs of dead tuples are proportional to the rate of dead tuple creation. I grant that there are also some costs proportional to the rate at which scans visit dead tuples, but I really don't believe that the latter are dominant. So I think it's bogus to claim that the current behavior is somehow optimal. One more time: the sole reason it works the way it does now is that that was the path of least resistance back in 2007, and we never yet got around to trying to optimize that. I'm glad to see someone wanting to revisit the issue, but I don't think that we necessarily have to go as far as creating user-visible knobs in order to make it better. > The rate at which vacuuming happens does not ramp up in the same way; > it's limited by autovacuum cost settings (which people tend not have > set correctly, and don't adjust themselves on the fly) True, but that seems like a pretty well-defined improvement project right there (as well as an argument against user-visible knobs in general ;-)). Nasby's speculations just upthread could be useful here, too. regards, tom lane
On Thu, Jan 9, 2014 at 4:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> The problem with saying that we should let VACUUM do this work is the >> same as the problem with saying that if you're late for your Concorde >> flight, you should go running across the tarmac and try to catch it. >> The cost of dead tuples is related in a linear fashion to the rate at >> which pages are accessed. Not coincidentally, the number of >> opportunities for HOT pruning is *also* related in a linear fashion to >> the rate at which pages are accessed. This is why it works so well. > > That seems like a large oversimplification. Some (most?) of the costs of > dead tuples are proportional to the rate of dead tuple creation. I grant > that there are also some costs proportional to the rate at which scans > visit dead tuples, but I really don't believe that the latter are > dominant. So I think it's bogus to claim that the current behavior is > somehow optimal. This doesn't make any sense to me at all. What costs are proportional to the rate of dead tuple creation? I'm referring specifically to the performance penalty that scans incur for having to skip over dead tuples, and those costs aren't incurred when the tuples are created, but rather when you try to access the still-live data afterwards. If anything, our system is a big WIN at the time tuples are created, precisely because we leave the old tuples around to be cleaned up later rather than getting rid of them at once. That's why, for example, we tend to win delete-heavy benchmarks vs. other database systems. > One more time: the sole reason it works the way it does now is that that > was the path of least resistance back in 2007, and we never yet got around > to trying to optimize that. I'm glad to see someone wanting to revisit > the issue, but I don't think that we necessarily have to go as far as > creating user-visible knobs in order to make it better. Sure, I'm not denying that. The fact that it was the path of least resistance doesn't mean it was a bad idea. I'm happy to see it improved, too, but I think it's important to understand what happens now. And at least on the pgbench tests I've done, what happens is that VACUUM makes no significant contribution to pruning; IIRC, it would have to visit pages at least 1000 times more often to be relevant. So when somebody says "relying on vacuum instead of doing HOT pruning" what I hear is "flush performance down the toilet"... but of course the real way to resolve this is to test whatever patch Simon or someone else eventually posts, not to speculate without data. >> The rate at which vacuuming happens does not ramp up in the same way; >> it's limited by autovacuum cost settings (which people tend not have >> set correctly, and don't adjust themselves on the fly) > > True, but that seems like a pretty well-defined improvement project right > there (as well as an argument against user-visible knobs in general ;-)). > Nasby's speculations just upthread could be useful here, too. Fair point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
* Robert Haas (robertmhaas@gmail.com) wrote: > So when somebody says "relying on vacuum instead of doing > HOT pruning" what I hear is "flush performance down the toilet"... but > of course the real way to resolve this is to test whatever patch Simon > or someone else eventually posts, not to speculate without data. I don't think anyone was seriously proposing that (certainly not with today's VACUUM). What I've heard speculated about is doing HOT pruning during UPDATE and/or INSERT but specifically not during SELECT. I concur that we need data to really understand the difference, hopefully there'll be a patch posted which we can play with. Thanks, Stephen
On 8 January 2014 08:33, Simon Riggs <simon@2ndquadrant.com> wrote: > VACUUM cleans up blocks, which is nice because it happens offline in a > lazy manner. > > We also make SELECT clean up blocks as it goes. That is useful in OLTP > workloads, but it means that large SQL queries and pg_dump effectively > do much the same work as VACUUM, generating huge amounts of I/O and > WAL on the master, the cost and annoyance of which is experienced > directly by the user. That is avoided on standbys. > > Effects of that are that long running statements often run much longer > than we want, increasing bloat as a result. It also produces wildly > varying response times, depending upon extent of cleanup required. > > It is a simple task to make that behaviour optional on the master. > > I propose a USERSET parameter, prune_cost_limit (<---insert better name here) > which will make the behaviour optional, default -1, in normal user > processes. VACUUM will ignore this parameter and so its actions will > never be deferred. > > In detail, this parameter would disable pruning for any scan larger > than the cost limit. So large scans will disable the behaviour. The > default, -1, means never disable pruning, which is the current > behavour. > > We track the number of pages dirtied by the current statement. When > this reaches prune_cost_limit, we will apply these behaviours to all > shared_buffer block accesses... > > (1) avoid running heap_page_prune_opt() > > (2) avoid dirtying the buffer for hints. (This is safe because the > hinted changes will either be lost or will be part of the full page > image when we make a logged-change). > > (i.e. doesn't apply to temp tables) > > For example, if we set prune_cost_limit = 4 this behaviour allows > small index lookups via bitmapheapscan to continue to cleanup, while > larger index and seq scans will avoid cleanup. > > > > There would be a postgresql.conf parameter prune_cost_limit, as well > as a table level parameter that would prevent pruning except via > VACUUM. > > This will help in these ways > * Reduce write I/O from SELECTs and pg_dump - improving backups and BI queries > * Allow finer grained control over Hot Standby conflicts > * Potentially allow diagnostic inspection of older data via SeqScan > > Prototype patch shows this is possible and simple enough for 9.4. > Major objections? Or should I polish up and submit? Patch attached, implemented to reduce writes by SELECTs only. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Tue, Jan 14, 2014 at 4:13 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 8 January 2014 08:33, Simon Riggs <simon@2ndquadrant.com> wrote: >> VACUUM cleans up blocks, which is nice because it happens offline in a >> lazy manner. >> >> We also make SELECT clean up blocks as it goes. That is useful in OLTP >> workloads, but it means that large SQL queries and pg_dump effectively >> do much the same work as VACUUM, generating huge amounts of I/O and >> WAL on the master, the cost and annoyance of which is experienced >> directly by the user. That is avoided on standbys. >> >> Effects of that are that long running statements often run much longer >> than we want, increasing bloat as a result. It also produces wildly >> varying response times, depending upon extent of cleanup required. >> >> It is a simple task to make that behaviour optional on the master. >> >> I propose a USERSET parameter, prune_cost_limit (<---insert better name here) >> which will make the behaviour optional, default -1, in normal user >> processes. VACUUM will ignore this parameter and so its actions will >> never be deferred. >> >> In detail, this parameter would disable pruning for any scan larger >> than the cost limit. So large scans will disable the behaviour. The >> default, -1, means never disable pruning, which is the current >> behavour. >> >> We track the number of pages dirtied by the current statement. When >> this reaches prune_cost_limit, we will apply these behaviours to all >> shared_buffer block accesses... >> >> (1) avoid running heap_page_prune_opt() >> >> (2) avoid dirtying the buffer for hints. (This is safe because the >> hinted changes will either be lost or will be part of the full page >> image when we make a logged-change). >> >> (i.e. doesn't apply to temp tables) >> >> For example, if we set prune_cost_limit = 4 this behaviour allows >> small index lookups via bitmapheapscan to continue to cleanup, while >> larger index and seq scans will avoid cleanup. >> >> >> >> There would be a postgresql.conf parameter prune_cost_limit, as well >> as a table level parameter that would prevent pruning except via >> VACUUM. >> >> This will help in these ways >> * Reduce write I/O from SELECTs and pg_dump - improving backups and BI queries >> * Allow finer grained control over Hot Standby conflicts >> * Potentially allow diagnostic inspection of older data via SeqScan >> >> Prototype patch shows this is possible and simple enough for 9.4. >> Major objections? Or should I polish up and submit? > > Patch attached, implemented to reduce writes by SELECTs only. I am still not sure whether we want this, but I think it's definitely an improvement over the previous version. Assorted comments: - Naming consistency seems to me to dictate that there should be more similarity between the reloption name (allow_buffer_cleanup) and the GUC (prune_page_dirty_limit). - The documentation doesn't describe the use case where suppressing cleanup on a per-table basis would be desirable, and I can't think of one, either. - There are a variety of ways to limit pruning; here, you've chosen to limit it to a particular number of pruning operations per executor invocation. But the flag is global, not part of the executor state, so a query that calls a PL/pgsql function during execution will reset the counter for the parent query also, which doesn't seem very principled. In a patch I posted a few years ago to set hint bits only sometimes, I settled on an algorithm where I dirtied the first 50 pages per scan and then skipped the next 950, or something like that. The idea was that you wanted the pages that did get dirtied to be clustered together to avoid random I/O; and also that you wanted table of arbitrary size to get hinted within a certain number of scans (e.g. 20). The limiting here is much more aggressive, so on large tables it will amount to basically no pruning at all. I dunno whether that's a good idea or not. But if the idea of making this an integer rather than a boolean is to allow some pruning to still happen while keeping it checked within reasonable bounds, I'm not sure it will succeed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 15 January 2014 16:47, Robert Haas <robertmhaas@gmail.com> wrote: >>> There would be a postgresql.conf parameter prune_cost_limit, as well >>> as a table level parameter that would prevent pruning except via >>> VACUUM. >>> >>> This will help in these ways >>> * Reduce write I/O from SELECTs and pg_dump - improving backups and BI queries >>> * Allow finer grained control over Hot Standby conflicts >>> * Potentially allow diagnostic inspection of older data via SeqScan >> Patch attached, implemented to reduce writes by SELECTs only. > > I am still not sure whether we want this, but I think it's definitely > an improvement over the previous version. Assorted comments: > > - Naming consistency seems to me to dictate that there should be more > similarity between the reloption name (allow_buffer_cleanup) and the > GUC (prune_page_dirty_limit). Now that I've written the patch, I'm seeing those as two different things, but YMMV and I am very open to naming suggestions. > - The documentation doesn't describe the use case where suppressing > cleanup on a per-table basis would be desirable, and I can't think of > one, either. We already know that HOT is ineffective in areas of high contention (previous thread by me). Prior experience was that smaller tables didn't show much apparent benefit from using HOT either; its effectiveness was limited to medium and large tables being updated. The two already stated use cases that would apply are these ones * Allow finer grained control over Hot Standby conflicts * Potentially allow diagnostic inspection of older data viaSeqScan So the use cases for the two parameters seem quite different and we may decide we want one but not the other. > - There are a variety of ways to limit pruning; here, you've chosen to > limit it to a particular number of pruning operations per executor > invocation. But the flag is global, not part of the executor state, > so a query that calls a PL/pgsql function during execution will reset > the counter for the parent query also, which doesn't seem very > principled. That is subtle thing in this patch and I agree that potential problem exists. The current limit is set according to the current executing statement, but the current total is not reset until start of the top level statement. So the behaviour is not reset during statements executed within PL/pgSQL function. > In a patch I posted a few years ago to set hint bits only sometimes, I > settled on an algorithm where I dirtied the first 50 pages per scan > and then skipped the next 950, or something like that. The idea was > that you wanted the pages that did get dirtied to be clustered > together to avoid random I/O; and also that you wanted table of > arbitrary size to get hinted within a certain number of scans (e.g. > 20). The limiting here is much more aggressive, so on large tables it > will amount to basically no pruning at all. I dunno whether that's a > good idea or not. But if the idea of making this an integer rather > than a boolean is to allow some pruning to still happen while keeping > it checked within reasonable bounds, I'm not sure it will succeed. It sounds like you're in favour of the overall concept of limiting writes, which is good. The behaviour I think we need, based on listening to everybody so far is * OLTP is unaffected * Large SELECTs and pg_dump don't cause lots of write I/O. and hence why "prune_page_dirty_limit" offers a change in behaviour at a certain point. Reducing cleanup to "only 5%" just reduces but doesn't remove the problem. If the data is stored on very poor I/O infrastructure, any significant volume of writes can adversely affect performance. As we reduce the percentage, we also reduce the benefit from inducing writes in the first place and so I would question why bother at all using a percentage. For me, a parameter that gives you absolute rather than relative control is more desirable. The current behaviour assumes it is OK for the first/next user to touch the data to be the one that won't mind re-writing everything. In time critical applications, the first/next user could well have a very urgent need to access the data quickly and doesn't want to have to pay this price. In seldom-accessed data applications, VACUUM has lots of time to run out of hours, so users are OK to defer this work. Some applications exist where we literally want zero I/O. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jan 15, 2014 at 5:14 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > We already know that HOT is ineffective in areas of high contention > (previous thread by me). Prior experience was that smaller tables > didn't show much apparent benefit from using HOT either; its > effectiveness was limited to medium and large tables being updated. > The two already stated use cases that would apply are these ones Do you have a link to that previous thread? I don't happen to recall that conversation. I've found that HOT can be very important on smaller tables, so I'm skeptical of that as a general conclusion. What I think might be true is that if VACUUM is going to hit the table often enough to make you happy, then you don't really need HOT. In other words, if the update rate is non-zero but low, not too much cruft will accumulate before the table gets vacuumed, and you may be OK. If the update rate is high, though, I think disabling HOT will be painful on a table of any size. There might be exceptions, but I can't think of what the are off-hand. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jan 15, 2014 at 2:43 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 8 January 2014 08:33, Simon Riggs <simon@2ndquadrant.com> wrote: > > Patch attached, implemented to reduce writes by SELECTs only. This is really a valuable improvement over current SELECT behaviour w.r.t Writes. While going though patch, I observed few points, so thought of sharing with you: + /* + * If we are tracking pruning in SELECTs then we can only get + * here by heap_page_prune_opt() call that cleans a block, + * so in that case, register it as a pruning operation. + * Make sure we don't double count during VACUUMs. + */ + if (PrunePageDirtyLimit > -1) + PrunePageDirty++; a. As PrunePageDirtyLimit variable is not initialized for DDL flow, any statement like Create Function().. will have valueof PrunePageDirtyLimit as 4 (default) and in such cases MarkBufferDirty() will increment the wrong counter. b. For DDL statements like Create Materialized view, it will behave as Select statement. Ex. Create Materialized viewmv1 as select * from t1; Now here I think it might not be a problem, because for t1 anyway there will be no write, so skipping pruning should notbe a problem and for materialized views also there will no dead rows, so skipping should be okay, but I think it isnot strictly adhering to statement "to reduce writes by SELECTs only" and purpose of patch which is to avoid only when Top level statement is SELECT. Do you think it's better to consider such cases and optimize for them or should weavoid it by following thumb rule that pruning will be avoided only for top level SELECT? 2. + "Allow cleanup of shared buffers by foreground processes, allowing later cleanup by VACUUM", This line is not clear, what do you mean to say by "allowing later cleanup by VACUUM", if already foreground process has done cleanup, then it should save effort of Vacuum. In general, though both the optimisations (allow_buffer_cleanup and prune_page_dirty_limit ) used in patch have similarity in the sense that they will be used to avoid pruning, but still I feel they are for different cases (READ ONLY OP and WRITE ON SMALL TABLES) and also as there are more people inclined to do this for only SELECT operations, do you think it will be a good idea to make them as separate patches? I think there can be some applications or use cases which can be benefited by avoiding pruning for WRITE ON SMALL TABLES, but the case for SELECT is more general and more applications can get benefit with this optimisation,so it would be better if we first try to accomplish that case. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Feb 3, 2014 at 3:42 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Jan 15, 2014 at 2:43 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 8 January 2014 08:33, Simon Riggs <simon@2ndquadrant.com> wrote: >> >> Patch attached, implemented to reduce writes by SELECTs only. This patch is registered in this CF. It does not apply anymore and needs a rebase. Robert and Amit have provided as well some comments but they have not been addressed. Is it fair to mark it as "returned with feedback" even if it has not been reviewed within the last month? -- Michael
On Fri, Sep 12, 2014 at 3:19 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Mon, Feb 3, 2014 at 3:42 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Wed, Jan 15, 2014 at 2:43 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> On 8 January 2014 08:33, Simon Riggs <simon@2ndquadrant.com> wrote: >>> >>> Patch attached, implemented to reduce writes by SELECTs only. > > This patch is registered in this CF. It does not apply anymore and > needs a rebase. Robert and Amit have provided as well some comments > but they have not been addressed. Is it fair to mark it as "returned > with feedback" even if it has not been reviewed within the last month? For the time being, status has been changed to "waiting on author". -- Michael
Michael Paquier wrote: > On Fri, Sep 12, 2014 at 3:19 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: > > On Mon, Feb 3, 2014 at 3:42 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > >> On Wed, Jan 15, 2014 at 2:43 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > >>> On 8 January 2014 08:33, Simon Riggs <simon@2ndquadrant.com> wrote: > >>> > >>> Patch attached, implemented to reduce writes by SELECTs only. > > > > This patch is registered in this CF. It does not apply anymore and > > needs a rebase. Robert and Amit have provided as well some comments > > but they have not been addressed. Is it fair to mark it as "returned > > with feedback" even if it has not been reviewed within the last month? > For the time being, status has been changed to "waiting on author". As it happens, I was studying this patch yesterday on the flight back home. I gave it a quick look; I noticed it was in the commitfest and hadn't seen any review activity for many months, which seemed odd. Anyway I first read the whole thread to know what to focus on, before going over the patch itself. Once I finished reading the emails, I had a vague idea of how I thought it would work: my thinking was that heap/index scans would either call heap_page_prune_opt, or not, depending on whether they were part of a read-only executor node. So if you have a query that updates a certain table, and while doing so scans another table in read-only mode, then the HOT updates would be enabled for the table being written, but disabled for the one being read. As it turns out, the patch as written is nothing at all like that, and TBH I don't think I like it very much. My idea is that we would have a new executor flag, say EXEC_FLAG_READ_ONLY; we would set it on nodes that are known to be read-only, and reset it on those that aren't, such as LockRows and ModifyTable (obviously we need to pass it down correctly from parent to children). Then in ExecInitSeqScan and ExecInitIndexScan, if we see the flag set, we call heap/index_set_allow_prune(false) for the heap scan; same thing in index scans. (I envisioned it as a boolean rather than enabling a certain number of cleanups per scan.) I tried to code this but I think it doesn't work correctly, and no time for debug currently. Anyway let me know what you think of this general idea. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 12 September 2014 14:54, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > My idea is that we would have a new executor flag, say > EXEC_FLAG_READ_ONLY; we would set it on nodes that are known to be > read-only, and reset it on those that aren't, such as LockRows and > ModifyTable (obviously we need to pass it down correctly from parent to > children). Then in ExecInitSeqScan and ExecInitIndexScan, if we see the > flag set, we call heap/index_set_allow_prune(false) for the heap scan; > same thing in index scans. (I envisioned it as a boolean rather than > enabling a certain number of cleanups per scan.) > > I tried to code this but I think it doesn't work correctly, and no time > for debug currently. Anyway let me know what you think of this general > idea. Thanks for looking at this. My concern was to ensure that UPDATEs and DELETEs continue to call heap_page_prune_opt while larger SELECTs do not. This is achieved without a counter, so after some thought like it better; simple is good. Happy to progress from here, or you can? -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs wrote: > On 12 September 2014 14:54, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > My idea is that we would have a new executor flag, say > > EXEC_FLAG_READ_ONLY; we would set it on nodes that are known to be > > read-only, and reset it on those that aren't, such as LockRows and > > ModifyTable (obviously we need to pass it down correctly from parent to > > children). Then in ExecInitSeqScan and ExecInitIndexScan, if we see the > > flag set, we call heap/index_set_allow_prune(false) for the heap scan; > > same thing in index scans. (I envisioned it as a boolean rather than > > enabling a certain number of cleanups per scan.) > > > > I tried to code this but I think it doesn't work correctly, and no time > > for debug currently. Anyway let me know what you think of this general > > idea. > > Thanks for looking at this. > > My concern was to ensure that UPDATEs and DELETEs continue to call > heap_page_prune_opt while larger SELECTs do not. > > This is achieved without a counter, so after some thought like it > better; simple is good. Happy to progress from here, or you can? Please feel free to take over. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndquadrant.com> writes: > On 12 September 2014 14:54, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> My idea is that we would have a new executor flag, say >> EXEC_FLAG_READ_ONLY; we would set it on nodes that are known to be >> read-only, and reset it on those that aren't, such as LockRows and >> ModifyTable (obviously we need to pass it down correctly from parent to >> children). Then in ExecInitSeqScan and ExecInitIndexScan, if we see the >> flag set, we call heap/index_set_allow_prune(false) for the heap scan; >> same thing in index scans. (I envisioned it as a boolean rather than >> enabling a certain number of cleanups per scan.) >> >> I tried to code this but I think it doesn't work correctly, and no time >> for debug currently. Anyway let me know what you think of this general >> idea. > Thanks for looking at this. > My concern was to ensure that UPDATEs and DELETEs continue to call > heap_page_prune_opt while larger SELECTs do not. I think there's another way to think about it: what about saying that the query's target relation(s) are subject to pruning, while others are not? Then you do not need an executor flag, you just need to look at the estate->es_result_relations array (or maybe even only at estate->es_result_relation_info). This would have the advantage of doing what-I-think-is-the-right-thing for updates/deletes involving joins to other tables. The mechanism Alvaro describes would probably have to prune all tables involved in such a query; do we really want that? regards, tom lane
I wrote: > I think there's another way to think about it: what about saying that > the query's target relation(s) are subject to pruning, while others > are not? Then you do not need an executor flag, you just need to > look at the estate->es_result_relations array (or maybe even only at > estate->es_result_relation_info). After a little bit I remembered there was already a function for this. So specifically, I'd suggest using ExecRelationIsTargetRelation() to decide whether to mark the scan as requiring pruning. regards, tom lane
On 12 September 2014 15:30, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> I think there's another way to think about it: what about saying that >> the query's target relation(s) are subject to pruning, while others >> are not? Then you do not need an executor flag, you just need to >> look at the estate->es_result_relations array (or maybe even only at >> estate->es_result_relation_info). > > After a little bit I remembered there was already a function for this. > So specifically, I'd suggest using ExecRelationIsTargetRelation() > to decide whether to mark the scan as requiring pruning. Sounds cool. Thanks both, this is sounding like a viable route now. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 12 September 2014 18:19, Simon Riggs <simon@2ndquadrant.com> wrote: > On 12 September 2014 15:30, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> After a little bit I remembered there was already a function for this. >> So specifically, I'd suggest using ExecRelationIsTargetRelation() >> to decide whether to mark the scan as requiring pruning. > > Sounds cool. Thanks both, this is sounding like a viable route now. Yes, this is viable. Patch attached, using Alvaro's idea of use-case specific pruning and Tom's idea of aiming at target relations. Patch uses or extends existing infrastructure, so its shorter than it might have been, yet with all that bufmgr yuck removed. This is very, very good because while going through this I notice the dozen or more places where we were pruning blocks in annoying places I didn't even know about such as about 4-5 constraint checks. In more than a few DDL commands like ALTER TABLE and CLUSTER we were even pruning the old relation prior to rewrite. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Sun, Sep 14, 2014 at 4:37 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 12 September 2014 18:19, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 12 September 2014 15:30, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >>> After a little bit I remembered there was already a function for this. >>> So specifically, I'd suggest using ExecRelationIsTargetRelation() >>> to decide whether to mark the scan as requiring pruning. >> >> Sounds cool. Thanks both, this is sounding like a viable route now. > > Yes, this is viable. > > Patch attached, using Alvaro's idea of use-case specific pruning and > Tom's idea of aiming at target relations. Patch uses or extends > existing infrastructure, so its shorter than it might have been, yet > with all that bufmgr yuck removed. > > This is very, very good because while going through this I notice the > dozen or more places where we were pruning blocks in annoying places I > didn't even know about such as about 4-5 constraint checks. In more > than a few DDL commands like ALTER TABLE and CLUSTER we were even > pruning the old relation prior to rewrite. Do we really want to disable HOT for all catalog scans? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 15 September 2014 17:09, Robert Haas <robertmhaas@gmail.com> wrote: > Do we really want to disable HOT for all catalog scans? The intention of the patch is that catalog scans are treated identically to non-catalog scans. The idea here is that HOT cleanup only occurs on scans on target relations, so only INSERT, UPDATE and DELETE do HOT cleanup. It's possible that many catalog scans don't follow the normal target relation logic, so we might argue we should use HOT every time. OTOH, since we now have separate catalog xmins we may find that using HOT on catalogs is no longer effective. So I could go either way on how to proceed; its an easy change either way. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
El 14/09/14 17:37, Simon Riggs escribió: > On 12 September 2014 18:19, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 12 September 2014 15:30, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> After a little bit I remembered there was already a function for this. >>> So specifically, I'd suggest using ExecRelationIsTargetRelation() >>> to decide whether to mark the scan as requiring pruning. >> Sounds cool. Thanks both, this is sounding like a viable route now. > Yes, this is viable. > > Patch attached, using Alvaro's idea of use-case specific pruning and > Tom's idea of aiming at target relations. Patch uses or extends > existing infrastructure, so its shorter than it might have been, yet > with all that bufmgr yuck removed. > > This is very, very good because while going through this I notice the > dozen or more places where we were pruning blocks in annoying places I > didn't even know about such as about 4-5 constraint checks. In more > than a few DDL commands like ALTER TABLE and CLUSTER we were even > pruning the old relation prior to rewrite. > A simple performance test with the following variables: LOOP=50 CONN=60 TXSS=500 SCALE=30 Select only: WITH PATCH Average: 20716.1 tps NO PATCH Average: 19141.7 tps With writes: WITH PATCH Average: 2602.65 NO PATCH Average: 2565.32 TODO: - Consistency check. - ALTER and CLUSTER test.
El 15/09/14 18:13, Simon Riggs escribió: > On 15 September 2014 17:09, Robert Haas <robertmhaas@gmail.com> wrote: > >> Do we really want to disable HOT for all catalog scans? > The intention of the patch is that catalog scans are treated > identically to non-catalog scans. The idea here is that HOT cleanup > only occurs on scans on target relations, so only INSERT, UPDATE and > DELETE do HOT cleanup. > > It's possible that many catalog scans don't follow the normal target > relation logic, so we might argue we should use HOT every time. OTOH, > since we now have separate catalog xmins we may find that using HOT on > catalogs is no longer effective. So I could go either way on how to > proceed; its an easy change either way. > I setup a more concurrent scenario and the difference is quite larger: Without patch: 1st concurrent with writes: tps = 5705.261620 (including connections establishing) tps = 5945.338557 (excluding connections establishing) 2nd no writes being executed: tps = 9988.792575 (including connections establishing) tps = 11059.671351 (excluding connections establishing) Patched version: 1st concurrent with writes: tps = 9476.741707 (including connections establishing) tps = 10274.831185 (excluding connections establishing) 2nd no writes being executed: tps = 12993.644808 (including connections establishing) tps = 15171.214744 (excluding connections establishing) Stats (writes have been run with a time limit, not by tx): hotcleanup=# select relname ,n_live_tup, n_dead_tup, n_tup_hot_upd from pg_stat_user_tables where relname ~ 'pgbench'; relname | n_live_tup | n_dead_tup | n_tup_hot_upd ------------------+------------+------------+---------------pgbench_tellers | 500 | 0 | 2044192pgbench_accounts| 5109728 | 310842 | 1969264pgbench_history | 2265882 | 0 | 0pgbench_branches | 50 | 0 | 2237167 (4 rows) hotcleanup=# select relname ,n_live_tup, n_dead_tup, n_tup_hot_upd from pg_stat_user_tables where relname ~ 'pgbench'; relname | n_live_tup | n_dead_tup | n_tup_hot_upd ------------------+------------+------------+---------------pgbench_history | 2148946 | 0 | 0pgbench_tellers | 500 | 0 | 1969675pgbench_branches | 50 | 0 | 2150655pgbench_accounts| 5098774 | 300123 | 1897484 (4 rows) I ran the regression tests over the patched version and they passed ok. -- -- Emanuel Calvo http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 15 September 2014 22:13, Simon Riggs <simon@2ndquadrant.com> wrote: > On 15 September 2014 17:09, Robert Haas <robertmhaas@gmail.com> wrote: > >> Do we really want to disable HOT for all catalog scans? > > The intention of the patch is that catalog scans are treated > identically to non-catalog scans. The idea here is that HOT cleanup > only occurs on scans on target relations, so only INSERT, UPDATE and > DELETE do HOT cleanup. Since INSERT, UPDATE and DELETE can only be called when !RecoveryInProgress(), we can completely avoid making this test at the top of each heap_page_prune_opt() call. I very much like the simplicity of saying "no target, no cleanup". -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Sep 15, 2014 at 5:13 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 15 September 2014 17:09, Robert Haas <robertmhaas@gmail.com> wrote: >> Do we really want to disable HOT for all catalog scans? > > The intention of the patch is that catalog scans are treated > identically to non-catalog scans. The idea here is that HOT cleanup > only occurs on scans on target relations, so only INSERT, UPDATE and > DELETE do HOT cleanup. > > It's possible that many catalog scans don't follow the normal target > relation logic, so we might argue we should use HOT every time. OTOH, > since we now have separate catalog xmins we may find that using HOT on > catalogs is no longer effective. So I could go either way on how to > proceed; its an easy change either way. What I'm thinking about is that the smarts to enable pruning is all in the executor nodes. So anything that updates the catalog without going through the executor will never be subject to pruning. That includes nearly all catalog-modifying code throughout the backend. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 19 September 2014 13:04, Robert Haas <robertmhaas@gmail.com> wrote: > What I'm thinking about is that the smarts to enable pruning is all in > the executor nodes. So anything that updates the catalog without > going through the executor will never be subject to pruning. That > includes nearly all catalog-modifying code throughout the backend. Are you saying this is a problem or a benefit? (and please explain why). -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On September 19, 2014 10:16:35 PM CEST, Simon Riggs <simon@2ndquadrant.com> wrote: >On 19 September 2014 13:04, Robert Haas <robertmhaas@gmail.com> wrote: > >> What I'm thinking about is that the smarts to enable pruning is all >in >> the executor nodes. So anything that updates the catalog without >> going through the executor will never be subject to pruning. That >> includes nearly all catalog-modifying code throughout the backend. > >Are you saying this is a problem or a benefit? (and please explain >why). I have no idea what Robert is thinking of, but I'd imagine its horrible for workloads with catalog bloat. Like ones involvingtemp tables. I generally have serious doubts about disabling it generally for read workloads. I imagine it e.g. will significantly penalizeworkloads where its likely that a cleanup lock can't be acquired every time... Andres --- Please excuse brevity and formatting - I am writing this on my mobile phone.
Andres Freund <andres@anarazel.de> writes: > On September 19, 2014 10:16:35 PM CEST, Simon Riggs <simon@2ndquadrant.com> wrote: >> Are you saying this is a problem or a benefit? (and please explain >> why). > I have no idea what Robert is thinking of, but I'd imagine its horrible for workloads with catalog bloat. Like ones involvingtemp tables. Yeah. But it's also the case that we know a good deal more about the access patterns for system-driven catalog updates than we do about user queries. ISTM we could probably suppress HOT pruning during catalog *scans* and instead try to do it when a system-driven heap_update occurs. Having said that, this could reasonably be considered outside the scope of a patch that's trying to improve the behavior for user queries. But if the patch author doesn't want to expand the scope like that, ISTM he ought to ensure that the behavior *doesn't* change for system accesses, rather than trying to convince us that disabling HOT for system updates is a good idea. regards, tom lane
On Fri, Sep 19, 2014 at 4:30 PM, Andres Freund <andres@anarazel.de> wrote: > On September 19, 2014 10:16:35 PM CEST, Simon Riggs <simon@2ndquadrant.com> wrote: >>On 19 September 2014 13:04, Robert Haas <robertmhaas@gmail.com> wrote: >> >>> What I'm thinking about is that the smarts to enable pruning is all >>in >>> the executor nodes. So anything that updates the catalog without >>> going through the executor will never be subject to pruning. That >>> includes nearly all catalog-modifying code throughout the backend. >> >>Are you saying this is a problem or a benefit? (and please explain >>why). > > I have no idea what Robert is thinking of, but I'd imagine its horrible for workloads with catalog bloat. Like ones involvingtemp tables. Right, that's what I was going for. > I generally have serious doubts about disabling it generally for read workloads. I imagine it e.g. will significantly penalizeworkloads where its likely that a cleanup lock can't be acquired every time... I share that doubt. But I understand why Simon wants to do something, too, because the current situation is not great either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-09-19 17:29:08 -0400, Robert Haas wrote: > > I generally have serious doubts about disabling it generally for > > read workloads. I imagine it e.g. will significantly penalize > > workloads where its likely that a cleanup lock can't be acquired > > every time... > > I share that doubt. But I understand why Simon wants to do something, > too, because the current situation is not great either. Right, I totally agree. I doubt a simple approach like this will work in the general case, but I think something needs to be done. I think limiting the amount of HOT cleanup for readonly queries is a good idea, but I think it has to be gradual. Say after a single cleaned up page at least another 500 pages need to have been touched till the next hot cleanup. That way a single query won't be penalized with cleaning up everything, but there'll be some progress. The other thing I think might be quite worthwile would be to abort hot cleanup when the gain is only minimal. If e.g. only 1 small tuple is removed from a half full page it's not worth the cost of the wal logging et al. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 19 September 2014 15:35, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Having said that, this could reasonably be considered outside the scope > of a patch that's trying to improve the behavior for user queries. > But if the patch author doesn't want to expand the scope like that, > ISTM he ought to ensure that the behavior *doesn't* change for system > accesses, rather than trying to convince us that disabling HOT for > system updates is a good idea. As I said, I could make an argument to go either way, so I was unsure. I'm happy to avoid changing behaviour for catalog scans in this patch. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2014-09-19 16:35:19 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On September 19, 2014 10:16:35 PM CEST, Simon Riggs <simon@2ndquadrant.com> wrote: > >> Are you saying this is a problem or a benefit? (and please explain > >> why). > > > I have no idea what Robert is thinking of, but I'd imagine its horrible for workloads with catalog bloat. Like ones involvingtemp tables. > > Yeah. But it's also the case that we know a good deal more about the > access patterns for system-driven catalog updates than we do about user > queries. ISTM we could probably suppress HOT pruning during catalog > *scans* and instead try to do it when a system-driven heap_update > occurs. > > Having said that, this could reasonably be considered outside the scope > of a patch that's trying to improve the behavior for user queries. > But if the patch author doesn't want to expand the scope like that, > ISTM he ought to ensure that the behavior *doesn't* change for system > accesses, rather than trying to convince us that disabling HOT for > system updates is a good idea. I think it'd have to change for anything not done via the executor. There definitely is user defined code out there doing manual heap_* stuff. I know because i've written some. And I know I'm not the only one. If such paths suddenly stop doing HOT cleanup we'll cause a noticeable amount of pain. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Sep 19, 2014 at 5:42 PM, Andres Freund <andres@anarazel.de> wrote: > On 2014-09-19 17:29:08 -0400, Robert Haas wrote: >> > I generally have serious doubts about disabling it generally for >> > read workloads. I imagine it e.g. will significantly penalize >> > workloads where its likely that a cleanup lock can't be acquired >> > every time... >> >> I share that doubt. But I understand why Simon wants to do something, >> too, because the current situation is not great either. > > Right, I totally agree. I doubt a simple approach like this will work in > the general case, but I think something needs to be done. > > I think limiting the amount of HOT cleanup for readonly queries is a > good idea, but I think it has to be gradual. Say after a single cleaned > up page at least another 500 pages need to have been touched till the > next hot cleanup. That way a single query won't be penalized with > cleaning up everything, but there'll be some progress. I tried this kind of thing several years ago with hint-bit-setting and was unimpressed by the results. http://www.postgresql.org/message-id/AANLkTik5QzR8wTs0MqCWwmNp-qHGrdKY5Av5aOB7W4Dp@mail.gmail.com http://www.postgresql.org/message-id/AANLkTimGKaG7wdu-x77GNV2Gh6_Qo5Ss1u5b6Q1MsPUy@mail.gmail.com Granted, I never tried a ratio as low as 500:1, and HOT pruning is not the same thing as setting hint bits, but I think the basic problems are similar, namely: 1. You can't know how many times the page is going to be referenced in the future before it again gets modified. If that number is small, then you shouldn't bother with hint bits, or HOT-pruning, or freezing. But if it's big, you should do all of those things as soon as possible because the benefits are quite significant. Therefore, any change in this area is guaranteed to lose on some easy-to-construct workload, because I just described two of them that want opposing things. 2. Dirtying every N'th page is a great way to generate lots of random I/O that will quite possibly make your disk almost as sad - or even sadder - than dirtying all of them, but without anywhere as near as much performance benefit. Variations on this idea have been proposed so many times over the years that I'm tempted to give some credence to the theory that we ought to adopt one of them. But there will certainly be losers, as well as winners. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
This patch has gotten a fair amount of review, and has been rewritten once during the commitfest. I think it's pretty close to being committable, the only remaining question seems to be what to do with system catalogs. I'm marking this as "Returned with feedback", I take it that Simon can proceed from here, outside the commitfest. - Heikki
On 2014-09-27 10:23:33 +0300, Heikki Linnakangas wrote: > This patch has gotten a fair amount of review, and has been rewritten once > during the commitfest. I think it's pretty close to being committable, the > only remaining question seems to be what to do with system catalogs. I'm > marking this as "Returned with feedback", I take it that Simon can proceed > from here, outside the commitfest. FWIW, I don't think it is, even with that. As is it seems very likely that it's going to regress a fair share of workloads. At the very least it needs a fair amount of benchmarking beforehand. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 27 September 2014 09:29, Andres Freund <andres@anarazel.de> wrote: > On 2014-09-27 10:23:33 +0300, Heikki Linnakangas wrote: >> This patch has gotten a fair amount of review, and has been rewritten once >> during the commitfest. I think it's pretty close to being committable, the >> only remaining question seems to be what to do with system catalogs. I'm >> marking this as "Returned with feedback", I take it that Simon can proceed >> from here, outside the commitfest. > > FWIW, I don't think it is, even with that. As is it seems very likely > that it's going to regress a fair share of workloads. At the very least > it needs a fair amount of benchmarking beforehand. There is some doubt there. We've not seen a workload that does actually exhibit a negative behaviour. I'm not saying one doesn't exist, but it does matter how common/likely it is. If anyone can present a performance test case that demonstrates a regression, I think it will make it easier to discuss how wide that case is and what we should do about it. Discussing whether to do various kinds of limited pruning are moot until that is clear. My memory was that it took months for people to understand the frequent update use case, since catching it in flagrante delicto was hard. That may be the case here, or not, but negative-benefit experimental results very welcome. Updated patch attached to address earlier comments. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2014-09-28 19:51:36 +0100, Simon Riggs wrote: > On 27 September 2014 09:29, Andres Freund <andres@anarazel.de> wrote: > > On 2014-09-27 10:23:33 +0300, Heikki Linnakangas wrote: > >> This patch has gotten a fair amount of review, and has been rewritten once > >> during the commitfest. I think it's pretty close to being committable, the > >> only remaining question seems to be what to do with system catalogs. I'm > >> marking this as "Returned with feedback", I take it that Simon can proceed > >> from here, outside the commitfest. > > > > FWIW, I don't think it is, even with that. As is it seems very likely > > that it's going to regress a fair share of workloads. At the very least > > it needs a fair amount of benchmarking beforehand. > > There is some doubt there. We've not seen a workload that does > actually exhibit a negative behaviour. Neither is there much data about the magnitude of positive effect the patch has... > I'm not saying one doesn't exist, but it does matter how common/likely > it is. If anyone can present a performance test case that demonstrates > a regression, I think it will make it easier to discuss how wide that > case is and what we should do about it. Discussing whether to do > various kinds of limited pruning are moot until that is clear. I doubt it'll be hard to construct a case where it'll show. My first try of using a pgbench scale 100, -M prepared, -cj8 with a custom file with 1 write and 5 read transaction yielded the following on my laptop: Baseline:relname | pgbench_tellerspg_total_relation_size | 458752relname | pgbench_accountspg_total_relation_size| 1590337536relname | pgbench_branchespg_total_relation_size | 286720relname | pgbench_historypg_total_relation_size | 49979392 Patched:relname | pgbench_tellerspg_total_relation_size | 516096relname | pgbench_accountspg_total_relation_size| 1590337536relname | pgbench_branchespg_total_relation_size | 360448relname | pgbench_historypg_total_relation_size | 49528832 So, there's a noticeable increase in size. Mostly on the smaller tables, so probably HOT cleanup was sometimes skipped during UPDATEs due to locks. Baseline was: tps = 9655.486532 (excluding connections establishing) Patched was: tps = 9466.158701 (including connections establishing) That's not a unrealistic testcase. I'm pretty sure this could be made quite a bit more pronounced by not using a uniform distribution in the pgbench runs. And selecting a test that's more vulnerable to the change (e.g. using a wider distribution for the read only statements than the modifying ones) would make the the CPU overhead of the additional heap_hot_search_buffer() overhead heavier. > > My memory was that it took months for people to understand the > frequent update use case, since catching it in flagrante delicto was > hard. That may be the case here, or not, but negative-benefit > experimental results very welcome. > > Updated patch attached to address earlier comments. contrib (at least pgstattuple) doesn't currently compile with this... Easily patched up tho. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2014-09-29 11:13:43 +0200, Andres Freund wrote: > I doubt it'll be hard to construct a case where it'll show. My first try > of using a pgbench scale 100, -M prepared, -cj8 with a custom file with > 1 write and 5 read transaction yielded the following on my laptop: Hm. On second thought that testcase probably *decreased* the bad effects of this because pgbench's readonly statements only touch one table... patched: Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2014-09-29 11:31:11 +0200, Andres Freund wrote: > On 2014-09-29 11:13:43 +0200, Andres Freund wrote: > > I doubt it'll be hard to construct a case where it'll show. My first try > > of using a pgbench scale 100, -M prepared, -cj8 with a custom file with > > 1 write and 5 read transaction yielded the following on my laptop: > > Hm. On second thought that testcase probably *decreased* the bad effects > of this because pgbench's readonly statements only touch one table... I've attached the file nonetheless, for posterities sake. Note it has the scale hardcoded... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
What happened to this patch? I'm going over something that could use the concept of "clean some stuff up when reading this page, but only if we're already writing" or similar. I see some cases were presented that had a performance decrease. Did we get any numbers for the increase in performance in some other interesting cases? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 17 November 2014 21:09, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > What happened to this patch? I'm going over something that could use > the concept of "clean some stuff up when reading this page, but only if > we're already writing" or similar. > > I see some cases were presented that had a performance decrease. Did we > get any numbers for the increase in performance in some other > interesting cases? It's not dead; it just needs more work. Maybe for next CF, or you can now. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 17 November 2014 at 22:08, Simon Riggs <simon@2ndquadrant.com> wrote: > On 17 November 2014 21:09, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> What happened to this patch? I'm going over something that could use >> the concept of "clean some stuff up when reading this page, but only if >> we're already writing" or similar. >> >> I see some cases were presented that had a performance decrease. Did we >> get any numbers for the increase in performance in some other >> interesting cases? > > It's not dead; it just needs more work. Maybe for next CF, or you can now. Latest version attached for next CF -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 17 November 2014 at 22:08, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 17 November 2014 21:09, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> What happened to this patch? I'm going over something that could use
>> the concept of "clean some stuff up when reading this page, but only if
>> we're already writing" or similar.
>>
>> I see some cases were presented that had a performance decrease. Did we
>> get any numbers for the increase in performance in some other
>> interesting cases?
>
> It's not dead; it just needs more work. Maybe for next CF, or you can now.
Latest version attached for next CF
On 15 December 2014 at 20:26, Jeff Janes <jeff.janes@gmail.com> wrote: > I still get the compiler error in contrib: > > pgstattuple.c: In function 'pgstat_heap': > pgstattuple.c:279: error: too few arguments to function > 'heap_beginscan_strat' > > Should it pass false for the always_prune? Yes. New version attached. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
<div dir="ltr"><br /><div class="gmail_extra"><br /><div class="gmail_quote">On Wed, Dec 17, 2014 at 5:39 PM, Simon Riggs<span dir="ltr"><<a href="mailto:simon@2ndquadrant.com" target="_blank">simon@2ndquadrant.com</a>></span> wrote:<br/><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On15 December 2014 at 20:26, Jeff Janes <<a href="mailto:jeff.janes@gmail.com">jeff.janes@gmail.com</a>> wrote:<br/><br /> > I still get the compiler error in contrib:<br /> ><br /> > pgstattuple.c: In function 'pgstat_heap':<br/> > pgstattuple.c:279: error: too few arguments to function<br /> > 'heap_beginscan_strat'<br />><br /> > Should it pass false for the always_prune?<br /><br /></span>Yes.<br /><br /> New version attached.<br/></blockquote></div><br clear="all" /></div><div class="gmail_extra">Moved patch to CF 2015-02 with same status"Needs review". It visibly needs more work, and numbers to show increase in performance while only cases showing upperformance decrease showed up.<br /></div><div class="gmail_extra">-- <br /><div class="gmail_signature">Michael<br /></div></div></div>
On 15 December 2014 at 20:26, Jeff Janes <jeff.janes@gmail.com> wrote:
> I still get the compiler error in contrib:
>
> pgstattuple.c: In function 'pgstat_heap':
> pgstattuple.c:279: error: too few arguments to function
> 'heap_beginscan_strat'
>
> Should it pass false for the always_prune?
Yes.
New version attached.
On 12/17/14 3:39 AM, Simon Riggs wrote: > On 15 December 2014 at 20:26, Jeff Janes <jeff.janes@gmail.com> wrote: > >> I still get the compiler error in contrib: >> >> pgstattuple.c: In function 'pgstat_heap': >> pgstattuple.c:279: error: too few arguments to function >> 'heap_beginscan_strat' >> >> Should it pass false for the always_prune? > > Yes. > > New version attached. README.HOT section "When can/should we prune or defragment?" needs a major update as a result of this patch.
On 3/9/15 1:36 PM, Jeff Janes wrote: > Did versions 7 and 8 of this patch address Andres' concern about > performance regressions? I don't think so. Andres basically wanted a nontrival algorithm to determine how much pruning to do during a read-only scan. And Robert basically said, that's not really possible. The presented patch actually has a hardcoded prune limit of 4 per scan, which I don't see mentioned in the discussion anywhere (except in very early versions, where this was exposed as a knob). I think most people were of the opinion that scans on system catalogs should not be affected by this behavior change. Makes sense to me: System catalog bloat is likely a bigger problem than speeding up queries on catalogs with large live data. And then there is still some disagreement whether just turning this on is tolerable for all uses. Andres mentioned workloads that have trouble getting a cleanup lock. README.HOT seems to think that cleaning up during reads is important because skipping over dead tuples is expensive. Nobody seems to like the idea of (implicitly) pushing more responsibility on VACUUM. We have seen some benchmarks that show significant improvements. We have seen some (constructed ones) that show problems. I don't know how to move forward. We could give users a knob: This might make your queries faster or not -- good luck. But of course nobody will like that either.
On 2015-03-11 20:55:18 -0400, Peter Eisentraut wrote: > I don't think so. Andres basically wanted a nontrival algorithm to > determine how much pruning to do during a read-only scan. And Robert > basically said, that's not really possible. I don't think either of us made really strong statements. > We have seen some benchmarks that show significant improvements. We > have seen some (constructed ones) that show problems. FWIW, it's not that constructed. It's just a mixture of read with write load. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 3/12/15 5:41 AM, Andres Freund wrote: > On 2015-03-11 20:55:18 -0400, Peter Eisentraut wrote: >> I don't think so. Andres basically wanted a nontrival algorithm to >> determine how much pruning to do during a read-only scan. And Robert >> basically said, that's not really possible. > > I don't think either of us made really strong statements. I didn't mean to put words in your mouth. I just wanted to summarize the thread as, Andres wanted more fine-tuning on the behavior, Robert expressed serious doubts that that will lead to an acceptable result.
On Thu, Mar 12, 2015 at 3:48 PM, Peter Eisentraut <peter_e@gmx.net> wrote: > On 3/12/15 5:41 AM, Andres Freund wrote: >> On 2015-03-11 20:55:18 -0400, Peter Eisentraut wrote: >>> I don't think so. Andres basically wanted a nontrival algorithm to >>> determine how much pruning to do during a read-only scan. And Robert >>> basically said, that's not really possible. >> >> I don't think either of us made really strong statements. > > I didn't mean to put words in your mouth. I just wanted to summarize > the thread as, Andres wanted more fine-tuning on the behavior, Robert > expressed serious doubts that that will lead to an acceptable result. Or to put that another way, I'm not sure there's one behavior here that will please everybody. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 11 March 2015 at 20:55, Peter Eisentraut <peter_e@gmx.net> wrote: > I don't know how to move forward. We could give users a knob: This > might make your queries faster or not -- good luck. But of course > nobody will like that either. What is clear is that large SELECT queries are doing the work VACUUM should do. We should not be doing large background tasks (block cleanup) during long running foreground tasks. But there is no need for changing behaviour during small SELECTs. So the setting of 4 gives current behaviour for small SELECTs and new behaviour for larger SELECTs. The OP said this... <op> We also make SELECT clean up blocks as it goes. That is useful in OLTP workloads, but it means that large SQL queries and pg_dump effectively do much the same work as VACUUM, generating huge amounts of I/O and WAL on the master, the cost and annoyance of which is experienced directly by the user. That is avoided on standbys. Effects of that are that long running statements often run much longer than we want, increasing bloat as a result. It also produces wildly varying response times, depending upon extent of cleanup required. </op> This is not a performance patch. This is about one user doing the cleanup work for another. People running large SELECTs should not be penalised. The patch has been shown to avoid that and no further discussion should be required. I don't really care whether we have a parameter for this or not. As long as we have the main feature. It's trivial to add/remove a parameter to control this. Currently there isn't one. I'd like to commit this. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services
Attachment
On 4/14/15 6:07 PM, Simon Riggs wrote: > On 11 March 2015 at 20:55, Peter Eisentraut <peter_e@gmx.net> wrote: > >> I don't know how to move forward. We could give users a knob: This >> might make your queries faster or not -- good luck. But of course >> nobody will like that either. > > What is clear is that large SELECT queries are doing the work VACUUM > should do. We should not be doing large background tasks (block > cleanup) during long running foreground tasks. But there is no need > for changing behaviour during small SELECTs. So the setting of 4 gives > current behaviour for small SELECTs and new behaviour for larger > SELECTs. > > The OP said this... > <op> > We also make SELECT clean up blocks as it goes. That is useful in OLTP > workloads, but it means that large SQL queries and pg_dump effectively > do much the same work as VACUUM, generating huge amounts of I/O and > WAL on the master, the cost and annoyance of which is experienced > directly by the user. That is avoided on standbys. > > Effects of that are that long running statements often run much longer > than we want, increasing bloat as a result. It also produces wildly > varying response times, depending upon extent of cleanup required. > </op> > > This is not a performance patch. This is about one user doing the > cleanup work for another. People running large SELECTs should not be > penalised. The patch has been shown to avoid that and no further > discussion should be required. > > I don't really care whether we have a parameter for this or not. As > long as we have the main feature. > > It's trivial to add/remove a parameter to control this. Currently > there isn't one. > > I'd like to commit this. +1 from me. One of the last databases I worked on had big raw partitions that were written to and then sequentially scanned exactly once before being dropped. It was painful to see all those writes happening for nothing. In other cases there were sequential scans that happened directly after the main writes, but then the next read might be days in the future (if ever) and the system was basically idle for a while which would have allowed vacuum to come in and do the job without affecting performance of the main job. I think that in batch-oriented databases this patch will definitely be a boon to performance. -- - David Steele david@pgmasters.net
On Tue, Apr 14, 2015 at 6:07 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 11 March 2015 at 20:55, Peter Eisentraut <peter_e@gmx.net> wrote: >> I don't know how to move forward. We could give users a knob: This >> might make your queries faster or not -- good luck. But of course >> nobody will like that either. > > What is clear is that large SELECT queries are doing the work VACUUM > should do. We should not be doing large background tasks (block > cleanup) during long running foreground tasks. But there is no need > for changing behaviour during small SELECTs. So the setting of 4 gives > current behaviour for small SELECTs and new behaviour for larger > SELECTs. Peter commented previously that README.HOT should get an update. The relevant section seems to be "When can/should we prune or defragment?". I wonder if it would be a useful heuristic to still prune pages if those pages are already dirty. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 14 April 2015 at 21:53, Robert Haas <robertmhaas@gmail.com> wrote: > Peter commented previously that README.HOT should get an update. The > relevant section seems to be "When can/should we prune or > defragment?". That's easy enough to change once we agree to commit. > I wonder if it would be a useful heuristic to still prune pages if > those pages are already dirty. Useful for who? This is about responsibility. Why should someone performing a large SELECT take the responsibility for cleaning pages? -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services
On Wed, Apr 15, 2015 at 3:37 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 14 April 2015 at 21:53, Robert Haas <robertmhaas@gmail.com> wrote: >> Peter commented previously that README.HOT should get an update. The >> relevant section seems to be "When can/should we prune or >> defragment?". > > That's easy enough to change once we agree to commit. > >> I wonder if it would be a useful heuristic to still prune pages if >> those pages are already dirty. > > Useful for who? This is about responsibility. Why should someone > performing a large SELECT take the responsibility for cleaning pages? Because it makes it subsequent accesses to the page cheaper. Of course, that applies in all cases, but when the page is already dirty, the cost of pruning it is probably quite small - we're going to have to write the page anyway, and pruning it before it gets evicted (perhaps even by our scan) will be cheaper than writing it now and writing it again after it's pruned. When the page is clean, the cost of pruning is significantly higher. I won't take responsibility for paying my neighbor's tax bill, but I might take responsibility for picking up his mail while he's on holiday. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 15 April 2015 at 08:04, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Apr 15, 2015 at 3:37 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 14 April 2015 at 21:53, Robert Haas <robertmhaas@gmail.com> wrote: >>> Peter commented previously that README.HOT should get an update. The >>> relevant section seems to be "When can/should we prune or >>> defragment?". >> >> That's easy enough to change once we agree to commit. >> >>> I wonder if it would be a useful heuristic to still prune pages if >>> those pages are already dirty. >> >> Useful for who? This is about responsibility. Why should someone >> performing a large SELECT take the responsibility for cleaning pages? > > Because it makes it subsequent accesses to the page cheaper. Cheaper for whom? > Of > course, that applies in all cases, but when the page is already dirty, > the cost of pruning it is probably quite small - we're going to have > to write the page anyway, and pruning it before it gets evicted > (perhaps even by our scan) will be cheaper than writing it now and > writing it again after it's pruned. When the page is clean, the cost > of pruning is significantly higher. "We" aren't going to have to write the page, but someone will. In a single workload, the mix of actions can be useful. In separate workloads, where some guy just wants to run a report or a backup, its not right that we slow them down because of someone else's actions. > I won't take responsibility for paying my neighbor's tax bill, but I > might take responsibility for picking up his mail while he's on > holiday. That makes it sound like this is an occasional, non-annoying thing. It's more like, whoever fetches the mail needs to fetch it for everybody. So we are slowing down one person disproportionately, while others fly through without penalty. There is no argument that one workload necessarily needs to perform that on behalf of the other workload. The actions you suggest are reasonable and should ideally be the role of a background process. But that doesn't mean in the absence of that we should pay the cost in the foreground. Let me apply this patch. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services
On 2015-04-15 08:42:33 -0400, Simon Riggs wrote: > > Because it makes it subsequent accesses to the page cheaper. > > Cheaper for whom? Everyone. Including further readers. Following HOT chains in read mostly workloads can be really expensive. If you have workloads with a 'hot' value range that's frequently updated, but that range moves you can easily end up with heavily chained tuples which won't soon be touched by a writer again. And writers will often not yet be able to prune the page because there's still live readers for the older versions (like other updaters). > > Of > > course, that applies in all cases, but when the page is already dirty, > > the cost of pruning it is probably quite small - we're going to have > > to write the page anyway, and pruning it before it gets evicted > > (perhaps even by our scan) will be cheaper than writing it now and > > writing it again after it's pruned. When the page is clean, the cost > > of pruning is significantly higher. > > "We" aren't going to have to write the page, but someone will. If it's already dirty that doesn't change at all. *Not* pruning in that moment actually will often *increase* the total amount of writes to the OS. Because now the pruning will happen on the next write access or vacuum - when the page already might have been undirtied. I don't really see the downside to this suggestion. > The actions you suggest are reasonable and should ideally be the role > of a background process. But that doesn't mean in the absence of that > we should pay the cost in the foreground. I'm not sure that's true. A background process will either cause additional read IO to find worthwhile pages, or it'll not find worthwhile pages because they're already paged out. Greetings, Andres Freund
On 15 April 2015 at 09:10, Andres Freund <andres@anarazel.de> wrote: > On 2015-04-15 08:42:33 -0400, Simon Riggs wrote: >> > Because it makes it subsequent accesses to the page cheaper. >> >> Cheaper for whom? > > Everyone. I think what you mean is "Everyone else". It is demonstrably quicker and more consistent for a process when it limits the amount of pruning it does, as well as the fact that it causes additional WAL traffic when it does so, causing replication lag. I love it when someone cleans up for me. I just don't think they'll accept the argument that they should clean up for me because it makes their life easier. Certainly doesn't work with my kids. > I don't really see the downside to this suggestion. The suggestion makes things better than they are now but is still less than I have proposed. If what you both mean is "IMHO this is an acceptable compromise", I can accept it also, at this point in the CF. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services
Simon Riggs wrote: > On 15 April 2015 at 09:10, Andres Freund <andres@anarazel.de> wrote: > > I don't really see the downside to this suggestion. > > The suggestion makes things better than they are now but is still less > than I have proposed. > > If what you both mean is "IMHO this is an acceptable compromise", I > can accept it also, at this point in the CF. Let me see if I understand things. What we have now is: when reading a page, we also HOT-clean it. This runs HOT-cleanup a large number of times, and causes many pages to become dirty. Your patch is "when reading a page, HOT-clean it, but only 5 times in each scan". This runs HOT-cleanup at most 5 times, and causes at most 5 pages to become dirty. Robert's proposal is "when reading a page, if dirty HOT-clean it; if not dirty, also HOT-clean it but only 5 times in each scan". This runs HOT-cleanup some number of times (as many as there are dirty), and causes at most 5 pages to become dirty. Am I right in thinking that HOT-clean in a dirty page is something that runs completely within CPU cache? If so, it would be damn fast and would have benefits for future readers, for very little cost. Dirtying a page is very different; if buffer reads are common, the system is later bogged down trying to find clean pages to read uncached buffers (including the read-only scan itself, so it becomes slower.) If I have understood things correctly, then I stand behind Robert's suggestion. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 04/15/2015 05:44 PM, Alvaro Herrera wrote: > Simon Riggs wrote: >> On 15 April 2015 at 09:10, Andres Freund <andres@anarazel.de> wrote: > >>> I don't really see the downside to this suggestion. >> >> The suggestion makes things better than they are now but is still less >> than I have proposed. >> >> If what you both mean is "IMHO this is an acceptable compromise", I >> can accept it also, at this point in the CF. > > Let me see if I understand things. > > What we have now is: when reading a page, we also HOT-clean it. This > runs HOT-cleanup a large number of times, and causes many pages to > become dirty. > > Your patch is "when reading a page, HOT-clean it, but only 5 times in > each scan". This runs HOT-cleanup at most 5 times, and causes at most 5 > pages to become dirty. > > Robert's proposal is "when reading a page, if dirty HOT-clean it; if not > dirty, also HOT-clean it but only 5 times in each scan". This runs > HOT-cleanup some number of times (as many as there are dirty), and > causes at most 5 pages to become dirty. > > > Am I right in thinking that HOT-clean in a dirty page is something that > runs completely within CPU cache? If so, it would be damn fast and > would have benefits for future readers, for very little cost. If there are many tuples on the page, it takes some CPU effort to scan all the HOT chains and move tuples around. Also, it creates a WAL record, which isn't free. Another question is whether the patch can reliably detect whether it's doing a "read-only" scan or not. I haven't tested, but I suspect it'd not do pruning when you do something like "INSERT INTO foo SELECT * FROM foo WHERE blah". I.e. when the target relation is referenced twice in the same statement: once as the target, and second time as a source. Maybe that's OK, though. - Heikki
Heikki Linnakangas wrote: > On 04/15/2015 05:44 PM, Alvaro Herrera wrote: > >Robert's proposal is "when reading a page, if dirty HOT-clean it; if not > >dirty, also HOT-clean it but only 5 times in each scan". This runs > >HOT-cleanup some number of times (as many as there are dirty), and > >causes at most 5 pages to become dirty. > > > > > >Am I right in thinking that HOT-clean in a dirty page is something that > >runs completely within CPU cache? If so, it would be damn fast and > >would have benefits for future readers, for very little cost. > > If there are many tuples on the page, it takes some CPU effort to scan all > the HOT chains and move tuples around. Also, it creates a WAL record, which > isn't free. But if the page is in CPU cache, the CPU effort shouldn't be all that noticeable, should it? That's my point, but then maybe I'm wrong. Now, the WAL logging is annoying, so let's limit that too -- do it at most for, say, 20 dirty pages and at most 5 clean pages. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 04/15/2015 07:11 PM, Alvaro Herrera wrote: > Heikki Linnakangas wrote: >> On 04/15/2015 05:44 PM, Alvaro Herrera wrote: > >>> Robert's proposal is "when reading a page, if dirty HOT-clean it; if not >>> dirty, also HOT-clean it but only 5 times in each scan". This runs >>> HOT-cleanup some number of times (as many as there are dirty), and >>> causes at most 5 pages to become dirty. >>> >>> >>> Am I right in thinking that HOT-clean in a dirty page is something that >>> runs completely within CPU cache? If so, it would be damn fast and >>> would have benefits for future readers, for very little cost. >> >> If there are many tuples on the page, it takes some CPU effort to scan all >> the HOT chains and move tuples around. Also, it creates a WAL record, which >> isn't free. > > But if the page is in CPU cache, the CPU effort shouldn't be all that > noticeable, should it? That's my point, but then maybe I'm wrong. Now, > the WAL logging is annoying, so let's limit that too -- do it at most > for, say, 20 dirty pages and at most 5 clean pages. There isn't much difference between that and just doing it on first 5 pages. Both of those numbers were pulled out of thin air, anyway. I'd rather just keep it simple. - Heikki
On Wed, Apr 15, 2015 at 8:42 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> I won't take responsibility for paying my neighbor's tax bill, but I >> might take responsibility for picking up his mail while he's on >> holiday. > > That makes it sound like this is an occasional, non-annoying thing. > > It's more like, whoever fetches the mail needs to fetch it for > everybody. So we are slowing down one person disproportionately, while > others fly through without penalty. There is no argument that one > workload necessarily needs to perform that on behalf of the other > workload. Sure there is. It's called a tragedy of the commons - everybody acts in their own selfish interest (it's not *my* responsibility to limit grazing on public land, or prune this page that I'm not modifying) and as a result some resource that everybody cares about (grass, system-wide I/O) gets trashed to everyone's detriment. Purely selfish behavior can only be justified here if we assume that the selfish actor intends to participate in the system only once: I'm going to run one big reporting query which must run as fast as possible, and then I'm getting on a space ship to Mars. So if my refusal to do any pruning during that reporting query causes lots of extra I/O on the system ten minutes from now, I don't care, because I'll have left the playing field forever at that point. As Heikki points out, any HOT pruning operation generates WAL and has a CPU cost. However, pruning a page that is currently dirty *decreases* the total volume of writes to the data files, whereas pruning a page that is currently clean *increases* the total volume of writes to the data files. In the first case, if we prune the page right now while it's still dirty, we can't possibly cause any additional data-file writes, and we may save one, because it's possible that someone else would later have pruned it when it was clean and there was no other reason to dirty it. In the second case, if we prune the page that is currently clean, it will become dirty. That will cost us no additional I/O if the page is again modified before it's written out, but otherwise it costs an additional data file write. I think there's a big difference between those two cases. Sure, from the narrow point of view of how much work it takes this scan to process this page, it's always better not to prune. But if you make the more realistic assumption that you will keep on issuing queries on the system, then what you're doing to the overall system I/O load is pretty important. By the way, was anything ever done about this: http://www.postgresql.org/message-id/20140929091343.GA4716@alap3.anarazel.de That's just a workload that is 5/6th pgbench -S and 1/6th pgbench, which is in no way an unrealistic workload, and showed a significant regression with an earlier version of the patch. You seem very eager to commit this patch after four months of inactivity, but I think this is a pretty massive behavior change that deserves careful scrutiny before it goes in. If we push something that changes longstanding behavior and can't even be turned off, and it regresses behavior for a use case that common, our users are going to come after us with pitchforks. That's not to say some people won't be happy, but in my experience it takes a lot of happy users to make up for getting stabbed with even one pitchfork. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 15 April 2015 at 12:39, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Apr 15, 2015 at 8:42 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> I won't take responsibility for paying my neighbor's tax bill, but I >>> might take responsibility for picking up his mail while he's on >>> holiday. >> >> That makes it sound like this is an occasional, non-annoying thing. >> >> It's more like, whoever fetches the mail needs to fetch it for >> everybody. So we are slowing down one person disproportionately, while >> others fly through without penalty. There is no argument that one >> workload necessarily needs to perform that on behalf of the other >> workload. > > Sure there is. It's called a tragedy of the commons - everybody acts > in their own selfish interest (it's not *my* responsibility to limit > grazing on public land, or prune this page that I'm not modifying) and > as a result some resource that everybody cares about (grass, > system-wide I/O) gets trashed to everyone's detriment. Purely selfish > behavior can only be justified here if we assume that the selfish > actor intends to participate in the system only once: I'm going to run > one big reporting query which must run as fast as possible, and then > I'm getting on a space ship to Mars. So if my refusal to do any > pruning during that reporting query causes lots of extra I/O on the > system ten minutes from now, I don't care, because I'll have left the > playing field forever at that point. It all depends upon who is being selfish. Why is a user "selfish" for not wanting to clean every single block they scan, when the people that made the mess do nothing and go faster 10 minutes from now? Randomly and massively penalising large SELECTs makes no sense. Some cleanup is OK, with reasonable limits, which is why that is proposed. On 04/15/2015 05:44 PM, Alvaro Herrera wrote: > Robert's proposal is "when reading a page, if dirty HOT-clean it; if not > dirty, also HOT-clean it but only 5 times in each scan". This runs > HOT-cleanup some number of times (as many as there are dirty), and > causes at most 5 pages to become dirty. My understanding of Robert's proposal was "when reading a page, HOT-clean it, but only do this up to 5 times on clean pages, but continue to do this indefinitely when the page is already dirty.". Andres said that was the only way and I have agreed to it. Are you now saying not to commit your proposal at all? -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services
Simon Riggs wrote: > On 15 April 2015 at 12:39, Robert Haas <robertmhaas@gmail.com> wrote: > On 04/15/2015 05:44 PM, Alvaro Herrera wrote: > > Robert's proposal is "when reading a page, if dirty HOT-clean it; if not > > dirty, also HOT-clean it but only 5 times in each scan". This runs > > HOT-cleanup some number of times (as many as there are dirty), and > > causes at most 5 pages to become dirty. > > My understanding of Robert's proposal was "when reading a page, > HOT-clean it, but only do this up to 5 times on clean pages, but > continue to do this indefinitely when the page is already dirty.". To me, both statements look identical. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 15 April 2015 at 16:01, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Simon Riggs wrote: >> On 15 April 2015 at 12:39, Robert Haas <robertmhaas@gmail.com> wrote: > >> On 04/15/2015 05:44 PM, Alvaro Herrera wrote: >> > Robert's proposal is "when reading a page, if dirty HOT-clean it; if not >> > dirty, also HOT-clean it but only 5 times in each scan". This runs >> > HOT-cleanup some number of times (as many as there are dirty), and >> > causes at most 5 pages to become dirty. >> >> My understanding of Robert's proposal was "when reading a page, >> HOT-clean it, but only do this up to 5 times on clean pages, but >> continue to do this indefinitely when the page is already dirty.". > > To me, both statements look identical. I didn't read it that way, apologies for any confusion. But that is good news. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, RemoteDBA, Training & Services
On Wed, Apr 15, 2015 at 6:10 AM, Andres Freund <andres@anarazel.de> wrote: >> > Of >> > course, that applies in all cases, but when the page is already dirty, >> > the cost of pruning it is probably quite small - we're going to have >> > to write the page anyway, and pruning it before it gets evicted >> > (perhaps even by our scan) will be cheaper than writing it now and >> > writing it again after it's pruned. When the page is clean, the cost >> > of pruning is significantly higher. >> >> "We" aren't going to have to write the page, but someone will. > > If it's already dirty that doesn't change at all. *Not* pruning in that > moment actually will often *increase* the total amount of writes to the > OS. Because now the pruning will happen on the next write access or > vacuum - when the page already might have been undirtied. > > I don't really see the downside to this suggestion. +1. I think, in general, the opportunity cost of not pruning when a page is already dirty is likely to be rather high. In general, it's likely to be worth it. -- Peter Geoghegan
<p dir="ltr"><br /> On 15 Apr 2015 15:43, "Simon Riggs" <<a href="mailto:simon@2ndquadrant.com">simon@2ndquadrant.com</a>>wrote:<br /> ><br /> > It all depends upon who isbeing selfish. Why is a user "selfish" for<br /> > not wanting to clean every single block they scan, when the people<br/> > that made the mess do nothing and go faster 10 minutes from now?<br /> > Randomly and massively penalisinglarge SELECTs makes no sense. Some<br /> > cleanup is OK, with reasonable limits, which is why that is proposed.<pdir="ltr">I don't think it's productive to think of a query as a different actor with only an interest in itsown performance and no interest in overall system performance.<p dir="ltr">From a holistic point of view the questionis how many times is a given hit chain going to need to be followed before it's pruned. Or to put it another way,how expensive is creating a hot chain. Does it cause a single prune? a fixed number of chain readers followed by a prune?Does the amount of work depend on the workload or is it consistent?<p dir="ltr">My intuition is that a fixed cutofflike "five pages" is dangerous because if you update many pages there's no limit to the number of times they'll beread before they're all pruned. The steady state could easily be that every query is having to read hot chains forever.<pdir="ltr">My intuition, again, is that what we need is a percentage such as "do 10 prunes then ignore the next1000 clean pages with hot chains. That guarantees that after 100 selects the hot chains will all be pruned but each selectwill only prune 1% of the clean pages it sees.
On 15 Apr 2015 15:43, "Simon Riggs" <simon@2ndquadrant.com> wrote:
>
> It all depends upon who is being selfish. Why is a user "selfish" for
> not wanting to clean every single block they scan, when the people
> that made the mess do nothing and go faster 10 minutes from now?
> Randomly and massively penalising large SELECTs makes no sense. Some
> cleanup is OK, with reasonable limits, which is why that is proposed.I don't think it's productive to think of a query as a different actor with only an interest in its own performance and no interest in overall system performance.
From a holistic point of view the question is how many times is a given hit chain going to need to be followed before it's pruned. Or to put it another way, how expensive is creating a hot chain. Does it cause a single prune? a fixed number of chain readers followed by a prune? Does the amount of work depend on the workload or is it consistent?
My intuition, again, is that what we need is a percentage such as "do 10 prunes then ignore the next 1000 clean pages with hot chains. That guarantees that after 100 selects the hot chains will all be pruned but each select will only prune 1% of the clean pages it sees.
http://www.linkedin.com/in/pavandeolasee
Pavan Deolasee wrote: > On Thu, Apr 16, 2015 at 2:47 PM, Greg Stark <stark@mit.edu> wrote: > > From a holistic point of view the question is how many times is a given > > hit chain going to need to be followed before it's pruned. Or to put it > > another way, how expensive is creating a hot chain. Does it cause a single > > prune? a fixed number of chain readers followed by a prune? Does the amount > > of work depend on the workload or is it consistent? > > IMO the size or traversal of the HOT chain is not that expensive compared > to the cost of either pruning too frequently, which generates WAL as well > as makes buffers dirty. OTOH cost of less frequent pruning could also be > very high. It can cause severe table bloat which may just stay for a very > long time. Even if dead space is recovered within a page, truncating a > bloated heap is not always possible. I think you're failing to consider that in the patch there is a distinction between read-only page accesses and page updates. During a page update, HOT cleanup is always done even with the patch, so there won't be any additional bloat that would not be there without the patch. It's only the read-only accesses to the patch that skip the HOT pruning. Of course, as Greg says there will be some additional scans of the HOT chain by read-only processes. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Pavan Deolasee wrote:
> On Thu, Apr 16, 2015 at 2:47 PM, Greg Stark <stark@mit.edu> wrote:
> > From a holistic point of view the question is how many times is a given
> > hit chain going to need to be followed before it's pruned. Or to put it
> > another way, how expensive is creating a hot chain. Does it cause a single
> > prune? a fixed number of chain readers followed by a prune? Does the amount
> > of work depend on the workload or is it consistent?
>
> IMO the size or traversal of the HOT chain is not that expensive compared
> to the cost of either pruning too frequently, which generates WAL as well
> as makes buffers dirty. OTOH cost of less frequent pruning could also be
> very high. It can cause severe table bloat which may just stay for a very
> long time. Even if dead space is recovered within a page, truncating a
> bloated heap is not always possible.
I think you're failing to consider that in the patch there is a
distinction between read-only page accesses and page updates. During a
page update, HOT cleanup is always done even with the patch, so there
won't be any additional bloat that would not be there without the patch.
It's only the read-only accesses to the patch that skip the HOT pruning.
On 2015-04-16 10:20:20 -0300, Alvaro Herrera wrote: > I think you're failing to consider that in the patch there is a > distinction between read-only page accesses and page updates. During a > page update, HOT cleanup is always done even with the patch, so there > won't be any additional bloat that would not be there without the > patch. That's not really true (and my benchmark upthread proves it). The fact that hot pruning only happens when we can get a cleanup lock means that we can end up with more pages that are full, if we prune on select less often. Especially if SELECTs are more frequent than write accesses - pretty darn common - the likelihood of SELECTs getting the lock is correspondingly higher. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2015-04-16 10:20:20 -0300, Alvaro Herrera wrote:
> I think you're failing to consider that in the patch there is a
> distinction between read-only page accesses and page updates. During a
> page update, HOT cleanup is always done even with the patch, so there
> won't be any additional bloat that would not be there without the
> patch.
That's not really true (and my benchmark upthread proves it). The fact
that hot pruning only happens when we can get a cleanup lock means that
we can end up with more pages that are full, if we prune on select less
often. Especially if SELECTs are more frequent than write accesses -
pretty darn common - the likelihood of SELECTs getting the lock is
correspondingly higher.
Andres Freund wrote: > On 2015-04-16 10:20:20 -0300, Alvaro Herrera wrote: > > I think you're failing to consider that in the patch there is a > > distinction between read-only page accesses and page updates. During a > > page update, HOT cleanup is always done even with the patch, so there > > won't be any additional bloat that would not be there without the > > patch. > > That's not really true (and my benchmark upthread proves it). The fact > that hot pruning only happens when we can get a cleanup lock means that > we can end up with more pages that are full, if we prune on select less > often. Especially if SELECTs are more frequent than write accesses - > pretty darn common - the likelihood of SELECTs getting the lock is > correspondingly higher. Interesting point. Of course, this code should count HOT cleanups against the total limit when they are effectively carried out, and ignore those that are skipped because of inability to acquire the cleanup lock. Not sure whether the submitted code does that. Can we keep stats on how many pages we don't clean in the updating process due to failure to acquire cleanup lock? My intuition says that it should be similar to the number of backends running concurrently, but that might be wrong. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
<p dir="ltr"><br /> On Mon, Sep 29, 2014 at 2:13 AM, Andres Freund <<a href="mailto:andres@anarazel.de">andres@anarazel.de</a>>wrote:<br /> ><br /> > On 2014-09-28 19:51:36 +0100, SimonRiggs wrote:<br /> > > On 27 September 2014 09:29, Andres Freund <<a href="mailto:andres@anarazel.de">andres@anarazel.de</a>>wrote:<br /> > > > On 2014-09-27 10:23:33 +0300, HeikkiLinnakangas wrote:<br /> > > >> This patch has gotten a fair amount of review, and has been rewritten once<br/> > > >> during the commitfest. I think it's pretty close to being committable, the<br /> > > >>only remaining question seems to be what to do with system catalogs. I'm<br /> > > >> marking this as"Returned with feedback", I take it that Simon can proceed<br /> > > >> from here, outside the commitfest.<br/> > > ><br /> > > > FWIW, I don't think it is, even with that. As is it seems very likely<br/> > > > that it's going to regress a fair share of workloads. At the very least<br /> > > > itneeds a fair amount of benchmarking beforehand.<br /> > ><br /> > > There is some doubt there. We've not seena workload that does<br /> > > actually exhibit a negative behaviour.<br /> ><br /> > Neither is there muchdata about the magnitude of positive effect the<br /> > patch has...<br /> ><br /> > > I'm not saying onedoesn't exist, but it does matter how common/likely<br /> > > it is. If anyone can present a performance test casethat demonstrates<br /> > > a regression, I think it will make it easier to discuss how wide that<br /> > >case is and what we should do about it. Discussing whether to do<br /> > > various kinds of limited pruning aremoot until that is clear.<br /> ><br /> > I doubt it'll be hard to construct a case where it'll show. My first try<br/> > of using a pgbench scale 100, -M prepared, -cj8 with a custom file with<br /> > 1 write and 5 read transactionyielded the following on my laptop:<br /> ><br /> > Baseline:<br /> > relname | pgbench_tellers<br/> > pg_total_relation_size | 458752<br /> > relname | pgbench_accounts<br /> > pg_total_relation_size | 1590337536<br /> > relname | pgbench_branches<br /> > pg_total_relation_size| 286720<br /> > relname | pgbench_history<br /> > pg_total_relation_size |49979392<br /> > Patched:<br /> > relname | pgbench_tellers<br /> > pg_total_relation_size | 516096<br/> > relname | pgbench_accounts<br /> > pg_total_relation_size | 1590337536<br /> > relname | pgbench_branches<br /> > pg_total_relation_size | 360448<br /> > relname | pgbench_history<br /> > pg_total_relation_size | 49528832<br /> ><br /> > So, there's a noticeable increasein size. Mostly on the smaller tables,<br /> > so probably HOT cleanup was sometimes skipped during UPDATEs dueto<br /> > locks.<br /> ><br /> > Baseline was:<br /> > tps = 9655.486532 (excluding connections establishing)<br/> > Patched was:<br /> > tps = 9466.158701 (including connections establishing)<br /><p dir="ltr">Wasthis reproducible? I've run your custom sql file with 4 clients (that is how many CPUs I have) on a machine<br/> with a BBU. I had wal_level = hot_standby, but the archive_command just returned true without archiving anything.And using the latest patch.<p dir="ltr">The size of the pgbench_tellers and pgbench_branches relations were surprisinglyvariable in both patched and unpatched, but there was no reliable difference between them, just within them.<pdir="ltr">On the TPS front, there was a hint that patched one was slightly slower but the within sample variationwas also high, and the p-val for difference was only 0.214 on n of 66.<br /> <br /> test case attached.<p dir="ltr">>That's not a unrealistic testcase.<br /> ><br /> > I'm pretty sure this could be made quite a bit morepronounced by not<br /> > using a uniform distribution in the pgbench runs. And selecting a test<br /> > that'smore vulnerable to the change (e.g. using a wider distribution<br /> > for the read only statements than the modifyingones) would make the the<br /> > CPU overhead of the additional heap_hot_search_buffer() overhead<br /> >heavier.<br /><p dir="ltr">Sorry I don't understand this description. Why would queries selecting data that is not changinghave any extra overhead?<p dir="ltr">Is the idea that the hot part of the table for updates would move around overtime, but the hot part for selects would be even throughout? I'm not sure how to put that to the test.<p dir="ltr">Cheers,<pdir="ltr">Jeff<p dir="ltr">
On 2015-04-20 01:04:18 -0700, Jeff Janes wrote: > Was this reproducible? Yes, at least with an old version of the patch. I don't think you could see a difference using exactly that with the newer versions which have the 5 page limit. After all it'll pretty much never reach it. > > That's not a unrealistic testcase. > > > > I'm pretty sure this could be made quite a bit more pronounced by not > > using a uniform distribution in the pgbench runs. And selecting a test > > that's more vulnerable to the change (e.g. using a wider distribution > > for the read only statements than the modifying ones) would make the the > > CPU overhead of the additional heap_hot_search_buffer() overhead > > heavier. > > Sorry I don't understand this description. Why would queries selecting > data that is not changing have any extra overhead? The idea, I think, was that by having a uniform (or just wider) distribution of the reads they'd be more likely to land on values that have been updated at some point, but not been pruned since (because at that point the patch IIRC didn't prune during reads at all). I.e. ones wer > Is the idea that the hot part of the table for updates would move around > over time, but the hot part for selects would be even throughout? Pretty much. > I'm not sure how to put that to the test. That pretty much was what I'd tried to model, yea. I guess it'd be possible to model this by inserting NOW()/updating values NOW() - 5 and selecting values up to NOW() - 60. That'd roughly model some realistic insert/update/select patterns I've seen. To possibly see any difference with the new patch this would have to be done in a way that regularly a couple of pages would be touched, with not that many selected tuples on each. Greetings, Andres Freund
On Thu, Apr 16, 2015 at 03:41:54PM +0100, Simon Riggs wrote: > That is how we arrive at the idea of a cleanup limit, further enhanced by a > limit that applies only to dirtying clean blocks, which we have 4? recent votes > in favour of. > > I would personally be in favour of a parameter to control the limit, since > whatever we chose is right/wrong depending upon circumstances. I am however > comfortable with not having a parameter if people think it is hard to tune > that, which I agree it would be, hence no parameter in the patch. I think the limit has to be in terms of a percentage of the table size. For example, if we do one SELECT on a table with all non-dirty pages, it would be good to know that 5% of the pages were pruned --- that tells me that another 19 SELECTs will totally prune the table, assuming no future writes. If there are future writes, they would dirty the pages and cause even more pruning, but the 5% gives me the maximum pruning number of SELECTs. If there aren't another 19 SELECTs, do I care if the table is pruned or not? Probably not. Measuring in page count doesn't do that, and a large table could receive millions of selects before being fully cleaned. Also, I am also not sure we should be designing features at this stage in our release process. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Also, I am also not sure we should be designing features at this stage
in our release process.
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Bruce Momjian wrote: > I think the limit has to be in terms of a percentage of the table size. > For example, if we do one SELECT on a table with all non-dirty pages, it > would be good to know that 5% of the pages were pruned --- that tells me > that another 19 SELECTs will totally prune the table, assuming no future > writes. This seems simple to implement: keep two counters, where the second one is pages we skipped cleanup in. Once that counter hits SOME_MAX_VALUE, reset the first counter so that further 5 pages will get HOT pruned. 5% seems a bit high though. (In Simon's design, SOME_MAX_VALUE is essentially +infinity.) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Apr 16, 2015 at 03:41:54PM +0100, Simon Riggs wrote:
> That is how we arrive at the idea of a cleanup limit, further enhanced by a
> limit that applies only to dirtying clean blocks, which we have 4? recent votes
> in favour of.
>
> I would personally be in favour of a parameter to control the limit, since
> whatever we chose is right/wrong depending upon circumstances. I am however
> comfortable with not having a parameter if people think it is hard to tune
> that, which I agree it would be, hence no parameter in the patch.
I think the limit has to be in terms of a percentage of the table size.
For example, if we do one SELECT on a table with all non-dirty pages, it
would be good to know that 5% of the pages were pruned --- that tells me
that another 19 SELECTs will totally prune the table, assuming no future
writes.
If there are future writes, they would dirty the pages and
cause even more pruning, but the 5% gives me the maximum pruning number
of SELECTs. If there aren't another 19 SELECTs, do I care if the table
is pruned or not?
On Mon, Apr 20, 2015 at 04:19:22PM -0300, Alvaro Herrera wrote: > Bruce Momjian wrote: > > > I think the limit has to be in terms of a percentage of the table size. > > For example, if we do one SELECT on a table with all non-dirty pages, it > > would be good to know that 5% of the pages were pruned --- that tells me > > that another 19 SELECTs will totally prune the table, assuming no future > > writes. > > This seems simple to implement: keep two counters, where the second one > is pages we skipped cleanup in. Once that counter hits SOME_MAX_VALUE, > reset the first counter so that further 5 pages will get HOT pruned. 5% > seems a bit high though. (In Simon's design, SOME_MAX_VALUE is > essentially +infinity.) Oh, I pulled 5% out of the air. Thinking of a SELECT-only workload, which would be our worse case, I was thinking how many SELECTS running through HOT update chains would it take to be slower than generating the WAL to prune the page. I see the percentage as something that we could reasonably balance, while a fixed page count couldn't be analyzed in that way. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Mon, Apr 20, 2015 at 12:28:11PM -0700, Jeff Janes wrote: > But why should 1 SELECT or 20 SELECTs or 200 SELECTs have to do a job, while > the user waits, which is fundamentally VACUUM's duty to do in the background? > If there are a handful of very hot pages, then it makes sense not to wait for > vacuum to get to them. And that is what a block-count limit does. > > But if the entire table is very hot, I think that that is just another of way > of saying that autovacuum is horribly misconfigured. I think the purpose of Well, we have to assume there are many misconfigured configurations --- autovacuum isn't super-easy to configure, so we can't just blame the user if this makes things worse. In fact, page pruning was designed spefically for cases where autovacuum wasn't running our couldn't keep up. > this patch is to fix something that can't be fixed through configuration alone. > > > If there are future writes, they would dirty the pages and > cause even more pruning, but the 5% gives me the maximum pruning number > of SELECTs. If there aren't another 19 SELECTs, do I care if the table > is pruned or not? > > > The use case I see for this is when there is a mixed workload. There is one > select which reads the entire table, and hundreds of thousands of selects/ > updates/insert that don't, and of course vacuum comes along every now and then > and does it thing. Why should the one massive SELECT have horrible performance > just because it was run right before autovacuum would have kicked in instead of > right after if finished? I see your point, but what about the read-only workload after a big update? Do we leave large tables to be non-pruned for a long time? Also, consider cases where you did a big update, the autovacuum thresh-hold was not met, so autovacuum doesn't run on that table --- again, do we keep those non-pruned rows around for millions of scans? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
But why should 1 SELECT or 20 SELECTs or 200 SELECTs have to do a job, while the user waits, which is fundamentally VACUUM's duty to do in the background?
The use case I see for this is when there is a mixed workload. There is one select which reads the entire table, and hundreds of thousands of selects/updates/insert that don't, and of course vacuum comes along every now and then and does it thing. Why should the one massive SELECT have horrible performance just because it was run right before autovacuum would have kicked in instead of right after if finished?
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Apr 20, 2015 at 3:28 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > But why should 1 SELECT or 20 SELECTs or 200 SELECTs have to do a job, while > the user waits, which is fundamentally VACUUM's duty to do in the > background? If there are a handful of very hot pages, then it makes sense > not to wait for vacuum to get to them. And that is what a block-count limit > does. I think that's a fundamental mischaracterization of the problem. As soon as you define this as "vacuum's problem", then of course it makes no sense to prune in the foreground, ever. But if you define the problem as "get the best overall system performance", then it clearly DOES sometimes make sense to prune in the foreground, as benchmark results upthread demonstrate. The fact is that on a workload like pgbench - and it doesn't have to be exactly pgbench, just any kind of workload where there are lots of changes to the table - vacuum can at any given time be pruning at most one page of the table. That is because only one vacuum process can be running in a given table at one time, and it can't be doing two things at once. But there can be many processes doing inserts, updates, or deletes on that table, as many as whatever you have max_connections set to. There can easily be dozens even on a well-configured system; on a poorly configured system, there could be hundreds. It seems obvious that if you can have dozens or hundreds of processes creating garbage and at most one process cleaning it up, there will be cases where you get further and further behind. Now, it might well be that the right solution to that problem is to allow multiple vacuum processes in the same database, or add background workers to help with opportunistic HOT-pruning of pages so it doesn't get done in the foreground. Fine. But as of today, on a heavily-modified table, the ONLY way that we can possibly remove junk from the table as fast as we're creating junk is if the backends touching the table do some of the work. Now, Simon is making the argument that it should be good enough to have people *modifying* the table help with the cleanup rather than imposing that load on the people who are only *reading* it, and that's not a dumb argument, but there are still cases where that strategy loses - specifically, where the table churn has stopped or paused, by autovacuum hasn't run yet. If you're going to do 1 sequential scan of the table and then go home for the day, HOT-pruning is dumb even in that case. If you're going to do 1000 sequential scans of that table in a row, HOT-pruning may very well be smart. There's no guarantee that the table has met the autovacuum threshold, but HOT-pruning it could well be a win anyway. Or it might be a loss. You can make any policy here look smart or dumb by picking a particular workload, and you don't even have to invent crazy things that will never happen in real life to do it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Apr 20, 2015 at 04:19:22PM -0300, Alvaro Herrera wrote: > Bruce Momjian wrote: > > > I think the limit has to be in terms of a percentage of the table size. > > For example, if we do one SELECT on a table with all non-dirty pages, it > > would be good to know that 5% of the pages were pruned --- that tells me > > that another 19 SELECTs will totally prune the table, assuming no future > > writes. > > This seems simple to implement: keep two counters, where the second one > is pages we skipped cleanup in. Once that counter hits SOME_MAX_VALUE, > reset the first counter so that further 5 pages will get HOT pruned. 5% > seems a bit high though. (In Simon's design, SOME_MAX_VALUE is > essentially +infinity.) This would tend to dirty non-sequential heap pages --- it seems best to just clean as many as we are supposed to, then skip the rest, so we can write sequential dirty pages to storage. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Mon, Apr 20, 2015 at 09:56:20PM +0100, Simon Riggs wrote: > On 20 April 2015 at 20:28, Jeff Janes <jeff.janes@gmail.com> wrote: > > > But why should 1 SELECT or 20 SELECTs or 200 SELECTs have to do a job, > while the user waits, which is fundamentally VACUUM's duty to do in the > background? > > > Agreed. I don't see a % as giving us anything at all. > > The idea is that we want to turn an O(N) problem for one query into an O(1) > task. > > > The use case I see for this is when there is a mixed workload. There is > one select which reads the entire table, and hundreds of thousands of > selects/updates/insert that don't, and of course vacuum comes along every > now and then and does it thing. Why should the one massive SELECT have > horrible performance just because it was run right before autovacuum would > have kicked in instead of right after if finished? > > > +1 You can +1 all you want, but if you ignore the specific workloads I mentioned, you are not going to get much traction. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Bruce Momjian wrote: > On Mon, Apr 20, 2015 at 04:19:22PM -0300, Alvaro Herrera wrote: > > Bruce Momjian wrote: > > > > This seems simple to implement: keep two counters, where the second one > > is pages we skipped cleanup in. Once that counter hits SOME_MAX_VALUE, > > reset the first counter so that further 5 pages will get HOT pruned. 5% > > seems a bit high though. (In Simon's design, SOME_MAX_VALUE is > > essentially +infinity.) > > This would tend to dirty non-sequential heap pages --- it seems best to > just clean as many as we are supposed to, then skip the rest, so we can > write sequential dirty pages to storage. Keep in mind there's a disconnect between dirtying a page and writing it to storage. A page could remain dirty for a long time in the buffer cache. This writing of sequential pages would occur at checkpoint time only, which seems the wrong thing to optimize. If some other process needs to evict pages to make room to read some other page in, surely it's going to try one page at a time, not write "many sequential dirty pages." -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Apr 20, 2015 at 07:13:38PM -0300, Alvaro Herrera wrote: > Bruce Momjian wrote: > > On Mon, Apr 20, 2015 at 04:19:22PM -0300, Alvaro Herrera wrote: > > > Bruce Momjian wrote: > > > > > > This seems simple to implement: keep two counters, where the second one > > > is pages we skipped cleanup in. Once that counter hits SOME_MAX_VALUE, > > > reset the first counter so that further 5 pages will get HOT pruned. 5% > > > seems a bit high though. (In Simon's design, SOME_MAX_VALUE is > > > essentially +infinity.) > > > > This would tend to dirty non-sequential heap pages --- it seems best to > > just clean as many as we are supposed to, then skip the rest, so we can > > write sequential dirty pages to storage. > > Keep in mind there's a disconnect between dirtying a page and writing it > to storage. A page could remain dirty for a long time in the buffer > cache. This writing of sequential pages would occur at checkpoint time > only, which seems the wrong thing to optimize. If some other process > needs to evict pages to make room to read some other page in, surely > it's going to try one page at a time, not write "many sequential dirty > pages." Yes, it might be too much optimization to try to get the checkpoint to flush all those pages sequentially, but I was thinking of our current behavior where, after an update of all rows, we effectively write out the entire table because we have dirtied every page. I guess with later prune-based writes, we aren't really writing all the pages as we have the pattern where pages with prunable content is kind of random. I guess I was just wondering what value there is to your write-then-skip idea, vs just writing the first X% of pages we find? Your idea certainly spreads out the pruning, and doesn't require knowing the size of the table, though I though that information was easily determined. One thing to consider is how we handle pruning of index scans that hit multiple heap pages. Do we still write X% of the pages in the table, or %X of the heap pages we actually access via SELECT? With the write-then-skip approach, we would do X% of the pages we access, while with the first-X% approach, we would probably prune all of them as we would not be accessing most of the table. I don't think we can do the first first-X% of pages and have the percentage based on the number of pages accessed as we have no way to know how many heap pages we will access from the index. (We would know for bitmap scans, but that complexity doesn't seem worth it.) That would argue, for consistency with sequential and index-based heap access, that your approach is best. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Mon, Apr 20, 2015 at 6:13 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Bruce Momjian wrote: >> On Mon, Apr 20, 2015 at 04:19:22PM -0300, Alvaro Herrera wrote: >> > Bruce Momjian wrote: >> > This seems simple to implement: keep two counters, where the second one >> > is pages we skipped cleanup in. Once that counter hits SOME_MAX_VALUE, >> > reset the first counter so that further 5 pages will get HOT pruned. 5% >> > seems a bit high though. (In Simon's design, SOME_MAX_VALUE is >> > essentially +infinity.) >> >> This would tend to dirty non-sequential heap pages --- it seems best to >> just clean as many as we are supposed to, then skip the rest, so we can >> write sequential dirty pages to storage. > > Keep in mind there's a disconnect between dirtying a page and writing it > to storage. A page could remain dirty for a long time in the buffer > cache. This writing of sequential pages would occur at checkpoint time > only, which seems the wrong thing to optimize. If some other process > needs to evict pages to make room to read some other page in, surely > it's going to try one page at a time, not write "many sequential dirty > pages." Well, for a big sequential scan, we use a ring buffer, so we will typically be evicting the pages that we ourselves read in moments before. So in this case we would do a lot of sequential writes of dirty pages. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 4/21/15 10:04 AM, Bruce Momjian wrote: > One thing to consider is how we handle pruning of index scans that hit > multiple heap pages. Do we still write X% of the pages in the table, or > %X of the heap pages we actually access via SELECT? With the > write-then-skip approach, we would do X% of the pages we access, while > with the first-X% approach, we would probably prune all of them as we > would not be accessing most of the table. I don't think we can do the > first first-X% of pages and have the percentage based on the number of > pages accessed as we have no way to know how many heap pages we will > access from the index. This comment made me wonder... has anyone considered handing the pruning work off to a bgworker, at least for SELECTs? That means the selects themselves wouldn't be burdened by the actual prune work, only in notifying the bgworker. While that's not going to be free, presumably it's a lot cheaper... -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Tue, Apr 21, 2015 at 11:04 AM, Bruce Momjian <bruce@momjian.us> wrote: > Yes, it might be too much optimization to try to get the checkpoint to > flush all those pages sequentially, but I was thinking of our current > behavior where, after an update of all rows, we effectively write out > the entire table because we have dirtied every page. I guess with later > prune-based writes, we aren't really writing all the pages as we have > the pattern where pages with prunable content is kind of random. I guess > I was just wondering what value there is to your write-then-skip idea, > vs just writing the first X% of pages we find? Your idea certainly > spreads out the pruning, and doesn't require knowing the size of the > table, though I though that information was easily determined. > > One thing to consider is how we handle pruning of index scans that hit > multiple heap pages. Do we still write X% of the pages in the table, or > %X of the heap pages we actually access via SELECT? With the > write-then-skip approach, we would do X% of the pages we access, while > with the first-X% approach, we would probably prune all of them as we > would not be accessing most of the table. I don't think we can do the > first first-X% of pages and have the percentage based on the number of > pages accessed as we have no way to know how many heap pages we will > access from the index. (We would know for bitmap scans, but that > complexity doesn't seem worth it.) That would argue, for consistency > with sequential and index-based heap access, that your approach is best. I actually implemented something like this for setting hint bits a few years ago: http://www.postgresql.org/message-id/AANLkTik5QzR8wTs0MqCWwmNp-qHGrdKY5Av5aOB7W4Dp@mail.gmail.com http://www.postgresql.org/message-id/AANLkTimGKaG7wdu-x77GNV2Gh6_Qo5Ss1u5b6Q1MsPUy@mail.gmail.com At least in later versions, the patch writes a certain number of hinted pages, then skips writing a run of pages, then writes another run of hinted pages. The basic problem here is that, after the fsync queue compaction patch went in, the benefits on my tests were pretty modest. Yeah, it costs something to write out lots of dirty pages, but before the fsync queue compaction stuff, the initial scan of an unhinted table took like 6x the time on the machine I tested on, but after that, it was like 1.5x the time. Blunting that spike just wasn't exciting enough. It strikes me that it would be better to have an integrated strategy for this problem. It doesn't make sense to have one strategy for deciding whether to set hint bits and a separate strategy for deciding whether to HOT-prune. And if we decide to set hint bits and HOT-prune, it might be smart to try to mark the page all-visible, too, if it is and we're not about to update it. I believe we're losing a lot of performance on OLTP workloads by re-dirtying the same pages over and over again. We've probably all hit cases where there is an obvious loss of performance because of this sort of thing, but I'm starting to think it's hurting us in a lot of less-obvious ways. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 4/21/15 4:45 PM, Jim Nasby wrote: > This comment made me wonder... has anyone considered handing the pruning > work off to a bgworker, at least for SELECTs? That means the selects > themselves wouldn't be burdened by the actual prune work, only in > notifying the bgworker. While that's not going to be free, presumably > it's a lot cheaper... The nice thing about having foreground queries to the light cleanup is that they can work in parallel and naturally hit the interesting parts of the table first. In order for a background worker to keep up with some of the workloads that have been presented as counterexamples, you'd need multiple background workers operating in parallel and preferring to work on certain parts of a table. That would require a lot more sophisticated job management than we currently have for, say, autovacuum.
On Mon, Apr 20, 2015 at 8:48 PM, Bruce Momjian <bruce@momjian.us> wrote: > >> But if the entire table is very hot, I think that that is just another of way >> of saying that autovacuum is horribly misconfigured. I think the purpose of > > Well, we have to assume there are many misconfigured configurations --- > autovacuum isn't super-easy to configure, so we can't just blame the > user if this makes things worse. In fact, page pruning was designed > spefically for cases where autovacuum wasn't running our couldn't keep > up. Well autovacuum isn't currently considering HOT pruning part of its job at all. It's hard to call it "misconfigured" when there's literally *no* way to configure it "correctly". If you update less than autovacuum_vacuum_scale_factor fraction of the table and then never update another row autovacuum will never run. Ever. Every select will forevermore need to follow hot chains on that table. Until eventually transaction wraparound forces a vacuum on that table if that ever happens. Possibly autovacuum could be adjusted to count how many selects are happening on the table and decide to vacuum it when the cost of the selects following the dead tuples is balanced by the cost of doing a vacuum. But that's not something included in the design of autovacuum today. The original design of tuple storage was aimed at optimizing the steady state where most tuples were not recently updated. It guaranteed that except for tuples that were in the process of being updated or were recently updated a tuple read didn't have to read the CLOG, didn't have to follow any chains, didn't have to do any I/O or other work other than to read the bits on the tuple itself. When a tuple is updated it's put into a state where everyone who comes along has to do extra work but as soon as practical the hint bits get set and that extra work stops. We had similar discussions about setting hint bits in the past. I'm not sure why HOT pruning is the focus now because I actually think hint bit setting is a larger source of I/O in innocent looking selects even today. And it's a major headache, people are always being surprised that their selects cause lots of I/O and slow down dramatically after a big update or data load has finished. It's characterized as "why is the database writing everything twice" (and saying it's actually writing everything three times doesn't make people feel better). In the new age of checksums with hint bit logging I wonder if it's even a bigger issue. It occurs to me that generating these dirty pages isn't really that expensive individually. It's only that there's a sudden influx of a large number of dirty pages that causes them to get translated immediately into filesystem I/O. Perhaps we should dirty pages on hint bit updates and do HOT pruning only to the extent it can be done without causing I/O. Of course it's hard to tell that in advance but maybe something like "if the current buffer had to be fetched and caused a dirty buffer to be evicted then skip hot pruning and don't dirty it for any hint bit updates" would at least mean that once the select fills up its share of buffers with dirty buffers it stops dirtying more. It would dirty pages only as fast as bgwriter or checkpoints manage to write them out. That sounds a bit weird but I think the right solution should have that combination of properties. It should guarantee that hint bits get set and hot chains pruned within some length of time but that no one select causes a storm of dirty buffers that then need to be flushed to disk. -- greg
On 4/21/15 4:07 PM, Peter Eisentraut wrote: > On 4/21/15 4:45 PM, Jim Nasby wrote: > In order for a background worker to keep up with some of the workloads > that have been presented as counterexamples, you'd need multiple > background workers operating in parallel and preferring to work on > certain parts of a table. That would require a lot more sophisticated > job management than we currently have for, say, autovacuum. My thought was that the foreground queries would send page IDs to the bgworker via a shmq. If the queries have to do much waiting at all on IO then I'd expect the bgworker to be able to keep pace with a bunch of them since it's just grabbing buffers that are already in the pool (and only those in the pool; it wouldn't make sense for it to pull it back from the kernel, let alone disk). We'd need to code this so that if a queue fills up the query doesn't block; we just skip that opportunity to prune. I think that'd be fine. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Tue, Apr 21, 2015 at 04:36:53PM -0400, Robert Haas wrote: > > Keep in mind there's a disconnect between dirtying a page and writing it > > to storage. A page could remain dirty for a long time in the buffer > > cache. This writing of sequential pages would occur at checkpoint time > > only, which seems the wrong thing to optimize. If some other process > > needs to evict pages to make room to read some other page in, surely > > it's going to try one page at a time, not write "many sequential dirty > > pages." > > Well, for a big sequential scan, we use a ring buffer, so we will > typically be evicting the pages that we ourselves read in moments > before. So in this case we would do a lot of sequential writes of > dirty pages. Ah, yes, this again supports the prune-then-skip approach, rather than doing the first X% pruneable pages seen. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 4/22/15 11:37 AM, Jim Nasby wrote: > On 4/21/15 4:07 PM, Peter Eisentraut wrote: >> On 4/21/15 4:45 PM, Jim Nasby wrote: >> In order for a background worker to keep up with some of the workloads >> that have been presented as counterexamples, you'd need multiple >> background workers operating in parallel and preferring to work on >> certain parts of a table. That would require a lot more sophisticated >> job management than we currently have for, say, autovacuum. > > My thought was that the foreground queries would send page IDs to the > bgworker via a shmq. If the queries have to do much waiting at all on IO > then I'd expect the bgworker to be able to keep pace with a bunch of > them since it's just grabbing buffers that are already in the pool (and > only those in the pool; it wouldn't make sense for it to pull it back > from the kernel, let alone disk). > > We'd need to code this so that if a queue fills up the query doesn't > block; we just skip that opportunity to prune. I think that'd be fine. I think a "to-clean-up map" would work better. But basically we need a way to remember where to clean up later if we're not going to do it in the foreground.
On Tue, Apr 21, 2015 at 05:07:52PM -0400, Peter Eisentraut wrote: > On 4/21/15 4:45 PM, Jim Nasby wrote: > > This comment made me wonder... has anyone considered handing the pruning > > work off to a bgworker, at least for SELECTs? That means the selects > > themselves wouldn't be burdened by the actual prune work, only in > > notifying the bgworker. While that's not going to be free, presumably > > it's a lot cheaper... > > The nice thing about having foreground queries to the light cleanup is > that they can work in parallel and naturally hit the interesting parts > of the table first. > > In order for a background worker to keep up with some of the workloads > that have been presented as counterexamples, you'd need multiple > background workers operating in parallel and preferring to work on > certain parts of a table. That would require a lot more sophisticated > job management than we currently have for, say, autovacuum. Well, the visibility map tells us where _not_ to clean up, so using another map to tell use _where_ to cleanup might make sense. However, the density of the map might be low enough that a list makes more sense, as you suggested. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Greg Stark <stark@mit.edu> wrote: > And it's a major headache, people are always being surprised that > their selects cause lots of I/O and slow down dramatically after > a big update or data load has finished. It's characterized as > "why is the database writing everything twice" (and saying it's > actually writing everything three times doesn't make people feel > better). When I looked at the life-cycle of a heap tuple in a database I was using, I found that (ignoring related index access and ignoring WAL-file copying, etc., for our backups), each tuple that existed long enough to freeze and be eventually deleted caused a lot of writes. (1) WAL log the insert. (2) Write the tuple. (3) Hint and rewrite the tuple. (4) WAL log the freeze of the tuple. (5) Rewrite the frozen tuple. (6) WAL-log the delete. (7) Rewrite the deleted tuple. (8) Prune and rewrite the page. (9) Free line pointers and rewrite the page. If I was lucky some of the writes could be combined in cache because they happened close enough together. Also, one could hope that not too much of the WAL-logging involved full page writes to the WAL -- again, keeping steps close together in time helps with that. If all of (1) through (5) are done in quick succession, you save two physical writes of the heap page and save one full page write to WAL. If steps (7) through (9) are done in quick succession, you save two more physical writes to the heap. This is part of what makes the aggressive incremental freezing being discussed on a nearby thread appealing -- at least for some workloads. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 22, 2015 at 04:36:17PM +0100, Greg Stark wrote: > On Mon, Apr 20, 2015 at 8:48 PM, Bruce Momjian <bruce@momjian.us> wrote: > > Well, we have to assume there are many misconfigured configurations --- > > autovacuum isn't super-easy to configure, so we can't just blame the > > user if this makes things worse. In fact, page pruning was designed > > spefically for cases where autovacuum wasn't running our couldn't keep > > up. > > Well autovacuum isn't currently considering HOT pruning part of its > job at all. It's hard to call it "misconfigured" when there's > literally *no* way to configure it "correctly". Good point, but doesn't vacuum remove the need for pruning as it removes all the old rows? > If you update less than autovacuum_vacuum_scale_factor fraction of the > table and then never update another row autovacuum will never run. > Ever. Every select will forevermore need to follow hot chains on that > table. Until eventually transaction wraparound forces a vacuum on that > table if that ever happens. Yes, that is a very good point, and it matches my concerns. Of course, Simon's concern is to avoid overly-aggressive pruning where the row is being pruned but will soon be modified, making the prune, and its WAL volume, undesirable. We have to consider both cases in any final solution. > Possibly autovacuum could be adjusted to count how many selects are > happening on the table and decide to vacuum it when the cost of the > selects following the dead tuples is balanced by the cost of doing a > vacuum. But that's not something included in the design of autovacuum > today. Well, autovacuum is also going to clean indexes, which seem like overkill for pruning HOT updates. > The original design of tuple storage was aimed at optimizing the > steady state where most tuples were not recently updated. It > guaranteed that except for tuples that were in the process of being > updated or were recently updated a tuple read didn't have to read the > CLOG, didn't have to follow any chains, didn't have to do any I/O or > other work other than to read the bits on the tuple itself. When a > tuple is updated it's put into a state where everyone who comes along > has to do extra work but as soon as practical the hint bits get set > and that extra work stops. Yes, Simon is right that doing everything as-soon-as-possible is not optimal. I think the trick is knowing when we should give up waiting for something else to dirty the page and prune it. > We had similar discussions about setting hint bits in the past. I'm > not sure why HOT pruning is the focus now because I actually think > hint bit setting is a larger source of I/O in innocent looking selects > even today. And it's a major headache, people are always being > surprised that their selects cause lots of I/O and slow down > dramatically after a big update or data load has finished. It's > characterized as "why is the database writing everything twice" (and > saying it's actually writing everything three times doesn't make > people feel better). In the new age of checksums with hint bit logging > I wonder if it's even a bigger issue. What would be the downside of only doing pruning during SELECT hint bit setting? Hinting is delayed by long-running transactions, but so is pruning. I assume you can do more pruning than setting all_visible hints because the old prunable rows are older by definition, but I am unclear how much older they are. FYI, while hint bit setting causes page writes, it does not cause WAL writes unless you have wal_log_hints set or page-level checksums are enabled. By doing pruning at the same time as hint bit setting, you are sharing the same page write, but are generating more WAL. Of course, if you are setting all-visible, then you are by definition waiting longer to prune than before, and this might be enough to make it a win for all use cases. You wouldn't never-prune in a read-only workload because your hint bits would eventually cause the pruning. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Bruce Momjian wrote: > On Wed, Apr 22, 2015 at 04:36:17PM +0100, Greg Stark wrote: > > On Mon, Apr 20, 2015 at 8:48 PM, Bruce Momjian <bruce@momjian.us> wrote: > > > Well, we have to assume there are many misconfigured configurations --- > > > autovacuum isn't super-easy to configure, so we can't just blame the > > > user if this makes things worse. In fact, page pruning was designed > > > spefically for cases where autovacuum wasn't running our couldn't keep > > > up. > > > > Well autovacuum isn't currently considering HOT pruning part of its > > job at all. It's hard to call it "misconfigured" when there's > > literally *no* way to configure it "correctly". > > Good point, but doesn't vacuum remove the need for pruning as it removes > all the old rows? Sure. The point, I think, is to make autovacuum runs of some sort that don't actually vacuum but only do HOT-pruning. Maybe this is a reasonable solution to the problem that queries don't prune anymore after Simon's patch. If we made autovac HOT-prune periodically, we could have read-only queries prune only already-dirty pages. Of course, that would need further adjustments to default number of autovac workers, I/O allocation, etc. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Apr 22, 2015 at 06:07:00PM -0300, Alvaro Herrera wrote: > > Good point, but doesn't vacuum remove the need for pruning as it removes > > all the old rows? > > Sure. The point, I think, is to make autovacuum runs of some sort that > don't actually vacuum but only do HOT-pruning. Maybe this is a > reasonable solution to the problem that queries don't prune anymore > after Simon's patch. If we made autovac HOT-prune periodically, we > could have read-only queries prune only already-dirty pages. Of course, > that would need further adjustments to default number of autovac > workers, I/O allocation, etc. Do we really want to make vacuum more complex for this? vacuum does have the delay settings we would need though. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 4/22/15 1:51 PM, Kevin Grittner wrote: > (1) WAL log the insert. > (2) Write the tuple. > (3) Hint and rewrite the tuple. > (4) WAL log the freeze of the tuple. > (5) Rewrite the frozen tuple. > (6) WAL-log the delete. > (7) Rewrite the deleted tuple. > (8) Prune and rewrite the page. > (9) Free line pointers and rewrite the page. > > If I was lucky some of the writes could be combined in cache > because they happened close enough together. Also, one could hope > that not too much of the WAL-logging involved full page writes to > the WAL -- again, keeping steps close together in time helps with > that. This is why I like the idea of methods that tell us where we need to do cleanup... they provide us with a rough ability to track what tuples are in what part of their lifecycle. The VM helps with this a small amount, but really it only applies after 1 and 6; it doesn't help us with any other portions. Having a way to track recently created tuples would allow us to be much more efficient with 1-3, and with aggressive freezing, 1-5. A way to track recently deleted tuples would help with 6-7, possibly 6-9 if no indexes. If we doubled the size of the VM, that would let us track 4 states for each page: - Page has newly inserted tuples - Page has newly deleted tuples - Page is all visible - Page is frozen though as discussed elsewhere, we could probably combine all visible and frozen. The win from doing this would be easily knowing what pages need hinting (newly inserted) and pruning (newly deleted). Unfortunately we still wouldn't know whether we could do real work without visiting the page itself, but I suspect that for many workloads just having newly inserted/deleted would be a serious win. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Wed, Apr 22, 2015 at 5:17 PM, Bruce Momjian <bruce@momjian.us> wrote: > On Wed, Apr 22, 2015 at 06:07:00PM -0300, Alvaro Herrera wrote: >> > Good point, but doesn't vacuum remove the need for pruning as it removes >> > all the old rows? >> >> Sure. The point, I think, is to make autovacuum runs of some sort that >> don't actually vacuum but only do HOT-pruning. Maybe this is a >> reasonable solution to the problem that queries don't prune anymore >> after Simon's patch. If we made autovac HOT-prune periodically, we >> could have read-only queries prune only already-dirty pages. Of course, >> that would need further adjustments to default number of autovac >> workers, I/O allocation, etc. > > Do we really want to make vacuum more complex for this? vacuum does > have the delay settings we would need though. I think it's abundantly clear that, as wonderful as autovacuum is compared with what we had before autovacuum, it's not good enough. This is one area where I think improvement is definitely needed, and I've suggested it before. Discussion began here: http://www.postgresql.org/message-id/AANLkTimd3ieGCm9pXV39ci6-owy3rX0mzz_N1tL=0ZLm@mail.gmail.com Some of the things I suggested then seem dumb in hindsight, but I think the basic concept is still valid: if we scan the heap and find only a few dead tuples, the expense of scanning all of the indexes may not be justified. Also, the fact that a relation can currently only be vacuumed by one process at a time is coming to seem like a major limitation. Some users are partitioning tables just so that each partition can be autovac'd separately. That really shouldn't be required. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 4/23/15 8:25 AM, Robert Haas wrote: > Some users are partitioning tables just so that each > partition can be autovac'd separately. That really shouldn't be > required. Are they doing this for improved heap scan performance? Index scan performance? If the table wasn't partitioned, would they need more than one pass through the indexes due to exhausting maintenance_work_mem? There's probably some fairly low-hanging fruit for parallelizing vacuum, but it really depends on what problems people are running into. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Thu, Apr 23, 2015 at 10:44 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > On 4/23/15 8:25 AM, Robert Haas wrote: >> Some users are partitioning tables just so that each >> partition can be autovac'd separately. That really shouldn't be >> required. > > Are they doing this for improved heap scan performance? Index scan > performance? If the table wasn't partitioned, would they need more than one > pass through the indexes due to exhausting maintenance_work_mem? I don't know of anyone with a properly-configured system who needs more than one pass through the indexes due to exhausting maintenance_work_mem. The issue is that you have to vacuum a table frequently enough to avoid accumulating bloat. The frequency with which you need to vacuum varies depending on the size of the table and how frequently it's updated. However, a large, heavily-updated table can take long enough to vacuum that, by the time you get done, it's already overdue to be vacuumed again. That's a problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-04-23 15:40:36 -0400, Robert Haas wrote: > The issue is that you have to vacuum a table frequently enough to > avoid accumulating bloat. The frequency with which you need to vacuum > varies depending on the size of the table and how frequently it's > updated. However, a large, heavily-updated table can take long enough > to vacuum that, by the time you get done, it's already overdue to be > vacuumed again. That's a problem. Especially because the indexes are scanned fully. In many cases I've observed the heap scans themselves being fast; but scanning hundreds (yes) of gigabytes of indexes taking ages. Andres
Hi Simon, On 2015-04-20 19:21:24 +0100, Simon Riggs wrote: > Since we have many votes in favour of change in this area I'll post a new > version and look for an early review/commit for next release. If I see correctly there's been no new patch version since, right? The patch is included in the current commitfest as needing review nonetheless - that's possibly because it wasn't possible to "return with feedback" for a while? I marked it as RWF for now, if I missed something we can change it back. Greetings, Andres Freund