Thread: Limiting setting of hint bits by read-only queries; vacuum_delay
vacuum_delay is designed to slow down VACUUMs from writing too many blocks. However, SELECTs also dirty data blocks but are NOT slowed down by vacuum_delay. So the current situation is that a large SELECT operates similarly to a VACUUM, throwing out many dirty blocks and using additional I/O resources but without constraint or control. The user issuing the SELECT experiences a noticeable slow-down, which is annoying if it wasn't them that issued any writes to that data. The dbadmin is also annoyed because the SELECT is uncontrollable in its write behaviour, which has a knock-on effect on replication lag and so reduces high availability. The checksum patch highlights this behaviour, but its been pretty annoying for years even without that. Yes, it is that which inspires this commentary now, but its also been the subject of much recent discussion and patch submission, which regrettably has come to nothing. IMHO it is time to limit the hint bit writes caused by SELECTs, or at least larger SELECTs. Proposal is to prevent SELECTs from causing more than N buffers from being dirtied by hint bit setting and block cleanup. Smaller SELECTs still clean up, but larger queries don't get swamped by background duties which autovacuum ought to be performing. Write statements (INSERT, UPDATE, DELETE) are not affected, nor are SELECT ... FOR $LOCK queries, i.e. they will clean blocks if they can (because they need to). query_cleanup_limit = 4 (default) range -1... INT_MAX -1 means "no limit" and is equivalent to current behaviour Once a query has reached its query_cleanup_limit it will no longer mark *heap* buffers dirty in MarkBufferDirtyHint, nor will it attempt to do optional HOT block cleanup. Patch to implement is a few hours work. The only complexity is deciding how to handle SQL in functions.... to which I would say, as simply as possible. Can we do this now? -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Mar 24, 2013 at 11:14 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > Proposal is to prevent SELECTs from causing more than N buffers from > being dirtied by hint bit setting and block cleanup. I think you need to clarify what you mean by this. I'm guessing you mean that there would be a global variable somewhere (why not a field in the executor node for the sequential scan?) counting how many blocks were marked dirty by the current plan (why not the whole transaction?) and once it reaches the limit subsequent blocks would be processed normally including setting hint bits but wouldn't mark the block dirty? (Or do you mean the hint bits wouldn't even be set so even if the block was dirty they wouldn't be updated?) This means that there's no limit to the number of times an in-doubt record would have to have its xmin/xmax looked up in the clog. Every select would have to do the same work over and over again indefinitely. The whole point of the hint bits is to guarantee that each update causes a bounded amount of work. Each update transaction currently causes one write, followed by a limited period when the transaction status needs to be checked, followed by another write with the hint bit set. And that's the total work caused. After that the record can be read without consulting the clog or anything else. -- greg
Simon Riggs <simon@2ndQuadrant.com> writes: > Can we do this now? No. We are trying to get 9.3 out the door, not undertake design, coding, and testing of entirely new behaviors which will have complex performance tradeoffs. This sounds like a great project for early in the 9.4 development cycle, but it's way too late to try to do it now. (At least if you'd like to ship 9.3 this year. If we're going to re-open 9.3 for this kind of development, I have some planner fixes I need to make ...) regards, tom lane
On 24 March 2013 13:29, Greg Stark <stark@mit.edu> wrote: > On Sun, Mar 24, 2013 at 11:14 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> Proposal is to prevent SELECTs from causing more than N buffers from >> being dirtied by hint bit setting and block cleanup. > I'm guessing you mean that there would be a global variable somewhere > (why not a field in the executor node for the sequential scan?) > counting how many blocks were marked dirty by the current plan (why > not the whole transaction?) and once it reaches the limit subsequent > blocks would be processed normally including setting hint bits but > wouldn't mark the block dirty? Yep, that's what I'm proposing. > This means that there's no limit to the number of times an in-doubt > record would have to have its xmin/xmax looked up in the clog. Every > select would have to do the same work over and over again > indefinitely. > The whole point of the hint bits is to guarantee that each update > causes a bounded amount of work. Each update transaction currently > causes one write, followed by a limited period when the transaction > status needs to be checked, followed by another write with the hint > bit set. And that's the total work caused. After that the record can > be read without consulting the clog or anything else. ...and as a result, the rest of your comments don't apply at all to the proposal. Sorry about that confusion. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 24 March 2013 13:38, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> Can we do this now? > > No. > > We are trying to get 9.3 out the door, not undertake design, coding, > and testing of entirely new behaviors which will have complex > performance tradeoffs. This sounds like a great project for early in > the 9.4 development cycle, but it's way too late to try to do it now. > (At least if you'd like to ship 9.3 this year. If we're going to > re-open 9.3 for this kind of development, I have some planner fixes > I need to make ...) Yes, there are tradeoffs and people would like to control them rather than being stuck with one and only one behavour that we know has major downsides in certain use cases. I'm proposing a parameter to control this, with the existing behaviour being still obtainable. This is not a new automatic behaviour. If you have a new optimiser behaviour that can be controlled by parameter and takes a few hours to implement, please do it. I'm greatly concerned that nobody seems to care about replication delay; this is the second time I've raised major issues that affect that and we have no resolution. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Mar 24, 2013 at 9:02 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > ...and as a result, the rest of your comments don't apply at all to > the proposal. Sorry about that confusion. How do you figure that? -- greg
On 3/24/13 7:14 AM, Simon Riggs wrote: > Patch to implement is a few hours work. The only complexity is > deciding how to handle SQL in functions.... to which I would say, as > simply as possible. Like the Page replacement ideas, the throttle on how fast something like this will get done depends not on development time, but on putting together more performance regression tests. The idea I was thinking about is refactoring the background writer's role in hint bit maintenance. If backends could push "this looks dirty" page numbers toward the BGW and keep going, it might stream those out to disk under its control. I also have a larger proposal for how to refactor I/O so that both queries and CREATE INDEX have similar cost controls to the ones used to limit vacuum. And two more based on collecting extra data for high cost queries before they run. But right now I keep biting my tongue about the design on all these, and return to working on one of the patches already in the CF queue instead. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
On 25 March 2013 02:50, Greg Smith <greg@2ndquadrant.com> wrote: > On 3/24/13 7:14 AM, Simon Riggs wrote: >> >> Patch to implement is a few hours work. The only complexity is >> deciding how to handle SQL in functions.... to which I would say, as >> simply as possible. > > > Like the Page replacement ideas, the throttle on how fast something like > this will get done depends not on development time, but on putting together > more performance regression tests. I don't see any role for regression tests there, nor even performance tests. This isn't vague discussion about page replacement. There's absolutely zero point spending time on how background tasks might work while the foreground tasks continue to do the work right in front of our eyes. Until we agree how to limit what foreground tasks do, we'll never move forwards. > The idea I was thinking about is > refactoring the background writer's role in hint bit maintenance. If > backends could push "this looks dirty" page numbers toward the BGW and keep > going, it might stream those out to disk under its control. > I also have a larger proposal for how to refactor I/O so that both queries > and CREATE INDEX have similar cost controls to the ones used to limit > vacuum. And two more based on collecting extra data for high cost queries > before they run. > > But right now I keep biting my tongue about the design on all these, and > return to working on one of the patches already in the CF queue instead. This is related to one of the patches in this CF, hence why I bring it up now. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Mar 25, 2013 at 2:50 AM, Greg Smith <greg@2ndquadrant.com> wrote: > The idea I was thinking about is refactoring the background writer's role in > hint bit maintenance A good first step might be to separate the "dirty" bit into two bits. "mandatory dirty" and "optional dirty". (Or maybe "hard dirty" and "soft dirty"). Hint bit updates could set the latter and then some later policy decision could be made about whether to bother writing out the buffer if it's only optionally dirty. My personal suspicion is that the right policy will look something like writing out 1/nth of the hint bit dirtied pages. Where n is something like MaxTuplesPerPage or perhaps the actual number of tuples per page. And perhaps that should only happen if there are any mandatory dirty pages waiting to be written or something like that. -- greg
On Sun, Mar 24, 2013 at 6:14 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > vacuum_delay is designed to slow down VACUUMs from writing too many > blocks. However, SELECTs also dirty data blocks but are NOT slowed > down by vacuum_delay. > > So the current situation is that a large SELECT operates similarly to > a VACUUM, throwing out many dirty blocks and using additional I/O > resources but without constraint or control. > > The user issuing the SELECT experiences a noticeable slow-down, which > is annoying if it wasn't them that issued any writes to that data. The > dbadmin is also annoyed because the SELECT is uncontrollable in its > write behaviour, which has a knock-on effect on replication lag and so > reduces high availability. The checksum patch highlights this > behaviour, but its been pretty annoying for years even without that. > Yes, it is that which inspires this commentary now, but its also been > the subject of much recent discussion and patch submission, which > regrettably has come to nothing. > > IMHO it is time to limit the hint bit writes caused by SELECTs, or at > least larger SELECTs. > > Proposal is to prevent SELECTs from causing more than N buffers from > being dirtied by hint bit setting and block cleanup. Smaller SELECTs > still clean up, but larger queries don't get swamped by background > duties which autovacuum ought to be performing. Write statements > (INSERT, UPDATE, DELETE) are not affected, nor are SELECT ... FOR > $LOCK queries, i.e. they will clean blocks if they can (because they > need to). > > query_cleanup_limit = 4 (default) range -1... INT_MAX > -1 means "no limit" and is equivalent to current behaviour > > Once a query has reached its query_cleanup_limit it will no longer > mark *heap* buffers dirty in MarkBufferDirtyHint, nor will it attempt > to do optional HOT block cleanup. > > Patch to implement is a few hours work. The only complexity is > deciding how to handle SQL in functions.... to which I would say, as > simply as possible. This is pretty similar to the proposal Atri and I just recently made. I am 100% in agreement that something must be done here...SELECT has none of the i/o mitigation features that vacuum has. Is your idea better? probably (although you have to give a small penalty for a user facing tunable) but we need testing against real world workloads, or at least a much better synthetic one than pgbench, which per recent discussions is probably the top objective of the project (a performance farm, etc.). merlin
> This is pretty similar to the proposal Atri and I just recently made. > I am 100% in agreement that something must be done here...SELECT has > none of the i/o mitigation features that vacuum has. Is your idea > better? probably (although you have to give a small penalty for a user > facing tunable) but we need testing against real world workloads, or > at least a much better synthetic one than pgbench, which per recent > discussions is probably the top objective of the project (a > performance farm, etc.). > I have been working on some tests for improving the performance in case of bulk INSERTs for our patch. So far, I think it has some relation to the visibility map optimization, which our patch seems to be affecting. Some more testing is in place, which has been delayed due to me being wound up in other projects. Now that they are complete, I will resume testing next week or so. Regards, Atri -- Regards, Atri l'apprenant
On 25 March 2013 14:26, Merlin Moncure <mmoncure@gmail.com> wrote: > This is pretty similar to the proposal Atri and I just recently made. > I am 100% in agreement that something must be done here...SELECT has > none of the i/o mitigation features that vacuum has. Is your idea > better? probably (although you have to give a small penalty for a user > facing tunable) I was hoping this was a new idea entirely, since I was focusing on simply limiting foreground work rather than trying to work out how to optimise foreground work or work out how to make background tasks work better. > but we need testing against real world workloads, or > at least a much better synthetic one than pgbench, which per recent > discussions is probably the top objective of the project (a > performance farm, etc.). Self-tuning the background workloads needs lots of testing. Limiting foreground work needs very little, or none. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> wrote: > Merlin Moncure <mmoncure@gmail.com> wrote: >> we need testing against real world workloads, or at least a much >> better synthetic one than pgbench, which per recent discussions >> is probably the top objective of the project (a performance >> farm, etc.). > > Self-tuning the background workloads needs lots of testing. > Limiting foreground work needs very little, or none. This is absolutely a real-world problem, but I disagree that the solution you propose is risk-free. It would be trivial to construct a test which would show massive performance degradation. Consider a single largish table which fits in cache and is subject to frequent seqscans, and a workload which burns through transaction IDs fast enough to cause clog thrashing as these xids get old and still lack hint bits. I think there are ways to deal with that problem, with the foreground select telling the bgwriter or autovacuum to pay attention to the problem tables (or pages), but now is not the time to start designing that. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Mar 25, 2013 at 1:05 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 25 March 2013 14:26, Merlin Moncure <mmoncure@gmail.com> wrote: > >> This is pretty similar to the proposal Atri and I just recently made. >> I am 100% in agreement that something must be done here...SELECT has >> none of the i/o mitigation features that vacuum has. Is your idea >> better? probably (although you have to give a small penalty for a user >> facing tunable) > > I was hoping this was a new idea entirely, since I was focusing on > simply limiting foreground work rather than trying to work out how to > optimise foreground work or work out how to make background tasks work > better. They are very similar, in that based on $simple_condition hint bits do not get written out during a scan. Also, the effect in both cases is to push more work into vacuum. Our $simple_condition was a little different and maybe less good than yours, but that should be proven. A good starting point would be to run the battery of performance tests that Amit and Hari ran against what Atri proposed. After seeing the results, I hedged on pushing the patch further -- it wasn't clear that the results were win-win and I think your patch idea mostly has the same pros/cons (see: http://postgresql.1045698.n5.nabble.com/WIP-patch-for-hint-bit-i-o-mitigation-td5730963i20.html). merlin
On 25 March 2013 20:44, Kevin Grittner <kgrittn@ymail.com> wrote: > Simon Riggs <simon@2ndQuadrant.com> wrote: >> Merlin Moncure <mmoncure@gmail.com> wrote: > >>> we need testing against real world workloads, or at least a much >>> better synthetic one than pgbench, which per recent discussions >>> is probably the top objective of the project (a performance >>> farm, etc.). >> >> Self-tuning the background workloads needs lots of testing. >> Limiting foreground work needs very little, or none. > > This is absolutely a real-world problem, but I disagree that the > solution you propose is risk-free. It would be trivial to > construct a test which would show massive performance degradation. It is trivial to show massive performance degredation through the *lack* of this feature, as you know. Since so many people have experienced the pain, doing nothing because it isn't auto-tuned is not sensible. Preventing manual control of problems just doesn't make sense. > Consider a single largish table which fits in cache and is subject > to frequent seqscans, and a workload which burns through > transaction IDs fast enough to cause clog thrashing as these xids > get old and still lack hint bits. That wouldn't happen. I suggested setting hint bits but not dirtying the data blocks. > I think there are ways to deal with that problem, with the > foreground select telling the bgwriter or autovacuum to pay > attention to the problem tables (or pages), but now is not the time > to start designing that. We do already tell autovacuum to deal with the problem, but it wakes up too late to do anything useful. We've been waiting for a solution along those lines for most of a decade (late 2004). My guess is we'll spend a whole chunk of time and still implement something that doesn't work, just as we did with bgwriter in 8.0 -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> wrote: > On 25 March 2013 20:44, Kevin Grittner <kgrittn@ymail.com> wrote: >> This is absolutely a real-world problem, but I disagree that the >> solution you propose is risk-free. It would be trivial to >> construct a test which would show massive performance degradation. > > It is trivial to show massive performance degredation through the > *lack* of this feature, as you know. True, but I think we have to be more careful about causing performance *regressions* in a new release than perpetuating *existing* performance problems. > Since so many people have experienced the pain, doing nothing because > it isn't auto-tuned is not sensible. Preventing manual control of > problems just doesn't make sense. It's a matter of not destabilizing things with brand-new ideas this late in the release cycle. >> Consider a single largish table which fits in cache and is subject >> to frequent seqscans, and a workload which burns through >> transaction IDs fast enough to cause clog thrashing as these xids >> get old and still lack hint bits. > > That wouldn't happen. I suggested setting hint bits but not dirtying > the data blocks. [ thinks about the implications of that a bit more ] That would make it harder to construct a degenerate case; but it's hard to be confident that real-world applications couldn't hit some significant performance regressions. If we had a performance testing farm, this would be a lot less scary. >> I think there are ways to deal with that problem, with the >> foreground select telling the bgwriter or autovacuum to pay >> attention to the problem tables (or pages), but now is not the time >> to start designing that. > > We do already tell autovacuum to deal with the problem, but it wakes > up too late to do anything useful. I was thinking about some more direct communication to nudge those into action under certain circumstances. I think we could come up with heuristics to ensure that the worst-case regression was small enough not to be noticed. > We've been waiting for a solution along those lines for most of a > decade (late 2004). My guess is we'll spend a whole chunk of time and > still implement something that doesn't work, just as we did with > bgwriter in 8.0 So submit this for the first CF in the next release. If we start allowing new features to be designed at this point in the 9.3 cycle, 9.3 won't be released any sooner than 9.4 otherwise would be. Certainly you must have some features you would like to see released before this summer is out. If we open the door to new ideas for 9.3, there will be an avalanche of "this is important, too" patches, and you just won't see the release until *much* later. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Mar 25, 2013 at 9:53 PM, Kevin Grittner <kgrittn@ymail.com> wrote: > That would make it harder to construct a degenerate case I don't think it's hard at all. It's the same as the case Simon wants to solve except that the cost is incurred in a different way. Imagine a system where there's a huge data load to a table which is then read-only for an OLTP system. Until vacuum comes along -- and it may never since the table never sees deletes or updates -- every transaction needs to do a clog lookup for every tuple it sees. That means a significant cpu slowdown for every row lookup forever more. To save a one-time i/o cost. -- greg
On Mon, Mar 25, 2013 at 5:57 PM, Greg Stark <stark@mit.edu> wrote: > On Mon, Mar 25, 2013 at 9:53 PM, Kevin Grittner <kgrittn@ymail.com> wrote: >> That would make it harder to construct a degenerate case > > I don't think it's hard at all. It's the same as the case Simon wants > to solve except that the cost is incurred in a different way. Imagine > a system where there's a huge data load to a table which is then > read-only for an OLTP system. Until vacuum comes along -- and it may > never since the table never sees deletes or updates -- every > transaction needs to do a clog lookup for every tuple it sees. That > means a significant cpu slowdown for every row lookup forever more. To > save a one-time i/o cost. That is simply not the case, unless every tuple was created by a unique (or at least non-sequential) transaction xid. There is a 'last transaction id' guard over clog lookup which is pretty effective. The real cost is actually i/o from writing hint bits. merlin
Greg Stark <stark@mit.edu> writes: > On Mon, Mar 25, 2013 at 9:53 PM, Kevin Grittner <kgrittn@ymail.com> wrote: >> That would make it harder to construct a degenerate case > I don't think it's hard at all. It's the same as the case Simon wants > to solve except that the cost is incurred in a different way. Imagine > a system where there's a huge data load to a table which is then > read-only for an OLTP system. Until vacuum comes along -- and it may > never since the table never sees deletes or updates -- every > transaction needs to do a clog lookup for every tuple it sees. That > means a significant cpu slowdown for every row lookup forever more. To > save a one-time i/o cost. If I read Simon's original mail correctly, he was proposing that individual SELECTs would update some hint bits but eventually "get tired" and stop doing that. So repeated queries would eventually get all the hint bits set in a scenario such as you describe. It's less clear what would happen in a scenario where there's a steady flow of new updates. This is clearly worth thinking about and trying to find better solutions for. I'm only against trying to solve it in the 9.3 timeframe. It will take a lot longer than that to get something that works tolerably well. regards, tom lane
On 25 March 2013 23:18, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Greg Stark <stark@mit.edu> writes: >> On Mon, Mar 25, 2013 at 9:53 PM, Kevin Grittner <kgrittn@ymail.com> wrote: >>> That would make it harder to construct a degenerate case > >> I don't think it's hard at all. It's the same as the case Simon wants >> to solve except that the cost is incurred in a different way. Imagine >> a system where there's a huge data load to a table which is then >> read-only for an OLTP system. Until vacuum comes along -- and it may >> never since the table never sees deletes or updates -- every >> transaction needs to do a clog lookup for every tuple it sees. That >> means a significant cpu slowdown for every row lookup forever more. To >> save a one-time i/o cost. > > If I read Simon's original mail correctly, he was proposing that > individual SELECTs would update some hint bits but eventually "get > tired" and stop doing that. Yes, thank you for understanding that. > So repeated queries would eventually get > all the hint bits set in a scenario such as you describe. It's less > clear what would happen in a scenario where there's a steady flow of > new updates. With a steady flow of updates we rely for cleanup upon i) the updates themselves and/or small selects, ii) VACUUM The idea is to avoid penalising users wishing to run larger queries, who are effectively forced to clean up everybody's else's mess. It is important to understand that the steady state is exactly the same between the patched/unpatched cases when there are no larger queries. The difference is that the steady state is maintained with patch, rather than forcing the large query to perform long and slow *early* cleanup of the database - earlier than would have occurred naturally in the steady state. So we cannot argue that there is "more I/O" because actually there will be exactly the same I/O required for steady state updates, its just that less of that I/O happens earlier during the large user query. Think about the I/O required with/without the large query... it seems likely that early cleanup could itself result in higher overall I/O. > This is clearly worth thinking about and trying to find better solutions > for. I'm only against trying to solve it in the 9.3 timeframe. It will > take a lot longer than that to get something that works tolerably well. I'll bet you all a beer at PgCon 2014 that this remains unresolved at that point. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes: > On 25 March 2013 23:18, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> This is clearly worth thinking about and trying to find better solutions >> for. I'm only against trying to solve it in the 9.3 timeframe. It will >> take a lot longer than that to get something that works tolerably well. > I'll bet you all a beer at PgCon 2014 that this remains unresolved at > that point. Well, if so, then either (a) it wasn't actually important so nobody bothered to work on it, or (b) it's a lot more difficult than you think. This is hardly an argument for pushing a one-tenth-baked fix into 9.3. regards, tom lane
On Tue, Mar 26, 2013 at 12:00 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > I'll bet you all a beer at PgCon 2014 that this remains unresolved at > that point. Are you saying you're only interested in working on it now? That after 9.3 is release you won't be interested in working on it any more? As you said we've been eyeing this particular logic since 2004, why did it suddenly become more urgent now? Why didn't you work on it 9 months ago at the beginning of the release cycle? -- greg
On Mon, 2013-03-25 at 12:21 +0000, Greg Stark wrote: > On Mon, Mar 25, 2013 at 2:50 AM, Greg Smith <greg@2ndquadrant.com> wrote: > > The idea I was thinking about is refactoring the background writer's role in > > hint bit maintenance > > A good first step might be to separate the "dirty" bit into two bits. > "mandatory dirty" and "optional dirty". (Or maybe "hard dirty" and > "soft dirty"). Hint bit updates could set the latter and then some > later policy decision could be made about whether to bother writing > out the buffer if it's only optionally dirty. I like this idea, but I think there's a problem. When checksums are enabled, we may need to write WAL at the time the page is dirtied. It would seem a waste if a full-page image was written, but the bgwriter never wrote the page out because it was only a "soft dirty". One solution is to make it work as you say, unless checksums are enabled and it needs to write WAL (which is only for the first modification after a checkpoint). Unfortunately, it basically means that users of checksums will still see a big impact from doing a SELECT on a large, freshly loaded table. So that doesn't really answer Simon's use case. Maybe it can still be made to work. Let's say that the bgwriter can't write soft-dirty pages out, it can only ignore them or drop them on the floor. But VACUUM would promote pages from soft-dirty to dirty at a controlled rate, and would write out WAL if necessary for checksums. That should still keep some distance between the WAL writing and the WAL flushing, but still offer better control than writing it at SELECT time. Also, it would solve a complaint I have about Simon's proposal: that we lose the information that we have a changed buffer in shared memory. Maybe that's not a practical problem, but it seems like we should keep track of that and have a way to make sure a page gets cleaned if we want to. The downside is that it doesn't allow anyone to get the current behavior, unless we provide an additional GUC to say whether SELECT causes a page to be marked soft-dirty or dirty. At least that would be a boolean though, and the finer-grained control can be over VACUUM. Regards,Jeff Davis
On 26 March 2013 01:35, Greg Stark <stark@mit.edu> wrote: > On Tue, Mar 26, 2013 at 12:00 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> I'll bet you all a beer at PgCon 2014 that this remains unresolved at >> that point. > > Are you saying you're only interested in working on it now? That after > 9.3 is release you won't be interested in working on it any more? > > As you said we've been eyeing this particular logic since 2004, why > did it suddenly become more urgent now? Why didn't you work on it 9 > months ago at the beginning of the release cycle? I'm not sure why your comments are so confrontational here, but I don't think it helps much. I'm happy to buy you a beer too. As I explained clearly in my first post, this idea came about trying to improve on the negative aspects of the checksum patch. People were working on ideas 9 months ago to resolve this, but they have come to nothing. I regret that; Merlin and others have worked hard to find a way: Respect to them. My suggestion is to implement a feature that takes 1 day to write and needs little testing to show it works. I'm happy to pursue that path now, or later. Deciding we need an all-singing, all-dancing solution that will take our best men (another) 6 months of hard arguing and implementation is by far the best way I know of killing anything and I won't be pursuing that route. If we did have 6 months funding for any-feature-you-like, I wouldn't spend it all on this. My bet that nobody else will have enough patience, time and skill, let alone unpaid leave to follow that path, stands. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Mar 26, 2013 at 5:27 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 26 March 2013 01:35, Greg Stark <stark@mit.edu> wrote: >> On Tue, Mar 26, 2013 at 12:00 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> I'll bet you all a beer at PgCon 2014 that this remains unresolved at >>> that point. >> >> Are you saying you're only interested in working on it now? That after >> 9.3 is release you won't be interested in working on it any more? >> >> As you said we've been eyeing this particular logic since 2004, why >> did it suddenly become more urgent now? Why didn't you work on it 9 >> months ago at the beginning of the release cycle? > > I'm not sure why your comments are so confrontational here, but I > don't think it helps much. I'm happy to buy you a beer too. > > As I explained clearly in my first post, this idea came about trying > to improve on the negative aspects of the checksum patch. People were > working on ideas 9 months ago to resolve this, but they have come to > nothing. I regret that; Merlin and others have worked hard to find a > way: Respect to them. > > My suggestion is to implement a feature that takes 1 day to write and > needs little testing to show it works. Any patch in this area isn't likely to take much testing to establish whether it improves some particular case. The problem is what happens to all of the other cases - and I don't believe that part needs little testing, hence the objections (with which I agree) to doing anything about this now. If we want to change something in this area, we might consider resurrecting the patch I worked on for this last year, which had, I believe, a fairly similar mechanism of operation to what you're proposing, and some other nice properties as well: http://www.postgresql.org/message-id/AANLkTik5QzR8wTs0MqCWwmNp-qHGrdKY5Av5aOB7W4Dp@mail.gmail.com http://www.postgresql.org/message-id/AANLkTimGKaG7wdu-x77GNV2Gh6_Qo5Ss1u5b6Q1MsPUy@mail.gmail.com ...but I think the main reason why that never went anywhere is because we never really had any confidence that the upsides were worth the downsides. Fundamentally, postponing hint bit setting (or hint bit I/O) increases the total amount of work done by the system. You still end up writing the hint bits eventually, and in the meantime you do more CLOG lookups. Now, as a compensating benefit, you can spread the work of writing the hint-bit updated pages out over a longer period of time, so that no single query carries too much of the burden of getting the bits set. The worst-case-latency vs. aggregate-throughput tradeoff is one with a long history and I think it's appropriate to view this problem through that lens also. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 26 March 2013 11:33, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Mar 26, 2013 at 5:27 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 26 March 2013 01:35, Greg Stark <stark@mit.edu> wrote: >>> On Tue, Mar 26, 2013 at 12:00 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>> I'll bet you all a beer at PgCon 2014 that this remains unresolved at >>>> that point. >>> >>> Are you saying you're only interested in working on it now? That after >>> 9.3 is release you won't be interested in working on it any more? >>> >>> As you said we've been eyeing this particular logic since 2004, why >>> did it suddenly become more urgent now? Why didn't you work on it 9 >>> months ago at the beginning of the release cycle? >> >> I'm not sure why your comments are so confrontational here, but I >> don't think it helps much. I'm happy to buy you a beer too. >> >> As I explained clearly in my first post, this idea came about trying >> to improve on the negative aspects of the checksum patch. People were >> working on ideas 9 months ago to resolve this, but they have come to >> nothing. I regret that; Merlin and others have worked hard to find a >> way: Respect to them. >> >> My suggestion is to implement a feature that takes 1 day to write and >> needs little testing to show it works. > > Any patch in this area isn't likely to take much testing to establish > whether it improves some particular case. The problem is what happens > to all of the other cases - and I don't believe that part needs little > testing, hence the objections (with which I agree) to doing anything > about this now. > > If we want to change something in this area, we might consider > resurrecting the patch I worked on for this last year, which had, I > believe, a fairly similar mechanism of operation to what you're > proposing, and some other nice properties as well: > > http://www.postgresql.org/message-id/AANLkTik5QzR8wTs0MqCWwmNp-qHGrdKY5Av5aOB7W4Dp@mail.gmail.com > http://www.postgresql.org/message-id/AANLkTimGKaG7wdu-x77GNV2Gh6_Qo5Ss1u5b6Q1MsPUy@mail.gmail.com > > ...but I think the main reason why that never went anywhere is because > we never really had any confidence that the upsides were worth the > downsides. Fundamentally, postponing hint bit setting (or hint bit > I/O) increases the total amount of work done by the system. You still > end up writing the hint bits eventually, and in the meantime you do > more CLOG lookups. Now, as a compensating benefit, you can spread the > work of writing the hint-bit updated pages out over a longer period of > time, so that no single query carries too much of the burden of > getting the bits set. The worst-case-latency vs. aggregate-throughput > tradeoff is one with a long history and I think it's appropriate to > view this problem through that lens also. I hadn't realised so many patches existed that were similar. Hackers is bigger these days. Reviewing the patch, I'd say the problem is that it is basically implementing a new automatic heuristic. We simply don't have any evidence that any new heuristic will work for all cases, so we do nothing. Whether we apply my patch, yours or Merlin's, my main thought now is that we need a user parameter to control it so it can be adjusted according to need and not touched at all if there is no problem. My washing machine has a wonderful feature "15 min wash" and it works great for the times I know I need it; but in general, the auto wash mode works fine since often you don't care that it takes 90 minutes. It's much easier to see that the additional user option is beneficial, but much harder to start arguing that the default wash cycle should be 85 or 92 minutes. It'd be great if the washing machine could work out that I need my clothes quickly and that on-this-day-only I don't care about the thoroughness of the wash, but it can't. I don't think the washing machine engineers are idiots for not being able to work that out, but if they only offered a single option because they thought they knew better than me, I'd be less than impressed. In the same way, we need some way to say "big queries shouldn't do cleanup" even if autovacuum ends up doing more I/O over time (though in fact I doubt this is the case, detailed argument on other post). So please, lets go with a simple solution now that allows users to say what they want. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Mar 26, 2013 at 7:30 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 26 March 2013 11:33, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Mar 26, 2013 at 5:27 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>> On 26 March 2013 01:35, Greg Stark <stark@mit.edu> wrote: >>>> On Tue, Mar 26, 2013 at 12:00 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >>>>> I'll bet you all a beer at PgCon 2014 that this remains unresolved at >>>>> that point. >>>> >>>> Are you saying you're only interested in working on it now? That after >>>> 9.3 is release you won't be interested in working on it any more? >>>> >>>> As you said we've been eyeing this particular logic since 2004, why >>>> did it suddenly become more urgent now? Why didn't you work on it 9 >>>> months ago at the beginning of the release cycle? >>> >>> I'm not sure why your comments are so confrontational here, but I >>> don't think it helps much. I'm happy to buy you a beer too. >>> >>> As I explained clearly in my first post, this idea came about trying >>> to improve on the negative aspects of the checksum patch. People were >>> working on ideas 9 months ago to resolve this, but they have come to >>> nothing. I regret that; Merlin and others have worked hard to find a >>> way: Respect to them. >>> >>> My suggestion is to implement a feature that takes 1 day to write and >>> needs little testing to show it works. >> >> Any patch in this area isn't likely to take much testing to establish >> whether it improves some particular case. The problem is what happens >> to all of the other cases - and I don't believe that part needs little >> testing, hence the objections (with which I agree) to doing anything >> about this now. >> >> If we want to change something in this area, we might consider >> resurrecting the patch I worked on for this last year, which had, I >> believe, a fairly similar mechanism of operation to what you're >> proposing, and some other nice properties as well: >> >> http://www.postgresql.org/message-id/AANLkTik5QzR8wTs0MqCWwmNp-qHGrdKY5Av5aOB7W4Dp@mail.gmail.com >> http://www.postgresql.org/message-id/AANLkTimGKaG7wdu-x77GNV2Gh6_Qo5Ss1u5b6Q1MsPUy@mail.gmail.com >> >> ...but I think the main reason why that never went anywhere is because >> we never really had any confidence that the upsides were worth the >> downsides. Fundamentally, postponing hint bit setting (or hint bit >> I/O) increases the total amount of work done by the system. You still >> end up writing the hint bits eventually, and in the meantime you do >> more CLOG lookups. Now, as a compensating benefit, you can spread the >> work of writing the hint-bit updated pages out over a longer period of >> time, so that no single query carries too much of the burden of >> getting the bits set. The worst-case-latency vs. aggregate-throughput >> tradeoff is one with a long history and I think it's appropriate to >> view this problem through that lens also. > > I hadn't realised so many patches existed that were similar. Hackers > is bigger these days. > > Reviewing the patch, I'd say the problem is that it is basically > implementing a new automatic heuristic. We simply don't have any > evidence that any new heuristic will work for all cases, so we do > nothing. > > Whether we apply my patch, yours or Merlin's, my main thought now is > that we need a user parameter to control it so it can be adjusted > according to need and not touched at all if there is no problem. After a night thinking about this, I'd like to make some points: *) my patch deliberately did not 'set bits without dirty' -- with checksums in mind as you noted (thanks for that). I think the upside for marking pages in that fasion anyways is overrated. *) Any strategy that does not approximate hint bit behavior IMNSHO is a non-starter. By that I mean when your $condition is met so that hint bits are not being written out, scans need to bail out of HeapTupleSatisfiesMVCC processing with a cheap check. If you don't do that and rely on the transam.c guard, you've already missed the boat: the even without clog lookup the extra processing there I can assure you will show up in profiling of repeated scans (until vacuum). *) The case of sequential tuples with the same xid is far and away the most important one. In OLTP workloads hint bit i/o is minor compared to everything else going on. Also, OLTP workloads are probably better handled with an hint bit check just before eviction via bgwriter vs during scan. *) The budget for extra work inside HeapTupleSatisfiesMVCC is exceptionally low. For this reason, I think your idea would be better framed at the page level and the bailout should be measured in the number of pages, not tuples (that way the page can send in a single boolean to control hint bit behavior). *) The upside of optimizing xmax processing is fairly low for most workloads I've seen *) The benchmarking Amit and Hari did needs analysis. *) For off-cycle release work that would help enable patches with complex performance trade-offs (I'm working up another patch that has even more compelling benefits and risks in the buffer allocator), We desperately need a standard battery of comprehensive performance tests and doner machines. merlin
Simon Riggs <simon@2ndQuadrant.com> writes: > So please, lets go with a simple solution now that allows users to say > what they want. Simon, this is just empty posturing, as your arguments have nothing whatsoever to do with whether the above description applies to your patch. More generally, the fact that a patch has some user-frobbable knob does not mean that it's actually a good or even usable solution. As everybody keeps saying, testing on a wide range of use-cases would be needed to prove that, and we don't have enough time left for such testing in the 9.3 timeframe. This problem needs to be attacked in an organized and deliberate fashion, not by hacking something up under time pressure and shipping it with minimal testing. regards, tom lane
On 26 March 2013 14:44, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: >> So please, lets go with a simple solution now that allows users to say >> what they want. > > Simon, this is just empty posturing, as your arguments have nothing > whatsoever to do with whether the above description applies to your > patch. Waiting for an auto-tuned solution to *every* problem means we just sit and watch bad things happen, knowing how to fix them for particular cases yet not being able to do anything at all. > More generally, the fact that a patch has some user-frobbable knob > does not mean that it's actually a good or even usable solution. As > everybody keeps saying, testing on a wide range of use-cases would be > needed to prove that, and we don't have enough time left for such > testing in the 9.3 timeframe. This problem needs to be attacked in > an organized and deliberate fashion, not by hacking something up under > time pressure and shipping it with minimal testing. Well, it has been tackled like that and we've *all* got nowhere. No worries, I can wait a year for that beer. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Mar 26, 2013 at 03:06:30PM +0000, Simon Riggs wrote: > > More generally, the fact that a patch has some user-frobbable knob > > does not mean that it's actually a good or even usable solution. As > > everybody keeps saying, testing on a wide range of use-cases would be > > needed to prove that, and we don't have enough time left for such > > testing in the 9.3 timeframe. This problem needs to be attacked in > > an organized and deliberate fashion, not by hacking something up under > > time pressure and shipping it with minimal testing. > > Well, it has been tackled like that and we've *all* got nowhere. No > worries, I can wait a year for that beer. This was the obvious result of this discussion --- it is a shame we had to discuss this rather than working on more pressing 9.3 issues. I also think someone saying "I would like to apply this now" is more disruptive than casual discussion about things like buffer count locking. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Merlin Moncure <mmoncure@gmail.com> wrote: > *) For off-cycle release work that would help enable patches with > complex performance trade-offs (I'm working up another patch that has > even more compelling benefits and risks in the buffer allocator), We > desperately need a standard battery of comprehensive performance tests > and doner machines. Such a thing would vastly reduce the time needed to work on something like this with confidence that it would not be a disaster for some unidentified workload. Sure, something could still "slip though the cracks" -- but they would *be* cracks, not a wide gaping hole. -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company