Thread: Implementing cost limit/delays for insert/delete/update/select
Hello, I'd like to have a stab at implementing cost delays, for regular INSERT/DELETE/UPDATE/SELECT. The motivation is roughly the same as for VACUUM and the autovacuum limits; one may have application specific bulk operations that need executing without adverseley affecting latency/throughput of other operations. I tentatively call this executing statements "nicely". A better naming scheme might be a good idea... The idea would be to introduce two GUC variables: - nice_cost_limit - nice_cost_delay Which would be equivalent to their vacuum_* counterparts. Upon executing an INSERT, UPDATE, DELETE or SELECT, one would optionally specify a "NICELY" modifier to enable nice cost limits for that statement. For example: DELETE NICELY FROM large_table WHERE id < 50000000 In the future I foresee also specifying a nice multiplier of some kind, thus supporting variable niceness on a per-statement basis. To avoid complexity, the initial version would not contain anything similar to the autovacuum balancing of costs between separate vacuum processes. The cost accounting would be entirely local to each backend. Firstly, any opinions on whether this is a good idea, or on whether the above suggestions sound sensible? Implementation: Although the current cost delay support, in terms of variable naming, suggest it is specific to vacuuming, I haven't found anything that is really specific to this. As far as I can tell, an implementaiton would entail: * Adding the GUC variables * Modifying the parser slightly to support the NICELY "modifier" (terminology?) * Modify ExecPlan in backend/executor/execMain.c to contain accounting initialization and cleanup like backend/commands/vacuum.c'svacuum(). * Create an equivalent of the vacuum_delay_point() and call it in each loop iteration in ExecPlan(). And then costmetically, probably rename various things so that non-vacuum specific cost accounting is no longer named as if it were. Does this sound vaguely sensible? Is there an obvious show-stopper I am missing? -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>' Key retrieval: Send an E-Mail to getpgpkey@scode.org E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org
"Peter Schuller" <peter.schuller@infidyne.com> writes: > Hello, > > I'd like to have a stab at implementing cost delays, for regular > INSERT/DELETE/UPDATE/SELECT. The motivation is roughly the same as for > VACUUM and the autovacuum limits; one may have application specific > bulk operations that need executing without adverseley affecting > latency/throughput of other operations. > > I tentatively call this executing statements "nicely". A better naming > scheme might be a good idea... > > The idea would be to introduce two GUC variables: > > - nice_cost_limit > - nice_cost_delay I think the experience with vacuum was that cost_delay was a mistake. The only parameter users really ought to be messing with is cost_limit. Every time a user has posted about vacuum taking interminably long it was because they set *_cost_delay to something unreasonable. I suppose this could be selection bias since we would never hear about users who didn't set it unreasonably high. But I think we should consider removing the {auto,}vacuum_cost_delay parameter or at least hiding and undocumenting it. It's a foot-gun and serves no useful purpose that merely lowering the {auto,}vacuum_cost_limit can't serve equally well. > Which would be equivalent to their vacuum_* counterparts. > > Upon executing an INSERT, UPDATE, DELETE or SELECT, one would > optionally specify a "NICELY" modifier to enable nice cost limits for > that statement. For example: > > DELETE NICELY FROM large_table WHERE id < 50000000 Why not just have the GUC and leave it at that? SET nice_cost_limit = ... DELETE FROM ... SET nice_cost_limit = ... UPDATE ... ... > In the future I foresee also specifying a nice multiplier of some > kind, thus supporting variable niceness on a per-statement basis. Yeah, actually I think both the vacuum and this parameter need some further thought. They don't represent any sort of real world parameter that the user has any hope of knowing how to set except by trial-and-error. I think we would be better off with something like a vacuum_io_bandwidth_cap or something like that. Then the user has a hope of understanding what kind of numbers make sense. > * Adding the GUC variables > * Modifying the parser slightly to support the NICELY "modifier" > (terminology?) As I mentioned I don't think this is necessary. > * Modify ExecPlan in backend/executor/execMain.c to contain accounting > initialization and cleanup like backend/commands/vacuum.c's vacuum(). > * Create an equivalent of the vacuum_delay_point() and call it in each > loop iteration in ExecPlan(). ExecutePlan? That's not often enough. You can easily construct plans that do massive sequential scans on the inner side of a join or in a subquery -- all of which happens before a single record is returned from ExecutePlan for a. You would have to test for whether it's time to sleep much more often. Possibly before every ExecProcNode call would be enough. Even then you have to worry about the i/o and cpu resources used by by tuplesort. And there are degenerate cases where a single ExecProcNode could do a lot of i/o such as a large scan looking for a single matching record. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB'sPostgreSQL training!
On Mon, 25 Aug 2008 22:39:54 +0100 Gregory Stark <stark@enterprisedb.com> wrote: > But I think we should consider removing the {auto,}vacuum_cost_delay > parameter or at least hiding and undocumenting it. It's a foot-gun > and serves no useful purpose that merely lowering the > {auto,}vacuum_cost_limit can't serve equally well. I thought we were already considering some kind of IO tweak for Vacuum, e.g; you may use up to 5Mb/s or something like that? Sincerely, Joshua D. Drake -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ United States PostgreSQL Association: http://www.postgresql.us/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
Btw, I forgot to mention in my original post that one interesting use case that is not application specific, is to apply cost limits on pg_dump. That would be one of the final goals for me. > I think the experience with vacuum was that cost_delay was a mistake. The only > parameter users really ought to be messing with is cost_limit. Every time a > user has posted about vacuum taking interminably long it was because they set > *_cost_delay to something unreasonable. I suppose this could be selection bias > since we would never hear about users who didn't set it unreasonably high. > > But I think we should consider removing the {auto,}vacuum_cost_delay parameter > or at least hiding and undocumenting it. It's a foot-gun and serves no useful > purpose that merely lowering the {auto,}vacuum_cost_limit can't serve equally > well. Sounds sensible to me. I included nice_cost_delay in this case to remain consistent with the others. > > DELETE NICELY FROM large_table WHERE id < 50000000 > > Why not just have the GUC and leave it at that? > > SET nice_cost_limit = ... > DELETE FROM ... > SET nice_cost_limit = ... > UPDATE ... > ... Sounds a lot cleaner than introducing new syntax, yes. Leaving it with GUC only does mean the submitter must choose a value, and cannot just indicate "whichever the administrator chose to be sensible". Perhaps have a separate boolean cost_limit flag that would allow just turning it on, without specifying actual limits? > I think we would be better off with something like a vacuum_io_bandwidth_cap > or something like that. Then the user has a hope of understanding what kind of > numbers make sense. Another option might be to give page_miss and friends an actual unit that is meaningful, such as the expected worst-case "device time" required to perform the I/O operation (tweaked on a per-device basis). One could then specify the maximum vacuum cost in terms of percentage of real time spend on vacuum related I/O. > ExecutePlan? That's not often enough. You can easily construct plans that do > massive sequential scans on the inner side of a join or in a subquery -- all > of which happens before a single record is returned from ExecutePlan for a. > You would have to test for whether it's time to sleep much more often. > Possibly before every ExecProcNode call would be enough. > > Even then you have to worry about the i/o and cpu resources used by by > tuplesort. And there are degenerate cases where a single ExecProcNode could do > a lot of i/o such as a large scan looking for a single matching record. Ok - I obviously need to look at these parts more carefully. Thanks for the feedback! -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>' Key retrieval: Send an E-Mail to getpgpkey@scode.org E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org
Gregory Stark <stark@enterprisedb.com> writes: > You would have to test for whether it's time to sleep much more often. > Possibly before every ExecProcNode call would be enough. That would have overhead comparable to EXPLAIN ANALYZE, which is a lot. I'm fairly dubious about this whole proposal: it's not clear to me that the vacuum delay stuff works very well at all, and to the extent that it does work it's because vacuum has such stylized, predictable behavior. The same can't be said of general SQL queries. For one thing, it's not apparent that rate-limiting I/O would be sufficient, because although vacuum is nearly always I/O bound, general queries often are CPU bound; or their system impact might be due to other factors like contention for shared-memory data structures. Priority inversion is a pretty serious concern as well (ie, a sleeping "low priority" query might be blocking other queries). I think this proposal is likely to lead to a large, invasive patch that isn't actually very useful in the real world. Oh, and I think the suggestion to modify SQL syntax for this is right out. A GUC setting might be a sane API. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Gregory Stark <stark@enterprisedb.com> writes: >> You would have to test for whether it's time to sleep much more often. >> Possibly before every ExecProcNode call would be enough. > > That would have overhead comparable to EXPLAIN ANALYZE, which is a lot. I don't think the overhead would be anywhere near as much. He wouldn't have to call gettimeofday() (except just after sleeping) which is what hurts EXPLAIN ANLAYZE so much. Just count i/o operations. In any case, if the point is to call sleep how much would do we care about overhead? Do you think it would consume so much extra cpu that even after sleeping for 10ms it would be a big resource drain for other processes? The point about the invasiveness remains. If you want to count i/o operations it means touching a lot of lower-level code. I'm interested in doing that anyways to get better monitoring and profiling data though. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB'sPostgreSQL training!
On Mon, 2008-08-25 at 22:39 +0200, Peter Schuller wrote: > Does this sound vaguely sensible? Is there an obvious show-stopper I > am missing? This was a well structured proposal. The main problem is where you put the delay_point() calls. If you put them at the top of the executor then you will get a delay proportional to the number of rows retrieved. For many queries, such as count(*) this might be just one row, yet have run for hours. There's no point having a priority scheme if it doesn't apply to all queries equally. If you put them at each call of each node then you will get an unacceptable overhead as Tom suggests. Not sure what to suggest, if anything, apart from just writing your own delay() function and using it in your query. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Hello, [ I have not yet had the time to look at code again in response to some of the points raised raised by several people; but I wanted to follow-up somewhat still on other bits. ] > > You would have to test for whether it's time to sleep much more often. > > Possibly before every ExecProcNode call would be enough. > > That would have overhead comparable to EXPLAIN ANALYZE, which is a lot. > > I'm fairly dubious about this whole proposal: it's not clear to me that > the vacuum delay stuff works very well at all, and to the extent that it > does work it's because vacuum has such stylized, predictable behavior. Well, it definitely works well enough to make a large difference in my use cases. In particular with respect to the amount of write activity generated which easily causes latency problems. That said, it remains to be seen how much of an issue heavy write activity will be once upgraded to 8.3 and after tweaking the Linux buffer cache. Right now, I do not expect the database to be even useful in one of my use cases, if it were not for delay points during vacuuming. So although I make no argument as to whether it works better due to the limited and understood nature of vacuuming, it is definitely an appreciate feature for my use cases. > The same can't be said of general SQL queries. For one thing, it's > not apparent that rate-limiting I/O would be sufficient, because > although vacuum is nearly always I/O bound, general queries often are > CPU bound; or their system impact might be due to other factors like > contention for shared-memory data structures. In my case I mostly care about I/O. I believe that this is going to be a fairly common fact with anyone whose primary concern is latency. The good part about CPU contention is that it is handled quite well by modern operating systems/hardware. Even on a single-core machine, a single CPU bound query should still only have a percentage-wise throughput impact on other traffic (normally; of course you might have some particularly bad contention on some resource, etc). If your database is very sensitive to latency, you are likely running it at far below full throughput, meaning that there should be quite a bit of margin in terms of CPU. This would be especially true on multi-core machines where the impact of a single backend is even less. The problem I have with I/O is that saturating I/O, in particular with writes, has all sorts of indirect effects that are difficult to predict, and are not at all guaranteed to translate into a simple percentage-wise slow-down. For example, I've seen stalls lasting several *minutes* due to a bulk DELETE of a million rows or so. With mixed random-access writes, streaming writes, and the PG buffer cache, the operating system buffer cache, and the RAID controller's cache, it is not at all unlikely that you will have significant latency problems when saturating the system with writes. So recognizing that I am not likely to ever have very good behavior while saturating the storage system with writes, I instead want to limit the write activity generated to a sensible amount (preferably such that individual bursts are small enough to e.g. fit in a RAID controller cache). This reduces the problem of ensuring good behavior with respect to short burst of writes and their interaction with checkpoints, which is a much easier problem than somehow ensuring "fairness" under write-saturated load. So that is where my motivation comes from; in more or less all my use cases, limiting disk I/O is massively more important than limiting CPU usage. On this topic, I have started thinking again about direct I/O. I asked about this on -performance a while back in a different context and it seemed there was definitely no clear concensus that one "should" have direct I/O. That seemed to be mostly due to a perceived lack of increase in throughput. However my gut feel is that by bypassing the OS buffer cache, you could significantly improve real-time/latency sensitive aspects in PostgreSQL in cases where throughput is not your primary concern. Perhaps something like that would be a more effective approach. > Priority inversion is > a pretty serious concern as well (ie, a sleeping "low priority" query > might be blocking other queries). I presume this is in reference to bulk modifications (rather than selects) blocking other transactions with conflicting updates? If so, yes I see that. The feature would be difficult to use reliably for writes except in very controlled situations (in my particular use-case that I am tunnel vision:ing on, it is guaranteed that there is no conflict due to the nature of the application). But is my understanding correct that there is no reason to believe there are such issues for read-only queries, or queries that do not actually conflict (at the SQL level) with concurrent transactions? (Ignoring the impact it might have on old transactions hanging around for a longer time.) -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>' Key retrieval: Send an E-Mail to getpgpkey@scode.org E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org