Thread: WAL insert delay settings
Bulk operations like CREATE INDEX, ALTER TABLE, or bulk loads can create a lot of WAL. A lot of WAL at once can cause delays in replication. For synchronous replication, this can make seemingly unrelated sessions hang. But also for asynchronous replication, it will increase latency. One idea to address this is to slow down WAL-generating maintenance operations. This is similar to the vacuum delay. Where the vacuum delay counts notional I/O cost before sleeping, here we would count how much WAL has been generated and sleep after some amount. I attach an example patch for this functionality. It introduces three settings: wal_insert_delay_enabled wal_insert_delay wal_insert_delay_size When you turn on wal_insert_delay_enabled, then it will sleep for wal_insert_delay after the session has produced wal_insert_delay_size of WAL data. The idea is that you would tune wal_insert_delay and wal_insert_delay_size to your required performance characteristics and then turn on wal_insert_delay_enabled individually in maintenance jobs or similar. To test, for example, set up pgbench with synchronous replication and run an unrelated large index build in a separate session. With the settings, you can make it as fast or as slow as you want. Tuning these settings, however, is quite mysterious I fear. You have to play around a lot to get settings that achieve the right balance. So, some questions: Is this useful? Any other thoughts on how to configure this or do this? Should we aim for a more general delay system, possibly including vacuum delay and perhaps something else? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Hi, On February 13, 2019 1:16:07 PM GMT+01:00, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: >Bulk operations like CREATE INDEX, ALTER TABLE, or bulk loads can >create >a lot of WAL. A lot of WAL at once can cause delays in replication. >For synchronous replication, this can make seemingly unrelated sessions >hang. But also for asynchronous replication, it will increase latency. > >One idea to address this is to slow down WAL-generating maintenance >operations. This is similar to the vacuum delay. Where the vacuum >delay counts notional I/O cost before sleeping, here we would count how >much WAL has been generated and sleep after some amount. > >I attach an example patch for this functionality. It introduces three >settings: > >wal_insert_delay_enabled >wal_insert_delay >wal_insert_delay_size > >When you turn on wal_insert_delay_enabled, then it will sleep for >wal_insert_delay after the session has produced wal_insert_delay_size >of >WAL data. > >The idea is that you would tune wal_insert_delay and >wal_insert_delay_size to your required performance characteristics and >then turn on wal_insert_delay_enabled individually in maintenance jobs >or similar. > >To test, for example, set up pgbench with synchronous replication and >run an unrelated large index build in a separate session. With the >settings, you can make it as fast or as slow as you want. > >Tuning these settings, however, is quite mysterious I fear. You have >to >play around a lot to get settings that achieve the right balance. > >So, some questions: > >Is this useful? > >Any other thoughts on how to configure this or do this? > >Should we aim for a more general delay system, possibly including >vacuum >delay and perhaps something else? Interesting idea, not yet quite sure what to think. But I don't think the way you did it is acceptable - we can't just delaywhile holding buffer locks, in critical sections, while not interruptible. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Hi 13.02.2019 17:16, Peter Eisentraut пишет: > Bulk operations like CREATE INDEX, ALTER TABLE, or bulk loads can create > a lot of WAL. A lot of WAL at once can cause delays in replication. > For synchronous replication, this can make seemingly unrelated sessions > hang. But also for asynchronous replication, it will increase latency. > > One idea to address this is to slow down WAL-generating maintenance > operations. This is similar to the vacuum delay. Where the vacuum > delay counts notional I/O cost before sleeping, here we would count how > much WAL has been generated and sleep after some amount. > > I attach an example patch for this functionality. It introduces three > settings: > > wal_insert_delay_enabled > wal_insert_delay > wal_insert_delay_size > > When you turn on wal_insert_delay_enabled, then it will sleep for > wal_insert_delay after the session has produced wal_insert_delay_size of > WAL data. > > The idea is that you would tune wal_insert_delay and > wal_insert_delay_size to your required performance characteristics and > then turn on wal_insert_delay_enabled individually in maintenance jobs > or similar. > > To test, for example, set up pgbench with synchronous replication and > run an unrelated large index build in a separate session. With the > settings, you can make it as fast or as slow as you want. > > Tuning these settings, however, is quite mysterious I fear. You have to > play around a lot to get settings that achieve the right balance. > > So, some questions: > > Is this useful? > > Any other thoughts on how to configure this or do this? > > Should we aim for a more general delay system, possibly including vacuum > delay and perhaps something else? > I think it's better to have more general cost-based settings which allow to control performance. Something like what have been already done for autovacuum. For example, introduce vacuum-similar mechanism with the following controlables: maintenance_cost_page_hit maintenance_cost_page_miss maintenance_cost_page_dirty maintenance_cost_delay maintenance_cost_limit maintenance_cost_delay=0 (default) means feature is disabled, but if user wants to limit performance he can define such parameters in per-session, or per-user manner. Especially it can be useful for limiting an already running sessions, such as mass deletion, or pg_dump. Of course, it's just an idea, because I can't imagine how many things should be touched in order to implement this. Regards, Alexey Lesovsky
Andres Freund <andres@anarazel.de> writes: > On February 13, 2019 1:16:07 PM GMT+01:00, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: >> One idea to address this is to slow down WAL-generating maintenance >> operations. This is similar to the vacuum delay. Where the vacuum >> delay counts notional I/O cost before sleeping, here we would count how >> much WAL has been generated and sleep after some amount. > Interesting idea, not yet quite sure what to think. But I don't think the way you did it is acceptable - we can't justdelay while holding buffer locks, in critical sections, while not interruptible. Yeah. Maybe it could be done in a less invasive way by just having the WAL code keep a running sum of how much WAL this process has created, and then letting the existing vacuum-delay infrastructure use that as one of its how-much-IO-have-I-done inputs. Not sure if that makes the tuning problem easier or harder, but it seems reasonable on its face to count WAL emission as I/O. regards, tom lane
Greetings, * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote: > Bulk operations like CREATE INDEX, ALTER TABLE, or bulk loads can create > a lot of WAL. A lot of WAL at once can cause delays in replication. Agreed, though I think VACUUM should certainly be included in this. I'm all for the idea though it seems like a different approach is needed based on the down-thread discussion. Ultimately, having a way to have these activities happen without causing issues for replicas is a great idea and would definitely address a practical issue that a lot of people run into. Thanks! Stephen
Attachment
On 13/02/2019 13:18, Andres Freund wrote: > But I don't think the way you did it is acceptable - we can't just delay while holding buffer locks, in critical sections,while not interruptible. The code I added to XLogInsertRecord() is not inside the critical section. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On February 13, 2019 4:39:21 PM GMT+01:00, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: >On 13/02/2019 13:18, Andres Freund wrote: >> But I don't think the way you did it is acceptable - we can't just >delay while holding buffer locks, in critical sections, while not >interruptible. > >The code I added to XLogInsertRecord() is not inside the critical >section. Most callers do xlog insertions inside crit sections though. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On 13.02.2019 19:57, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: >> On February 13, 2019 1:16:07 PM GMT+01:00, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: >>> One idea to address this is to slow down WAL-generating maintenance >>> operations. This is similar to the vacuum delay. Where the vacuum >>> delay counts notional I/O cost before sleeping, here we would count how >>> much WAL has been generated and sleep after some amount. > >> Interesting idea, not yet quite sure what to think. But I don't think the way you did it is acceptable - we can't justdelay while holding buffer locks, in critical sections, while not interruptible. > > Yeah. Maybe it could be done in a less invasive way by just having the > WAL code keep a running sum of how much WAL this process has created, > and then letting the existing vacuum-delay infrastructure use that as > one of its how-much-IO-have-I-done inputs. > > Not sure if that makes the tuning problem easier or harder, but > it seems reasonable on its face to count WAL emission as I/O. > > regards, tom lane Also we can add a 'soft' clause to DML queries. It will some abstraction for background query execution. It can contain the WAL write velocity limit parameter (as Tom proposed) and may some another. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
On Wed, Feb 13, 2019 at 4:16 AM Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > Bulk operations like CREATE INDEX, ALTER TABLE, or bulk loads can create > a lot of WAL. A lot of WAL at once can cause delays in replication. > For synchronous replication, this can make seemingly unrelated sessions > hang. But also for asynchronous replication, it will increase latency. I think that I suggested a feature like this early during my time at Heroku, about 5 years ago. There would occasionally be cases where ops would find it useful to throttle WAL writing using their own terrible kludge (it involved sending SIGSTOP to the WAL writer). I recall that this idea was not well received at the time. I still think it's a good idea, though. Provided there is a safe way to get it to work. -- Peter Geoghegan
Hi, On 2019-02-13 23:21:39 -0800, Peter Geoghegan wrote: > There would occasionally be cases where ops > would find it useful to throttle WAL writing using their own terrible > kludge (it involved sending SIGSTOP to the WAL writer). That can't have been the workaround - either you'd interrupt it while holding critical locks (in which case nobody could write WAL anymore), or you'd just move all the writing to backends, no? Greetings, Andres Freund
On Thu, Feb 14, 2019 at 12:42 AM Andres Freund <andres@anarazel.de> wrote: > That can't have been the workaround - either you'd interrupt it while > holding critical locks (in which case nobody could write WAL anymore), > or you'd just move all the writing to backends, no? I imagine that it held the critical locks briefly. I'm not endorsing that approach, obviously, but apparently it more or less worked. It was something that was used in rare cases, only when there was no application-specific way to throttle writes, and only when the server was in effect destabilized by writing out WAL too quickly. -- Peter Geoghegan
On Thu, Feb 14, 2019 at 12:52 AM Peter Geoghegan <pg@bowt.ie> wrote: > I imagine that it held the critical locks briefly. I didn't mention that the utility they used would send SIGSTOP and SIGCONT in close succession. (Yeah, I know.) -- Peter Geoghegan
On Thu, Feb 14, 2019 at 12:53 AM Peter Geoghegan <pg@bowt.ie> wrote: > I didn't mention that the utility they used would send SIGSTOP and > SIGCONT in close succession. (Yeah, I know.) Actually, it SIGSTOP'd backends, not the WAL writer or background writer. -- Peter Geoghegan
On 2/13/19 4:31 PM, Stephen Frost wrote: > Greetings, > > * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote: >> Bulk operations like CREATE INDEX, ALTER TABLE, or bulk loads can create >> a lot of WAL. A lot of WAL at once can cause delays in replication. > > Agreed, though I think VACUUM should certainly be included in this. > Won't these two throttling criteria interact in undesirable and/or unpredictable way? With the regular vacuum throttling (based on hit/miss/dirty) it's possible to compute rough read/write I/O limits. But with the additional sleeps based on amount-of-WAL, we may sleep for one of two reasons, so we may not reach either limit. No? > I'm all for the idea though it seems like a different approach is needed > based on the down-thread discussion. Ultimately, having a way to have > these activities happen without causing issues for replicas is a great > idea and would definitely address a practical issue that a lot of people > run into. > +1 -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2019-02-14 10:00:38 +0100, Tomas Vondra wrote: > On 2/13/19 4:31 PM, Stephen Frost wrote: > > Greetings, > > > > * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote: > >> Bulk operations like CREATE INDEX, ALTER TABLE, or bulk loads can create > >> a lot of WAL. A lot of WAL at once can cause delays in replication. > > > > Agreed, though I think VACUUM should certainly be included in this. > > > > Won't these two throttling criteria interact in undesirable and/or > unpredictable way? With the regular vacuum throttling (based on > hit/miss/dirty) it's possible to compute rough read/write I/O limits. > But with the additional sleeps based on amount-of-WAL, we may sleep for > one of two reasons, so we may not reach either limit. No? Well, it'd be max rates for either, if done right. I think we only should start adding delays for WAL logging if we're exceeding the WAL write rate. That's obviously more complicated than the stuff we do for the current VACUUM throttling, but I can't see two such systems interacting well. Also, the current logic just doesn't work well when you consider IO actually taking time, and/or process scheduling effects on busy systems. Greetings, Andres Freund
On 2/14/19 10:06 AM, Andres Freund wrote: > Hi, > > On 2019-02-14 10:00:38 +0100, Tomas Vondra wrote: >> On 2/13/19 4:31 PM, Stephen Frost wrote: >>> Greetings, >>> >>> * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote: >>>> Bulk operations like CREATE INDEX, ALTER TABLE, or bulk loads can create >>>> a lot of WAL. A lot of WAL at once can cause delays in replication. >>> >>> Agreed, though I think VACUUM should certainly be included in this. >>> >> >> Won't these two throttling criteria interact in undesirable and/or >> unpredictable way? With the regular vacuum throttling (based on >> hit/miss/dirty) it's possible to compute rough read/write I/O limits. >> But with the additional sleeps based on amount-of-WAL, we may sleep for >> one of two reasons, so we may not reach either limit. No? > > Well, it'd be max rates for either, if done right. I think we only > should start adding delays for WAL logging if we're exceeding the WAL > write rate. Not really, I think. If you add additional sleep() calls somewhere, that may affect the limits in vacuum, making it throttle before reaching the derived throughput limits. > That's obviously more complicated than the stuff we do for > the current VACUUM throttling, but I can't see two such systems > interacting well. Also, the current logic just doesn't work well when > you consider IO actually taking time, and/or process scheduling effects > on busy systems. > True, but making it even less predictable is hardly an improvement. cheers -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On February 14, 2019 10:31:57 AM GMT+01:00, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > >On 2/14/19 10:06 AM, Andres Freund wrote: >> Hi, >> >> On 2019-02-14 10:00:38 +0100, Tomas Vondra wrote: >>> On 2/13/19 4:31 PM, Stephen Frost wrote: >>>> Greetings, >>>> >>>> * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote: >>>>> Bulk operations like CREATE INDEX, ALTER TABLE, or bulk loads can >create >>>>> a lot of WAL. A lot of WAL at once can cause delays in >replication. >>>> >>>> Agreed, though I think VACUUM should certainly be included in this. >>>> >>> >>> Won't these two throttling criteria interact in undesirable and/or >>> unpredictable way? With the regular vacuum throttling (based on >>> hit/miss/dirty) it's possible to compute rough read/write I/O >limits. >>> But with the additional sleeps based on amount-of-WAL, we may sleep >for >>> one of two reasons, so we may not reach either limit. No? >> >> Well, it'd be max rates for either, if done right. I think we only >> should start adding delays for WAL logging if we're exceeding the WAL >> write rate. > >Not really, I think. If you add additional sleep() calls somewhere, >that >may affect the limits in vacuum, making it throttle before reaching the >derived throughput limits. I don't understand. Obviously, if you have two limits, the scarcer resource can limit full use of the other resource. Thatseems OK? The thing u think we need to be careful about is not to limit in a way, e.g. by adding sleeps even when belowthe limit, that a WAL limit causes throttling of normal IO before the WAL limit is reached. >> That's obviously more complicated than the stuff we do for >> the current VACUUM throttling, but I can't see two such systems >> interacting well. Also, the current logic just doesn't work well when >> you consider IO actually taking time, and/or process scheduling >effects >> on busy systems. >> > >True, but making it even less predictable is hardly an improvement. I don't quite see the problem here. Could you expand? Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On 2/14/19 10:36 AM, Andres Freund wrote: > > > On February 14, 2019 10:31:57 AM GMT+01:00, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> >> >> On 2/14/19 10:06 AM, Andres Freund wrote: >>> Hi, >>> >>> On 2019-02-14 10:00:38 +0100, Tomas Vondra wrote: >>>> On 2/13/19 4:31 PM, Stephen Frost wrote: >>>>> Greetings, >>>>> >>>>> * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote: >>>>>> Bulk operations like CREATE INDEX, ALTER TABLE, or bulk loads can >> create >>>>>> a lot of WAL. A lot of WAL at once can cause delays in >> replication. >>>>> >>>>> Agreed, though I think VACUUM should certainly be included in this. >>>>> >>>> >>>> Won't these two throttling criteria interact in undesirable and/or >>>> unpredictable way? With the regular vacuum throttling (based on >>>> hit/miss/dirty) it's possible to compute rough read/write I/O >> limits. >>>> But with the additional sleeps based on amount-of-WAL, we may sleep >> for >>>> one of two reasons, so we may not reach either limit. No? >>> >>> Well, it'd be max rates for either, if done right. I think we only >>> should start adding delays for WAL logging if we're exceeding the WAL >>> write rate. >> >> Not really, I think. If you add additional sleep() calls somewhere, >> that >> may affect the limits in vacuum, making it throttle before reaching the >> derived throughput limits. > > I don't understand. Obviously, if you have two limits, the scarcer > resource can limit full use of the other resource. That seems OK? The > thing u think we need to be careful about is not to limit in a way, > e.g. by adding sleeps even when below the limit, that a WAL limit > causes throttling of normal IO before the WAL limit is reached. > With the vacuum throttling, rough I/O throughput maximums can be computed by by counting the number of pages you can read/write between sleeps. For example with the defaults (200 credits, 20ms sleeps, miss cost 10 credits) this means 20 writes/round, with 50 rounds/second, so 8MB/s. But this is based on assumption that the work between sleeps takes almost no time - that's not perfect, but generally works. But if you add extra sleep() calls somewhere (say because there's also limit on WAL throughput), it will affect how fast VACUUM works in general. Yet it'll continue with the cost-based throttling, but it will never reach the limits. Say you do another 20ms sleep somewhere. Suddenly it means it only does 25 rounds/second, and the actual write limit drops to 4 MB/s. > >>> That's obviously more complicated than the stuff we do for >>> the current VACUUM throttling, but I can't see two such systems >>> interacting well. Also, the current logic just doesn't work well when >>> you consider IO actually taking time, and/or process scheduling >> effects >>> on busy systems. >>> >> >> True, but making it even less predictable is hardly an improvement. > > I don't quite see the problem here. Could you expand? > All I'm saying that you can now estimate how much reads/writes vacuum does. With the extra sleeps (due to additional throttling mechanism) it will be harder. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 14/02/2019 11:03, Tomas Vondra wrote: > But if you add extra sleep() calls somewhere (say because there's also > limit on WAL throughput), it will affect how fast VACUUM works in > general. Yet it'll continue with the cost-based throttling, but it will > never reach the limits. Say you do another 20ms sleep somewhere. > Suddenly it means it only does 25 rounds/second, and the actual write > limit drops to 4 MB/s. I think at a first approximation, you probably don't want to add WAL delays to vacuum jobs, since they are already slowed down, so the rate of WAL they produce might not be your first problem. The problem is more things like CREATE INDEX CONCURRENTLY that run at full speed. That leads to an alternative idea of expanding the existing cost-based vacuum delay system to other commands. We could even enhance the cost system by taking WAL into account as an additional factor. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 13/02/2019 16:40, Andres Freund wrote: > On February 13, 2019 4:39:21 PM GMT+01:00, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: >> On 13/02/2019 13:18, Andres Freund wrote: >>> But I don't think the way you did it is acceptable - we can't just >> delay while holding buffer locks, in critical sections, while not >> interruptible. >> >> The code I added to XLogInsertRecord() is not inside the critical >> section. > > Most callers do xlog insertions inside crit sections though. Is it a problem that pg_usleep(CommitDelay) is inside a critical section? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Greetings,
On Thu, Feb 14, 2019 at 10:15 Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
On 14/02/2019 11:03, Tomas Vondra wrote:
> But if you add extra sleep() calls somewhere (say because there's also
> limit on WAL throughput), it will affect how fast VACUUM works in
> general. Yet it'll continue with the cost-based throttling, but it will
> never reach the limits. Say you do another 20ms sleep somewhere.
> Suddenly it means it only does 25 rounds/second, and the actual write
> limit drops to 4 MB/s.
I think at a first approximation, you probably don't want to add WAL
delays to vacuum jobs, since they are already slowed down, so the rate
of WAL they produce might not be your first problem. The problem is
more things like CREATE INDEX CONCURRENTLY that run at full speed.
That leads to an alternative idea of expanding the existing cost-based
vacuum delay system to other commands.
We could even enhance the cost system by taking WAL into account as an
additional factor.
This is really what I was thinking- let’s not have multiple independent ways of slowing down maintenance and similar jobs to reduce their impact on I/o to the heap and to WAL.
Thanks!
Stephen
Hi, On 2019-02-14 16:16:05 +0100, Peter Eisentraut wrote: > On 13/02/2019 16:40, Andres Freund wrote: > > On February 13, 2019 4:39:21 PM GMT+01:00, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > >> On 13/02/2019 13:18, Andres Freund wrote: > >>> But I don't think the way you did it is acceptable - we can't just > >> delay while holding buffer locks, in critical sections, while not > >> interruptible. > >> > >> The code I added to XLogInsertRecord() is not inside the critical > >> section. > > > > Most callers do xlog insertions inside crit sections though. > > Is it a problem that pg_usleep(CommitDelay) is inside a critical section? Well: We can't make things sleep for considerable time while holding crucial locks and not make such sleeps interruptible. And holding lwlocks will make it noninterruptible (but still it could throw an error), but with crit sections, we can't even error out if we somehow got that error. Consider throttled code writing to a popular index or bree page - they'd suddenly be stalled and everyone else would also queue up in an uninterruptible manner via lwlocks. You'd throttle the whole system. Greetings, Andres Freund
On 2019-02-14 11:02:24 -0500, Stephen Frost wrote: > Greetings, > > On Thu, Feb 14, 2019 at 10:15 Peter Eisentraut < > peter.eisentraut@2ndquadrant.com> wrote: > > > On 14/02/2019 11:03, Tomas Vondra wrote: > > > But if you add extra sleep() calls somewhere (say because there's also > > > limit on WAL throughput), it will affect how fast VACUUM works in > > > general. Yet it'll continue with the cost-based throttling, but it will > > > never reach the limits. Say you do another 20ms sleep somewhere. > > > Suddenly it means it only does 25 rounds/second, and the actual write > > > limit drops to 4 MB/s. > > > > I think at a first approximation, you probably don't want to add WAL > > delays to vacuum jobs, since they are already slowed down, so the rate > > of WAL they produce might not be your first problem. The problem is > > more things like CREATE INDEX CONCURRENTLY that run at full speed. > > > > That leads to an alternative idea of expanding the existing cost-based > > vacuum delay system to other commands. > > > > We could even enhance the cost system by taking WAL into account as an > > additional factor. > > > This is really what I was thinking- let’s not have multiple independent > ways of slowing down maintenance and similar jobs to reduce their impact on > I/o to the heap and to WAL. I think that's a bad idea. Both because the current vacuum code is *terrible* if you desire higher rates because both CPU and IO time aren't taken into account. And it's extremely hard to control. And it seems entirely valuable to be able to limit the amount of WAL generated for replication, but still try go get the rest of the work done as quickly as reasonably possible wrt local IO. Greetings, Andres Freund
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On 2019-02-14 11:02:24 -0500, Stephen Frost wrote: > > On Thu, Feb 14, 2019 at 10:15 Peter Eisentraut < > > peter.eisentraut@2ndquadrant.com> wrote: > > > On 14/02/2019 11:03, Tomas Vondra wrote: > > > > But if you add extra sleep() calls somewhere (say because there's also > > > > limit on WAL throughput), it will affect how fast VACUUM works in > > > > general. Yet it'll continue with the cost-based throttling, but it will > > > > never reach the limits. Say you do another 20ms sleep somewhere. > > > > Suddenly it means it only does 25 rounds/second, and the actual write > > > > limit drops to 4 MB/s. > > > > > > I think at a first approximation, you probably don't want to add WAL > > > delays to vacuum jobs, since they are already slowed down, so the rate > > > of WAL they produce might not be your first problem. The problem is > > > more things like CREATE INDEX CONCURRENTLY that run at full speed. > > > > > > That leads to an alternative idea of expanding the existing cost-based > > > vacuum delay system to other commands. > > > > > > We could even enhance the cost system by taking WAL into account as an > > > additional factor. > > > > This is really what I was thinking- let’s not have multiple independent > > ways of slowing down maintenance and similar jobs to reduce their impact on > > I/o to the heap and to WAL. > > I think that's a bad idea. Both because the current vacuum code is > *terrible* if you desire higher rates because both CPU and IO time > aren't taken into account. And it's extremely hard to control. And it > seems entirely valuable to be able to limit the amount of WAL generated > for replication, but still try go get the rest of the work done as > quickly as reasonably possible wrt local IO. I'm all for making improvements to the vacuum code and making it easier to control. I don't buy off on the argument that there is some way to segregate the local I/O question from the WAL when we're talking about these kinds of operations (VACUUM, CREATE INDEX, CLUSTER, etc) on logged relations, nor do I think we do our users a service by giving them independent knobs for both that will undoubtably end up making it more difficult to understand and control what's going on overall. Even here, it seems, you're arguing that the existing approach for VACUUM is hard to control; wouldn't adding another set of knobs for controlling the amount of WAL generated by VACUUM make that worse? I have a hard time seeing how it wouldn't. Thanks! Stephen
Attachment
Hi, On 2019-02-15 08:50:03 -0500, Stephen Frost wrote: > * Andres Freund (andres@anarazel.de) wrote: > > On 2019-02-14 11:02:24 -0500, Stephen Frost wrote: > > > On Thu, Feb 14, 2019 at 10:15 Peter Eisentraut < > > > peter.eisentraut@2ndquadrant.com> wrote: > > > > On 14/02/2019 11:03, Tomas Vondra wrote: > > > > > But if you add extra sleep() calls somewhere (say because there's also > > > > > limit on WAL throughput), it will affect how fast VACUUM works in > > > > > general. Yet it'll continue with the cost-based throttling, but it will > > > > > never reach the limits. Say you do another 20ms sleep somewhere. > > > > > Suddenly it means it only does 25 rounds/second, and the actual write > > > > > limit drops to 4 MB/s. > > > > > > > > I think at a first approximation, you probably don't want to add WAL > > > > delays to vacuum jobs, since they are already slowed down, so the rate > > > > of WAL they produce might not be your first problem. The problem is > > > > more things like CREATE INDEX CONCURRENTLY that run at full speed. > > > > > > > > That leads to an alternative idea of expanding the existing cost-based > > > > vacuum delay system to other commands. > > > > > > > > We could even enhance the cost system by taking WAL into account as an > > > > additional factor. > > > > > > This is really what I was thinking- let’s not have multiple independent > > > ways of slowing down maintenance and similar jobs to reduce their impact on > > > I/o to the heap and to WAL. > > > > I think that's a bad idea. Both because the current vacuum code is > > *terrible* if you desire higher rates because both CPU and IO time > > aren't taken into account. And it's extremely hard to control. And it > > seems entirely valuable to be able to limit the amount of WAL generated > > for replication, but still try go get the rest of the work done as > > quickly as reasonably possible wrt local IO. > > I'm all for making improvements to the vacuum code and making it easier > to control. > > I don't buy off on the argument that there is some way to segregate the > local I/O question from the WAL when we're talking about these kinds of > operations (VACUUM, CREATE INDEX, CLUSTER, etc) on logged relations, nor > do I think we do our users a service by giving them independent knobs > for both that will undoubtably end up making it more difficult to > understand and control what's going on overall. > > Even here, it seems, you're arguing that the existing approach for > VACUUM is hard to control; wouldn't adding another set of knobs for > controlling the amount of WAL generated by VACUUM make that worse? I > have a hard time seeing how it wouldn't. I think it's because I see them as, often, having two largely independent use cases. If your goal is to avoid swamping replication with WAL, you don't necessarily care about also throttling VACUUM (or REINDEX, or CLUSTER, or ...)'s local IO. By forcing to combine the two you just make the whole feature less usable. I think it'd not be insane to add two things: - WAL write rate limiting, independent of the vacuum stuff. It'd also be used by lots of other bulk commands (CREATE INDEX, ALTER TABLE rewrites, ...) - Account for WAL writes in the current vacuum costing logic, by accounting for it using a new cost parameter Then VACUUM would be throttled by the *minimum* of the two, which seems to make plenty sense to me, given the usecases. Greetings, Andres Freund
On 2/15/19 7:41 PM, Andres Freund wrote: > Hi, > > On 2019-02-15 08:50:03 -0500, Stephen Frost wrote: >> * Andres Freund (andres@anarazel.de) wrote: >>> On 2019-02-14 11:02:24 -0500, Stephen Frost wrote: >>>> On Thu, Feb 14, 2019 at 10:15 Peter Eisentraut < >>>> peter.eisentraut@2ndquadrant.com> wrote: >>>>> On 14/02/2019 11:03, Tomas Vondra wrote: >>>>>> But if you add extra sleep() calls somewhere (say because there's also >>>>>> limit on WAL throughput), it will affect how fast VACUUM works in >>>>>> general. Yet it'll continue with the cost-based throttling, but it will >>>>>> never reach the limits. Say you do another 20ms sleep somewhere. >>>>>> Suddenly it means it only does 25 rounds/second, and the actual write >>>>>> limit drops to 4 MB/s. >>>>> >>>>> I think at a first approximation, you probably don't want to add WAL >>>>> delays to vacuum jobs, since they are already slowed down, so the rate >>>>> of WAL they produce might not be your first problem. The problem is >>>>> more things like CREATE INDEX CONCURRENTLY that run at full speed. >>>>> >>>>> That leads to an alternative idea of expanding the existing cost-based >>>>> vacuum delay system to other commands. >>>>> >>>>> We could even enhance the cost system by taking WAL into account as an >>>>> additional factor. >>>> >>>> This is really what I was thinking- let’s not have multiple independent >>>> ways of slowing down maintenance and similar jobs to reduce their impact on >>>> I/o to the heap and to WAL. >>> >>> I think that's a bad idea. Both because the current vacuum code is >>> *terrible* if you desire higher rates because both CPU and IO time >>> aren't taken into account. And it's extremely hard to control. And it >>> seems entirely valuable to be able to limit the amount of WAL generated >>> for replication, but still try go get the rest of the work done as >>> quickly as reasonably possible wrt local IO. >> >> I'm all for making improvements to the vacuum code and making it easier >> to control. >> >> I don't buy off on the argument that there is some way to segregate the >> local I/O question from the WAL when we're talking about these kinds of >> operations (VACUUM, CREATE INDEX, CLUSTER, etc) on logged relations, nor >> do I think we do our users a service by giving them independent knobs >> for both that will undoubtably end up making it more difficult to >> understand and control what's going on overall. >> >> Even here, it seems, you're arguing that the existing approach for >> VACUUM is hard to control; wouldn't adding another set of knobs for >> controlling the amount of WAL generated by VACUUM make that worse? I >> have a hard time seeing how it wouldn't. > > I think it's because I see them as, often, having two largely > independent use cases. If your goal is to avoid swamping replication > with WAL, you don't necessarily care about also throttling VACUUM > (or REINDEX, or CLUSTER, or ...)'s local IO. By forcing to combine > the two you just make the whole feature less usable. > I agree with that. > I think it'd not be insane to add two things: > - WAL write rate limiting, independent of the vacuum stuff. It'd also be > used by lots of other bulk commands (CREATE INDEX, ALTER TABLE > rewrites, ...) > - Account for WAL writes in the current vacuum costing logic, by > accounting for it using a new cost parameter > > Then VACUUM would be throttled by the *minimum* of the two, which seems > to make plenty sense to me, given the usecases. > Is it really minimum? If you add another cost parameter to the vacuum model, then there's almost no chance of actually reaching the limit because the budget (cost_limit) is shared with other stuff (local I/O). FWIW I do think the ability to throttle WAL is a useful feature, I just don't want to shoot myself in the foot by making other things worse. As you note, the existing VACUUM throttling is already hard to control, this seems to make it even harder. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Greetings, * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > On 2/15/19 7:41 PM, Andres Freund wrote: > > On 2019-02-15 08:50:03 -0500, Stephen Frost wrote: > >> * Andres Freund (andres@anarazel.de) wrote: > >>> On 2019-02-14 11:02:24 -0500, Stephen Frost wrote: > >>>> On Thu, Feb 14, 2019 at 10:15 Peter Eisentraut < > >>>> peter.eisentraut@2ndquadrant.com> wrote: > >>>>> On 14/02/2019 11:03, Tomas Vondra wrote: > >>>>>> But if you add extra sleep() calls somewhere (say because there's also > >>>>>> limit on WAL throughput), it will affect how fast VACUUM works in > >>>>>> general. Yet it'll continue with the cost-based throttling, but it will > >>>>>> never reach the limits. Say you do another 20ms sleep somewhere. > >>>>>> Suddenly it means it only does 25 rounds/second, and the actual write > >>>>>> limit drops to 4 MB/s. > >>>>> > >>>>> I think at a first approximation, you probably don't want to add WAL > >>>>> delays to vacuum jobs, since they are already slowed down, so the rate > >>>>> of WAL they produce might not be your first problem. The problem is > >>>>> more things like CREATE INDEX CONCURRENTLY that run at full speed. > >>>>> > >>>>> That leads to an alternative idea of expanding the existing cost-based > >>>>> vacuum delay system to other commands. > >>>>> > >>>>> We could even enhance the cost system by taking WAL into account as an > >>>>> additional factor. > >>>> > >>>> This is really what I was thinking- let’s not have multiple independent > >>>> ways of slowing down maintenance and similar jobs to reduce their impact on > >>>> I/o to the heap and to WAL. > >>> > >>> I think that's a bad idea. Both because the current vacuum code is > >>> *terrible* if you desire higher rates because both CPU and IO time > >>> aren't taken into account. And it's extremely hard to control. And it > >>> seems entirely valuable to be able to limit the amount of WAL generated > >>> for replication, but still try go get the rest of the work done as > >>> quickly as reasonably possible wrt local IO. > >> > >> I'm all for making improvements to the vacuum code and making it easier > >> to control. > >> > >> I don't buy off on the argument that there is some way to segregate the > >> local I/O question from the WAL when we're talking about these kinds of > >> operations (VACUUM, CREATE INDEX, CLUSTER, etc) on logged relations, nor > >> do I think we do our users a service by giving them independent knobs > >> for both that will undoubtably end up making it more difficult to > >> understand and control what's going on overall. > >> > >> Even here, it seems, you're arguing that the existing approach for > >> VACUUM is hard to control; wouldn't adding another set of knobs for > >> controlling the amount of WAL generated by VACUUM make that worse? I > >> have a hard time seeing how it wouldn't. > > > > I think it's because I see them as, often, having two largely > > independent use cases. If your goal is to avoid swamping replication > > with WAL, you don't necessarily care about also throttling VACUUM > > (or REINDEX, or CLUSTER, or ...)'s local IO. By forcing to combine > > the two you just make the whole feature less usable. > > I agree with that. I can agree that they're different use-cases but one does end up impacting the other and that's what I had been thinking about from the perspective of "if we could proivde just one knob for this." VACUUM is a pretty good example- if we're dirtying a page with VACUUM then we're also writing that page into the WAL (at least, if the relation isn't unlogged). Now, VACUUM does do other things (such as read pages), as does REINDEX or CLUSTER, so maybe there's a way to think about this feature in those terms- cost for doing local read i/o, vs. cost for doing write i/o (to both heap and WAL) vs. cost for doing "local" write i/o (just to heap, ie: unlogged tables). What I was trying to say I didn't like previously was the idea of having a "local i/o write" cost for VACUUM, to a logged table, *and* a "WAL write" cost for VACUUM, since those are very tightly correlated. The current costing mechanism in VACUUM only provides the single hammer of "if we hit the limit, go to sleep for a while" which seems a bit unfortunate- if we haven't hit the "read i/o" limit, it'd be nice if we could keep going and then come back to writing out pages when enough time has passed that we're below our "write i/o" limit. That would end up requiring quite a bit of change to how we do things though, I expect, so probably not something to tie into this particular feature but I wanted to express the thought in case others found it interesting. > > I think it'd not be insane to add two things: > > - WAL write rate limiting, independent of the vacuum stuff. It'd also be > > used by lots of other bulk commands (CREATE INDEX, ALTER TABLE > > rewrites, ...) > > - Account for WAL writes in the current vacuum costing logic, by > > accounting for it using a new cost parameter > > > > Then VACUUM would be throttled by the *minimum* of the two, which seems > > to make plenty sense to me, given the usecases. > > Is it really minimum? If you add another cost parameter to the vacuum > model, then there's almost no chance of actually reaching the limit > because the budget (cost_limit) is shared with other stuff (local I/O). Yeah, that does seem like it'd be an issue. > FWIW I do think the ability to throttle WAL is a useful feature, I just > don't want to shoot myself in the foot by making other things worse. > > As you note, the existing VACUUM throttling is already hard to control, > this seems to make it even harder. Agreed. Thanks! Stephen
Attachment
On Fri, Feb 15, 2019 at 1:42 PM Andres Freund <andres@anarazel.de> wrote: > I think it'd not be insane to add two things: > - WAL write rate limiting, independent of the vacuum stuff. It'd also be > used by lots of other bulk commands (CREATE INDEX, ALTER TABLE > rewrites, ...) > - Account for WAL writes in the current vacuum costing logic, by > accounting for it using a new cost parameter > > Then VACUUM would be throttled by the *minimum* of the two, which seems > to make plenty sense to me, given the usecases. Or maybe we should just blow up the current vacuum cost delay stuff and replace it with something that is easier to tune. For example, we could just have one parameter that sets the maximum read rate in kB/s and another that sets the maximum dirty-page rate in kB/s. Whichever limit is tighter binds. If we also have the thing that is the topic of this thread, that's a third possible upper limit. I really don't see much point in doubling down on the current vacuum cost delay logic. The overall idea is good, but the specific way that you have to set the parameters is pretty inscrutable, and I think we should just fix it so that it can be, uh, scruted. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2019-02-19 13:28:00 -0500, Robert Haas wrote: > On Fri, Feb 15, 2019 at 1:42 PM Andres Freund <andres@anarazel.de> wrote: > > I think it'd not be insane to add two things: > > - WAL write rate limiting, independent of the vacuum stuff. It'd also be > > used by lots of other bulk commands (CREATE INDEX, ALTER TABLE > > rewrites, ...) > > - Account for WAL writes in the current vacuum costing logic, by > > accounting for it using a new cost parameter > > > > Then VACUUM would be throttled by the *minimum* of the two, which seems > > to make plenty sense to me, given the usecases. > > Or maybe we should just blow up the current vacuum cost delay stuff > and replace it with something that is easier to tune. For example, we > could just have one parameter that sets the maximum read rate in kB/s > and another that sets the maximum dirty-page rate in kB/s. Whichever > limit is tighter binds. If we also have the thing that is the topic > of this thread, that's a third possible upper limit. > I really don't see much point in doubling down on the current vacuum > cost delay logic. The overall idea is good, but the specific way that > you have to set the parameters is pretty inscrutable, and I think we > should just fix it so that it can be, uh, scruted. I agree that that's something worthwhile to do, but given that the proposal in this thread is *NOT* just about VACUUM, I don't see why it'd be useful to tie a general WAL rate limiting to rewriting cost limiting of vacuum. It seems better to write the WAL rate limiting logic with an eye towards structuring it in a way that'd potentially allow reusing some of the code for a better VACUUM cost limiting. I still don't *AT ALL* buy Stephen and Tomas' argument that it'd be confusing that when both VACUUM and WAL cost limiting are active, the lower limit takes effect. Greetings, Andres Freund
On 2/19/19 7:35 PM, Andres Freund wrote: > Hi, > > On 2019-02-19 13:28:00 -0500, Robert Haas wrote: >> On Fri, Feb 15, 2019 at 1:42 PM Andres Freund <andres@anarazel.de> wrote: >>> I think it'd not be insane to add two things: >>> - WAL write rate limiting, independent of the vacuum stuff. It'd also be >>> used by lots of other bulk commands (CREATE INDEX, ALTER TABLE >>> rewrites, ...) >>> - Account for WAL writes in the current vacuum costing logic, by >>> accounting for it using a new cost parameter >>> >>> Then VACUUM would be throttled by the *minimum* of the two, which seems >>> to make plenty sense to me, given the usecases. >> >> Or maybe we should just blow up the current vacuum cost delay stuff >> and replace it with something that is easier to tune. For example, we >> could just have one parameter that sets the maximum read rate in kB/s >> and another that sets the maximum dirty-page rate in kB/s. Whichever >> limit is tighter binds. If we also have the thing that is the topic >> of this thread, that's a third possible upper limit. > >> I really don't see much point in doubling down on the current vacuum >> cost delay logic. The overall idea is good, but the specific way that >> you have to set the parameters is pretty inscrutable, and I think we >> should just fix it so that it can be, uh, scruted. > > I agree that that's something worthwhile to do, but given that the > proposal in this thread is *NOT* just about VACUUM, I don't see why it'd > be useful to tie a general WAL rate limiting to rewriting cost limiting > of vacuum. It seems better to write the WAL rate limiting logic with an > eye towards structuring it in a way that'd potentially allow reusing > some of the code for a better VACUUM cost limiting. > > I still don't *AT ALL* buy Stephen and Tomas' argument that it'd be > confusing that when both VACUUM and WAL cost limiting are active, the > lower limit takes effect. > Except that's not my argument. I'm not arguing against throttling once we hit the minimum of limits. The problem I have with implementing a separate throttling logic is that it also changes the other limits (which are already kinda fuzzy). If you add sleeps somewhere, those will affects the throttling built into autovacuum (lowering them in some unknown way). From this POV it would be better to include this into the vacuum cost limit, because then it's at least subject to the same budget. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Feb 19, 2019 at 1:35 PM Andres Freund <andres@anarazel.de> wrote: > I still don't *AT ALL* buy Stephen and Tomas' argument that it'd be > confusing that when both VACUUM and WAL cost limiting are active, the > lower limit takes effect. I think you guys may all be in vigorous -- not to say mortal -- agreement. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2019-02-19 19:43:14 +0100, Tomas Vondra wrote: > > > On 2/19/19 7:35 PM, Andres Freund wrote: > > Hi, > > > > On 2019-02-19 13:28:00 -0500, Robert Haas wrote: > >> On Fri, Feb 15, 2019 at 1:42 PM Andres Freund <andres@anarazel.de> wrote: > >>> I think it'd not be insane to add two things: > >>> - WAL write rate limiting, independent of the vacuum stuff. It'd also be > >>> used by lots of other bulk commands (CREATE INDEX, ALTER TABLE > >>> rewrites, ...) > >>> - Account for WAL writes in the current vacuum costing logic, by > >>> accounting for it using a new cost parameter > >>> > >>> Then VACUUM would be throttled by the *minimum* of the two, which seems > >>> to make plenty sense to me, given the usecases. > >> > >> Or maybe we should just blow up the current vacuum cost delay stuff > >> and replace it with something that is easier to tune. For example, we > >> could just have one parameter that sets the maximum read rate in kB/s > >> and another that sets the maximum dirty-page rate in kB/s. Whichever > >> limit is tighter binds. If we also have the thing that is the topic > >> of this thread, that's a third possible upper limit. > > > >> I really don't see much point in doubling down on the current vacuum > >> cost delay logic. The overall idea is good, but the specific way that > >> you have to set the parameters is pretty inscrutable, and I think we > >> should just fix it so that it can be, uh, scruted. > > > > I agree that that's something worthwhile to do, but given that the > > proposal in this thread is *NOT* just about VACUUM, I don't see why it'd > > be useful to tie a general WAL rate limiting to rewriting cost limiting > > of vacuum. It seems better to write the WAL rate limiting logic with an > > eye towards structuring it in a way that'd potentially allow reusing > > some of the code for a better VACUUM cost limiting. > > > > I still don't *AT ALL* buy Stephen and Tomas' argument that it'd be > > confusing that when both VACUUM and WAL cost limiting are active, the > > lower limit takes effect. > > > > Except that's not my argument. I'm not arguing against throttling once > we hit the minimum of limits. > > The problem I have with implementing a separate throttling logic is that > it also changes the other limits (which are already kinda fuzzy). If you > add sleeps somewhere, those will affects the throttling built into > autovacuum (lowering them in some unknown way). Those two paragraphs, to me, flat out contradict each other. If you throttle according to the lower of two limits, *of course* the throttling for the higher limit is affected in the sense that the limit won't be reached. So yea, if you have a WAL rate limit, and you reach it, you won't be able to predict the IO rate for vacuum. Duh? I must be missing something here. Greetings, Andres Freund
On 2/19/19 7:50 PM, Andres Freund wrote: > On 2019-02-19 19:43:14 +0100, Tomas Vondra wrote: >> >> >> On 2/19/19 7:35 PM, Andres Freund wrote: >>> Hi, >>> >>> On 2019-02-19 13:28:00 -0500, Robert Haas wrote: >>>> On Fri, Feb 15, 2019 at 1:42 PM Andres Freund <andres@anarazel.de> wrote: >>>>> I think it'd not be insane to add two things: >>>>> - WAL write rate limiting, independent of the vacuum stuff. It'd also be >>>>> used by lots of other bulk commands (CREATE INDEX, ALTER TABLE >>>>> rewrites, ...) >>>>> - Account for WAL writes in the current vacuum costing logic, by >>>>> accounting for it using a new cost parameter >>>>> >>>>> Then VACUUM would be throttled by the *minimum* of the two, which seems >>>>> to make plenty sense to me, given the usecases. >>>> >>>> Or maybe we should just blow up the current vacuum cost delay stuff >>>> and replace it with something that is easier to tune. For example, we >>>> could just have one parameter that sets the maximum read rate in kB/s >>>> and another that sets the maximum dirty-page rate in kB/s. Whichever >>>> limit is tighter binds. If we also have the thing that is the topic >>>> of this thread, that's a third possible upper limit. >>> >>>> I really don't see much point in doubling down on the current vacuum >>>> cost delay logic. The overall idea is good, but the specific way that >>>> you have to set the parameters is pretty inscrutable, and I think we >>>> should just fix it so that it can be, uh, scruted. >>> >>> I agree that that's something worthwhile to do, but given that the >>> proposal in this thread is *NOT* just about VACUUM, I don't see why it'd >>> be useful to tie a general WAL rate limiting to rewriting cost limiting >>> of vacuum. It seems better to write the WAL rate limiting logic with an >>> eye towards structuring it in a way that'd potentially allow reusing >>> some of the code for a better VACUUM cost limiting. >>> >>> I still don't *AT ALL* buy Stephen and Tomas' argument that it'd be >>> confusing that when both VACUUM and WAL cost limiting are active, the >>> lower limit takes effect. >>> >> >> Except that's not my argument. I'm not arguing against throttling once >> we hit the minimum of limits. >> >> The problem I have with implementing a separate throttling logic is that >> it also changes the other limits (which are already kinda fuzzy). If you >> add sleeps somewhere, those will affects the throttling built into >> autovacuum (lowering them in some unknown way). > > Those two paragraphs, to me, flat out contradict each other. If you > throttle according to the lower of two limits, *of course* the > throttling for the higher limit is affected in the sense that the limit > won't be reached. So yea, if you have a WAL rate limit, and you reach > it, you won't be able to predict the IO rate for vacuum. Duh? I must be > missing something here. > Let's do a short example. Assume the default vacuum costing parameters vacuum_cost_limit = 200 vacuum_cost_delay = 20ms cost_page_dirty = 20 and for simplicity we only do writes. So vacuum can do ~8MB/s of writes. Now, let's also throttle based on WAL - once in a while, after producing some amount of WAL we sleep for a while. Again, for simplicity let's assume the sleeps perfectly interleave and are also 20ms. So we have something like: sleep(20ms); -- vacuum sleep(20ms); -- WAL sleep(20ms); -- vacuum sleep(20ms); -- WAL sleep(20ms); -- vacuum sleep(20ms); -- WAL sleep(20ms); -- vacuum sleep(20ms); -- WAL Suddenly, we only reach 4MB/s of writes from vacuum. But we also reach only 1/2 the WAL throughput, because it's affected exactly the same way by the sleeps from vacuum throttling. We've not reached either of the limits. How exactly is this "lower limit takes effect"? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2/19/19 7:28 PM, Robert Haas wrote: > On Fri, Feb 15, 2019 at 1:42 PM Andres Freund <andres@anarazel.de> wrote: >> I think it'd not be insane to add two things: >> - WAL write rate limiting, independent of the vacuum stuff. It'd also be >> used by lots of other bulk commands (CREATE INDEX, ALTER TABLE >> rewrites, ...) >> - Account for WAL writes in the current vacuum costing logic, by >> accounting for it using a new cost parameter >> >> Then VACUUM would be throttled by the *minimum* of the two, which seems >> to make plenty sense to me, given the usecases. > > Or maybe we should just blow up the current vacuum cost delay stuff > and replace it with something that is easier to tune. For example, we > could just have one parameter that sets the maximum read rate in kB/s > and another that sets the maximum dirty-page rate in kB/s. Whichever > limit is tighter binds. If we also have the thing that is the topic > of this thread, that's a third possible upper limit. > > I really don't see much point in doubling down on the current vacuum > cost delay logic. The overall idea is good, but the specific way that > you have to set the parameters is pretty inscrutable, and I think we > should just fix it so that it can be, uh, scruted. > I think changing the vacuum throttling so that it uses actual I/O amounts (in kB/s) instead of the cost limit would be a step in the right directly. It's clearer, and it also works with arbitrary page sizes. Then, instead of sleeping for a fixed amount of time after reaching the cost limit, we should track progress and compute the amount of time we actually need to sleep. AFAICS that's what spread checkpoints do. I'm sure it's bound to be trickier in practice, of course. FWIW I'm not entirely sure we want to fully separate the limits. I'd argue using the same vacuum budget for both reads and writes is actually the right thing to do - the I/O likely hits the same device. So it makes sense to "balance" those two in some way. But that may or may not be the case for WAL, which is often moved to a different device. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2019-02-19 20:02:32 +0100, Tomas Vondra wrote: > Let's do a short example. Assume the default vacuum costing parameters > > vacuum_cost_limit = 200 > vacuum_cost_delay = 20ms > cost_page_dirty = 20 > > and for simplicity we only do writes. So vacuum can do ~8MB/s of writes. > > Now, let's also throttle based on WAL - once in a while, after producing > some amount of WAL we sleep for a while. Again, for simplicity let's > assume the sleeps perfectly interleave and are also 20ms. So we have > something like: > sleep(20ms); -- vacuum > sleep(20ms); -- WAL > sleep(20ms); -- vacuum > sleep(20ms); -- WAL > sleep(20ms); -- vacuum > sleep(20ms); -- WAL > sleep(20ms); -- vacuum > sleep(20ms); -- WAL > > Suddenly, we only reach 4MB/s of writes from vacuum. But we also reach > only 1/2 the WAL throughput, because it's affected exactly the same way > by the sleeps from vacuum throttling. > > We've not reached either of the limits. How exactly is this "lower limit > takes effect"? Because I upthread said that that's not how I think a sane implementation of WAL throttling would work. I think the whole cost budgeting approach is BAD, and it'd be serious mistake to copy it for a WAL rate limit (it disregards the time taken to execute IO and CPU costs etc, and in this case the cost of other bandwidth limitations). What I'm saying is that we ought to instead specify an WAL rate in bytes/sec and *only* sleep once we've exceeded it for a time period (with some optimizations, so we don't gettimeofday after every XLogInsert(), but instead compute how many bytes later need to re-determine the time to see if we're still in the same 'granule'). Now, a non-toy implementation would probably would want to have a sliding window to avoid being overly bursty, and reduce the number of gettimeofday as mentioned above, but for explanation's sake basically imagine that at the "main loop" of an bulk xlog emitting command would invoke a helper with a a computation in pseudocode like: current_time = gettimeofday(); if (same_second(current_time, last_time)) { wal_written_in_second += new_wal_written; if (wal_written_in_second >= wal_write_limit_per_second) { double too_much = (wal_written_in_second - wal_write_limit_per_second); sleep_fractional_seconds(too_much / wal_written_in_second); last_time = current_time; } } else { last_time = current_time; } which'd mean that in contrast to your example we'd not continually sleep for WAL, we'd only do so if we actually exceeded (or are projected to exceed in a smarter implementation), the specified WAL write rate. As the 20ms sleeps from vacuum effectively reduce the WAL write rate, we'd correspondingly sleep less. And my main point is that even if you implement a proper bytes/sec limit ONLY for WAL, the behaviour of VACUUM rate limiting doesn't get meaningfully more confusing than right now. Greetings, Andres Freund
On 2/19/19 8:22 PM, Andres Freund wrote: > On 2019-02-19 20:02:32 +0100, Tomas Vondra wrote: >> Let's do a short example. Assume the default vacuum costing parameters >> >> vacuum_cost_limit = 200 >> vacuum_cost_delay = 20ms >> cost_page_dirty = 20 >> >> and for simplicity we only do writes. So vacuum can do ~8MB/s of writes. >> >> Now, let's also throttle based on WAL - once in a while, after producing >> some amount of WAL we sleep for a while. Again, for simplicity let's >> assume the sleeps perfectly interleave and are also 20ms. So we have >> something like: > >> sleep(20ms); -- vacuum >> sleep(20ms); -- WAL >> sleep(20ms); -- vacuum >> sleep(20ms); -- WAL >> sleep(20ms); -- vacuum >> sleep(20ms); -- WAL >> sleep(20ms); -- vacuum >> sleep(20ms); -- WAL >> >> Suddenly, we only reach 4MB/s of writes from vacuum. But we also reach >> only 1/2 the WAL throughput, because it's affected exactly the same way >> by the sleeps from vacuum throttling. >> >> We've not reached either of the limits. How exactly is this "lower limit >> takes effect"? > > Because I upthread said that that's not how I think a sane > implementation of WAL throttling would work. I think the whole cost > budgeting approach is BAD, and it'd be serious mistake to copy it for a > WAL rate limit (it disregards the time taken to execute IO and CPU costs > etc, and in this case the cost of other bandwidth limitations). What > I'm saying is that we ought to instead specify an WAL rate in bytes/sec > and *only* sleep once we've exceeded it for a time period (with some > optimizations, so we don't gettimeofday after every XLogInsert(), but > instead compute how many bytes later need to re-determine the time to > see if we're still in the same 'granule'). > OK, I agree with that. That's mostly what I described in response to Robert a while ago, I think. (If you've described that earlier in the thread, I missed it.) > Now, a non-toy implementation would probably would want to have a > sliding window to avoid being overly bursty, and reduce the number of > gettimeofday as mentioned above, but for explanation's sake basically > imagine that at the "main loop" of an bulk xlog emitting command would > invoke a helper with a a computation in pseudocode like: > > current_time = gettimeofday(); > if (same_second(current_time, last_time)) > { > wal_written_in_second += new_wal_written; > if (wal_written_in_second >= wal_write_limit_per_second) > { > double too_much = (wal_written_in_second - wal_write_limit_per_second); > sleep_fractional_seconds(too_much / wal_written_in_second); > > last_time = current_time; > } > } > else > { > last_time = current_time; > } > > which'd mean that in contrast to your example we'd not continually sleep > for WAL, we'd only do so if we actually exceeded (or are projected to > exceed in a smarter implementation), the specified WAL write rate. As > the 20ms sleeps from vacuum effectively reduce the WAL write rate, we'd > correspondingly sleep less. > Yes, that makes sense. > > And my main point is that even if you implement a proper bytes/sec limit > ONLY for WAL, the behaviour of VACUUM rate limiting doesn't get > meaningfully more confusing than right now. > So, why not to modify autovacuum to also use this approach? I wonder if the situation there is more complicated because of multiple workers sharing the same budget ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2019-02-19 20:34:25 +0100, Tomas Vondra wrote: > On 2/19/19 8:22 PM, Andres Freund wrote: > > And my main point is that even if you implement a proper bytes/sec limit > > ONLY for WAL, the behaviour of VACUUM rate limiting doesn't get > > meaningfully more confusing than right now. > > > > So, why not to modify autovacuum to also use this approach? I wonder if > the situation there is more complicated because of multiple workers > sharing the same budget ... I think the main reason is that implementing a scheme like this for WAL rate limiting isn't a small task, but it'd be aided by the fact that it'd probably not on by default, and that there'd not be any regressions because the behaviour didn't exist before. I contrast, people are extremely sensitive to autovacuum behaviour changes, even if it's to improve autovacuum. I think it makes more sense to build the logic in a lower profile case first, and then migrate autovacuum over it. Even leaving the maturity issue aside, reducing the scope of the project into more bite sized chunks seems to increase the likelihood of getting anything substantially. Greetings, Andres Freund
On Wed, 20 Feb 2019 at 07:28, Robert Haas <robertmhaas@gmail.com> wrote: > Or maybe we should just blow up the current vacuum cost delay stuff > and replace it with something that is easier to tune. For example, we > could just have one parameter that sets the maximum read rate in kB/s > and another that sets the maximum dirty-page rate in kB/s. Whichever > limit is tighter binds. If we also have the thing that is the topic > of this thread, that's a third possible upper limit. I had similar thoughts when I saw that Peter's proposal didn't seem all that compatible with how the vacuum cost delays work today. I agree the cost limit would have to turn into something time based rather than points based. To me, it seems just too crude to have a per-backend limit. I think global "soft" limits would be better. Let's say, for example, the DBA would like to CREATE INDEX CONCURRENTLY on a 6TB table. They think this is going to take about 36 hours, so they start the operation at the start of off-peak, which is expected to last 12 hours. This means the create index is going to run for 2 off-peaks and 1 on-peak. Must they really configure the create index to run at a speed that is suitable for running at peak-load? That's pretty wasteful as surely it could run much more quickly during the off-peak. I know there's debate as to if this can rate limit WAL, but, if we can find a way to do that, then it seems to me some settings like: max_soft_global_wal_rate (MB/sec) min_hard_local_wal_rate (MB/sec) That way the rate limited process would slow down to min_hard_local_wal_rate when the WAL rate of all processes is exceeding max_soft_global_wal_rate. The min_hard_local_wal_rate is just there to ensure the process never stops completely. It can simply tick along at that rate until the global WAL rate slows down again. It's likely going to be easier to do something like that in WAL than with buffer read/write/dirties since we already can easily see how much WAL has been written by looking at the current LSN. (Of course, these GUC names are not very good. I just picked them out the air quickly to try and get their meaning across) -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2/19/19 8:40 PM, Andres Freund wrote: > Hi, > > On 2019-02-19 20:34:25 +0100, Tomas Vondra wrote: >> On 2/19/19 8:22 PM, Andres Freund wrote: >>> And my main point is that even if you implement a proper bytes/sec limit >>> ONLY for WAL, the behaviour of VACUUM rate limiting doesn't get >>> meaningfully more confusing than right now. >>> >> >> So, why not to modify autovacuum to also use this approach? I wonder if >> the situation there is more complicated because of multiple workers >> sharing the same budget ... > > I think the main reason is that implementing a scheme like this for WAL > rate limiting isn't a small task, but it'd be aided by the fact that > it'd probably not on by default, and that there'd not be any regressions > because the behaviour didn't exist before. I contrast, people are > extremely sensitive to autovacuum behaviour changes, even if it's to > improve autovacuum. I think it makes more sense to build the logic in a > lower profile case first, and then migrate autovacuum over it. Even > leaving the maturity issue aside, reducing the scope of the project into > more bite sized chunks seems to increase the likelihood of getting > anything substantially. > Maybe. I guess the main thing I'm advocating for here is to aim for a unified throttling approach, not multiple disparate approaches interacting in ways that are hard to understand/predict. The time-based approach you described looks fine, an it's kinda what I was imagining (and not unlike the checkpoint throttling). I don't think it'd be that hard to tweak autovacuum to use it too, but I admit I have not thought about it particularly hard and there's stuff like per-table settings which might make it more complex. So maybe doing it for WAL first makes sense. But I still think we need to have at least a rough idea how it interacts with the existing throttling and how it will work in the end. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Greetings, * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > On 2/19/19 8:40 PM, Andres Freund wrote: > > On 2019-02-19 20:34:25 +0100, Tomas Vondra wrote: > >> On 2/19/19 8:22 PM, Andres Freund wrote: > >>> And my main point is that even if you implement a proper bytes/sec limit > >>> ONLY for WAL, the behaviour of VACUUM rate limiting doesn't get > >>> meaningfully more confusing than right now. > >> > >> So, why not to modify autovacuum to also use this approach? I wonder if > >> the situation there is more complicated because of multiple workers > >> sharing the same budget ... > > > > I think the main reason is that implementing a scheme like this for WAL > > rate limiting isn't a small task, but it'd be aided by the fact that > > it'd probably not on by default, and that there'd not be any regressions > > because the behaviour didn't exist before. I contrast, people are > > extremely sensitive to autovacuum behaviour changes, even if it's to > > improve autovacuum. I think it makes more sense to build the logic in a > > lower profile case first, and then migrate autovacuum over it. Even > > leaving the maturity issue aside, reducing the scope of the project into > > more bite sized chunks seems to increase the likelihood of getting > > anything substantially. > > Maybe. I concur with that 'maybe'. :) > I guess the main thing I'm advocating for here is to aim for a unified > throttling approach, not multiple disparate approaches interacting in > ways that are hard to understand/predict. Yes, agreed. > The time-based approach you described looks fine, an it's kinda what I > was imagining (and not unlike the checkpoint throttling). I don't think > it'd be that hard to tweak autovacuum to use it too, but I admit I have > not thought about it particularly hard and there's stuff like per-table > settings which might make it more complex. When reading Andres' proposal, I was heavily reminded of how checkpoint throttling is handled and wondered if there might be some way to reuse or generalize that existing code/technique/etc and make it available to be used for WAL, and more-or-less any/every other bulk operation (CREATE INDEX, REINDEX, CLUSTER, VACUUM...). > So maybe doing it for WAL first makes sense. But I still think we need > to have at least a rough idea how it interacts with the existing > throttling and how it will work in the end. Well, it seems like Andres explained how it'll work with the existing throttling, no? As for how all of this will work in the end, that's a good question but also a rather difficult one to answer, I suspect. Just to share a few additional thoughts after pondering this for a while, but the comment Andres made up-thread really struck a chord- we don't necessairly want to throttle anything, what we'd really rather do is *prioritize* things, whereby foreground work (regular queries and such) have a higher priority than background/bulk work (VACUUM, REINDEX, etc) but otherwise we use the system to its full capacity. We don't actually want to throttle a VACUUM run any more than a CREATE INDEX, we just don't want those to hurt the performance of regular queries that are happening. The other thought I had was that doing things on a per-table basis, in particular, isn't really addressing the resource question appropriately. WAL is relatively straight-forward and independent of a resource from the IO for the heap/indexes, so getting an idea from the admin of how much capacity they have for WAL makes sense. When it comes to the capacity for the heap/indexes, in terms of IO, that really goes to the underlying storage system/channel, which would actually be a tablespace in properly set up environments (imv anyway). Wrapping this up- it seems unlikely that we're going to get a priority-based system in place any time particularly soon but I do think it's worthy of some serious consideration and discussion about how we might be able to get there. On the other hand, if we can provide a way for the admin to say "these are my IO channels (per-tablespace values, plus a value for WAL), here's what their capacity is, and here's how much buffer for foreground work I want to have (again, per IO channel), so, PG, please arrange to not use more than 'capacity-buffer' amount of resources for background/bulk tasks (per IO channel)" then we can at least help them address the issue that foreground tasks are being stalled or delayed due to background/bulk work. This approach means that they won't be utilizing the system to its full capacity, but they'll know that and they'll know that it's because, for them, it's more important that they have that low latency for foreground tasks. Thanks! Stephen
Attachment
On 2/20/19 10:43 PM, Stephen Frost wrote: > Greetings, > ... >> >> So maybe doing it for WAL first makes sense. But I still think we need >> to have at least a rough idea how it interacts with the existing >> throttling and how it will work in the end. > > Well, it seems like Andres explained how it'll work with the existing > throttling, no? As for how all of this will work in the end, that's a > good question but also a rather difficult one to answer, I suspect. > Well ... he explained how to do WAL throttling, and I agree what he proposed seems entirely sane to me. But when it comes to interactions with current vacuum cost-based throttling, he claims it does not get meaningfully more confusing due to interactions with WAL throttling. I don't quite agree with that, but I'm not going to beat this horse any longer ... > Just to share a few additional thoughts after pondering this for a > while, but the comment Andres made up-thread really struck a chord- we > don't necessairly want to throttle anything, what we'd really rather do > is *prioritize* things, whereby foreground work (regular queries and > such) have a higher priority than background/bulk work (VACUUM, REINDEX, > etc) but otherwise we use the system to its full capacity. We don't > actually want to throttle a VACUUM run any more than a CREATE INDEX, we > just don't want those to hurt the performance of regular queries that > are happening. > I think you're forgetting the motivation of this very patch was to prevent replication lag caused by a command generating large amounts of WAL (like CREATE INDEX / ALTER TABLE etc.). That has almost nothing to do with prioritization or foreground/background split. I'm not arguing against ability to prioritize stuff, but I disagree it somehow replaces throttling. > The other thought I had was that doing things on a per-table basis, in > particular, isn't really addressing the resource question appropriately. > WAL is relatively straight-forward and independent of a resource from > the IO for the heap/indexes, so getting an idea from the admin of how > much capacity they have for WAL makes sense. When it comes to the > capacity for the heap/indexes, in terms of IO, that really goes to the > underlying storage system/channel, which would actually be a tablespace > in properly set up environments (imv anyway). > > Wrapping this up- it seems unlikely that we're going to get a > priority-based system in place any time particularly soon but I do think > it's worthy of some serious consideration and discussion about how we > might be able to get there. On the other hand, if we can provide a way > for the admin to say "these are my IO channels (per-tablespace values, > plus a value for WAL), here's what their capacity is, and here's how > much buffer for foreground work I want to have (again, per IO channel), > so, PG, please arrange to not use more than 'capacity-buffer' amount of > resources for background/bulk tasks (per IO channel)" then we can at > least help them address the issue that foreground tasks are being > stalled or delayed due to background/bulk work. This approach means > that they won't be utilizing the system to its full capacity, but > they'll know that and they'll know that it's because, for them, it's > more important that they have that low latency for foreground tasks. > I think it's mostly orthogonal feature to throttling. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Greetings, * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > On 2/20/19 10:43 PM, Stephen Frost wrote: > > Just to share a few additional thoughts after pondering this for a > > while, but the comment Andres made up-thread really struck a chord- we > > don't necessairly want to throttle anything, what we'd really rather do > > is *prioritize* things, whereby foreground work (regular queries and > > such) have a higher priority than background/bulk work (VACUUM, REINDEX, > > etc) but otherwise we use the system to its full capacity. We don't > > actually want to throttle a VACUUM run any more than a CREATE INDEX, we > > just don't want those to hurt the performance of regular queries that > > are happening. > > I think you're forgetting the motivation of this very patch was to > prevent replication lag caused by a command generating large amounts of > WAL (like CREATE INDEX / ALTER TABLE etc.). That has almost nothing to > do with prioritization or foreground/background split. > > I'm not arguing against ability to prioritize stuff, but I disagree it > somehow replaces throttling. Why is replication lag an issue though? I would contend it's an issue because with sync replication, it makes foreground processes wait, and with async replication, it makes the actions of foreground processes show up late on the replicas. If the actual WAL records for the foreground processes got priority and were pushed out earlier than the background ones, that would eliminate both of those issues with replication lag. Perhaps there's other issues that replication lag cause but which aren't solved by prioritizing the actual WAL records that you care about getting to the replicas faster, but if so, I'd like to hear what those are. > > The other thought I had was that doing things on a per-table basis, in > > particular, isn't really addressing the resource question appropriately. > > WAL is relatively straight-forward and independent of a resource from > > the IO for the heap/indexes, so getting an idea from the admin of how > > much capacity they have for WAL makes sense. When it comes to the > > capacity for the heap/indexes, in terms of IO, that really goes to the > > underlying storage system/channel, which would actually be a tablespace > > in properly set up environments (imv anyway). > > > > Wrapping this up- it seems unlikely that we're going to get a > > priority-based system in place any time particularly soon but I do think > > it's worthy of some serious consideration and discussion about how we > > might be able to get there. On the other hand, if we can provide a way > > for the admin to say "these are my IO channels (per-tablespace values, > > plus a value for WAL), here's what their capacity is, and here's how > > much buffer for foreground work I want to have (again, per IO channel), > > so, PG, please arrange to not use more than 'capacity-buffer' amount of > > resources for background/bulk tasks (per IO channel)" then we can at > > least help them address the issue that foreground tasks are being > > stalled or delayed due to background/bulk work. This approach means > > that they won't be utilizing the system to its full capacity, but > > they'll know that and they'll know that it's because, for them, it's > > more important that they have that low latency for foreground tasks. > > I think it's mostly orthogonal feature to throttling. I'm... not sure that what I was getting at above really got across. What I was saying above, in a nutshell, is that if we're going to provide throttling then we should give users a way to configure the throttling on a per-IO-channel basis, which means at the tablespace level, plus an independent configuration option for WAL since we allow that to be placed elsewhere too. Ideally, the configuration parameter would be in the same units as the actual resource is too- which would probably be IOPS+bandwidth, really. Just doing it in terms of bandwidth ends up being a bit of a mismatch as compared to reality, and would mean that users would have to tune it down farther than they might otherwise and therefore give up that much more in terms of system capability. Thanks! Stephen
Attachment
Hi, On 2019-02-20 18:46:09 -0500, Stephen Frost wrote: > * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > > On 2/20/19 10:43 PM, Stephen Frost wrote: > > > Just to share a few additional thoughts after pondering this for a > > > while, but the comment Andres made up-thread really struck a chord- we > > > don't necessairly want to throttle anything, what we'd really rather do > > > is *prioritize* things, whereby foreground work (regular queries and > > > such) have a higher priority than background/bulk work (VACUUM, REINDEX, > > > etc) but otherwise we use the system to its full capacity. We don't > > > actually want to throttle a VACUUM run any more than a CREATE INDEX, we > > > just don't want those to hurt the performance of regular queries that > > > are happening. > > > > I think you're forgetting the motivation of this very patch was to > > prevent replication lag caused by a command generating large amounts of > > WAL (like CREATE INDEX / ALTER TABLE etc.). That has almost nothing to > > do with prioritization or foreground/background split. > > > > I'm not arguing against ability to prioritize stuff, but I disagree it > > somehow replaces throttling. > > Why is replication lag an issue though? I would contend it's an issue > because with sync replication, it makes foreground processes wait, and > with async replication, it makes the actions of foreground processes > show up late on the replicas. I think reaching the bandwidth limit of either the replication stream, or of the startup process is actually more common than these. And for that prioritization doesn't help, unless it somehow reduces the total amount of WAL. > If the actual WAL records for the foreground processes got priority and > were pushed out earlier than the background ones, that would eliminate > both of those issues with replication lag. Perhaps there's other issues > that replication lag cause but which aren't solved by prioritizing the > actual WAL records that you care about getting to the replicas faster, > but if so, I'd like to hear what those are. Wait, what? Are you actually suggesting that different sources of WAL records should be streamed out in different order? You're blowing a somewhat reasonably doable project up into "let's rewrite a large chunk of all of the durability layer in postgres". Stephen, we gotta stop blowing up projects into something that can't ever realistically be finished. Greetings, Andres Freund
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On 2019-02-20 18:46:09 -0500, Stephen Frost wrote: > > * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > > > On 2/20/19 10:43 PM, Stephen Frost wrote: > > > > Just to share a few additional thoughts after pondering this for a > > > > while, but the comment Andres made up-thread really struck a chord- we > > > > don't necessairly want to throttle anything, what we'd really rather do > > > > is *prioritize* things, whereby foreground work (regular queries and > > > > such) have a higher priority than background/bulk work (VACUUM, REINDEX, > > > > etc) but otherwise we use the system to its full capacity. We don't > > > > actually want to throttle a VACUUM run any more than a CREATE INDEX, we > > > > just don't want those to hurt the performance of regular queries that > > > > are happening. > > > > > > I think you're forgetting the motivation of this very patch was to > > > prevent replication lag caused by a command generating large amounts of > > > WAL (like CREATE INDEX / ALTER TABLE etc.). That has almost nothing to > > > do with prioritization or foreground/background split. > > > > > > I'm not arguing against ability to prioritize stuff, but I disagree it > > > somehow replaces throttling. > > > > Why is replication lag an issue though? I would contend it's an issue > > because with sync replication, it makes foreground processes wait, and > > with async replication, it makes the actions of foreground processes > > show up late on the replicas. > > I think reaching the bandwidth limit of either the replication stream, > or of the startup process is actually more common than these. And for > that prioritization doesn't help, unless it somehow reduces the total > amount of WAL. The issue with hitting those bandwidth limits is that you end up with queues outside of your control and therefore are unable to prioritize the data going through them. I agree, that's an issue and it might be necessary to ask the admin to provide what the bandwidth limit is, so that we could then avoid running into issues with downstream queues that are outside of our control causing unexpected/unacceptable lag. > > If the actual WAL records for the foreground processes got priority and > > were pushed out earlier than the background ones, that would eliminate > > both of those issues with replication lag. Perhaps there's other issues > > that replication lag cause but which aren't solved by prioritizing the > > actual WAL records that you care about getting to the replicas faster, > > but if so, I'd like to hear what those are. > > Wait, what? Are you actually suggesting that different sources of WAL > records should be streamed out in different order? You're blowing a > somewhat reasonably doable project up into "let's rewrite a large chunk > of all of the durability layer in postgres". > > Stephen, we gotta stop blowing up projects into something that can't > ever realistically be finished. I started this sub-thread specifically the way I did because I was trying to make it clear that these were just ideas for possible discussion- I'm *not* suggesting, nor saying, that we have to go implement this right now instead of implementing the throttling that started this thread. I'm also, to be clear, not objecting to implementing the throttling discussed (though, as mentioned but seemingly ignored, I'd see it maybe configurable in different ways than originally suggested). If there's a way I can get that across more clearly than saying "Just to share a few additional thoughts", I'm happy to try and do so, but I don't agree that I should be required to simply keep such thoughts to myself; indeed, I'll admit that I don't know how large a project this would actually be and while I figured it'd be *huge*, I wanted to share the thought in case someone might see a way that we could implement it with much less work and have a better solution as a result. Thanks! Stephen
Attachment
On Thu, Feb 21, 2019 at 2:20 AM Stephen Frost <sfrost@snowman.net> wrote:
* Andres Freund (andres@anarazel.de) wrote:
> On 2019-02-20 18:46:09 -0500, Stephen Frost wrote:
> > * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> > > On 2/20/19 10:43 PM, Stephen Frost wrote:
> > > > Just to share a few additional thoughts after pondering this for a
> > > > while, but the comment Andres made up-thread really struck a chord- we
> > > > don't necessairly want to throttle anything, what we'd really rather do
> > > > is *prioritize* things, whereby foreground work (regular queries and
> > > > such) have a higher priority than background/bulk work (VACUUM, REINDEX,
> > > > etc) but otherwise we use the system to its full capacity. We don't
> > > > actually want to throttle a VACUUM run any more than a CREATE INDEX, we
> > > > just don't want those to hurt the performance of regular queries that
> > > > are happening.
> > >
> > > I think you're forgetting the motivation of this very patch was to
> > > prevent replication lag caused by a command generating large amounts of
> > > WAL (like CREATE INDEX / ALTER TABLE etc.). That has almost nothing to
> > > do with prioritization or foreground/background split.
> > >
> > > I'm not arguing against ability to prioritize stuff, but I disagree it
> > > somehow replaces throttling.
> >
> > Why is replication lag an issue though? I would contend it's an issue
> > because with sync replication, it makes foreground processes wait, and
> > with async replication, it makes the actions of foreground processes
> > show up late on the replicas.
>
> I think reaching the bandwidth limit of either the replication stream,
> or of the startup process is actually more common than these. And for
> that prioritization doesn't help, unless it somehow reduces the total
> amount of WAL.
The issue with hitting those bandwidth limits is that you end up with
queues outside of your control and therefore are unable to prioritize
the data going through them. I agree, that's an issue and it might be
necessary to ask the admin to provide what the bandwidth limit is, so
that we could then avoid running into issues with downstream queues that
are outside of our control causing unexpected/unacceptable lag.
If there is a global rate limit on WAL throughput it could be adjusted by a control loop, measuring replication queue length and/or apply delay. I don't see any sane way how one would tune a per command rate limit, or even worse, a cost-delay parameter. It would have the same problems as work_mem settings.
Rate limit in front of WAL insertion would allow for allocating the throughput between foreground and background tasks, and even allow for priority inheritance to alleviate priority inversion due to locks.
There is also an implicit assumption here that a maintenance command is a background task and a normal DML query is a foreground task. This is not true for all cases, users may want to throttle transactions doing lots of DML to keep synchronous commit latencies for smaller transactions within reasonable limits.
As a wild idea for how to handle the throttling, what if when all our wal insertion credits are used up XLogInsert() sets InterruptPending and the actual sleep is done inside ProcessInterrupts()?
Regards,
Ants Aasma
Ants Aasma
Greetings, * Ants Aasma (ants.aasma@eesti.ee) wrote: > On Thu, Feb 21, 2019 at 2:20 AM Stephen Frost <sfrost@snowman.net> wrote: > > The issue with hitting those bandwidth limits is that you end up with > > queues outside of your control and therefore are unable to prioritize > > the data going through them. I agree, that's an issue and it might be > > necessary to ask the admin to provide what the bandwidth limit is, so > > that we could then avoid running into issues with downstream queues that > > are outside of our control causing unexpected/unacceptable lag. > > If there is a global rate limit on WAL throughput it could be adjusted by a > control loop, measuring replication queue length and/or apply delay. I > don't see any sane way how one would tune a per command rate limit, or even > worse, a cost-delay parameter. It would have the same problems as work_mem > settings. Yeah, having some kind of feedback loop would be interesting. I agree that a per-command rate limit would have similar problems to work_mem, and that's definitely one problem we have with the way VACUUM is tuned today but the ship has more-or-less sailed on that- I don't think we're going to be able to simply remove the VACUUM settings. Avoiding adding new settings that are per-command would be good though, if we can sort out a way how. > Rate limit in front of WAL insertion would allow for allocating the > throughput between foreground and background tasks, and even allow for > priority inheritance to alleviate priority inversion due to locks. I'm not sure how much we have to worry about priority inversion here as you need to have conflicts for that and if there's actually a conflict, then it seems like we should just press on. That is, a non-concurrent REINDEX is going to prevent an UPDATE from modifying anything in the table, which if the UPDATE is a higher priority than the REINDEX would be priority inversion, but that doesn't mean we should slow down the REINDEX to allow the UPDATE to happen because the UPDATE simply can't happen until the REINDEX is complete. Now, we might slow down the REINDEX because there's UPDATEs against *other* tables that aren't conflicting and we want those UPDATEs to be prioritized over the REINDEX but then that isn't priority inversion. Basically, I'm not sure that there's anything we can do, or need to do, differently from what we do today when it comes to priority inversion risk, at least as it relates to this discussion. There's an interesting discussion to be had about if we should delay the REINDEX taking the lock at all when there's an UPDATE pending, but you run the risk of starving the REINDEX from ever getting the lock and being able to run in that case. A better approach is what we're already working on- arrange for the REINDEX to not require a conflicting lock, so that both can run concurrently. > There is also an implicit assumption here that a maintenance command is a > background task and a normal DML query is a foreground task. This is not > true for all cases, users may want to throttle transactions doing lots of > DML to keep synchronous commit latencies for smaller transactions within > reasonable limits. Agreed, that was something that I was contemplating too- and one could possibly argue in the other direction as well (maybe that REINDEX is on a small table but has a very high priority and we're willing to accept that some regular DML is delayed a bit to allow that REINDEX to finish). Still, I would think we'd basically want to use the heuristic that DDL is bulk and DML is a higher priority for a baseline/default position, but then provide users with a way to change the priority on a per-session level, presumably with a GUC or similar, if they have a case where that heuristic is wrong. Again, just to be clear, this is all really 'food for thought' and interesting discussion and shouldn't keep us from doing something simple now, if we can, to help alleviate the immediate practical issue that bulk commands can cause serious WAL lag. I think it's good to have these discussions since they may help us to craft the simple solution in a way that could later be extended (or at least won't get in the way) for these much larger changes, but even if that's not possible, we should be open to accepting a simpler, short-term, improvement, as these larger changes would very likely be multiple major releases away if they're able to be done at all. > As a wild idea for how to handle the throttling, what if when all our wal > insertion credits are used up XLogInsert() sets InterruptPending and the > actual sleep is done inside ProcessInterrupts()? This comment might be better if it was made higher up in the thread, closer to where the discussion was happening about the issues with critical sections and the current patch's approach for throttle-based rate limiting. I'm afraid that it might get lost in this sub-thread about these much larger and loftier ideas around where we might want to go in the future. Thanks! Stephen
Attachment
On Thu, Feb 21, 2019 at 12:50 PM Stephen Frost <sfrost@snowman.net> wrote:
> Rate limit in front of WAL insertion would allow for allocating the
> throughput between foreground and background tasks, and even allow for
> priority inheritance to alleviate priority inversion due to locks.
I'm not sure how much we have to worry about priority inversion here as
you need to have conflicts for that and if there's actually a conflict,
then it seems like we should just press on.
That is, a non-concurrent REINDEX is going to prevent an UPDATE from
modifying anything in the table, which if the UPDATE is a higher
priority than the REINDEX would be priority inversion, but that doesn't
mean we should slow down the REINDEX to allow the UPDATE to happen
because the UPDATE simply can't happen until the REINDEX is complete.
Now, we might slow down the REINDEX because there's UPDATEs against
*other* tables that aren't conflicting and we want those UPDATEs to be
prioritized over the REINDEX but then that isn't priority inversion.
I was thinking along the lines that each backend gets a budget of WAL insertion credits per time interval, and when the credits run out the process sleeps. With this type of scheme it would be reasonably straightforward to let UPDATEs being blocked by REINDEX to transfer their WAL insertion budgets to the REINDEX, making it get a larger piece of the total throughput pie.
Regards,
Ants Aasma
Greetings, * Ants Aasma (ants.aasma@eesti.ee) wrote: > On Thu, Feb 21, 2019 at 12:50 PM Stephen Frost <sfrost@snowman.net> wrote: > > > > Rate limit in front of WAL insertion would allow for allocating the > > > throughput between foreground and background tasks, and even allow for > > > priority inheritance to alleviate priority inversion due to locks. > > > > I'm not sure how much we have to worry about priority inversion here as > > you need to have conflicts for that and if there's actually a conflict, > > then it seems like we should just press on. > > > > That is, a non-concurrent REINDEX is going to prevent an UPDATE from > > modifying anything in the table, which if the UPDATE is a higher > > priority than the REINDEX would be priority inversion, but that doesn't > > mean we should slow down the REINDEX to allow the UPDATE to happen > > because the UPDATE simply can't happen until the REINDEX is complete. > > Now, we might slow down the REINDEX because there's UPDATEs against > > *other* tables that aren't conflicting and we want those UPDATEs to be > > prioritized over the REINDEX but then that isn't priority inversion. > > I was thinking along the lines that each backend gets a budget of WAL > insertion credits per time interval, and when the credits run out the > process sleeps. With this type of scheme it would be reasonably > straightforward to let UPDATEs being blocked by REINDEX to transfer their > WAL insertion budgets to the REINDEX, making it get a larger piece of the > total throughput pie. Sure, that could possibly be done if it's a per-backend throttle mechanism, but that's got more-or-less the same issue as a per-command mechanism and work_mem as discussed up-thread. Also seems like if we've solved for a way to do this transferring and delay and such that we could come up with a way to prioritize (or 'give more credits') to foreground and less to background (there was another point made elsewhere in the thread that background processes should still be given *some* amount of credits, always, so that they don't end up starving completely, and I agree with that). There's actually a lot of similarity or parallels between this and basic network traffic shaping, it seems to me anyway, where you have a pipe of a certain size and you want to prioritize some things (like interactive SSH) while de-prioritizing other things (bulk SCP) but also using the pipe fully. Thanks! Stephen