Thread: WAL insert delay settings

WAL insert delay settings

From
Peter Eisentraut
Date:
Bulk operations like CREATE INDEX, ALTER TABLE, or bulk loads can create
a lot of WAL.  A lot of WAL at once can cause delays in replication.
For synchronous replication, this can make seemingly unrelated sessions
hang.  But also for asynchronous replication, it will increase latency.

One idea to address this is to slow down WAL-generating maintenance
operations.  This is similar to the vacuum delay.  Where the vacuum
delay counts notional I/O cost before sleeping, here we would count how
much WAL has been generated and sleep after some amount.

I attach an example patch for this functionality.  It introduces three
settings:

wal_insert_delay_enabled
wal_insert_delay
wal_insert_delay_size

When you turn on wal_insert_delay_enabled, then it will sleep for
wal_insert_delay after the session has produced wal_insert_delay_size of
WAL data.

The idea is that you would tune wal_insert_delay and
wal_insert_delay_size to your required performance characteristics and
then turn on wal_insert_delay_enabled individually in maintenance jobs
or similar.

To test, for example, set up pgbench with synchronous replication and
run an unrelated large index build in a separate session.  With the
settings, you can make it as fast or as slow as you want.

Tuning these settings, however, is quite mysterious I fear.  You have to
play around a lot to get settings that achieve the right balance.

So, some questions:

Is this useful?

Any other thoughts on how to configure this or do this?

Should we aim for a more general delay system, possibly including vacuum
delay and perhaps something else?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: WAL insert delay settings

From
Andres Freund
Date:
Hi,

On February 13, 2019 1:16:07 PM GMT+01:00, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
>Bulk operations like CREATE INDEX, ALTER TABLE, or bulk loads can
>create
>a lot of WAL.  A lot of WAL at once can cause delays in replication.
>For synchronous replication, this can make seemingly unrelated sessions
>hang.  But also for asynchronous replication, it will increase latency.
>
>One idea to address this is to slow down WAL-generating maintenance
>operations.  This is similar to the vacuum delay.  Where the vacuum
>delay counts notional I/O cost before sleeping, here we would count how
>much WAL has been generated and sleep after some amount.
>
>I attach an example patch for this functionality.  It introduces three
>settings:
>
>wal_insert_delay_enabled
>wal_insert_delay
>wal_insert_delay_size
>
>When you turn on wal_insert_delay_enabled, then it will sleep for
>wal_insert_delay after the session has produced wal_insert_delay_size
>of
>WAL data.
>
>The idea is that you would tune wal_insert_delay and
>wal_insert_delay_size to your required performance characteristics and
>then turn on wal_insert_delay_enabled individually in maintenance jobs
>or similar.
>
>To test, for example, set up pgbench with synchronous replication and
>run an unrelated large index build in a separate session.  With the
>settings, you can make it as fast or as slow as you want.
>
>Tuning these settings, however, is quite mysterious I fear.  You have
>to
>play around a lot to get settings that achieve the right balance.
>
>So, some questions:
>
>Is this useful?
>
>Any other thoughts on how to configure this or do this?
>
>Should we aim for a more general delay system, possibly including
>vacuum
>delay and perhaps something else?

Interesting idea, not yet quite sure what to think. But I don't think the way you did it is acceptable - we can't just
delaywhile holding buffer locks, in critical sections, while not interruptible. 

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: WAL insert delay settings

From
dataegret
Date:
Hi

13.02.2019 17:16, Peter Eisentraut пишет:
> Bulk operations like CREATE INDEX, ALTER TABLE, or bulk loads can create
> a lot of WAL.  A lot of WAL at once can cause delays in replication.
> For synchronous replication, this can make seemingly unrelated sessions
> hang.  But also for asynchronous replication, it will increase latency.
>
> One idea to address this is to slow down WAL-generating maintenance
> operations.  This is similar to the vacuum delay.  Where the vacuum
> delay counts notional I/O cost before sleeping, here we would count how
> much WAL has been generated and sleep after some amount.
>
> I attach an example patch for this functionality.  It introduces three
> settings:
>
> wal_insert_delay_enabled
> wal_insert_delay
> wal_insert_delay_size
>
> When you turn on wal_insert_delay_enabled, then it will sleep for
> wal_insert_delay after the session has produced wal_insert_delay_size of
> WAL data.
>
> The idea is that you would tune wal_insert_delay and
> wal_insert_delay_size to your required performance characteristics and
> then turn on wal_insert_delay_enabled individually in maintenance jobs
> or similar.
>
> To test, for example, set up pgbench with synchronous replication and
> run an unrelated large index build in a separate session.  With the
> settings, you can make it as fast or as slow as you want.
>
> Tuning these settings, however, is quite mysterious I fear.  You have to
> play around a lot to get settings that achieve the right balance.
>
> So, some questions:
>
> Is this useful?
>
> Any other thoughts on how to configure this or do this?
>
> Should we aim for a more general delay system, possibly including vacuum
> delay and perhaps something else?
>
I think it's better to have more general cost-based settings which allow 
to control performance. Something like what have been already done for 
autovacuum.

For example, introduce vacuum-similar mechanism with the following 
controlables:
maintenance_cost_page_hit
maintenance_cost_page_miss
maintenance_cost_page_dirty
maintenance_cost_delay
maintenance_cost_limit

maintenance_cost_delay=0 (default) means feature is disabled, but if 
user wants to limit performance he can define such parameters in 
per-session, or per-user manner. Especially it can be useful for 
limiting an already running sessions, such as mass deletion, or pg_dump.

Of course, it's just an idea, because I can't imagine how many things 
should be touched in order to implement this.

Regards, Alexey Lesovsky



Re: WAL insert delay settings

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On February 13, 2019 1:16:07 PM GMT+01:00, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
>> One idea to address this is to slow down WAL-generating maintenance
>> operations.  This is similar to the vacuum delay.  Where the vacuum
>> delay counts notional I/O cost before sleeping, here we would count how
>> much WAL has been generated and sleep after some amount.

> Interesting idea, not yet quite sure what to think. But I don't think the way you did it is acceptable - we can't
justdelay while holding buffer locks, in critical sections, while not interruptible. 

Yeah.  Maybe it could be done in a less invasive way by just having the
WAL code keep a running sum of how much WAL this process has created,
and then letting the existing vacuum-delay infrastructure use that as
one of its how-much-IO-have-I-done inputs.

Not sure if that makes the tuning problem easier or harder, but
it seems reasonable on its face to count WAL emission as I/O.

            regards, tom lane


Re: WAL insert delay settings

From
Stephen Frost
Date:
Greetings,

* Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote:
> Bulk operations like CREATE INDEX, ALTER TABLE, or bulk loads can create
> a lot of WAL.  A lot of WAL at once can cause delays in replication.

Agreed, though I think VACUUM should certainly be included in this.

I'm all for the idea though it seems like a different approach is needed
based on the down-thread discussion.  Ultimately, having a way to have
these activities happen without causing issues for replicas is a great
idea and would definitely address a practical issue that a lot of people
run into.

Thanks!

Stephen

Attachment

Re: WAL insert delay settings

From
Peter Eisentraut
Date:
On 13/02/2019 13:18, Andres Freund wrote:
> But I don't think the way you did it is acceptable - we can't just delay while holding buffer locks, in critical
sections,while not interruptible.
 

The code I added to XLogInsertRecord() is not inside the critical section.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WAL insert delay settings

From
Andres Freund
Date:

On February 13, 2019 4:39:21 PM GMT+01:00, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
>On 13/02/2019 13:18, Andres Freund wrote:
>> But I don't think the way you did it is acceptable - we can't just
>delay while holding buffer locks, in critical sections, while not
>interruptible.
>
>The code I added to XLogInsertRecord() is not inside the critical
>section.

Most callers do xlog insertions inside crit sections though.

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: WAL insert delay settings

From
Andrey Lepikhov
Date:

On 13.02.2019 19:57, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
>> On February 13, 2019 1:16:07 PM GMT+01:00, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
>>> One idea to address this is to slow down WAL-generating maintenance
>>> operations.  This is similar to the vacuum delay.  Where the vacuum
>>> delay counts notional I/O cost before sleeping, here we would count how
>>> much WAL has been generated and sleep after some amount.
> 
>> Interesting idea, not yet quite sure what to think. But I don't think the way you did it is acceptable - we can't
justdelay while holding buffer locks, in critical sections, while not interruptible.
 
> 
> Yeah.  Maybe it could be done in a less invasive way by just having the
> WAL code keep a running sum of how much WAL this process has created,
> and then letting the existing vacuum-delay infrastructure use that as
> one of its how-much-IO-have-I-done inputs.
> 
> Not sure if that makes the tuning problem easier or harder, but
> it seems reasonable on its face to count WAL emission as I/O.
> 
>             regards, tom lane

Also we can add a 'soft' clause to DML queries. It will some abstraction 
for background query execution. It can contain the WAL write velocity 
limit parameter (as Tom proposed) and may some another.

-- 
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company


Re: WAL insert delay settings

From
Peter Geoghegan
Date:
On Wed, Feb 13, 2019 at 4:16 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> Bulk operations like CREATE INDEX, ALTER TABLE, or bulk loads can create
> a lot of WAL.  A lot of WAL at once can cause delays in replication.
> For synchronous replication, this can make seemingly unrelated sessions
> hang.  But also for asynchronous replication, it will increase latency.

I think that I suggested a feature like this early during my time at
Heroku, about 5 years ago. There would occasionally be cases where ops
would find it useful to throttle WAL writing using their own terrible
kludge (it involved sending SIGSTOP to the WAL writer).

I recall that this idea was not well received at the time. I still
think it's a good idea, though. Provided there is a safe way to get it
to work.

-- 
Peter Geoghegan


Re: WAL insert delay settings

From
Andres Freund
Date:
Hi,

On 2019-02-13 23:21:39 -0800, Peter Geoghegan wrote:
> There would occasionally be cases where ops
> would find it useful to throttle WAL writing using their own terrible
> kludge (it involved sending SIGSTOP to the WAL writer).

That can't have been the workaround - either you'd interrupt it while
holding critical locks (in which case nobody could write WAL anymore),
or you'd just move all the writing to backends, no?

Greetings,

Andres Freund


Re: WAL insert delay settings

From
Peter Geoghegan
Date:
On Thu, Feb 14, 2019 at 12:42 AM Andres Freund <andres@anarazel.de> wrote:
> That can't have been the workaround - either you'd interrupt it while
> holding critical locks (in which case nobody could write WAL anymore),
> or you'd just move all the writing to backends, no?

I imagine that it held the critical locks briefly. I'm not endorsing
that approach, obviously, but apparently it more or less worked. It
was something that was used in rare cases, only when there was no
application-specific way to throttle writes, and only when the server
was in effect destabilized by writing out WAL too quickly.

-- 
Peter Geoghegan


Re: WAL insert delay settings

From
Peter Geoghegan
Date:
On Thu, Feb 14, 2019 at 12:52 AM Peter Geoghegan <pg@bowt.ie> wrote:
> I imagine that it held the critical locks briefly.

I didn't mention that the utility they used would send SIGSTOP and
SIGCONT in close succession. (Yeah, I know.)

-- 
Peter Geoghegan


Re: WAL insert delay settings

From
Peter Geoghegan
Date:
On Thu, Feb 14, 2019 at 12:53 AM Peter Geoghegan <pg@bowt.ie> wrote:
> I didn't mention that the utility they used would send SIGSTOP and
> SIGCONT in close succession. (Yeah, I know.)

Actually, it SIGSTOP'd backends, not the WAL writer or background writer.


-- 
Peter Geoghegan


Re: WAL insert delay settings

From
Tomas Vondra
Date:

On 2/13/19 4:31 PM, Stephen Frost wrote:
> Greetings,
> 
> * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote:
>> Bulk operations like CREATE INDEX, ALTER TABLE, or bulk loads can create
>> a lot of WAL.  A lot of WAL at once can cause delays in replication.
> 
> Agreed, though I think VACUUM should certainly be included in this.
> 

Won't these two throttling criteria interact in undesirable and/or
unpredictable way? With the regular vacuum throttling (based on
hit/miss/dirty) it's possible to compute rough read/write I/O limits.
But with the additional sleeps based on amount-of-WAL, we may sleep for
one of two reasons, so we may not reach either limit. No?

> I'm all for the idea though it seems like a different approach is needed
> based on the down-thread discussion.  Ultimately, having a way to have
> these activities happen without causing issues for replicas is a great
> idea and would definitely address a practical issue that a lot of people
> run into.
> 

+1


-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WAL insert delay settings

From
Andres Freund
Date:
Hi,

On 2019-02-14 10:00:38 +0100, Tomas Vondra wrote:
> On 2/13/19 4:31 PM, Stephen Frost wrote:
> > Greetings,
> > 
> > * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote:
> >> Bulk operations like CREATE INDEX, ALTER TABLE, or bulk loads can create
> >> a lot of WAL.  A lot of WAL at once can cause delays in replication.
> > 
> > Agreed, though I think VACUUM should certainly be included in this.
> > 
> 
> Won't these two throttling criteria interact in undesirable and/or
> unpredictable way? With the regular vacuum throttling (based on
> hit/miss/dirty) it's possible to compute rough read/write I/O limits.
> But with the additional sleeps based on amount-of-WAL, we may sleep for
> one of two reasons, so we may not reach either limit. No?

Well, it'd be max rates for either, if done right. I think we only
should start adding delays for WAL logging if we're exceeding the WAL
write rate. That's obviously more complicated than the stuff we do for
the current VACUUM throttling, but I can't see two such systems
interacting well. Also, the current logic just doesn't work well when
you consider IO actually taking time, and/or process scheduling effects
on busy systems.

Greetings,

Andres Freund


Re: WAL insert delay settings

From
Tomas Vondra
Date:

On 2/14/19 10:06 AM, Andres Freund wrote:
> Hi,
> 
> On 2019-02-14 10:00:38 +0100, Tomas Vondra wrote:
>> On 2/13/19 4:31 PM, Stephen Frost wrote:
>>> Greetings,
>>>
>>> * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote:
>>>> Bulk operations like CREATE INDEX, ALTER TABLE, or bulk loads can create
>>>> a lot of WAL.  A lot of WAL at once can cause delays in replication.
>>>
>>> Agreed, though I think VACUUM should certainly be included in this.
>>>
>>
>> Won't these two throttling criteria interact in undesirable and/or
>> unpredictable way? With the regular vacuum throttling (based on
>> hit/miss/dirty) it's possible to compute rough read/write I/O limits.
>> But with the additional sleeps based on amount-of-WAL, we may sleep for
>> one of two reasons, so we may not reach either limit. No?
> 
> Well, it'd be max rates for either, if done right. I think we only
> should start adding delays for WAL logging if we're exceeding the WAL
> write rate.

Not really, I think. If you add additional sleep() calls somewhere, that
may affect the limits in vacuum, making it throttle before reaching the
derived throughput limits.

> That's obviously more complicated than the stuff we do for
> the current VACUUM throttling, but I can't see two such systems
> interacting well. Also, the current logic just doesn't work well when
> you consider IO actually taking time, and/or process scheduling effects
> on busy systems.
> 

True, but making it even less predictable is hardly an improvement.

cheers

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WAL insert delay settings

From
Andres Freund
Date:

On February 14, 2019 10:31:57 AM GMT+01:00, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
>
>On 2/14/19 10:06 AM, Andres Freund wrote:
>> Hi,
>>
>> On 2019-02-14 10:00:38 +0100, Tomas Vondra wrote:
>>> On 2/13/19 4:31 PM, Stephen Frost wrote:
>>>> Greetings,
>>>>
>>>> * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote:
>>>>> Bulk operations like CREATE INDEX, ALTER TABLE, or bulk loads can
>create
>>>>> a lot of WAL.  A lot of WAL at once can cause delays in
>replication.
>>>>
>>>> Agreed, though I think VACUUM should certainly be included in this.
>>>>
>>>
>>> Won't these two throttling criteria interact in undesirable and/or
>>> unpredictable way? With the regular vacuum throttling (based on
>>> hit/miss/dirty) it's possible to compute rough read/write I/O
>limits.
>>> But with the additional sleeps based on amount-of-WAL, we may sleep
>for
>>> one of two reasons, so we may not reach either limit. No?
>>
>> Well, it'd be max rates for either, if done right. I think we only
>> should start adding delays for WAL logging if we're exceeding the WAL
>> write rate.
>
>Not really, I think. If you add additional sleep() calls somewhere,
>that
>may affect the limits in vacuum, making it throttle before reaching the
>derived throughput limits.

I don't understand. Obviously, if you have two limits, the scarcer resource can limit full use of the other resource.
Thatseems OK? The thing u think we need to be careful about is not to limit in a way, e.g. by adding sleeps even when
belowthe limit, that a WAL limit causes throttling of normal IO before the WAL limit is reached. 


>> That's obviously more complicated than the stuff we do for
>> the current VACUUM throttling, but I can't see two such systems
>> interacting well. Also, the current logic just doesn't work well when
>> you consider IO actually taking time, and/or process scheduling
>effects
>> on busy systems.
>>
>
>True, but making it even less predictable is hardly an improvement.

I don't quite see the problem here. Could you expand?

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: WAL insert delay settings

From
Tomas Vondra
Date:

On 2/14/19 10:36 AM, Andres Freund wrote:
> 
> 
> On February 14, 2019 10:31:57 AM GMT+01:00, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>>
>> On 2/14/19 10:06 AM, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2019-02-14 10:00:38 +0100, Tomas Vondra wrote:
>>>> On 2/13/19 4:31 PM, Stephen Frost wrote:
>>>>> Greetings,
>>>>>
>>>>> * Peter Eisentraut (peter.eisentraut@2ndquadrant.com) wrote:
>>>>>> Bulk operations like CREATE INDEX, ALTER TABLE, or bulk loads can
>> create
>>>>>> a lot of WAL.  A lot of WAL at once can cause delays in
>> replication.
>>>>>
>>>>> Agreed, though I think VACUUM should certainly be included in this.
>>>>>
>>>>
>>>> Won't these two throttling criteria interact in undesirable and/or
>>>> unpredictable way? With the regular vacuum throttling (based on
>>>> hit/miss/dirty) it's possible to compute rough read/write I/O
>> limits.
>>>> But with the additional sleeps based on amount-of-WAL, we may sleep
>> for
>>>> one of two reasons, so we may not reach either limit. No?
>>>
>>> Well, it'd be max rates for either, if done right. I think we only
>>> should start adding delays for WAL logging if we're exceeding the WAL
>>> write rate.
>>
>> Not really, I think. If you add additional sleep() calls somewhere,
>> that
>> may affect the limits in vacuum, making it throttle before reaching the
>> derived throughput limits.
> 
> I don't understand. Obviously, if you have two limits, the scarcer
> resource can limit full use of the other resource. That seems OK? The
> thing u think we need to be careful about is not to limit in a way,
> e.g. by adding sleeps even when below the limit, that a WAL limit
> causes throttling of normal IO before the WAL limit is reached.
> 

With the vacuum throttling, rough I/O throughput maximums can be
computed by by counting the number of pages you can read/write between
sleeps. For example with the defaults (200 credits, 20ms sleeps, miss
cost 10 credits) this means 20 writes/round, with 50 rounds/second, so
8MB/s. But this is based on assumption that the work between sleeps
takes almost no time - that's not perfect, but generally works.

But if you add extra sleep() calls somewhere (say because there's also
limit on WAL throughput), it will affect how fast VACUUM works in
general. Yet it'll continue with the cost-based throttling, but it will
never reach the limits. Say you do another 20ms sleep somewhere.
Suddenly it means it only does 25 rounds/second, and the actual write
limit drops to 4 MB/s.

> 
>>> That's obviously more complicated than the stuff we do for
>>> the current VACUUM throttling, but I can't see two such systems
>>> interacting well. Also, the current logic just doesn't work well when
>>> you consider IO actually taking time, and/or process scheduling
>> effects
>>> on busy systems.
>>>
>>
>> True, but making it even less predictable is hardly an improvement.
> 
> I don't quite see the problem here. Could you expand?
> 

All I'm saying that you can now estimate how much reads/writes vacuum
does. With the extra sleeps (due to additional throttling mechanism) it
will be harder.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WAL insert delay settings

From
Peter Eisentraut
Date:
On 14/02/2019 11:03, Tomas Vondra wrote:
> But if you add extra sleep() calls somewhere (say because there's also
> limit on WAL throughput), it will affect how fast VACUUM works in
> general. Yet it'll continue with the cost-based throttling, but it will
> never reach the limits. Say you do another 20ms sleep somewhere.
> Suddenly it means it only does 25 rounds/second, and the actual write
> limit drops to 4 MB/s.

I think at a first approximation, you probably don't want to add WAL
delays to vacuum jobs, since they are already slowed down, so the rate
of WAL they produce might not be your first problem.  The problem is
more things like CREATE INDEX CONCURRENTLY that run at full speed.

That leads to an alternative idea of expanding the existing cost-based
vacuum delay system to other commands.

We could even enhance the cost system by taking WAL into account as an
additional factor.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WAL insert delay settings

From
Peter Eisentraut
Date:
On 13/02/2019 16:40, Andres Freund wrote:
> On February 13, 2019 4:39:21 PM GMT+01:00, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
>> On 13/02/2019 13:18, Andres Freund wrote:
>>> But I don't think the way you did it is acceptable - we can't just
>> delay while holding buffer locks, in critical sections, while not
>> interruptible.
>>
>> The code I added to XLogInsertRecord() is not inside the critical
>> section.
> 
> Most callers do xlog insertions inside crit sections though.

Is it a problem that pg_usleep(CommitDelay) is inside a critical section?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WAL insert delay settings

From
Stephen Frost
Date:
Greetings,

On Thu, Feb 14, 2019 at 10:15 Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
On 14/02/2019 11:03, Tomas Vondra wrote:
> But if you add extra sleep() calls somewhere (say because there's also
> limit on WAL throughput), it will affect how fast VACUUM works in
> general. Yet it'll continue with the cost-based throttling, but it will
> never reach the limits. Say you do another 20ms sleep somewhere.
> Suddenly it means it only does 25 rounds/second, and the actual write
> limit drops to 4 MB/s.

I think at a first approximation, you probably don't want to add WAL
delays to vacuum jobs, since they are already slowed down, so the rate
of WAL they produce might not be your first problem.  The problem is
more things like CREATE INDEX CONCURRENTLY that run at full speed.

That leads to an alternative idea of expanding the existing cost-based
vacuum delay system to other commands.

We could even enhance the cost system by taking WAL into account as an
additional factor.

This is really what I was thinking- let’s not have multiple independent ways of slowing down maintenance and similar jobs to reduce their impact on I/o to the heap and to WAL. 

Thanks!

Stephen

Re: WAL insert delay settings

From
Andres Freund
Date:
Hi,

On 2019-02-14 16:16:05 +0100, Peter Eisentraut wrote:
> On 13/02/2019 16:40, Andres Freund wrote:
> > On February 13, 2019 4:39:21 PM GMT+01:00, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
> >> On 13/02/2019 13:18, Andres Freund wrote:
> >>> But I don't think the way you did it is acceptable - we can't just
> >> delay while holding buffer locks, in critical sections, while not
> >> interruptible.
> >>
> >> The code I added to XLogInsertRecord() is not inside the critical
> >> section.
> > 
> > Most callers do xlog insertions inside crit sections though.
> 
> Is it a problem that pg_usleep(CommitDelay) is inside a critical section?

Well: We can't make things sleep for considerable time while holding
crucial locks and not make such sleeps interruptible. And holding
lwlocks will make it noninterruptible (but still it could throw an
error), but with crit sections, we can't even error out if we somehow
got that error.

Consider throttled code writing to a popular index or bree page - they'd
suddenly be stalled and everyone else would also queue up in an
uninterruptible manner via lwlocks. You'd throttle the whole system.

Greetings,

Andres Freund


Re: WAL insert delay settings

From
Andres Freund
Date:
On 2019-02-14 11:02:24 -0500, Stephen Frost wrote:
> Greetings,
> 
> On Thu, Feb 14, 2019 at 10:15 Peter Eisentraut <
> peter.eisentraut@2ndquadrant.com> wrote:
> 
> > On 14/02/2019 11:03, Tomas Vondra wrote:
> > > But if you add extra sleep() calls somewhere (say because there's also
> > > limit on WAL throughput), it will affect how fast VACUUM works in
> > > general. Yet it'll continue with the cost-based throttling, but it will
> > > never reach the limits. Say you do another 20ms sleep somewhere.
> > > Suddenly it means it only does 25 rounds/second, and the actual write
> > > limit drops to 4 MB/s.
> >
> > I think at a first approximation, you probably don't want to add WAL
> > delays to vacuum jobs, since they are already slowed down, so the rate
> > of WAL they produce might not be your first problem.  The problem is
> > more things like CREATE INDEX CONCURRENTLY that run at full speed.
> >
> > That leads to an alternative idea of expanding the existing cost-based
> > vacuum delay system to other commands.
> >
> > We could even enhance the cost system by taking WAL into account as an
> > additional factor.
> 
> 
> This is really what I was thinking- let’s not have multiple independent
> ways of slowing down maintenance and similar jobs to reduce their impact on
> I/o to the heap and to WAL.

I think that's a bad idea. Both because the current vacuum code is
*terrible* if you desire higher rates because both CPU and IO time
aren't taken into account. And it's extremely hard to control.  And it
seems entirely valuable to be able to limit the amount of WAL generated
for replication, but still try go get the rest of the work done as
quickly as reasonably possible wrt local IO.

Greetings,

Andres Freund


Re: WAL insert delay settings

From
Stephen Frost
Date:
Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> On 2019-02-14 11:02:24 -0500, Stephen Frost wrote:
> > On Thu, Feb 14, 2019 at 10:15 Peter Eisentraut <
> > peter.eisentraut@2ndquadrant.com> wrote:
> > > On 14/02/2019 11:03, Tomas Vondra wrote:
> > > > But if you add extra sleep() calls somewhere (say because there's also
> > > > limit on WAL throughput), it will affect how fast VACUUM works in
> > > > general. Yet it'll continue with the cost-based throttling, but it will
> > > > never reach the limits. Say you do another 20ms sleep somewhere.
> > > > Suddenly it means it only does 25 rounds/second, and the actual write
> > > > limit drops to 4 MB/s.
> > >
> > > I think at a first approximation, you probably don't want to add WAL
> > > delays to vacuum jobs, since they are already slowed down, so the rate
> > > of WAL they produce might not be your first problem.  The problem is
> > > more things like CREATE INDEX CONCURRENTLY that run at full speed.
> > >
> > > That leads to an alternative idea of expanding the existing cost-based
> > > vacuum delay system to other commands.
> > >
> > > We could even enhance the cost system by taking WAL into account as an
> > > additional factor.
> >
> > This is really what I was thinking- let’s not have multiple independent
> > ways of slowing down maintenance and similar jobs to reduce their impact on
> > I/o to the heap and to WAL.
>
> I think that's a bad idea. Both because the current vacuum code is
> *terrible* if you desire higher rates because both CPU and IO time
> aren't taken into account. And it's extremely hard to control.  And it
> seems entirely valuable to be able to limit the amount of WAL generated
> for replication, but still try go get the rest of the work done as
> quickly as reasonably possible wrt local IO.

I'm all for making improvements to the vacuum code and making it easier
to control.

I don't buy off on the argument that there is some way to segregate the
local I/O question from the WAL when we're talking about these kinds of
operations (VACUUM, CREATE INDEX, CLUSTER, etc) on logged relations, nor
do I think we do our users a service by giving them independent knobs
for both that will undoubtably end up making it more difficult to
understand and control what's going on overall.

Even here, it seems, you're arguing that the existing approach for
VACUUM is hard to control; wouldn't adding another set of knobs for
controlling the amount of WAL generated by VACUUM make that worse?  I
have a hard time seeing how it wouldn't.

Thanks!

Stephen

Attachment

Re: WAL insert delay settings

From
Andres Freund
Date:
Hi,

On 2019-02-15 08:50:03 -0500, Stephen Frost wrote:
> * Andres Freund (andres@anarazel.de) wrote:
> > On 2019-02-14 11:02:24 -0500, Stephen Frost wrote:
> > > On Thu, Feb 14, 2019 at 10:15 Peter Eisentraut <
> > > peter.eisentraut@2ndquadrant.com> wrote:
> > > > On 14/02/2019 11:03, Tomas Vondra wrote:
> > > > > But if you add extra sleep() calls somewhere (say because there's also
> > > > > limit on WAL throughput), it will affect how fast VACUUM works in
> > > > > general. Yet it'll continue with the cost-based throttling, but it will
> > > > > never reach the limits. Say you do another 20ms sleep somewhere.
> > > > > Suddenly it means it only does 25 rounds/second, and the actual write
> > > > > limit drops to 4 MB/s.
> > > >
> > > > I think at a first approximation, you probably don't want to add WAL
> > > > delays to vacuum jobs, since they are already slowed down, so the rate
> > > > of WAL they produce might not be your first problem.  The problem is
> > > > more things like CREATE INDEX CONCURRENTLY that run at full speed.
> > > >
> > > > That leads to an alternative idea of expanding the existing cost-based
> > > > vacuum delay system to other commands.
> > > >
> > > > We could even enhance the cost system by taking WAL into account as an
> > > > additional factor.
> > > 
> > > This is really what I was thinking- let’s not have multiple independent
> > > ways of slowing down maintenance and similar jobs to reduce their impact on
> > > I/o to the heap and to WAL.
> > 
> > I think that's a bad idea. Both because the current vacuum code is
> > *terrible* if you desire higher rates because both CPU and IO time
> > aren't taken into account. And it's extremely hard to control.  And it
> > seems entirely valuable to be able to limit the amount of WAL generated
> > for replication, but still try go get the rest of the work done as
> > quickly as reasonably possible wrt local IO.
> 
> I'm all for making improvements to the vacuum code and making it easier
> to control.
> 
> I don't buy off on the argument that there is some way to segregate the
> local I/O question from the WAL when we're talking about these kinds of
> operations (VACUUM, CREATE INDEX, CLUSTER, etc) on logged relations, nor
> do I think we do our users a service by giving them independent knobs
> for both that will undoubtably end up making it more difficult to
> understand and control what's going on overall.
> 
> Even here, it seems, you're arguing that the existing approach for
> VACUUM is hard to control; wouldn't adding another set of knobs for
> controlling the amount of WAL generated by VACUUM make that worse?  I
> have a hard time seeing how it wouldn't.

I think it's because I see them as, often, having two largely
independent use cases. If your goal is to avoid swamping replication
with WAL, you don't necessarily care about also throttling VACUUM (or
REINDEX, or CLUSTER, or ...)'s local IO.  By forcing to combine the two
you just make the whole feature less usable.

I think it'd not be insane to add two things:
- WAL write rate limiting, independent of the vacuum stuff. It'd also be
  used by lots of other bulk commands (CREATE INDEX, ALTER TABLE
  rewrites, ...)
- Account for WAL writes in the current vacuum costing logic, by
  accounting for it using a new cost parameter

Then VACUUM would be throttled by the *minimum* of the two, which seems
to make plenty sense to me, given the usecases.

Greetings,

Andres Freund


Re: WAL insert delay settings

From
Tomas Vondra
Date:
On 2/15/19 7:41 PM, Andres Freund wrote:
> Hi,
> 
> On 2019-02-15 08:50:03 -0500, Stephen Frost wrote:
>> * Andres Freund (andres@anarazel.de) wrote:
>>> On 2019-02-14 11:02:24 -0500, Stephen Frost wrote:
>>>> On Thu, Feb 14, 2019 at 10:15 Peter Eisentraut <
>>>> peter.eisentraut@2ndquadrant.com> wrote:
>>>>> On 14/02/2019 11:03, Tomas Vondra wrote:
>>>>>> But if you add extra sleep() calls somewhere (say because there's also
>>>>>> limit on WAL throughput), it will affect how fast VACUUM works in
>>>>>> general. Yet it'll continue with the cost-based throttling, but it will
>>>>>> never reach the limits. Say you do another 20ms sleep somewhere.
>>>>>> Suddenly it means it only does 25 rounds/second, and the actual write
>>>>>> limit drops to 4 MB/s.
>>>>>
>>>>> I think at a first approximation, you probably don't want to add WAL
>>>>> delays to vacuum jobs, since they are already slowed down, so the rate
>>>>> of WAL they produce might not be your first problem.  The problem is
>>>>> more things like CREATE INDEX CONCURRENTLY that run at full speed.
>>>>>
>>>>> That leads to an alternative idea of expanding the existing cost-based
>>>>> vacuum delay system to other commands.
>>>>>
>>>>> We could even enhance the cost system by taking WAL into account as an
>>>>> additional factor.
>>>>
>>>> This is really what I was thinking- let’s not have multiple independent
>>>> ways of slowing down maintenance and similar jobs to reduce their impact on
>>>> I/o to the heap and to WAL.
>>>
>>> I think that's a bad idea. Both because the current vacuum code is
>>> *terrible* if you desire higher rates because both CPU and IO time
>>> aren't taken into account. And it's extremely hard to control.  And it
>>> seems entirely valuable to be able to limit the amount of WAL generated
>>> for replication, but still try go get the rest of the work done as
>>> quickly as reasonably possible wrt local IO.
>>
>> I'm all for making improvements to the vacuum code and making it easier
>> to control.
>>
>> I don't buy off on the argument that there is some way to segregate the
>> local I/O question from the WAL when we're talking about these kinds of
>> operations (VACUUM, CREATE INDEX, CLUSTER, etc) on logged relations, nor
>> do I think we do our users a service by giving them independent knobs
>> for both that will undoubtably end up making it more difficult to
>> understand and control what's going on overall.
>>
>> Even here, it seems, you're arguing that the existing approach for
>> VACUUM is hard to control; wouldn't adding another set of knobs for
>> controlling the amount of WAL generated by VACUUM make that worse?  I
>> have a hard time seeing how it wouldn't.
> 
> I think it's because I see them as, often, having two largely 
> independent use cases. If your goal is to avoid swamping replication 
> with WAL, you don't necessarily care about also throttling VACUUM
> (or REINDEX, or CLUSTER, or ...)'s local IO.  By forcing to combine
> the two you just make the whole feature less usable.
> 

I agree with that.

> I think it'd not be insane to add two things:
> - WAL write rate limiting, independent of the vacuum stuff. It'd also be
>   used by lots of other bulk commands (CREATE INDEX, ALTER TABLE
>   rewrites, ...)
> - Account for WAL writes in the current vacuum costing logic, by
>   accounting for it using a new cost parameter
> 
> Then VACUUM would be throttled by the *minimum* of the two, which seems
> to make plenty sense to me, given the usecases.
> 

Is it really minimum? If you add another cost parameter to the vacuum
model, then there's almost no chance of actually reaching the limit
because the budget (cost_limit) is shared with other stuff (local I/O).

FWIW I do think the ability to throttle WAL is a useful feature, I just
don't want to shoot myself in the foot by making other things worse.

As you note, the existing VACUUM throttling is already hard to control,
this seems to make it even harder.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WAL insert delay settings

From
Stephen Frost
Date:
Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> On 2/15/19 7:41 PM, Andres Freund wrote:
> > On 2019-02-15 08:50:03 -0500, Stephen Frost wrote:
> >> * Andres Freund (andres@anarazel.de) wrote:
> >>> On 2019-02-14 11:02:24 -0500, Stephen Frost wrote:
> >>>> On Thu, Feb 14, 2019 at 10:15 Peter Eisentraut <
> >>>> peter.eisentraut@2ndquadrant.com> wrote:
> >>>>> On 14/02/2019 11:03, Tomas Vondra wrote:
> >>>>>> But if you add extra sleep() calls somewhere (say because there's also
> >>>>>> limit on WAL throughput), it will affect how fast VACUUM works in
> >>>>>> general. Yet it'll continue with the cost-based throttling, but it will
> >>>>>> never reach the limits. Say you do another 20ms sleep somewhere.
> >>>>>> Suddenly it means it only does 25 rounds/second, and the actual write
> >>>>>> limit drops to 4 MB/s.
> >>>>>
> >>>>> I think at a first approximation, you probably don't want to add WAL
> >>>>> delays to vacuum jobs, since they are already slowed down, so the rate
> >>>>> of WAL they produce might not be your first problem.  The problem is
> >>>>> more things like CREATE INDEX CONCURRENTLY that run at full speed.
> >>>>>
> >>>>> That leads to an alternative idea of expanding the existing cost-based
> >>>>> vacuum delay system to other commands.
> >>>>>
> >>>>> We could even enhance the cost system by taking WAL into account as an
> >>>>> additional factor.
> >>>>
> >>>> This is really what I was thinking- let’s not have multiple independent
> >>>> ways of slowing down maintenance and similar jobs to reduce their impact on
> >>>> I/o to the heap and to WAL.
> >>>
> >>> I think that's a bad idea. Both because the current vacuum code is
> >>> *terrible* if you desire higher rates because both CPU and IO time
> >>> aren't taken into account. And it's extremely hard to control.  And it
> >>> seems entirely valuable to be able to limit the amount of WAL generated
> >>> for replication, but still try go get the rest of the work done as
> >>> quickly as reasonably possible wrt local IO.
> >>
> >> I'm all for making improvements to the vacuum code and making it easier
> >> to control.
> >>
> >> I don't buy off on the argument that there is some way to segregate the
> >> local I/O question from the WAL when we're talking about these kinds of
> >> operations (VACUUM, CREATE INDEX, CLUSTER, etc) on logged relations, nor
> >> do I think we do our users a service by giving them independent knobs
> >> for both that will undoubtably end up making it more difficult to
> >> understand and control what's going on overall.
> >>
> >> Even here, it seems, you're arguing that the existing approach for
> >> VACUUM is hard to control; wouldn't adding another set of knobs for
> >> controlling the amount of WAL generated by VACUUM make that worse?  I
> >> have a hard time seeing how it wouldn't.
> >
> > I think it's because I see them as, often, having two largely
> > independent use cases. If your goal is to avoid swamping replication
> > with WAL, you don't necessarily care about also throttling VACUUM
> > (or REINDEX, or CLUSTER, or ...)'s local IO.  By forcing to combine
> > the two you just make the whole feature less usable.
>
> I agree with that.

I can agree that they're different use-cases but one does end up
impacting the other and that's what I had been thinking about from the
perspective of "if we could proivde just one knob for this."

VACUUM is a pretty good example- if we're dirtying a page with VACUUM
then we're also writing that page into the WAL (at least, if the
relation isn't unlogged).  Now, VACUUM does do other things (such as
read pages), as does REINDEX or CLUSTER, so maybe there's a way to think
about this feature in those terms- cost for doing local read i/o, vs.
cost for doing write i/o (to both heap and WAL) vs. cost for doing
"local" write i/o (just to heap, ie: unlogged tables).

What I was trying to say I didn't like previously was the idea of having
a "local i/o write" cost for VACUUM, to a logged table, *and* a "WAL
write" cost for VACUUM, since those are very tightly correlated.

The current costing mechanism in VACUUM only provides the single hammer
of "if we hit the limit, go to sleep for a while" which seems a bit
unfortunate- if we haven't hit the "read i/o" limit, it'd be nice if we
could keep going and then come back to writing out pages when enough
time has passed that we're below our "write i/o" limit.  That would end
up requiring quite a bit of change to how we do things though, I expect,
so probably not something to tie into this particular feature but I
wanted to express the thought in case others found it interesting.

> > I think it'd not be insane to add two things:
> > - WAL write rate limiting, independent of the vacuum stuff. It'd also be
> >   used by lots of other bulk commands (CREATE INDEX, ALTER TABLE
> >   rewrites, ...)
> > - Account for WAL writes in the current vacuum costing logic, by
> >   accounting for it using a new cost parameter
> >
> > Then VACUUM would be throttled by the *minimum* of the two, which seems
> > to make plenty sense to me, given the usecases.
>
> Is it really minimum? If you add another cost parameter to the vacuum
> model, then there's almost no chance of actually reaching the limit
> because the budget (cost_limit) is shared with other stuff (local I/O).

Yeah, that does seem like it'd be an issue.

> FWIW I do think the ability to throttle WAL is a useful feature, I just
> don't want to shoot myself in the foot by making other things worse.
>
> As you note, the existing VACUUM throttling is already hard to control,
> this seems to make it even harder.

Agreed.

Thanks!

Stephen

Attachment

Re: WAL insert delay settings

From
Robert Haas
Date:
On Fri, Feb 15, 2019 at 1:42 PM Andres Freund <andres@anarazel.de> wrote:
> I think it'd not be insane to add two things:
> - WAL write rate limiting, independent of the vacuum stuff. It'd also be
>   used by lots of other bulk commands (CREATE INDEX, ALTER TABLE
>   rewrites, ...)
> - Account for WAL writes in the current vacuum costing logic, by
>   accounting for it using a new cost parameter
>
> Then VACUUM would be throttled by the *minimum* of the two, which seems
> to make plenty sense to me, given the usecases.

Or maybe we should just blow up the current vacuum cost delay stuff
and replace it with something that is easier to tune.  For example, we
could just have one parameter that sets the maximum read rate in kB/s
and another that sets the maximum dirty-page rate in kB/s.  Whichever
limit is tighter binds.  If we also have the thing that is the topic
of this thread, that's a third possible upper limit.

I really don't see much point in doubling down on the current vacuum
cost delay logic.  The overall idea is good, but the specific way that
you have to set the parameters is pretty inscrutable, and I think we
should just fix it so that it can be, uh, scruted.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: WAL insert delay settings

From
Andres Freund
Date:
Hi,

On 2019-02-19 13:28:00 -0500, Robert Haas wrote:
> On Fri, Feb 15, 2019 at 1:42 PM Andres Freund <andres@anarazel.de> wrote:
> > I think it'd not be insane to add two things:
> > - WAL write rate limiting, independent of the vacuum stuff. It'd also be
> >   used by lots of other bulk commands (CREATE INDEX, ALTER TABLE
> >   rewrites, ...)
> > - Account for WAL writes in the current vacuum costing logic, by
> >   accounting for it using a new cost parameter
> >
> > Then VACUUM would be throttled by the *minimum* of the two, which seems
> > to make plenty sense to me, given the usecases.
> 
> Or maybe we should just blow up the current vacuum cost delay stuff
> and replace it with something that is easier to tune.  For example, we
> could just have one parameter that sets the maximum read rate in kB/s
> and another that sets the maximum dirty-page rate in kB/s.  Whichever
> limit is tighter binds.  If we also have the thing that is the topic
> of this thread, that's a third possible upper limit.

> I really don't see much point in doubling down on the current vacuum
> cost delay logic.  The overall idea is good, but the specific way that
> you have to set the parameters is pretty inscrutable, and I think we
> should just fix it so that it can be, uh, scruted.

I agree that that's something worthwhile to do, but given that the
proposal in this thread is *NOT* just about VACUUM, I don't see why it'd
be useful to tie a general WAL rate limiting to rewriting cost limiting
of vacuum.  It seems better to write the WAL rate limiting logic with an
eye towards structuring it in a way that'd potentially allow reusing
some of the code for a better VACUUM cost limiting.

I still don't *AT ALL* buy Stephen and Tomas' argument that it'd be
confusing that when both VACUUM and WAL cost limiting are active, the
lower limit takes effect.

Greetings,

Andres Freund


Re: WAL insert delay settings

From
Tomas Vondra
Date:

On 2/19/19 7:35 PM, Andres Freund wrote:
> Hi,
> 
> On 2019-02-19 13:28:00 -0500, Robert Haas wrote:
>> On Fri, Feb 15, 2019 at 1:42 PM Andres Freund <andres@anarazel.de> wrote:
>>> I think it'd not be insane to add two things:
>>> - WAL write rate limiting, independent of the vacuum stuff. It'd also be
>>>   used by lots of other bulk commands (CREATE INDEX, ALTER TABLE
>>>   rewrites, ...)
>>> - Account for WAL writes in the current vacuum costing logic, by
>>>   accounting for it using a new cost parameter
>>>
>>> Then VACUUM would be throttled by the *minimum* of the two, which seems
>>> to make plenty sense to me, given the usecases.
>>
>> Or maybe we should just blow up the current vacuum cost delay stuff
>> and replace it with something that is easier to tune.  For example, we
>> could just have one parameter that sets the maximum read rate in kB/s
>> and another that sets the maximum dirty-page rate in kB/s.  Whichever
>> limit is tighter binds.  If we also have the thing that is the topic
>> of this thread, that's a third possible upper limit.
> 
>> I really don't see much point in doubling down on the current vacuum
>> cost delay logic.  The overall idea is good, but the specific way that
>> you have to set the parameters is pretty inscrutable, and I think we
>> should just fix it so that it can be, uh, scruted.
> 
> I agree that that's something worthwhile to do, but given that the
> proposal in this thread is *NOT* just about VACUUM, I don't see why it'd
> be useful to tie a general WAL rate limiting to rewriting cost limiting
> of vacuum.  It seems better to write the WAL rate limiting logic with an
> eye towards structuring it in a way that'd potentially allow reusing
> some of the code for a better VACUUM cost limiting.
> 
> I still don't *AT ALL* buy Stephen and Tomas' argument that it'd be
> confusing that when both VACUUM and WAL cost limiting are active, the
> lower limit takes effect.
> 

Except that's not my argument. I'm not arguing against throttling once
we hit the minimum of limits.

The problem I have with implementing a separate throttling logic is that
it also changes the other limits (which are already kinda fuzzy). If you
add sleeps somewhere, those will affects the throttling built into
autovacuum (lowering them in some unknown way).

From this POV it would be better to include this into the vacuum cost
limit, because then it's at least subject to the same budget.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WAL insert delay settings

From
Robert Haas
Date:
On Tue, Feb 19, 2019 at 1:35 PM Andres Freund <andres@anarazel.de> wrote:
> I still don't *AT ALL* buy Stephen and Tomas' argument that it'd be
> confusing that when both VACUUM and WAL cost limiting are active, the
> lower limit takes effect.

I think you guys may all be in vigorous -- not to say mortal -- agreement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: WAL insert delay settings

From
Andres Freund
Date:
On 2019-02-19 19:43:14 +0100, Tomas Vondra wrote:
> 
> 
> On 2/19/19 7:35 PM, Andres Freund wrote:
> > Hi,
> > 
> > On 2019-02-19 13:28:00 -0500, Robert Haas wrote:
> >> On Fri, Feb 15, 2019 at 1:42 PM Andres Freund <andres@anarazel.de> wrote:
> >>> I think it'd not be insane to add two things:
> >>> - WAL write rate limiting, independent of the vacuum stuff. It'd also be
> >>>   used by lots of other bulk commands (CREATE INDEX, ALTER TABLE
> >>>   rewrites, ...)
> >>> - Account for WAL writes in the current vacuum costing logic, by
> >>>   accounting for it using a new cost parameter
> >>>
> >>> Then VACUUM would be throttled by the *minimum* of the two, which seems
> >>> to make plenty sense to me, given the usecases.
> >>
> >> Or maybe we should just blow up the current vacuum cost delay stuff
> >> and replace it with something that is easier to tune.  For example, we
> >> could just have one parameter that sets the maximum read rate in kB/s
> >> and another that sets the maximum dirty-page rate in kB/s.  Whichever
> >> limit is tighter binds.  If we also have the thing that is the topic
> >> of this thread, that's a third possible upper limit.
> > 
> >> I really don't see much point in doubling down on the current vacuum
> >> cost delay logic.  The overall idea is good, but the specific way that
> >> you have to set the parameters is pretty inscrutable, and I think we
> >> should just fix it so that it can be, uh, scruted.
> > 
> > I agree that that's something worthwhile to do, but given that the
> > proposal in this thread is *NOT* just about VACUUM, I don't see why it'd
> > be useful to tie a general WAL rate limiting to rewriting cost limiting
> > of vacuum.  It seems better to write the WAL rate limiting logic with an
> > eye towards structuring it in a way that'd potentially allow reusing
> > some of the code for a better VACUUM cost limiting.
> > 
> > I still don't *AT ALL* buy Stephen and Tomas' argument that it'd be
> > confusing that when both VACUUM and WAL cost limiting are active, the
> > lower limit takes effect.
> > 
> 
> Except that's not my argument. I'm not arguing against throttling once
> we hit the minimum of limits.
> 
> The problem I have with implementing a separate throttling logic is that
> it also changes the other limits (which are already kinda fuzzy). If you
> add sleeps somewhere, those will affects the throttling built into
> autovacuum (lowering them in some unknown way).

Those two paragraphs, to me, flat out contradict each other. If you
throttle according to the lower of two limits, *of course* the
throttling for the higher limit is affected in the sense that the limit
won't be reached.  So yea, if you have a WAL rate limit, and you reach
it, you won't be able to predict the IO rate for vacuum. Duh?  I must be
missing something here.

Greetings,

Andres Freund


Re: WAL insert delay settings

From
Tomas Vondra
Date:
On 2/19/19 7:50 PM, Andres Freund wrote:
> On 2019-02-19 19:43:14 +0100, Tomas Vondra wrote:
>>
>>
>> On 2/19/19 7:35 PM, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2019-02-19 13:28:00 -0500, Robert Haas wrote:
>>>> On Fri, Feb 15, 2019 at 1:42 PM Andres Freund <andres@anarazel.de> wrote:
>>>>> I think it'd not be insane to add two things:
>>>>> - WAL write rate limiting, independent of the vacuum stuff. It'd also be
>>>>>   used by lots of other bulk commands (CREATE INDEX, ALTER TABLE
>>>>>   rewrites, ...)
>>>>> - Account for WAL writes in the current vacuum costing logic, by
>>>>>   accounting for it using a new cost parameter
>>>>>
>>>>> Then VACUUM would be throttled by the *minimum* of the two, which seems
>>>>> to make plenty sense to me, given the usecases.
>>>>
>>>> Or maybe we should just blow up the current vacuum cost delay stuff
>>>> and replace it with something that is easier to tune.  For example, we
>>>> could just have one parameter that sets the maximum read rate in kB/s
>>>> and another that sets the maximum dirty-page rate in kB/s.  Whichever
>>>> limit is tighter binds.  If we also have the thing that is the topic
>>>> of this thread, that's a third possible upper limit.
>>>
>>>> I really don't see much point in doubling down on the current vacuum
>>>> cost delay logic.  The overall idea is good, but the specific way that
>>>> you have to set the parameters is pretty inscrutable, and I think we
>>>> should just fix it so that it can be, uh, scruted.
>>>
>>> I agree that that's something worthwhile to do, but given that the
>>> proposal in this thread is *NOT* just about VACUUM, I don't see why it'd
>>> be useful to tie a general WAL rate limiting to rewriting cost limiting
>>> of vacuum.  It seems better to write the WAL rate limiting logic with an
>>> eye towards structuring it in a way that'd potentially allow reusing
>>> some of the code for a better VACUUM cost limiting.
>>>
>>> I still don't *AT ALL* buy Stephen and Tomas' argument that it'd be
>>> confusing that when both VACUUM and WAL cost limiting are active, the
>>> lower limit takes effect.
>>>
>>
>> Except that's not my argument. I'm not arguing against throttling once
>> we hit the minimum of limits.
>>
>> The problem I have with implementing a separate throttling logic is that
>> it also changes the other limits (which are already kinda fuzzy). If you
>> add sleeps somewhere, those will affects the throttling built into
>> autovacuum (lowering them in some unknown way).
> 
> Those two paragraphs, to me, flat out contradict each other. If you
> throttle according to the lower of two limits, *of course* the
> throttling for the higher limit is affected in the sense that the limit
> won't be reached.  So yea, if you have a WAL rate limit, and you reach
> it, you won't be able to predict the IO rate for vacuum. Duh?  I must be
> missing something here.
> 

Let's do a short example. Assume the default vacuum costing parameters

    vacuum_cost_limit = 200
    vacuum_cost_delay = 20ms
    cost_page_dirty = 20

and for simplicity we only do writes. So vacuum can do ~8MB/s of writes.

Now, let's also throttle based on WAL - once in a while, after producing
some amount of WAL we sleep for a while. Again, for simplicity let's
assume the sleeps perfectly interleave and are also 20ms. So we have
something like:

    sleep(20ms); -- vacuum
    sleep(20ms); -- WAL
    sleep(20ms); -- vacuum
    sleep(20ms); -- WAL
    sleep(20ms); -- vacuum
    sleep(20ms); -- WAL
    sleep(20ms); -- vacuum
    sleep(20ms); -- WAL

Suddenly, we only reach 4MB/s of writes from vacuum. But we also reach
only 1/2 the WAL throughput, because it's affected exactly the same way
by the sleeps from vacuum throttling.

We've not reached either of the limits. How exactly is this "lower limit
takes effect"?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WAL insert delay settings

From
Tomas Vondra
Date:

On 2/19/19 7:28 PM, Robert Haas wrote:
> On Fri, Feb 15, 2019 at 1:42 PM Andres Freund <andres@anarazel.de> wrote:
>> I think it'd not be insane to add two things:
>> - WAL write rate limiting, independent of the vacuum stuff. It'd also be
>>   used by lots of other bulk commands (CREATE INDEX, ALTER TABLE
>>   rewrites, ...)
>> - Account for WAL writes in the current vacuum costing logic, by
>>   accounting for it using a new cost parameter
>>
>> Then VACUUM would be throttled by the *minimum* of the two, which seems
>> to make plenty sense to me, given the usecases.
> 
> Or maybe we should just blow up the current vacuum cost delay stuff
> and replace it with something that is easier to tune.  For example, we
> could just have one parameter that sets the maximum read rate in kB/s
> and another that sets the maximum dirty-page rate in kB/s.  Whichever
> limit is tighter binds.  If we also have the thing that is the topic
> of this thread, that's a third possible upper limit.
> 
> I really don't see much point in doubling down on the current vacuum
> cost delay logic.  The overall idea is good, but the specific way that
> you have to set the parameters is pretty inscrutable, and I think we
> should just fix it so that it can be, uh, scruted.
> 

I think changing the vacuum throttling so that it uses actual I/O
amounts (in kB/s) instead of the cost limit would be a step in the right
directly. It's clearer, and it also works with arbitrary page sizes.

Then, instead of sleeping for a fixed amount of time after reaching the
cost limit, we should track progress and compute the amount of time we
actually need to sleep. AFAICS that's what spread checkpoints do.

I'm sure it's bound to be trickier in practice, of course.

FWIW I'm not entirely sure we want to fully separate the limits. I'd
argue using the same vacuum budget for both reads and writes is actually
the right thing to do - the I/O likely hits the same device. So it makes
sense to "balance" those two in some way. But that may or may not be the
case for WAL, which is often moved to a different device.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WAL insert delay settings

From
Andres Freund
Date:
On 2019-02-19 20:02:32 +0100, Tomas Vondra wrote:
> Let's do a short example. Assume the default vacuum costing parameters
>
>     vacuum_cost_limit = 200
>     vacuum_cost_delay = 20ms
>     cost_page_dirty = 20
>
> and for simplicity we only do writes. So vacuum can do ~8MB/s of writes.
>
> Now, let's also throttle based on WAL - once in a while, after producing
> some amount of WAL we sleep for a while. Again, for simplicity let's
> assume the sleeps perfectly interleave and are also 20ms. So we have
> something like:

>     sleep(20ms); -- vacuum
>     sleep(20ms); -- WAL
>     sleep(20ms); -- vacuum
>     sleep(20ms); -- WAL
>     sleep(20ms); -- vacuum
>     sleep(20ms); -- WAL
>     sleep(20ms); -- vacuum
>     sleep(20ms); -- WAL
>
> Suddenly, we only reach 4MB/s of writes from vacuum. But we also reach
> only 1/2 the WAL throughput, because it's affected exactly the same way
> by the sleeps from vacuum throttling.
>
> We've not reached either of the limits. How exactly is this "lower limit
> takes effect"?

Because I upthread said that that's not how I think a sane
implementation of WAL throttling would work. I think the whole cost
budgeting approach is BAD, and it'd be serious mistake to copy it for a
WAL rate limit (it disregards the time taken to execute IO and CPU costs
etc, and in this case the cost of other bandwidth limitations).  What
I'm saying is that we ought to instead specify an WAL rate in bytes/sec
and *only* sleep once we've exceeded it for a time period (with some
optimizations, so we don't gettimeofday after every XLogInsert(), but
instead compute how many bytes later need to re-determine the time to
see if we're still in the same 'granule').

Now, a non-toy implementation would probably would want to have a
sliding window to avoid being overly bursty, and reduce the number of
gettimeofday as mentioned above, but for explanation's sake basically
imagine that at the "main loop" of an bulk xlog emitting command would
invoke a helper with a a computation in pseudocode like:

    current_time = gettimeofday();
    if (same_second(current_time, last_time))
    {
        wal_written_in_second += new_wal_written;
        if (wal_written_in_second >= wal_write_limit_per_second)
        {
           double too_much = (wal_written_in_second - wal_write_limit_per_second);
           sleep_fractional_seconds(too_much / wal_written_in_second);

           last_time = current_time;
        }
    }
    else
    {
        last_time = current_time;
    }

which'd mean that in contrast to your example we'd not continually sleep
for WAL, we'd only do so if we actually exceeded (or are projected to
exceed in a smarter implementation), the specified WAL write rate. As
the 20ms sleeps from vacuum effectively reduce the WAL write rate, we'd
correspondingly sleep less.


And my main point is that even if you implement a proper bytes/sec limit
ONLY for WAL, the behaviour of VACUUM rate limiting doesn't get
meaningfully more confusing than right now.

Greetings,

Andres Freund


Re: WAL insert delay settings

From
Tomas Vondra
Date:

On 2/19/19 8:22 PM, Andres Freund wrote:
> On 2019-02-19 20:02:32 +0100, Tomas Vondra wrote:
>> Let's do a short example. Assume the default vacuum costing parameters
>>
>>     vacuum_cost_limit = 200
>>     vacuum_cost_delay = 20ms
>>     cost_page_dirty = 20
>>
>> and for simplicity we only do writes. So vacuum can do ~8MB/s of writes.
>>
>> Now, let's also throttle based on WAL - once in a while, after producing
>> some amount of WAL we sleep for a while. Again, for simplicity let's
>> assume the sleeps perfectly interleave and are also 20ms. So we have
>> something like:
> 
>>     sleep(20ms); -- vacuum
>>     sleep(20ms); -- WAL
>>     sleep(20ms); -- vacuum
>>     sleep(20ms); -- WAL
>>     sleep(20ms); -- vacuum
>>     sleep(20ms); -- WAL
>>     sleep(20ms); -- vacuum
>>     sleep(20ms); -- WAL
>>
>> Suddenly, we only reach 4MB/s of writes from vacuum. But we also reach
>> only 1/2 the WAL throughput, because it's affected exactly the same way
>> by the sleeps from vacuum throttling.
>>
>> We've not reached either of the limits. How exactly is this "lower limit
>> takes effect"?
> 
> Because I upthread said that that's not how I think a sane
> implementation of WAL throttling would work. I think the whole cost
> budgeting approach is BAD, and it'd be serious mistake to copy it for a
> WAL rate limit (it disregards the time taken to execute IO and CPU costs
> etc, and in this case the cost of other bandwidth limitations).  What
> I'm saying is that we ought to instead specify an WAL rate in bytes/sec
> and *only* sleep once we've exceeded it for a time period (with some
> optimizations, so we don't gettimeofday after every XLogInsert(), but
> instead compute how many bytes later need to re-determine the time to
> see if we're still in the same 'granule').
> 

OK, I agree with that. That's mostly what I described in response to
Robert a while ago, I think. (If you've described that earlier in the
thread, I missed it.)

> Now, a non-toy implementation would probably would want to have a
> sliding window to avoid being overly bursty, and reduce the number of
> gettimeofday as mentioned above, but for explanation's sake basically
> imagine that at the "main loop" of an bulk xlog emitting command would
> invoke a helper with a a computation in pseudocode like:
> 
>     current_time = gettimeofday();
>     if (same_second(current_time, last_time))
>     {
>         wal_written_in_second += new_wal_written;
>         if (wal_written_in_second >= wal_write_limit_per_second)
>         {
>            double too_much = (wal_written_in_second - wal_write_limit_per_second);
>            sleep_fractional_seconds(too_much / wal_written_in_second);
> 
>            last_time = current_time;
>         }
>     }
>     else
>     {
>         last_time = current_time;
>     }
> 
> which'd mean that in contrast to your example we'd not continually sleep
> for WAL, we'd only do so if we actually exceeded (or are projected to
> exceed in a smarter implementation), the specified WAL write rate. As
> the 20ms sleeps from vacuum effectively reduce the WAL write rate, we'd
> correspondingly sleep less.
> 

Yes, that makes sense.

> 
> And my main point is that even if you implement a proper bytes/sec limit
> ONLY for WAL, the behaviour of VACUUM rate limiting doesn't get
> meaningfully more confusing than right now.
> 

So, why not to modify autovacuum to also use this approach? I wonder if
the situation there is more complicated because of multiple workers
sharing the same budget ...

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WAL insert delay settings

From
Andres Freund
Date:
Hi,

On 2019-02-19 20:34:25 +0100, Tomas Vondra wrote:
> On 2/19/19 8:22 PM, Andres Freund wrote:
> > And my main point is that even if you implement a proper bytes/sec limit
> > ONLY for WAL, the behaviour of VACUUM rate limiting doesn't get
> > meaningfully more confusing than right now.
> > 
> 
> So, why not to modify autovacuum to also use this approach? I wonder if
> the situation there is more complicated because of multiple workers
> sharing the same budget ...

I think the main reason is that implementing a scheme like this for WAL
rate limiting isn't a small task, but it'd be aided by the fact that
it'd probably not on by default, and that there'd not be any regressions
because the behaviour didn't exist before. I contrast, people are
extremely sensitive to autovacuum behaviour changes, even if it's to
improve autovacuum. I think it makes more sense to build the logic in a
lower profile case first, and then migrate autovacuum over it. Even
leaving the maturity issue aside, reducing the scope of the project into
more bite sized chunks seems to increase the likelihood of getting
anything substantially.

Greetings,

Andres Freund


Re: WAL insert delay settings

From
David Rowley
Date:
On Wed, 20 Feb 2019 at 07:28, Robert Haas <robertmhaas@gmail.com> wrote:
> Or maybe we should just blow up the current vacuum cost delay stuff
> and replace it with something that is easier to tune.  For example, we
> could just have one parameter that sets the maximum read rate in kB/s
> and another that sets the maximum dirty-page rate in kB/s.  Whichever
> limit is tighter binds.  If we also have the thing that is the topic
> of this thread, that's a third possible upper limit.

I had similar thoughts when I saw that Peter's proposal didn't seem
all that compatible with how the vacuum cost delays work today. I
agree the cost limit would have to turn into something time based
rather than points based.

To me, it seems just too crude to have a per-backend limit.  I think
global "soft" limits would be better. Let's say, for example, the DBA
would like to CREATE INDEX CONCURRENTLY on a 6TB table.  They think
this is going to take about 36 hours, so they start the operation at
the start of off-peak, which is expected to last 12 hours. This means
the create index is going to run for 2 off-peaks and 1 on-peak.   Must
they really configure the create index to run at a speed that is
suitable for running at peak-load?  That's pretty wasteful as surely
it could run much more quickly during the off-peak.

I know there's debate as to if this can rate limit WAL, but, if we can
find a way to do that, then it seems to me some settings like:

max_soft_global_wal_rate (MB/sec)
min_hard_local_wal_rate (MB/sec)

That way the rate limited process would slow down to
min_hard_local_wal_rate when the WAL rate of all processes is
exceeding max_soft_global_wal_rate.   The min_hard_local_wal_rate is
just there to ensure the process never stops completely. It can simply
tick along at that rate until the global WAL rate slows down again.

It's likely going to be easier to do something like that in WAL than
with buffer read/write/dirties since we already can easily see how
much WAL has been written by looking at the current LSN.

(Of course, these GUC names are not very good. I just picked them out
the air quickly to try and get their meaning across)

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: WAL insert delay settings

From
Tomas Vondra
Date:

On 2/19/19 8:40 PM, Andres Freund wrote:
> Hi,
> 
> On 2019-02-19 20:34:25 +0100, Tomas Vondra wrote:
>> On 2/19/19 8:22 PM, Andres Freund wrote:
>>> And my main point is that even if you implement a proper bytes/sec limit
>>> ONLY for WAL, the behaviour of VACUUM rate limiting doesn't get
>>> meaningfully more confusing than right now.
>>>
>>
>> So, why not to modify autovacuum to also use this approach? I wonder if
>> the situation there is more complicated because of multiple workers
>> sharing the same budget ...
> 
> I think the main reason is that implementing a scheme like this for WAL
> rate limiting isn't a small task, but it'd be aided by the fact that
> it'd probably not on by default, and that there'd not be any regressions
> because the behaviour didn't exist before. I contrast, people are
> extremely sensitive to autovacuum behaviour changes, even if it's to
> improve autovacuum. I think it makes more sense to build the logic in a
> lower profile case first, and then migrate autovacuum over it. Even
> leaving the maturity issue aside, reducing the scope of the project into
> more bite sized chunks seems to increase the likelihood of getting
> anything substantially.
> 

Maybe.

I guess the main thing I'm advocating for here is to aim for a unified
throttling approach, not multiple disparate approaches interacting in
ways that are hard to understand/predict.

The time-based approach you described looks fine, an it's kinda what I
was imagining (and not unlike the checkpoint throttling). I don't think
it'd be that hard to tweak autovacuum to use it too, but I admit I have
not thought about it particularly hard and there's stuff like per-table
settings which might make it more complex.

So maybe doing it for WAL first makes sense. But I still think we need
to have at least a rough idea how it interacts with the existing
throttling and how it will work in the end.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WAL insert delay settings

From
Stephen Frost
Date:
Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> On 2/19/19 8:40 PM, Andres Freund wrote:
> > On 2019-02-19 20:34:25 +0100, Tomas Vondra wrote:
> >> On 2/19/19 8:22 PM, Andres Freund wrote:
> >>> And my main point is that even if you implement a proper bytes/sec limit
> >>> ONLY for WAL, the behaviour of VACUUM rate limiting doesn't get
> >>> meaningfully more confusing than right now.
> >>
> >> So, why not to modify autovacuum to also use this approach? I wonder if
> >> the situation there is more complicated because of multiple workers
> >> sharing the same budget ...
> >
> > I think the main reason is that implementing a scheme like this for WAL
> > rate limiting isn't a small task, but it'd be aided by the fact that
> > it'd probably not on by default, and that there'd not be any regressions
> > because the behaviour didn't exist before. I contrast, people are
> > extremely sensitive to autovacuum behaviour changes, even if it's to
> > improve autovacuum. I think it makes more sense to build the logic in a
> > lower profile case first, and then migrate autovacuum over it. Even
> > leaving the maturity issue aside, reducing the scope of the project into
> > more bite sized chunks seems to increase the likelihood of getting
> > anything substantially.
>
> Maybe.

I concur with that 'maybe'. :)

> I guess the main thing I'm advocating for here is to aim for a unified
> throttling approach, not multiple disparate approaches interacting in
> ways that are hard to understand/predict.

Yes, agreed.

> The time-based approach you described looks fine, an it's kinda what I
> was imagining (and not unlike the checkpoint throttling). I don't think
> it'd be that hard to tweak autovacuum to use it too, but I admit I have
> not thought about it particularly hard and there's stuff like per-table
> settings which might make it more complex.

When reading Andres' proposal, I was heavily reminded of how checkpoint
throttling is handled and wondered if there might be some way to reuse
or generalize that existing code/technique/etc and make it available to
be used for WAL, and more-or-less any/every other bulk operation (CREATE
INDEX, REINDEX, CLUSTER, VACUUM...).

> So maybe doing it for WAL first makes sense. But I still think we need
> to have at least a rough idea how it interacts with the existing
> throttling and how it will work in the end.

Well, it seems like Andres explained how it'll work with the existing
throttling, no?  As for how all of this will work in the end, that's a
good question but also a rather difficult one to answer, I suspect.

Just to share a few additional thoughts after pondering this for a
while, but the comment Andres made up-thread really struck a chord- we
don't necessairly want to throttle anything, what we'd really rather do
is *prioritize* things, whereby foreground work (regular queries and
such) have a higher priority than background/bulk work (VACUUM, REINDEX,
etc) but otherwise we use the system to its full capacity.  We don't
actually want to throttle a VACUUM run any more than a CREATE INDEX, we
just don't want those to hurt the performance of regular queries that
are happening.

The other thought I had was that doing things on a per-table basis, in
particular, isn't really addressing the resource question appropriately.
WAL is relatively straight-forward and independent of a resource from
the IO for the heap/indexes, so getting an idea from the admin of how
much capacity they have for WAL makes sense.  When it comes to the
capacity for the heap/indexes, in terms of IO, that really goes to the
underlying storage system/channel, which would actually be a tablespace
in properly set up environments (imv anyway).

Wrapping this up- it seems unlikely that we're going to get a
priority-based system in place any time particularly soon but I do think
it's worthy of some serious consideration and discussion about how we
might be able to get there.  On the other hand, if we can provide a way
for the admin to say "these are my IO channels (per-tablespace values,
plus a value for WAL), here's what their capacity is, and here's how
much buffer for foreground work I want to have (again, per IO channel),
so, PG, please arrange to not use more than 'capacity-buffer' amount of
resources for background/bulk tasks (per IO channel)" then we can at
least help them address the issue that foreground tasks are being
stalled or delayed due to background/bulk work.  This approach means
that they won't be utilizing the system to its full capacity, but
they'll know that and they'll know that it's because, for them, it's
more important that they have that low latency for foreground tasks.

Thanks!

Stephen

Attachment

Re: WAL insert delay settings

From
Tomas Vondra
Date:
On 2/20/19 10:43 PM, Stephen Frost wrote:
> Greetings,
> ...
>>
>> So maybe doing it for WAL first makes sense. But I still think we need
>> to have at least a rough idea how it interacts with the existing
>> throttling and how it will work in the end.
> 
> Well, it seems like Andres explained how it'll work with the existing
> throttling, no?  As for how all of this will work in the end, that's a
> good question but also a rather difficult one to answer, I suspect.
> 

Well ... he explained how to do WAL throttling, and I agree what he
proposed seems entirely sane to me.

But when it comes to interactions with current vacuum cost-based
throttling, he claims it does not get meaningfully more confusing due to
interactions with WAL throttling. I don't quite agree with that, but I'm
not going to beat this horse any longer ...

> Just to share a few additional thoughts after pondering this for a
> while, but the comment Andres made up-thread really struck a chord- we
> don't necessairly want to throttle anything, what we'd really rather do
> is *prioritize* things, whereby foreground work (regular queries and
> such) have a higher priority than background/bulk work (VACUUM, REINDEX,
> etc) but otherwise we use the system to its full capacity.  We don't
> actually want to throttle a VACUUM run any more than a CREATE INDEX, we
> just don't want those to hurt the performance of regular queries that
> are happening.
> 

I think you're forgetting the motivation of this very patch was to
prevent replication lag caused by a command generating large amounts of
WAL (like CREATE INDEX / ALTER TABLE etc.). That has almost nothing to
do with prioritization or foreground/background split.

I'm not arguing against ability to prioritize stuff, but I disagree it
somehow replaces throttling.

> The other thought I had was that doing things on a per-table basis, in
> particular, isn't really addressing the resource question appropriately.
> WAL is relatively straight-forward and independent of a resource from
> the IO for the heap/indexes, so getting an idea from the admin of how
> much capacity they have for WAL makes sense.  When it comes to the
> capacity for the heap/indexes, in terms of IO, that really goes to the
> underlying storage system/channel, which would actually be a tablespace
> in properly set up environments (imv anyway).
> 
> Wrapping this up- it seems unlikely that we're going to get a
> priority-based system in place any time particularly soon but I do think
> it's worthy of some serious consideration and discussion about how we
> might be able to get there.  On the other hand, if we can provide a way
> for the admin to say "these are my IO channels (per-tablespace values,
> plus a value for WAL), here's what their capacity is, and here's how
> much buffer for foreground work I want to have (again, per IO channel),
> so, PG, please arrange to not use more than 'capacity-buffer' amount of
> resources for background/bulk tasks (per IO channel)" then we can at
> least help them address the issue that foreground tasks are being
> stalled or delayed due to background/bulk work.  This approach means
> that they won't be utilizing the system to its full capacity, but
> they'll know that and they'll know that it's because, for them, it's
> more important that they have that low latency for foreground tasks.
> 

I think it's mostly orthogonal feature to throttling.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: WAL insert delay settings

From
Stephen Frost
Date:
Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> On 2/20/19 10:43 PM, Stephen Frost wrote:
> > Just to share a few additional thoughts after pondering this for a
> > while, but the comment Andres made up-thread really struck a chord- we
> > don't necessairly want to throttle anything, what we'd really rather do
> > is *prioritize* things, whereby foreground work (regular queries and
> > such) have a higher priority than background/bulk work (VACUUM, REINDEX,
> > etc) but otherwise we use the system to its full capacity.  We don't
> > actually want to throttle a VACUUM run any more than a CREATE INDEX, we
> > just don't want those to hurt the performance of regular queries that
> > are happening.
>
> I think you're forgetting the motivation of this very patch was to
> prevent replication lag caused by a command generating large amounts of
> WAL (like CREATE INDEX / ALTER TABLE etc.). That has almost nothing to
> do with prioritization or foreground/background split.
>
> I'm not arguing against ability to prioritize stuff, but I disagree it
> somehow replaces throttling.

Why is replication lag an issue though?  I would contend it's an issue
because with sync replication, it makes foreground processes wait, and
with async replication, it makes the actions of foreground processes
show up late on the replicas.

If the actual WAL records for the foreground processes got priority and
were pushed out earlier than the background ones, that would eliminate
both of those issues with replication lag.  Perhaps there's other issues
that replication lag cause but which aren't solved by prioritizing the
actual WAL records that you care about getting to the replicas faster,
but if so, I'd like to hear what those are.

> > The other thought I had was that doing things on a per-table basis, in
> > particular, isn't really addressing the resource question appropriately.
> > WAL is relatively straight-forward and independent of a resource from
> > the IO for the heap/indexes, so getting an idea from the admin of how
> > much capacity they have for WAL makes sense.  When it comes to the
> > capacity for the heap/indexes, in terms of IO, that really goes to the
> > underlying storage system/channel, which would actually be a tablespace
> > in properly set up environments (imv anyway).
> >
> > Wrapping this up- it seems unlikely that we're going to get a
> > priority-based system in place any time particularly soon but I do think
> > it's worthy of some serious consideration and discussion about how we
> > might be able to get there.  On the other hand, if we can provide a way
> > for the admin to say "these are my IO channels (per-tablespace values,
> > plus a value for WAL), here's what their capacity is, and here's how
> > much buffer for foreground work I want to have (again, per IO channel),
> > so, PG, please arrange to not use more than 'capacity-buffer' amount of
> > resources for background/bulk tasks (per IO channel)" then we can at
> > least help them address the issue that foreground tasks are being
> > stalled or delayed due to background/bulk work.  This approach means
> > that they won't be utilizing the system to its full capacity, but
> > they'll know that and they'll know that it's because, for them, it's
> > more important that they have that low latency for foreground tasks.
>
> I think it's mostly orthogonal feature to throttling.

I'm... not sure that what I was getting at above really got across.

What I was saying above, in a nutshell, is that if we're going to
provide throttling then we should give users a way to configure the
throttling on a per-IO-channel basis, which means at the tablespace
level, plus an independent configuration option for WAL since we allow
that to be placed elsewhere too.

Ideally, the configuration parameter would be in the same units as the
actual resource is too- which would probably be IOPS+bandwidth, really.
Just doing it in terms of bandwidth ends up being a bit of a mismatch
as compared to reality, and would mean that users would have to tune it
down farther than they might otherwise and therefore give up that much
more in terms of system capability.

Thanks!

Stephen

Attachment

Re: WAL insert delay settings

From
Andres Freund
Date:
Hi,

On 2019-02-20 18:46:09 -0500, Stephen Frost wrote:
> * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> > On 2/20/19 10:43 PM, Stephen Frost wrote:
> > > Just to share a few additional thoughts after pondering this for a
> > > while, but the comment Andres made up-thread really struck a chord- we
> > > don't necessairly want to throttle anything, what we'd really rather do
> > > is *prioritize* things, whereby foreground work (regular queries and
> > > such) have a higher priority than background/bulk work (VACUUM, REINDEX,
> > > etc) but otherwise we use the system to its full capacity.  We don't
> > > actually want to throttle a VACUUM run any more than a CREATE INDEX, we
> > > just don't want those to hurt the performance of regular queries that
> > > are happening.
> > 
> > I think you're forgetting the motivation of this very patch was to
> > prevent replication lag caused by a command generating large amounts of
> > WAL (like CREATE INDEX / ALTER TABLE etc.). That has almost nothing to
> > do with prioritization or foreground/background split.
> > 
> > I'm not arguing against ability to prioritize stuff, but I disagree it
> > somehow replaces throttling.
> 
> Why is replication lag an issue though?  I would contend it's an issue
> because with sync replication, it makes foreground processes wait, and
> with async replication, it makes the actions of foreground processes
> show up late on the replicas.

I think reaching the bandwidth limit of either the replication stream,
or of the startup process is actually more common than these. And for
that prioritization doesn't help, unless it somehow reduces the total
amount of WAL.


> If the actual WAL records for the foreground processes got priority and
> were pushed out earlier than the background ones, that would eliminate
> both of those issues with replication lag.  Perhaps there's other issues
> that replication lag cause but which aren't solved by prioritizing the
> actual WAL records that you care about getting to the replicas faster,
> but if so, I'd like to hear what those are.

Wait, what? Are you actually suggesting that different sources of WAL
records should be streamed out in different order? You're blowing a
somewhat reasonably doable project up into "let's rewrite a large chunk
of all of the durability layer in postgres".


Stephen, we gotta stop blowing up projects into something that can't
ever realistically be finished.

Greetings,

Andres Freund


Re: WAL insert delay settings

From
Stephen Frost
Date:
Greetings,

* Andres Freund (andres@anarazel.de) wrote:
> On 2019-02-20 18:46:09 -0500, Stephen Frost wrote:
> > * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> > > On 2/20/19 10:43 PM, Stephen Frost wrote:
> > > > Just to share a few additional thoughts after pondering this for a
> > > > while, but the comment Andres made up-thread really struck a chord- we
> > > > don't necessairly want to throttle anything, what we'd really rather do
> > > > is *prioritize* things, whereby foreground work (regular queries and
> > > > such) have a higher priority than background/bulk work (VACUUM, REINDEX,
> > > > etc) but otherwise we use the system to its full capacity.  We don't
> > > > actually want to throttle a VACUUM run any more than a CREATE INDEX, we
> > > > just don't want those to hurt the performance of regular queries that
> > > > are happening.
> > >
> > > I think you're forgetting the motivation of this very patch was to
> > > prevent replication lag caused by a command generating large amounts of
> > > WAL (like CREATE INDEX / ALTER TABLE etc.). That has almost nothing to
> > > do with prioritization or foreground/background split.
> > >
> > > I'm not arguing against ability to prioritize stuff, but I disagree it
> > > somehow replaces throttling.
> >
> > Why is replication lag an issue though?  I would contend it's an issue
> > because with sync replication, it makes foreground processes wait, and
> > with async replication, it makes the actions of foreground processes
> > show up late on the replicas.
>
> I think reaching the bandwidth limit of either the replication stream,
> or of the startup process is actually more common than these. And for
> that prioritization doesn't help, unless it somehow reduces the total
> amount of WAL.

The issue with hitting those bandwidth limits is that you end up with
queues outside of your control and therefore are unable to prioritize
the data going through them.  I agree, that's an issue and it might be
necessary to ask the admin to provide what the bandwidth limit is, so
that we could then avoid running into issues with downstream queues that
are outside of our control causing unexpected/unacceptable lag.

> > If the actual WAL records for the foreground processes got priority and
> > were pushed out earlier than the background ones, that would eliminate
> > both of those issues with replication lag.  Perhaps there's other issues
> > that replication lag cause but which aren't solved by prioritizing the
> > actual WAL records that you care about getting to the replicas faster,
> > but if so, I'd like to hear what those are.
>
> Wait, what? Are you actually suggesting that different sources of WAL
> records should be streamed out in different order? You're blowing a
> somewhat reasonably doable project up into "let's rewrite a large chunk
> of all of the durability layer in postgres".
>
> Stephen, we gotta stop blowing up projects into something that can't
> ever realistically be finished.

I started this sub-thread specifically the way I did because I was
trying to make it clear that these were just ideas for possible
discussion- I'm *not* suggesting, nor saying, that we have to go
implement this right now instead of implementing the throttling that
started this thread.  I'm also, to be clear, not objecting to
implementing the throttling discussed (though, as mentioned but
seemingly ignored, I'd see it maybe configurable in different ways than
originally suggested).

If there's a way I can get that across more clearly than saying "Just to
share a few additional thoughts", I'm happy to try and do so, but I
don't agree that I should be required to simply keep such thoughts to
myself; indeed, I'll admit that I don't know how large a project this
would actually be and while I figured it'd be *huge*, I wanted to share
the thought in case someone might see a way that we could implement it
with much less work and have a better solution as a result.

Thanks!

Stephen

Attachment

Re: WAL insert delay settings

From
Ants Aasma
Date:
On Thu, Feb 21, 2019 at 2:20 AM Stephen Frost <sfrost@snowman.net> wrote:
* Andres Freund (andres@anarazel.de) wrote:
> On 2019-02-20 18:46:09 -0500, Stephen Frost wrote:
> > * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> > > On 2/20/19 10:43 PM, Stephen Frost wrote:
> > > > Just to share a few additional thoughts after pondering this for a
> > > > while, but the comment Andres made up-thread really struck a chord- we
> > > > don't necessairly want to throttle anything, what we'd really rather do
> > > > is *prioritize* things, whereby foreground work (regular queries and
> > > > such) have a higher priority than background/bulk work (VACUUM, REINDEX,
> > > > etc) but otherwise we use the system to its full capacity.  We don't
> > > > actually want to throttle a VACUUM run any more than a CREATE INDEX, we
> > > > just don't want those to hurt the performance of regular queries that
> > > > are happening.
> > >
> > > I think you're forgetting the motivation of this very patch was to
> > > prevent replication lag caused by a command generating large amounts of
> > > WAL (like CREATE INDEX / ALTER TABLE etc.). That has almost nothing to
> > > do with prioritization or foreground/background split.
> > >
> > > I'm not arguing against ability to prioritize stuff, but I disagree it
> > > somehow replaces throttling.
> >
> > Why is replication lag an issue though?  I would contend it's an issue
> > because with sync replication, it makes foreground processes wait, and
> > with async replication, it makes the actions of foreground processes
> > show up late on the replicas.
>
> I think reaching the bandwidth limit of either the replication stream,
> or of the startup process is actually more common than these. And for
> that prioritization doesn't help, unless it somehow reduces the total
> amount of WAL.

The issue with hitting those bandwidth limits is that you end up with
queues outside of your control and therefore are unable to prioritize
the data going through them.  I agree, that's an issue and it might be
necessary to ask the admin to provide what the bandwidth limit is, so
that we could then avoid running into issues with downstream queues that
are outside of our control causing unexpected/unacceptable lag.

If there is a global rate limit on WAL throughput it could be adjusted by a control loop, measuring replication queue length and/or apply delay. I don't see any sane way how one would tune a per command rate limit, or even worse, a cost-delay parameter. It would have the same problems as work_mem settings.

Rate limit in front of WAL insertion would allow for allocating the throughput between foreground and background tasks, and even allow for priority inheritance to alleviate priority inversion due to locks.

There is also an implicit assumption here that a maintenance command is a background task and a normal DML query is a foreground task. This is not true for all cases, users may want to throttle transactions doing lots of DML to keep synchronous commit latencies for smaller transactions within reasonable limits.

As a wild idea for how to handle the throttling, what if when all our wal insertion credits are used up XLogInsert() sets InterruptPending and the actual sleep is done inside ProcessInterrupts()?

Regards,
Ants Aasma

Re: WAL insert delay settings

From
Stephen Frost
Date:
Greetings,

* Ants Aasma (ants.aasma@eesti.ee) wrote:
> On Thu, Feb 21, 2019 at 2:20 AM Stephen Frost <sfrost@snowman.net> wrote:
> > The issue with hitting those bandwidth limits is that you end up with
> > queues outside of your control and therefore are unable to prioritize
> > the data going through them.  I agree, that's an issue and it might be
> > necessary to ask the admin to provide what the bandwidth limit is, so
> > that we could then avoid running into issues with downstream queues that
> > are outside of our control causing unexpected/unacceptable lag.
>
> If there is a global rate limit on WAL throughput it could be adjusted by a
> control loop, measuring replication queue length and/or apply delay. I
> don't see any sane way how one would tune a per command rate limit, or even
> worse, a cost-delay parameter. It would have the same problems as work_mem
> settings.

Yeah, having some kind of feedback loop would be interesting.  I agree
that a per-command rate limit would have similar problems to work_mem,
and that's definitely one problem we have with the way VACUUM is tuned
today but the ship has more-or-less sailed on that- I don't think we're
going to be able to simply remove the VACUUM settings.  Avoiding adding
new settings that are per-command would be good though, if we can sort
out a way how.

> Rate limit in front of WAL insertion would allow for allocating the
> throughput between foreground and background tasks, and even allow for
> priority inheritance to alleviate priority inversion due to locks.

I'm not sure how much we have to worry about priority inversion here as
you need to have conflicts for that and if there's actually a conflict,
then it seems like we should just press on.

That is, a non-concurrent REINDEX is going to prevent an UPDATE from
modifying anything in the table, which if the UPDATE is a higher
priority than the REINDEX would be priority inversion, but that doesn't
mean we should slow down the REINDEX to allow the UPDATE to happen
because the UPDATE simply can't happen until the REINDEX is complete.
Now, we might slow down the REINDEX because there's UPDATEs against
*other* tables that aren't conflicting and we want those UPDATEs to be
prioritized over the REINDEX but then that isn't priority inversion.

Basically, I'm not sure that there's anything we can do, or need to do,
differently from what we do today when it comes to priority inversion
risk, at least as it relates to this discussion.  There's an interesting
discussion to be had about if we should delay the REINDEX taking the
lock at all when there's an UPDATE pending, but you run the risk of
starving the REINDEX from ever getting the lock and being able to run in
that case.  A better approach is what we're already working on- arrange
for the REINDEX to not require a conflicting lock, so that both can run
concurrently.

> There is also an implicit assumption here that a maintenance command is a
> background task and a normal DML query is a foreground task. This is not
> true for all cases, users may want to throttle transactions doing lots of
> DML to keep synchronous commit latencies for smaller transactions within
> reasonable limits.

Agreed, that was something that I was contemplating too- and one could
possibly argue in the other direction as well (maybe that REINDEX is on
a small table but has a very high priority and we're willing to accept
that some regular DML is delayed a bit to allow that REINDEX to finish).
Still, I would think we'd basically want to use the heuristic that DDL
is bulk and DML is a higher priority for a baseline/default position,
but then provide users with a way to change the priority on a
per-session level, presumably with a GUC or similar, if they have a case
where that heuristic is wrong.

Again, just to be clear, this is all really 'food for thought' and
interesting discussion and shouldn't keep us from doing something simple
now, if we can, to help alleviate the immediate practical issue that
bulk commands can cause serious WAL lag.  I think it's good to have
these discussions since they may help us to craft the simple solution in
a way that could later be extended (or at least won't get in the way)
for these much larger changes, but even if that's not possible, we
should be open to accepting a simpler, short-term, improvement, as these
larger changes would very likely be multiple major releases away if
they're able to be done at all.

> As a wild idea for how to handle the throttling, what if when all our wal
> insertion credits are used up XLogInsert() sets InterruptPending and the
> actual sleep is done inside ProcessInterrupts()?

This comment might be better if it was made higher up in the thread,
closer to where the discussion was happening about the issues with
critical sections and the current patch's approach for throttle-based
rate limiting.  I'm afraid that it might get lost in this sub-thread
about these much larger and loftier ideas around where we might want to
go in the future.

Thanks!

Stephen

Attachment

Re: WAL insert delay settings

From
Ants Aasma
Date:
On Thu, Feb 21, 2019 at 12:50 PM Stephen Frost <sfrost@snowman.net> wrote:
> Rate limit in front of WAL insertion would allow for allocating the
> throughput between foreground and background tasks, and even allow for
> priority inheritance to alleviate priority inversion due to locks.

I'm not sure how much we have to worry about priority inversion here as
you need to have conflicts for that and if there's actually a conflict,
then it seems like we should just press on.

That is, a non-concurrent REINDEX is going to prevent an UPDATE from
modifying anything in the table, which if the UPDATE is a higher
priority than the REINDEX would be priority inversion, but that doesn't
mean we should slow down the REINDEX to allow the UPDATE to happen
because the UPDATE simply can't happen until the REINDEX is complete.
Now, we might slow down the REINDEX because there's UPDATEs against
*other* tables that aren't conflicting and we want those UPDATEs to be
prioritized over the REINDEX but then that isn't priority inversion.

I was thinking along the lines that each backend gets a budget of WAL insertion credits per time interval, and when the credits run out the process sleeps. With this type of scheme it would be reasonably straightforward to let UPDATEs being blocked by REINDEX to transfer their WAL insertion budgets to the REINDEX, making it get a larger piece of the total throughput pie.

Regards,
Ants Aasma

Re: WAL insert delay settings

From
Stephen Frost
Date:
Greetings,

* Ants Aasma (ants.aasma@eesti.ee) wrote:
> On Thu, Feb 21, 2019 at 12:50 PM Stephen Frost <sfrost@snowman.net> wrote:
>
> > > Rate limit in front of WAL insertion would allow for allocating the
> > > throughput between foreground and background tasks, and even allow for
> > > priority inheritance to alleviate priority inversion due to locks.
> >
> > I'm not sure how much we have to worry about priority inversion here as
> > you need to have conflicts for that and if there's actually a conflict,
> > then it seems like we should just press on.
> >
> > That is, a non-concurrent REINDEX is going to prevent an UPDATE from
> > modifying anything in the table, which if the UPDATE is a higher
> > priority than the REINDEX would be priority inversion, but that doesn't
> > mean we should slow down the REINDEX to allow the UPDATE to happen
> > because the UPDATE simply can't happen until the REINDEX is complete.
> > Now, we might slow down the REINDEX because there's UPDATEs against
> > *other* tables that aren't conflicting and we want those UPDATEs to be
> > prioritized over the REINDEX but then that isn't priority inversion.
>
> I was thinking along the lines that each backend gets a budget of WAL
> insertion credits per time interval, and when the credits run out the
> process sleeps. With this type of scheme it would be reasonably
> straightforward to let UPDATEs being blocked by REINDEX to transfer their
> WAL insertion budgets to the REINDEX, making it get a larger piece of the
> total throughput pie.

Sure, that could possibly be done if it's a per-backend throttle
mechanism, but that's got more-or-less the same issue as a per-command
mechanism and work_mem as discussed up-thread.

Also seems like if we've solved for a way to do this transferring and
delay and such that we could come up with a way to prioritize (or 'give
more credits') to foreground and less to background (there was another
point made elsewhere in the thread that background processes should
still be given *some* amount of credits, always, so that they don't end
up starving completely, and I agree with that).

There's actually a lot of similarity or parallels between this and basic
network traffic shaping, it seems to me anyway, where you have a pipe of
a certain size and you want to prioritize some things (like interactive
SSH) while de-prioritizing other things (bulk SCP) but also using the
pipe fully.

Thanks!

Stephen

Attachment