Thread: Re: bgwrite process is too lazy
On Wed, 2024-10-02 at 16:48 +0800, wenhui qiu wrote: > Whenever I check the checkpoint information in a log, most dirty pages are written by the checkpoint process That's exactly how it should be! Yours, Laurenz Albe
On Wed, Oct 2, 2024 at 8:14 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Wed, 2024-10-02 at 16:48 +0800, wenhui qiu wrote:
> Whenever I check the checkpoint information in a log, most dirty pages are written by the checkpoint process
That's exactly how it should be!
is it because if bgwriter frequently flushes, the disk io will be more?🤔
On 10/2/24 17:02, Tony Wayne wrote: > > > On Wed, Oct 2, 2024 at 8:14 PM Laurenz Albe <laurenz.albe@cybertec.at > <mailto:laurenz.albe@cybertec.at>> wrote: > > On Wed, 2024-10-02 at 16:48 +0800, wenhui qiu wrote: > > Whenever I check the checkpoint information in a log, most dirty > pages are written by the checkpoint process > > That's exactly how it should be! > > is it because if bgwriter frequently flushes, the disk io will be more?🤔 Yes, pretty much. But it's also about where the writes happen. Checkpoint flushes dirty buffers only once per checkpoint interval, which is the lowest amount of write I/O that needs to happen. Every other way of flushing buffers is less efficient, and is mostly a sign of memory pressure (shared buffers not large enough for active part of the data). But it's also happens about where the writes happen. Checkpoint does that in the background, not as part of regular query execution. What we don't want is for the user backends to flush buffers, because it's expensive and can cause result in much higher latency. The bgwriter is somewhere in between - it's happens in the background, but may not be as efficient as doing it in the checkpointer. Still much better than having to do this in regular backends. regards -- Tomas Vondra
Hi Tomas
Thank you for explaining,If do not change this static parameter,Can refer to the parameter autovacuum_vacuum_cost_delay and lower the minimum value of the bgwrite_delay parameter to 2ms? Let bgwrite write more dirty pages to reduce the impact on performance when checkpoint occurs?After all, extending the checkpoint time and crash recovery time needs to find a balance, and cannot be increased indefinitely ( checkpoint_timeout ,max_wal_size)
Thanks
Tomas Vondra <tomas@vondra.me> 于2024年10月3日周四 00:36写道:
On 10/2/24 17:02, Tony Wayne wrote:
>
>
> On Wed, Oct 2, 2024 at 8:14 PM Laurenz Albe <laurenz.albe@cybertec.at
> <mailto:laurenz.albe@cybertec.at>> wrote:
>
> On Wed, 2024-10-02 at 16:48 +0800, wenhui qiu wrote:
> > Whenever I check the checkpoint information in a log, most dirty
> pages are written by the checkpoint process
>
> That's exactly how it should be!
>
> is it because if bgwriter frequently flushes, the disk io will be more?🤔
Yes, pretty much. But it's also about where the writes happen.
Checkpoint flushes dirty buffers only once per checkpoint interval,
which is the lowest amount of write I/O that needs to happen.
Every other way of flushing buffers is less efficient, and is mostly a
sign of memory pressure (shared buffers not large enough for active part
of the data).
But it's also happens about where the writes happen. Checkpoint does
that in the background, not as part of regular query execution. What we
don't want is for the user backends to flush buffers, because it's
expensive and can cause result in much higher latency.
The bgwriter is somewhere in between - it's happens in the background,
but may not be as efficient as doing it in the checkpointer. Still much
better than having to do this in regular backends.
regards
--
Tomas Vondra
Hi, On 2024-10-02 18:36:44 +0200, Tomas Vondra wrote: > On 10/2/24 17:02, Tony Wayne wrote: > > > > > > On Wed, Oct 2, 2024 at 8:14 PM Laurenz Albe <laurenz.albe@cybertec.at > > <mailto:laurenz.albe@cybertec.at>> wrote: > > > > On Wed, 2024-10-02 at 16:48 +0800, wenhui qiu wrote: > > > Whenever I check the checkpoint information in a log, most dirty > > pages are written by the checkpoint process > > > > That's exactly how it should be! > > > > is it because if bgwriter frequently flushes, the disk io will be more?🤔 > > Yes, pretty much. But it's also about where the writes happen. > > Checkpoint flushes dirty buffers only once per checkpoint interval, > which is the lowest amount of write I/O that needs to happen. > > Every other way of flushing buffers is less efficient, and is mostly a > sign of memory pressure (shared buffers not large enough for active part > of the data). It's implied, but to make it more explicit: One big efficiency advantage of writes by checkpointer is that they are sorted and can often be combined into larger writes. That's often a lot more efficient: For network attached storage it saves you iops, for local SSDs it's much friendlier to wear leveling. > But it's also happens about where the writes happen. Checkpoint does > that in the background, not as part of regular query execution. What we > don't want is for the user backends to flush buffers, because it's > expensive and can cause result in much higher latency. > > The bgwriter is somewhere in between - it's happens in the background, > but may not be as efficient as doing it in the checkpointer. Still much > better than having to do this in regular backends. Another aspect is that checkpointer's writes are much easier to pace over time than e.g. bgwriters, because bgwriter is triggered by a fairly short term signal. Eventually we'll want to combine writes by bgwriter too, but that's always going to be more expensive than doing it in a large batched fashion like checkpointer does. I think we could improve checkpointer's pacing further, fwiw, by taking into account that the WAL volume at the start of a spread-out checkpoint typically is bigger than at the end. Greetings, Andres Freund
Hi Andres
> It's implied, but to make it more explicit: One big efficiency advantage of
> writes by checkpointer is that they are sorted and can often be combined into
> larger writes. That's often a lot more efficient: For network attached storage
> it saves you iops, for local SSDs it's much friendlier to wear leveling.
> writes by checkpointer is that they are sorted and can often be combined into
> larger writes. That's often a lot more efficient: For network attached storage
> it saves you iops, for local SSDs it's much friendlier to wear leveling.
thank you for explanation, I think bgwrite also can merge io ,It writes asynchronously to the file system cache, scheduling by os, .
> Another aspect is that checkpointer's writes are much easier to pace over time
> than e.g. bgwriters, because bgwriter is triggered by a fairly short term
> signal. Eventually we'll want to combine writes by bgwriter too, but that's
> always going to be more expensive than doing it in a large batched fashion
> like checkpointer does.
> I think we could improve checkpointer's pacing further, fwiw, by taking into
> account that the WAL volume at the start of a spread-out checkpoint typically
> is bigger than at the end.
> Another aspect is that checkpointer's writes are much easier to pace over time
> than e.g. bgwriters, because bgwriter is triggered by a fairly short term
> signal. Eventually we'll want to combine writes by bgwriter too, but that's
> always going to be more expensive than doing it in a large batched fashion
> like checkpointer does.
> I think we could improve checkpointer's pacing further, fwiw, by taking into
> account that the WAL volume at the start of a spread-out checkpoint typically
> is bigger than at the end.
I'm also very keen to improve checkpoints , Whenever I do stress test, bgwrite does not write dirty pages when the data set is smaller than shard_buffer size,Before the checkpoint, the pressure measurement tps was stable and the highest during the entire pressure measurement phase,Other databases refresh dirty pages at a certain frequency, at intervals, and at dirty page water levels,They have a much smaller impact on performance when checkpoints occur
Thanks
Andres Freund <andres@anarazel.de> 于2024年10月4日周五 03:40写道:
Hi,
On 2024-10-02 18:36:44 +0200, Tomas Vondra wrote:
> On 10/2/24 17:02, Tony Wayne wrote:
> >
> >
> > On Wed, Oct 2, 2024 at 8:14 PM Laurenz Albe <laurenz.albe@cybertec.at
> > <mailto:laurenz.albe@cybertec.at>> wrote:
> >
> > On Wed, 2024-10-02 at 16:48 +0800, wenhui qiu wrote:
> > > Whenever I check the checkpoint information in a log, most dirty
> > pages are written by the checkpoint process
> >
> > That's exactly how it should be!
> >
> > is it because if bgwriter frequently flushes, the disk io will be more?🤔
>
> Yes, pretty much. But it's also about where the writes happen.
>
> Checkpoint flushes dirty buffers only once per checkpoint interval,
> which is the lowest amount of write I/O that needs to happen.
>
> Every other way of flushing buffers is less efficient, and is mostly a
> sign of memory pressure (shared buffers not large enough for active part
> of the data).
It's implied, but to make it more explicit: One big efficiency advantage of
writes by checkpointer is that they are sorted and can often be combined into
larger writes. That's often a lot more efficient: For network attached storage
it saves you iops, for local SSDs it's much friendlier to wear leveling.
> But it's also happens about where the writes happen. Checkpoint does
> that in the background, not as part of regular query execution. What we
> don't want is for the user backends to flush buffers, because it's
> expensive and can cause result in much higher latency.
>
> The bgwriter is somewhere in between - it's happens in the background,
> but may not be as efficient as doing it in the checkpointer. Still much
> better than having to do this in regular backends.
Another aspect is that checkpointer's writes are much easier to pace over time
than e.g. bgwriters, because bgwriter is triggered by a fairly short term
signal. Eventually we'll want to combine writes by bgwriter too, but that's
always going to be more expensive than doing it in a large batched fashion
like checkpointer does.
I think we could improve checkpointer's pacing further, fwiw, by taking into
account that the WAL volume at the start of a spread-out checkpoint typically
is bigger than at the end.
Greetings,
Andres Freund
Hi, On 2024-10-04 09:31:45 +0800, wenhui qiu wrote: > > It's implied, but to make it more explicit: One big efficiency advantage > of > > writes by checkpointer is that they are sorted and can often be combined > into > > larger writes. That's often a lot more efficient: For network attached > storage > > it saves you iops, for local SSDs it's much friendlier to wear leveling. > > thank you for explanation, I think bgwrite also can merge io ,It writes > asynchronously to the file system cache, scheduling by os, . Because bgwriter writes are just ordered by their buffer id (further made less sequential due to only writing out not-recently-used buffers), they are often effectively random. The OS can't do much about that. > > Another aspect is that checkpointer's writes are much easier to pace over > time > > than e.g. bgwriters, because bgwriter is triggered by a fairly short term > > signal. Eventually we'll want to combine writes by bgwriter too, but > that's > > always going to be more expensive than doing it in a large batched fashion > > like checkpointer does. > > > I think we could improve checkpointer's pacing further, fwiw, by taking > into > > account that the WAL volume at the start of a spread-out checkpoint > typically > > is bigger than at the end. > > I'm also very keen to improve checkpoints , Whenever I do stress test, > bgwrite does not write dirty pages when the data set is smaller than > shard_buffer size, It *SHOULD NOT* do anything in that situation. There's absolutely nothing to be gained by bgwriter writing in that case. > Before the checkpoint, the pressure measurement tps was stable and the > highest during the entire pressure measurement phase,Other databases > refresh dirty pages at a certain frequency, at intervals, and at dirty page > water levels,They have a much smaller impact on performance when > checkpoints occur I doubt that slowdown is caused by bgwriter not being active enough. I suspect what you're seeing is one or more of: a) The overhead of doing full page writes (due to increasing the WAL volume). You could verify whether that's the case by turning full_page_writes off (but note that that's not generally safe!) or see if the overhead shrinks if you set wal_compression=zstd or wal_compression=lz4 (don't use pglz, it's too slow). b) The overhead of renaming WAL segments during recycling. You could see if this is related by specifying --wal-segsize 512 or such during initdb. Greetings, Andres
Hi Andres Freund
Thank you for explanation
> I doubt that slowdown is caused by bgwriter not being active enough. I suspect
> what you're seeing is one or more of:
> a) The overhead of doing full page writes (due to increasing the WAL
> volume). You could verify whether that's the case by turning
> full_page_writes off (but note that that's not generally safe!) or see if
> what you're seeing is one or more of:
> a) The overhead of doing full page writes (due to increasing the WAL
> volume). You could verify whether that's the case by turning
> full_page_writes off (but note that that's not generally safe!) or see if
> the overhead shrinks if you set wal_compression=zstd or wal_compression=lz4
> (don't use pglz, it's too slow).
> b) The overhead of renaming WAL segments during recycling. You could see if
> this is related by specifying --wal-segsize 512 or such during initdb.
> b) The overhead of renaming WAL segments during recycling. You could see if
> this is related by specifying --wal-segsize 512 or such during initdb.
I am aware of these optimizations, and these optimizations only mitigate the impact, I didn't turn on wal log compression on purpose during stress test ,
shared_buffers = '32GB'
bgwriter_delay = '10ms'
bgwriter_lru_maxpages = '8192'
bgwriter_lru_multiplier = '10.0'
wal_buffers = '64MB'
checkpoint_completion_target = '0.999'
checkpoint_timeout = '600'
max_wal_size = '32GB'
min_wal_size = '16GB'
I think in business scenarios where there are many reads and few writes, it is indeed desirable to keep as many dirty pages in memory as possible,However, in scenarios such as push systems and task scheduling systems, which also have a lot of reads and writes, the impact of checkpoints will be more obvious,Adaptive bgwrite or bgwrite triggered when a dirty page reaches a certain watermark eliminates the impact of checkpoints on performance jitter.From what I understand, quite a few commercial databases residing in postgresql have added the adaptive refresh dirty page feature, and from their internal reports, the whole stress testing process was very smooth! Since it's a trade secret, I don't know how they implemented this feature.