Re: bgwrite process is too lazy - Mailing list pgsql-hackers
From | wenhui qiu |
---|---|
Subject | Re: bgwrite process is too lazy |
Date | |
Msg-id | CAGjGUA+7sTOCROOuYq=tmS1=himmEPGKauXfhcrU6yoL0ZobHQ@mail.gmail.com Whole thread Raw |
In response to | Re: bgwrite process is too lazy (Andres Freund <andres@anarazel.de>) |
Responses |
Re: bgwrite process is too lazy
|
List | pgsql-hackers |
Hi Andres
> It's implied, but to make it more explicit: One big efficiency advantage of
> writes by checkpointer is that they are sorted and can often be combined into
> larger writes. That's often a lot more efficient: For network attached storage
> it saves you iops, for local SSDs it's much friendlier to wear leveling.
> writes by checkpointer is that they are sorted and can often be combined into
> larger writes. That's often a lot more efficient: For network attached storage
> it saves you iops, for local SSDs it's much friendlier to wear leveling.
thank you for explanation, I think bgwrite also can merge io ,It writes asynchronously to the file system cache, scheduling by os, .
> Another aspect is that checkpointer's writes are much easier to pace over time
> than e.g. bgwriters, because bgwriter is triggered by a fairly short term
> signal. Eventually we'll want to combine writes by bgwriter too, but that's
> always going to be more expensive than doing it in a large batched fashion
> like checkpointer does.
> I think we could improve checkpointer's pacing further, fwiw, by taking into
> account that the WAL volume at the start of a spread-out checkpoint typically
> is bigger than at the end.
> Another aspect is that checkpointer's writes are much easier to pace over time
> than e.g. bgwriters, because bgwriter is triggered by a fairly short term
> signal. Eventually we'll want to combine writes by bgwriter too, but that's
> always going to be more expensive than doing it in a large batched fashion
> like checkpointer does.
> I think we could improve checkpointer's pacing further, fwiw, by taking into
> account that the WAL volume at the start of a spread-out checkpoint typically
> is bigger than at the end.
I'm also very keen to improve checkpoints , Whenever I do stress test, bgwrite does not write dirty pages when the data set is smaller than shard_buffer size,Before the checkpoint, the pressure measurement tps was stable and the highest during the entire pressure measurement phase,Other databases refresh dirty pages at a certain frequency, at intervals, and at dirty page water levels,They have a much smaller impact on performance when checkpoints occur
Thanks
Andres Freund <andres@anarazel.de> 于2024年10月4日周五 03:40写道:
Hi,
On 2024-10-02 18:36:44 +0200, Tomas Vondra wrote:
> On 10/2/24 17:02, Tony Wayne wrote:
> >
> >
> > On Wed, Oct 2, 2024 at 8:14 PM Laurenz Albe <laurenz.albe@cybertec.at
> > <mailto:laurenz.albe@cybertec.at>> wrote:
> >
> > On Wed, 2024-10-02 at 16:48 +0800, wenhui qiu wrote:
> > > Whenever I check the checkpoint information in a log, most dirty
> > pages are written by the checkpoint process
> >
> > That's exactly how it should be!
> >
> > is it because if bgwriter frequently flushes, the disk io will be more?🤔
>
> Yes, pretty much. But it's also about where the writes happen.
>
> Checkpoint flushes dirty buffers only once per checkpoint interval,
> which is the lowest amount of write I/O that needs to happen.
>
> Every other way of flushing buffers is less efficient, and is mostly a
> sign of memory pressure (shared buffers not large enough for active part
> of the data).
It's implied, but to make it more explicit: One big efficiency advantage of
writes by checkpointer is that they are sorted and can often be combined into
larger writes. That's often a lot more efficient: For network attached storage
it saves you iops, for local SSDs it's much friendlier to wear leveling.
> But it's also happens about where the writes happen. Checkpoint does
> that in the background, not as part of regular query execution. What we
> don't want is for the user backends to flush buffers, because it's
> expensive and can cause result in much higher latency.
>
> The bgwriter is somewhere in between - it's happens in the background,
> but may not be as efficient as doing it in the checkpointer. Still much
> better than having to do this in regular backends.
Another aspect is that checkpointer's writes are much easier to pace over time
than e.g. bgwriters, because bgwriter is triggered by a fairly short term
signal. Eventually we'll want to combine writes by bgwriter too, but that's
always going to be more expensive than doing it in a large batched fashion
like checkpointer does.
I think we could improve checkpointer's pacing further, fwiw, by taking into
account that the WAL volume at the start of a spread-out checkpoint typically
is bigger than at the end.
Greetings,
Andres Freund
pgsql-hackers by date: