Hi,
On 2024-10-04 09:31:45 +0800, wenhui qiu wrote:
> > It's implied, but to make it more explicit: One big efficiency advantage
> of
> > writes by checkpointer is that they are sorted and can often be combined
> into
> > larger writes. That's often a lot more efficient: For network attached
> storage
> > it saves you iops, for local SSDs it's much friendlier to wear leveling.
>
> thank you for explanation, I think bgwrite also can merge io ,It writes
> asynchronously to the file system cache, scheduling by os, .
Because bgwriter writes are just ordered by their buffer id (further made less
sequential due to only writing out not-recently-used buffers), they are often
effectively random. The OS can't do much about that.
> > Another aspect is that checkpointer's writes are much easier to pace over
> time
> > than e.g. bgwriters, because bgwriter is triggered by a fairly short term
> > signal. Eventually we'll want to combine writes by bgwriter too, but
> that's
> > always going to be more expensive than doing it in a large batched fashion
> > like checkpointer does.
>
> > I think we could improve checkpointer's pacing further, fwiw, by taking
> into
> > account that the WAL volume at the start of a spread-out checkpoint
> typically
> > is bigger than at the end.
>
> I'm also very keen to improve checkpoints , Whenever I do stress test,
> bgwrite does not write dirty pages when the data set is smaller than
> shard_buffer size,
It *SHOULD NOT* do anything in that situation. There's absolutely nothing to
be gained by bgwriter writing in that case.
> Before the checkpoint, the pressure measurement tps was stable and the
> highest during the entire pressure measurement phase,Other databases
> refresh dirty pages at a certain frequency, at intervals, and at dirty page
> water levels,They have a much smaller impact on performance when
> checkpoints occur
I doubt that slowdown is caused by bgwriter not being active enough. I suspect
what you're seeing is one or more of:
a) The overhead of doing full page writes (due to increasing the WAL
volume). You could verify whether that's the case by turning
full_page_writes off (but note that that's not generally safe!) or see if
the overhead shrinks if you set wal_compression=zstd or wal_compression=lz4
(don't use pglz, it's too slow).
b) The overhead of renaming WAL segments during recycling. You could see if
this is related by specifying --wal-segsize 512 or such during initdb.
Greetings,
Andres