Re: bgwrite process is too lazy - Mailing list pgsql-hackers

From Andres Freund
Subject Re: bgwrite process is too lazy
Date
Msg-id cixso3buqeddrsqh3cf4svus3dakho2jwvohstwz64aqttg647@pqd4kwtdcso7
Whole thread Raw
In response to Re: bgwrite process is too lazy  (wenhui qiu <qiuwenhuifx@gmail.com>)
Responses Re: bgwrite process is too lazy
List pgsql-hackers
Hi,

On 2024-10-04 09:31:45 +0800, wenhui qiu wrote:
> > It's implied, but to make it more explicit: One big efficiency advantage
> of
> > writes by checkpointer is that they are sorted and can often be combined
> into
> > larger writes. That's often a lot more efficient: For network attached
> storage
> > it saves you iops, for local SSDs it's much friendlier to wear leveling.
>
> thank you for explanation, I think bgwrite also can merge io ,It  writes
> asynchronously to the file system cache, scheduling by os, .

Because bgwriter writes are just ordered by their buffer id (further made less
sequential due to only writing out not-recently-used buffers), they are often
effectively random. The OS can't do much about that.



> > Another aspect is that checkpointer's writes are much easier to pace over
> time
> > than e.g. bgwriters, because bgwriter is triggered by a fairly short term
> > signal.  Eventually we'll want to combine writes by bgwriter too, but
> that's
> > always going to be more expensive than doing it in a large batched fashion
> > like checkpointer does.
>
> > I think we could improve checkpointer's pacing further, fwiw, by taking
> into
> > account that the WAL volume at the start of a spread-out checkpoint
> typically
> > is bigger than at the end.
>
> I'm also very keen to improve checkpoints , Whenever I do stress test,
> bgwrite does not write dirty pages when the data set is smaller than
> shard_buffer size,

It *SHOULD NOT* do anything in that situation. There's absolutely nothing to
be gained by bgwriter writing in that case.


> Before the checkpoint, the pressure measurement tps was stable and the
> highest during the entire pressure measurement phase,Other databases
> refresh dirty pages at a certain frequency, at intervals, and at dirty page
> water levels,They have a much smaller impact on performance when
> checkpoints occur

I doubt that slowdown is caused by bgwriter not being active enough. I suspect
what you're seeing is one or more of:

a) The overhead of doing full page writes (due to increasing the WAL
   volume). You could verify whether that's the case by turning
   full_page_writes off (but note that that's not generally safe!) or see if
   the overhead shrinks if you set wal_compression=zstd or wal_compression=lz4
   (don't use pglz, it's too slow).

b) The overhead of renaming WAL segments during recycling. You could see if
   this is related by specifying --wal-segsize 512 or such during initdb.

Greetings,

Andres



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: POC, WIP: OR-clause support for indexes
Next
From: Alexander Korotkov
Date:
Subject: Re: POC, WIP: OR-clause support for indexes