Thread: incremental-checkopints

incremental-checkopints

From

Thomas wen

Date:

26 July 2023, 10:21:57

Hi Hackes: I found this page : https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL no incremental checkpoints have been implemented so far. When a checkpoint is triggered, the performance jitter of PostgreSQL is very noticeable. I think incremental checkpoints should be implemented as soon as possible

Best whish

Thomas wen

Re: incremental-checkopints

From

Tomas Vondra

Date:

26 July 2023, 15:35:48

On 7/26/23 09:21, Thomas wen wrote:
> Hi Hackes:  I found this page :
> https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL
<https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL> noincremental checkpoints
havebeen implemented so far. When a checkpoint is triggered, the performance jitter of PostgreSQL is very noticeable. I
thinkincremental checkpoints should be implemented as soon as possible

> 

Well, that thread is 12 years old, and no one followed on that proposal.
So it seems people have different priorities, working on other stuff
that they consider is more valuable ...

You can either work on this yourself and write a patch, or try to
convince others it's worth working on. But you didn't provide any
information that'd demonstrate the jitter and that incremental
checkpoints would improve that.

For the record, the thread in our archives is:

https://www.postgresql.org/message-id/8a867f1ffea72091bf3cd6a49ba68a97.squirrel%40mail.go-link.net

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: incremental-checkopints

From

Alvaro Herrera

Date:

26 July 2023, 15:41:39

Hello

On 2023-Jul-26, Thomas wen wrote:

> Hi Hackes:   I found this page :
> https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL
> no incremental checkpoints have been implemented so far. When a
> checkpoint is triggered, the performance jitter of PostgreSQL is very
> noticeable. I think incremental checkpoints should be implemented as
> soon as possible

I think my first question is why do you think that is necessary; there
are probably other tools to achieve better performance.  For example,
you may want to try making checkpoint_completion_target closer to 1, and
the checkpoint interval longer (both checkpoint_timeout and
max_wal_size).  Also, changing shared_buffers may improve things.  You
can try adding more RAM to the machine.

Tuning the overall performance of a Postgres server is still black magic
to some extent, but there are a few well-known things to play with,
without having to write any patches.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Someone said that it is at least an order of magnitude more work to do
production software than a prototype. I think he is wrong by at least
an order of magnitude."                              (Brian Kernighan)

Re: incremental-checkopints

From

Matthias van de Meent

Date:

26 July 2023, 16:16:02

On Wed, 26 Jul 2023 at 14:41, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> Hello
>
> On 2023-Jul-26, Thomas wen wrote:
>
> > Hi Hackes:   I found this page :
> > https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL
> > no incremental checkpoints have been implemented so far. When a
> > checkpoint is triggered, the performance jitter of PostgreSQL is very
> > noticeable. I think incremental checkpoints should be implemented as
> > soon as possible
>
> I think my first question is why do you think that is necessary; there
> are probably other tools to achieve better performance.  For example,
> you may want to try making checkpoint_completion_target closer to 1, and
> the checkpoint interval longer (both checkpoint_timeout and
> max_wal_size).  Also, changing shared_buffers may improve things.  You
> can try adding more RAM to the machine.

Even with all those tuning options, a significant portion of a
checkpoint's IO (up to 50%) originates from FPIs in the WAL, which (in
general) will most often appear at the start of each checkpoint due to
each first update to a page after a checkpoint needing an FPI.
If instead we WAL-logged only the pages we are about to write to disk
(like MySQL's double-write buffer, but in WAL instead of a separate
cyclical buffer file), then a checkpoint_completion_target close to 1
would probably solve the issue, but with "WAL-logged torn page
protection at first update after checkpoint" we'll probably always
have higher-than-average FPI load just after a new checkpoint.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)

Re: incremental-checkopints

From

Tomas Vondra

Date:

26 July 2023, 21:58:29


On 7/26/23 15:16, Matthias van de Meent wrote:
> On Wed, 26 Jul 2023 at 14:41, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>>
>> Hello
>>
>> On 2023-Jul-26, Thomas wen wrote:
>>
>>> Hi Hackes:   I found this page :
>>> https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL
>>> no incremental checkpoints have been implemented so far. When a
>>> checkpoint is triggered, the performance jitter of PostgreSQL is very
>>> noticeable. I think incremental checkpoints should be implemented as
>>> soon as possible
>>
>> I think my first question is why do you think that is necessary; there
>> are probably other tools to achieve better performance.  For example,
>> you may want to try making checkpoint_completion_target closer to 1, and
>> the checkpoint interval longer (both checkpoint_timeout and
>> max_wal_size).  Also, changing shared_buffers may improve things.  You
>> can try adding more RAM to the machine.
> 
> Even with all those tuning options, a significant portion of a
> checkpoint's IO (up to 50%) originates from FPIs in the WAL, which (in
> general) will most often appear at the start of each checkpoint due to
> each first update to a page after a checkpoint needing an FPI.

Yeah, FPIs are certainly expensive and can represent huge part of the
WAL produced. But how would incremental checkpoints make that step
unnecessary?

> If instead we WAL-logged only the pages we are about to write to disk
> (like MySQL's double-write buffer, but in WAL instead of a separate
> cyclical buffer file), then a checkpoint_completion_target close to 1
> would probably solve the issue, but with "WAL-logged torn page
> protection at first update after checkpoint" we'll probably always
> have higher-than-average FPI load just after a new checkpoint.
> 

So essentially instead of WAL-logging the FPI on the first change, we'd
only do that later when actually writing-out the page (either during a
checkpoint or because of memory pressure)? How would you make sure
there's enough WAL space until the next checkpoint? I mean, FPIs are a
huge write amplification source ...

Imagine the system has max_wal_size set to 1GB, and does 1M updates
before writing 512MB of WAL and thus triggering a checkpoint. Now it
needs to write FPIs for 1M updates - easily 8GB of WAL, maybe more with
indexes. What then?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: incremental-checkopints

From

Hannu Krosing

Date:

26 July 2023, 22:51:00

Starting from increments checkpoint is approaching the problem from
the wrong end.

What you actually want is Atomic Disk Writes which will allow turning
off full_page_writes .

Without this you really can not do incremental checkpoints efficiently
as checkpoints are currently what is used to determine when is "the
first write to a page after checkpoint" and thereby when the full page
write is needed.




On Wed, Jul 26, 2023 at 8:58 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
>
>
> On 7/26/23 15:16, Matthias van de Meent wrote:
> > On Wed, 26 Jul 2023 at 14:41, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >>
> >> Hello
> >>
> >> On 2023-Jul-26, Thomas wen wrote:
> >>
> >>> Hi Hackes:   I found this page :
> >>> https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL
> >>> no incremental checkpoints have been implemented so far. When a
> >>> checkpoint is triggered, the performance jitter of PostgreSQL is very
> >>> noticeable. I think incremental checkpoints should be implemented as
> >>> soon as possible
> >>
> >> I think my first question is why do you think that is necessary; there
> >> are probably other tools to achieve better performance.  For example,
> >> you may want to try making checkpoint_completion_target closer to 1, and
> >> the checkpoint interval longer (both checkpoint_timeout and
> >> max_wal_size).  Also, changing shared_buffers may improve things.  You
> >> can try adding more RAM to the machine.
> >
> > Even with all those tuning options, a significant portion of a
> > checkpoint's IO (up to 50%) originates from FPIs in the WAL, which (in
> > general) will most often appear at the start of each checkpoint due to
> > each first update to a page after a checkpoint needing an FPI.
>
> Yeah, FPIs are certainly expensive and can represent huge part of the
> WAL produced. But how would incremental checkpoints make that step
> unnecessary?
>
> > If instead we WAL-logged only the pages we are about to write to disk
> > (like MySQL's double-write buffer, but in WAL instead of a separate
> > cyclical buffer file), then a checkpoint_completion_target close to 1
> > would probably solve the issue, but with "WAL-logged torn page
> > protection at first update after checkpoint" we'll probably always
> > have higher-than-average FPI load just after a new checkpoint.
> >
>
> So essentially instead of WAL-logging the FPI on the first change, we'd
> only do that later when actually writing-out the page (either during a
> checkpoint or because of memory pressure)? How would you make sure
> there's enough WAL space until the next checkpoint? I mean, FPIs are a
> huge write amplification source ...
>
> Imagine the system has max_wal_size set to 1GB, and does 1M updates
> before writing 512MB of WAL and thus triggering a checkpoint. Now it
> needs to write FPIs for 1M updates - easily 8GB of WAL, maybe more with
> indexes. What then?
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>

Re: incremental-checkopints

From

Matthias van de Meent

Date:

26 July 2023, 22:53:35

On Wed, 26 Jul 2023 at 20:58, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
>
>
> On 7/26/23 15:16, Matthias van de Meent wrote:
> > On Wed, 26 Jul 2023 at 14:41, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >>
> >> Hello
> >>
> >> On 2023-Jul-26, Thomas wen wrote:
> >>
> >>> Hi Hackes:   I found this page :
> >>> https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL
> >>> no incremental checkpoints have been implemented so far. When a
> >>> checkpoint is triggered, the performance jitter of PostgreSQL is very
> >>> noticeable. I think incremental checkpoints should be implemented as
> >>> soon as possible
> >>
> >> I think my first question is why do you think that is necessary; there
> >> are probably other tools to achieve better performance.  For example,
> >> you may want to try making checkpoint_completion_target closer to 1, and
> >> the checkpoint interval longer (both checkpoint_timeout and
> >> max_wal_size).  Also, changing shared_buffers may improve things.  You
> >> can try adding more RAM to the machine.
> >
> > Even with all those tuning options, a significant portion of a
> > checkpoint's IO (up to 50%) originates from FPIs in the WAL, which (in
> > general) will most often appear at the start of each checkpoint due to
> > each first update to a page after a checkpoint needing an FPI.
>
> Yeah, FPIs are certainly expensive and can represent huge part of the
> WAL produced. But how would incremental checkpoints make that step
> unnecessary?
>
> > If instead we WAL-logged only the pages we are about to write to disk
> > (like MySQL's double-write buffer, but in WAL instead of a separate
> > cyclical buffer file), then a checkpoint_completion_target close to 1
> > would probably solve the issue, but with "WAL-logged torn page
> > protection at first update after checkpoint" we'll probably always
> > have higher-than-average FPI load just after a new checkpoint.
> >
>
> So essentially instead of WAL-logging the FPI on the first change, we'd
> only do that later when actually writing-out the page (either during a
> checkpoint or because of memory pressure)? How would you make sure
> there's enough WAL space until the next checkpoint? I mean, FPIs are a
> huge write amplification source ...

You don't make sure that there's enough space for the modifications,
but does it matter from a durability point of view? As long as the
page isn't written to disk before the FPI, we can replay non-FPI (but
fsynced) WAL on top of the old version of the page that you read from
disk, instead of only trusting FPIs from WAL.

> Imagine the system has max_wal_size set to 1GB, and does 1M updates
> before writing 512MB of WAL and thus triggering a checkpoint. Now it
> needs to write FPIs for 1M updates - easily 8GB of WAL, maybe more with
> indexes. What then?

Then you ignore the max_wal_size GUC as PostgreSQL so often already
does. At least, it doesn't do what I expect it to do at face value -
limit the size of the WAL directory to the given size.

But more reasonably, you'd keep track of the count of modified pages
that are yet to be fully WAL-logged, and keep that into account as a
debt that you have to the current WAL insert pointer when considering
checkpoint distances and max_wal_size.

---

The main issue that I see with "WAL-logging the FPI only when you
write the dirty page to disk" is that dirty page flushing also happens
with buffer eviction in ReadBuffer(). This change in behaviour would
add a WAL insertion penalty to this write, and make it a very common
occurrance that we'd have to write WAL + fsync the WAL when we have to
write the dirty page. It would thus add significant latency to the
dirty write mechanism, which is probably a unpopular change.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Re: incremental-checkopints

From

Hannu Krosing

Date:

26 July 2023, 23:56:45

On Wed, Jul 26, 2023 at 9:54 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
>
> Then you ignore the max_wal_size GUC as PostgreSQL so often already
> does. At least, it doesn't do what I expect it to do at face value -
> limit the size of the WAL directory to the given size.

That would require stopping any new writes at wal size == max_wal_size
until the checkpoint is completed.
I don't think anybody would want that.

> But more reasonably, you'd keep track of the count of modified pages
> that are yet to be fully WAL-logged, and keep that into account as a
> debt that you have to the current WAL insert pointer when considering
> checkpoint distances and max_wal_size.

I think Peter Geoghegan has worked on somewhat similar approach to
account for "accumulated work needed until some desired outcome"
though I think it was on the VACUUM side of things.

Re: incremental-checkopints

From

Tomas Vondra

Date:

27 July 2023, 13:50:00

On 7/26/23 21:53, Matthias van de Meent wrote:
> On Wed, 26 Jul 2023 at 20:58, Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
>>
>>
>>
>> On 7/26/23 15:16, Matthias van de Meent wrote:
>>> On Wed, 26 Jul 2023 at 14:41, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>>>>
>>>> Hello
>>>>
>>>> On 2023-Jul-26, Thomas wen wrote:
>>>>
>>>>> Hi Hackes:   I found this page :
>>>>> https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL
>>>>> no incremental checkpoints have been implemented so far. When a
>>>>> checkpoint is triggered, the performance jitter of PostgreSQL is very
>>>>> noticeable. I think incremental checkpoints should be implemented as
>>>>> soon as possible
>>>>
>>>> I think my first question is why do you think that is necessary; there
>>>> are probably other tools to achieve better performance.  For example,
>>>> you may want to try making checkpoint_completion_target closer to 1, and
>>>> the checkpoint interval longer (both checkpoint_timeout and
>>>> max_wal_size).  Also, changing shared_buffers may improve things.  You
>>>> can try adding more RAM to the machine.
>>>
>>> Even with all those tuning options, a significant portion of a
>>> checkpoint's IO (up to 50%) originates from FPIs in the WAL, which (in
>>> general) will most often appear at the start of each checkpoint due to
>>> each first update to a page after a checkpoint needing an FPI.
>>
>> Yeah, FPIs are certainly expensive and can represent huge part of the
>> WAL produced. But how would incremental checkpoints make that step
>> unnecessary?
>>
>>> If instead we WAL-logged only the pages we are about to write to disk
>>> (like MySQL's double-write buffer, but in WAL instead of a separate
>>> cyclical buffer file), then a checkpoint_completion_target close to 1
>>> would probably solve the issue, but with "WAL-logged torn page
>>> protection at first update after checkpoint" we'll probably always
>>> have higher-than-average FPI load just after a new checkpoint.
>>>
>>
>> So essentially instead of WAL-logging the FPI on the first change, we'd
>> only do that later when actually writing-out the page (either during a
>> checkpoint or because of memory pressure)? How would you make sure
>> there's enough WAL space until the next checkpoint? I mean, FPIs are a
>> huge write amplification source ...
> 
> You don't make sure that there's enough space for the modifications,
> but does it matter from a durability point of view? As long as the
> page isn't written to disk before the FPI, we can replay non-FPI (but
> fsynced) WAL on top of the old version of the page that you read from
> disk, instead of only trusting FPIs from WAL.
> 

It does not matter from durability point of view, I think. But I was
thinking more about how this affects scheduling of checkpoints - how
would you know when the next checkpoint is likely to happen, when you
don't know how many FPIs you're going to write?

>> Imagine the system has max_wal_size set to 1GB, and does 1M updates
>> before writing 512MB of WAL and thus triggering a checkpoint. Now it
>> needs to write FPIs for 1M updates - easily 8GB of WAL, maybe more with
>> indexes. What then?
> 
> Then you ignore the max_wal_size GUC as PostgreSQL so often already
> does. At least, it doesn't do what I expect it to do at face value -
> limit the size of the WAL directory to the given size.
> 

I agree the soft-limit nature of max_wal_size (i.e. best effort, not a
strict limit) is not great. But just ignoring the limit altogether seems
like a step in the wrong direction - we should try not to exceed it.

I wonder if we'd actually need / want to write the FPIs into WAL. AFAICS
we only need the FPI until the page is written and flushed - since that
moment it shouldn't be possible to tear the page. So a small cyclic
buffer separate from WAL would be better ...

> But more reasonably, you'd keep track of the count of modified pages
> that are yet to be fully WAL-logged, and keep that into account as a
> debt that you have to the current WAL insert pointer when considering
> checkpoint distances and max_wal_size.
> 

Yeah, that might work. It'd likely be just estimates, but probably good
enough for pacing the writes.

> ---
> 
> The main issue that I see with "WAL-logging the FPI only when you
> write the dirty page to disk" is that dirty page flushing also happens
> with buffer eviction in ReadBuffer(). This change in behaviour would
> add a WAL insertion penalty to this write, and make it a very common
> occurrance that we'd have to write WAL + fsync the WAL when we have to
> write the dirty page. It would thus add significant latency to the
> dirty write mechanism, which is probably a unpopular change.

Yeah, it certainly move the latencies from one place to another.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company