Thread: Flush SLRU counters in checkpointer process

Flush SLRU counters in checkpointer process

From
Anthonin Bonnefoy
Date:
Hello hackers, 

Currently, the Checkpointer process only reports SLRU statistics at server shutdown, leading to delayed statistics for SLRU flushes. This patch adds a flush of SLRU stats to the end of checkpoints. 

Best regards,
Anthonin
Attachment

Re: Flush SLRU counters in checkpointer process

From
Aleksander Alekseev
Date:
Hi Anthonin,

> This patch adds a flush of SLRU stats to the end of checkpoints.

The patch looks good to me and passes the tests but let's see if
anyone else has any feedback.

Also I added a CF entry: https://commitfest.postgresql.org/42/4120/

-- 
Best regards,
Aleksander Alekseev



Re: Flush SLRU counters in checkpointer process

From
Andres Freund
Date:
Hi,

On 2023-01-11 10:29:06 +0100, Anthonin Bonnefoy wrote:
> Currently, the Checkpointer process only reports SLRU statistics at server
> shutdown, leading to delayed statistics for SLRU flushes. This patch adds a
> flush of SLRU stats to the end of checkpoints.

Hm. I wonder if we should do this even earlier, by the
pgstat_report_checkpointer() calls in CheckpointWriteDelay().

I'm inclined to move the pgstat_report_wal() and pgstat_report_slru() calls
into pgstat_report_checkpointer() to avoid needing to care about all the
individual places.


> @@ -505,6 +505,7 @@ CheckpointerMain(void)
>          /* Report pending statistics to the cumulative stats system */
>          pgstat_report_checkpointer();
>          pgstat_report_wal(true);
> +        pgstat_report_slru(true);

Why do we need a force parameter if all callers use it?

Greetings,

Andres Freund



Re: Flush SLRU counters in checkpointer process

From
Anthonin Bonnefoy
Date:
On Wed, Jan 11, 2023 at 5:33 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2023-01-11 10:29:06 +0100, Anthonin Bonnefoy wrote:
> Currently, the Checkpointer process only reports SLRU statistics at server
> shutdown, leading to delayed statistics for SLRU flushes. This patch adds a
> flush of SLRU stats to the end of checkpoints.

Hm. I wonder if we should do this even earlier, by the
pgstat_report_checkpointer() calls in CheckpointWriteDelay().

I'm inclined to move the pgstat_report_wal() and pgstat_report_slru() calls
into pgstat_report_checkpointer() to avoid needing to care about all the
individual places.
That would make sense. I've created a new patch with everything moved in pgstat_report_checkpointer().
I did split the checkpointer flush in a pgstat_flush_checkpointer() function as it seemed more readable. Thought? 
 
> @@ -505,6 +505,7 @@ CheckpointerMain(void)
>               /* Report pending statistics to the cumulative stats system */
>               pgstat_report_checkpointer();
>               pgstat_report_wal(true);
> +             pgstat_report_slru(true);

Why do we need a force parameter if all callers use it?
Good point. I've written the same signature as pgstat_report_wal but there's no need for the nowait parameter. 
 
Attachment

Re: Flush SLRU counters in checkpointer process

From
"Gregory Stark (as CFM)"
Date:
On Thu, 12 Jan 2023 at 03:46, Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:
>
>
> That would make sense. I've created a new patch with everything moved in pgstat_report_checkpointer().
> I did split the checkpointer flush in a pgstat_flush_checkpointer() function as it seemed more readable. Thought?

This patch appears to need a rebase. Is there really any feedback
needed or is it ready for committer once it's rebased?



-- 
Gregory Stark
As Commitfest Manager



Re: Flush SLRU counters in checkpointer process

From
Anthonin Bonnefoy
Date:
Here's the patch rebased with Andres' suggestions. 
Happy to update it if there's any additionalj change required.


On Wed, Mar 1, 2023 at 8:46 PM Gregory Stark (as CFM) <stark.cfm@gmail.com> wrote:
On Thu, 12 Jan 2023 at 03:46, Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:
>
>
> That would make sense. I've created a new patch with everything moved in pgstat_report_checkpointer().
> I did split the checkpointer flush in a pgstat_flush_checkpointer() function as it seemed more readable. Thought?

This patch appears to need a rebase. Is there really any feedback
needed or is it ready for committer once it's rebased?



--
Gregory Stark
As Commitfest Manager
Attachment

Re: Flush SLRU counters in checkpointer process

From
Daniel Gustafsson
Date:
> On 3 Mar 2023, at 09:06, Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> wrote:
>
> Here's the patch rebased with Andres' suggestions.
> Happy to update it if there's any additionalj change required.

This patch crashes 031_recovery_conflict with a SIGInvalid on Windows, can you
please investigate and see what might be going on there?  The test passed about
4 days ago on Windows so unless it's the CI being flaky it should be due to a
recent change.

If you don't have access to a Windows environment you can run your own
instrumented builds in your Github account with the CI files in the postgres
repo.

--
Daniel Gustafsson




Re: Flush SLRU counters in checkpointer process

From
Anthonin Bonnefoy
Date:
I think I've managed to reproduce the issue. The test I've added to check slru flush was the one failing in the regression suite.

SELECT SUM(flushes) > :slru_flushes_before FROM pg_stat_slru;
 ?column?
----------
 t

The origin seems to be a race condition on have_slrustats (https://github.com/postgres/postgres/blob/c8e1ba736b2b9e8c98d37a5b77c4ed31baf94147/src/backend/utils/activity/pgstat_slru.c#L161-L162).
I will try to get a new patch with improved test stability. 


On Mon, Jul 3, 2023 at 3:18 PM Daniel Gustafsson <daniel@yesql.se> wrote:
> On 3 Mar 2023, at 09:06, Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> wrote:
>
> Here's the patch rebased with Andres' suggestions.
> Happy to update it if there's any additionalj change required.

This patch crashes 031_recovery_conflict with a SIGInvalid on Windows, can you
please investigate and see what might be going on there?  The test passed about
4 days ago on Windows so unless it's the CI being flaky it should be due to a
recent change.

If you don't have access to a Windows environment you can run your own
instrumented builds in your Github account with the CI files in the postgres
repo.

--
Daniel Gustafsson

Re: Flush SLRU counters in checkpointer process

From
Aleksander Alekseev
Date:
Hi,

> I think I've managed to reproduce the issue. The test I've added to check slru flush was the one failing in the
regressionsuite.
 

A consensus was reached [1] to mark this patch as RwF for now. There
are many patches to be reviewed and this one doesn't seem to be in the
best shape, so we have to prioritise. Please feel free re-submitting
the patch for the next commitfest.

[1]: https://postgr.es/m/0737f444-59bb-ac1d-2753-873c40da0840%40eisentraut.org
-- 
Best regards,
Aleksander Alekseev