Thread: should we enable log_checkpoints out of the box?

should we enable log_checkpoints out of the box?

From

Bharath Rupireddy

Date:

31 October 2021, 16:50:12

Hi,

It seems the checkpoint stats, that are emitted to server logs when
the GUC log_checkpoints is enabled, are so important that a postgres
database provider would ever want to disable the GUC. Right now, the
GUC is disabled by default and a postgres database provider needs to
enable it explicitly. How about we
enable it out of the box? One disadvantage is that it may fill up the
server logs if the checkpoints are so frequent. However the
checkpoints stats are captured even when the GUC is disabled [1].

Thoughts?

[1]
    /*
     * Prepare to accumulate statistics.
     *
     * Note: because it is possible for log_checkpoints to change while a
     * checkpoint proceeds, we always accumulate stats, even if
     * log_checkpoints is currently off.
     */
    MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));

Regards,
Bharath Rupireddy.

Re: should we enable log_checkpoints out of the box?

From

Tom Lane

Date:

31 October 2021, 17:59:19

Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> writes:
> How about we enable it out of the box?

No.

The general policy at the moment is that a normally-functioning server
should emit *no* log traffic by default (other than a few messages
at startup and shutdown).  log_checkpoints is a particularly poor
candidate for an exception to that policy, because it would produce so
much traffic.  No DBA would be likely to consider it as anything but
log spam.

> It seems the checkpoint stats, that are emitted to server logs when
> the GUC log_checkpoints is enabled, are so important that a postgres
> database provider would ever want to disable the GUC.

This statement seems ridiculous on its face.  If users need to wait
with bated breath for a checkpoint report, we have something else
we need to fix.

            regards, tom lane

Re: should we enable log_checkpoints out of the box?

From

Andres Freund

Date:

31 October 2021, 21:56:01

Hi,

On 2021-10-31 10:59:19 -0400, Tom Lane wrote:
> The general policy at the moment is that a normally-functioning server
> should emit *no* log traffic by default (other than a few messages
> at startup and shutdown).  log_checkpoints is a particularly poor
> candidate for an exception to that policy, because it would produce so
> much traffic.

It shouldn't cause that much traffic - if the database is idle /
read-only, we don't trigger checkpoints. And these days that appears to
be reliable.

> No DBA would be likely to consider it as anything but log spam.

I don't agree at all. No postgres instance should be run without
log_checkpoints enabled. Performance is poor if checkpoints are
triggered by anything but time, and that can only be diagnosed if
log_checkpoints is on.

> > It seems the checkpoint stats, that are emitted to server logs when
> > the GUC log_checkpoints is enabled, are so important that a postgres
> > database provider would ever want to disable the GUC.
> 
> This statement seems ridiculous on its face.

It matches my experience. I think it's negligent to not turn it on on
anything but an absolutely toy database. And there it has practically no
cost.

The typical course is that suddenly the database has slowed down to a
crawl because checkpoints happen on an accelerated schedule due
max_wal_size. Without log_checkpoints one cannot diagnose this, and even
just turning on going forward isn't quite sufficient, because that
doesn't tell you whether checkpoints always were triggered by "xlog".

> If users need to wait with bated breath for a checkpoint report, we
> have something else we need to fix.

Well, we probably do. But it's not that easy to fundamentally improve
the situation.

Greetings,

Andres Freund

Re: should we enable log_checkpoints out of the box?

From

Tom Lane

Date:

31 October 2021, 22:43:57

Andres Freund <andres@anarazel.de> writes:
> On 2021-10-31 10:59:19 -0400, Tom Lane wrote:
>> No DBA would be likely to consider it as anything but log spam.

> I don't agree at all. No postgres instance should be run without
> log_checkpoints enabled. Performance is poor if checkpoints are
> triggered by anything but time, and that can only be diagnosed if
> log_checkpoints is on.

This is complete nonsense.  If we think that's a generic problem, we
should be fixing the problem (ie, making the checkpointer smarter);
not expecting that DBAs will monitor their logs for an undocumented
issue.  The number of installations where that would actually happen
is epsilon compared to the number where it won't.

            regards, tom lane

Re: should we enable log_checkpoints out of the box?

From

Andres Freund

Date:

31 October 2021, 23:16:33

Hi,

On 2021-10-31 15:43:57 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2021-10-31 10:59:19 -0400, Tom Lane wrote:
> >> No DBA would be likely to consider it as anything but log spam.
> 
> > I don't agree at all. No postgres instance should be run without
> > log_checkpoints enabled. Performance is poor if checkpoints are
> > triggered by anything but time, and that can only be diagnosed if
> > log_checkpoints is on.
> 
> This is complete nonsense.

Shrug. It's based on many years of doing or being around people doing
postgres support escalation shifts. And it's not like log_checkpoints
incurs meaningful overhead or causes that much log volume.

> If we think that's a generic problem, we should be fixing the problem
> (ie, making the checkpointer smarter);

We've made it less bad (checkpoint_segments -> max_wal_size, sorting IO
for checkpoints, forcing the OS to flush writes earlier). But it's still
a significant issue. It's not that easy to make it better.

The largest issues are the spikes in data write and WAL volumes. Of
course, tuning max_wal_size up helps, but that has its own set of issues
(we e.g. still PANIC on WAL ENOSPC).

One motivation for working on AIO/DIO is precisely to make checkpoints
less painful, FWIW. We are going to have to do something to reduce the
impact of FPWs at some point, but it's hard.

Greetings,

Andres Freund

Re: should we enable log_checkpoints out of the box?

From

Jan Wieck

Date:

31 October 2021, 23:38:04

On 10/31/21 16:16, Andres Freund wrote:
> Hi,
> 
> On 2021-10-31 15:43:57 -0400, Tom Lane wrote:
>> Andres Freund <andres@anarazel.de> writes:
>> > On 2021-10-31 10:59:19 -0400, Tom Lane wrote:
>> >> No DBA would be likely to consider it as anything but log spam.
>> 
>> > I don't agree at all. No postgres instance should be run without
>> > log_checkpoints enabled. Performance is poor if checkpoints are
>> > triggered by anything but time, and that can only be diagnosed if
>> > log_checkpoints is on.
>> 
>> This is complete nonsense.
> 
> Shrug. It's based on many years of doing or being around people doing
> postgres support escalation shifts. And it's not like log_checkpoints
> incurs meaningful overhead or causes that much log volume.

I agree with Andres 100%. Whenever called to diagnose any type of 
problems this is on the usual checklist and very few customers have it 
turned on. The usefulness of this information very much outweighs the 
tiny amount of extra log created.

> 
> 
>> If we think that's a generic problem, we should be fixing the problem
>> (ie, making the checkpointer smarter);
> 
> We've made it less bad (checkpoint_segments -> max_wal_size, sorting IO
> for checkpoints, forcing the OS to flush writes earlier). But it's still
> a significant issue. It's not that easy to make it better.

And we kept the default for max_wal_size at 1GB. While it is a "soft" 
limit, it is the main reason why instances are running full bore with a 
huge percentage of full page writes because it is way too small for 
their throughput and nothing in the logs warns them about it. I can run 
a certain TPC-C workload on an 8-core machine quite comfortably when 
max_wal_size is configured at 100G. The exact same TPC-C configuration 
will spiral the machine down if left with default max_wal_size and there 
is zero hint in the logs as to why.

-- 
Jan Wieck

Re: should we enable log_checkpoints out of the box?

From

Tomas Vondra

Date:

01 November 2021, 00:04:54

On 10/31/21 21:16, Andres Freund wrote:
> Hi,
> 
> On 2021-10-31 15:43:57 -0400, Tom Lane wrote:
>> Andres Freund <andres@anarazel.de> writes:
>>> On 2021-10-31 10:59:19 -0400, Tom Lane wrote:
>>>> No DBA would be likely to consider it as anything but log spam.
>>
>>> I don't agree at all. No postgres instance should be run without
>>> log_checkpoints enabled. Performance is poor if checkpoints are
>>> triggered by anything but time, and that can only be diagnosed if
>>> log_checkpoints is on.
>>
>> This is complete nonsense.
> 
> Shrug. It's based on many years of doing or being around people doing
> postgres support escalation shifts. And it's not like log_checkpoints
> incurs meaningful overhead or causes that much log volume.
> 

Yeah. In tuned instances the checkpoints happen fairly infrequently most 
of the time (occasional batch loads being an exception, etc.), so the 
extra log traffic should be fairly negligible. And it's not like we can 
make checkpointer infinitely smart - sometimes the cause is a change in 
the workload etc.

OTOH most of this data (# of timed/xlog checkpoints, buffers written by 
checkpointer etc.) is available in the pg_stat_bgwriter view, and people 
generally have monitoring these days.

I don't have a strong opinion on changing the default - we generally 
recommend changing it, and we'll keep doing it, but it's usually part of 
various other recommendations so this one tweak does not eliminate that.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: should we enable log_checkpoints out of the box?

From

Michael Banck

Date:

01 November 2021, 00:24:38

Hi,

On Sun, Oct 31, 2021 at 01:16:33PM -0700, Andres Freund wrote:
> On 2021-10-31 15:43:57 -0400, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > On 2021-10-31 10:59:19 -0400, Tom Lane wrote:
> > >> No DBA would be likely to consider it as anything but log spam.
> > 
> > > I don't agree at all. No postgres instance should be run without
> > > log_checkpoints enabled. Performance is poor if checkpoints are
> > > triggered by anything but time, and that can only be diagnosed if
> > > log_checkpoints is on.
> > 
> > This is complete nonsense.
> 
> Shrug. It's based on many years of doing or being around people doing
> postgres support escalation shifts. And it's not like log_checkpoints
> incurs meaningful overhead or causes that much log volume.

It could be a bit of reverse-survivorship-bias, i.e., you're only seeing
the pathological cases, while 99% of the Postgres installations out
there just hum along fine without anybody ever having to look at the
checkpoint entries in the log.

But yeah, for serious production instances, it makes sense to turn that
option on, but I'm not convinced it should be the default.

To put another option on the table: maybe a compromise could be to log
xlog checkpoints unconditionally, and the (checkpoint_timeout) time ones
only if log_checkpoints are set (maybe with some exponential backoff to
avoid log spam)?

Michael

-- 
Michael Banck
Team Lead PostgreSQL
Project Manager
Tel.: +49 2166 9901-171
Mail: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Management: Dr. Michael Meskes, Geoff Richardson, Peter Lilley

Our handling of personal data is subject to:
https://www.credativ.de/en/contact/privacy/

Re: should we enable log_checkpoints out of the box?

From

Magnus Hagander

Date:

01 November 2021, 16:02:57

On Sun, Oct 31, 2021 at 9:05 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 10/31/21 21:16, Andres Freund wrote:
> Hi,
>
> On 2021-10-31 15:43:57 -0400, Tom Lane wrote:
>> Andres Freund <andres@anarazel.de> writes:
>>> On 2021-10-31 10:59:19 -0400, Tom Lane wrote:
>>>> No DBA would be likely to consider it as anything but log spam.
>>
>>> I don't agree at all. No postgres instance should be run without
>>> log_checkpoints enabled. Performance is poor if checkpoints are
>>> triggered by anything but time, and that can only be diagnosed if
>>> log_checkpoints is on.
>>
>> This is complete nonsense.
>
> Shrug. It's based on many years of doing or being around people doing
> postgres support escalation shifts. And it's not like log_checkpoints
> incurs meaningful overhead or causes that much log volume.
>

Yeah. In tuned instances the checkpoints happen fairly infrequently most
of the time (occasional batch loads being an exception, etc.), so the
extra log traffic should be fairly negligible. And it's not like we can
make checkpointer infinitely smart - sometimes the cause is a change in
the workload etc.

OTOH most of this data (# of timed/xlog checkpoints, buffers written by
checkpointer etc.) is available in the pg_stat_bgwriter view, and people
generally have monitoring these days.

Yeah, I think you can get much of the data you need in pg_stat_bgwriter. There is still some data that log_checkpoint gives you that the statistics don't -- but maybe we should instead look at exposing that information in pg_stat_bgwriter, rather than changing the default.

Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

Re: should we enable log_checkpoints out of the box?

From

Daniel Gustafsson

Date:

02 November 2021, 14:00:15

> On 1 Nov 2021, at 14:02, Magnus Hagander <magnus@hagander.net> wrote:

> There is still some data that log_checkpoint gives you that the statistics don't -- but maybe we should instead look
atexposing that information in pg_stat_bgwriter, rather than changing the default. 

I don't have strong opinions on changing the default, but +1 on at
investigating exposing the information in the view for monitoring.

--
Daniel Gustafsson        https://vmware.com/

Re: should we enable log_checkpoints out of the box?

From

Robert Haas

Date:

02 November 2021, 18:55:23

On Sun, Oct 31, 2021 at 10:59 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The general policy at the moment is that a normally-functioning server
> should emit *no* log traffic by default (other than a few messages
> at startup and shutdown).  log_checkpoints is a particularly poor
> candidate for an exception to that policy, because it would produce so
> much traffic.  No DBA would be likely to consider it as anything but
> log spam.

That's absolutely false. On any system where there's anything actually
happening, there's going to be tons of stuff in the log because there
are going to be failed connection attempts, queries that result in
errors, and all kinds of other things like that. By any reasonable
metric, the log volume of log_checkpoints=on is tiny. It can't log
anything more than once per checkpoint interval, which means you're
not even talking about 1 message per minute. You can easily have
hundreds or thousands of errors or other log messages from user
activity *per second* and even on a relatively quiet system that stuff
is going to completely dwarf what you get from log_checkpoints. If
log_checkpoints=on is your biggest source of log output, what you need
isn't so much 'log_checkpoints=off' as 'pg_ctl stop'.

> > It seems the checkpoint stats, that are emitted to server logs when
> > the GUC log_checkpoints is enabled, are so important that a postgres
> > database provider would ever want to disable the GUC.
>
> This statement seems ridiculous on its face.  If users need to wait
> with bated breath for a checkpoint report, we have something else
> we need to fix.

Besides appearing to be unwarranted mockery of what Bharath wrote,
this statement also seems to reflect a complete lack of understanding
of what is involved with maintaining a production system. Most of the
time, things just work, so you don't need to look at the logs at all.
But when things do go wrong, then you need some way to figure out what
the problem is. System-wide performance problems not linked to an
individual query are most often caused by either of two things:
checkpoints, and big autovacuum jobs. If you've set up logging for
those things in advance, then if and when the problem happens, you
will have a chance of being able to understand it and solve it
quickly. If you have not, the very first piece of advice you're going
to get from me is to (1) check whether there are any autovacuum
workers still running and if so figure out what they're doing and (2)
set log_checkpoints=on and log_autovacuum_min_duration to the smallest
value that isn't going to fill up your logs with too much garbage.

I can tell you from long experience with this sort of situation that
users do not love it. It means that they often cannot troubleshoot
problems of this type that happened in the past, because there's
simply no information in the log file that is of any use, and they
have to wait for the problem to recur... and I don't think it is
overstating anything to say that some of them probably DO wait with
bated breath, because they'd like whatever the issue is to get fixed!

I think shipping with log_checkpoints=on and
log_autovacuum_min_duration=10m or so would be one of the best things
we could possibly do to allow ex-post-facto troubleshooting of
system-wide performance issues. The idea that users care more about
the inconvenience of a handful of extra log messages than they do
about being able to find problems when they have them matches no part
of my experience. I suspect that many users would be willing to incur
*way more* useless log messages than those settings would ever
generate if it meant that they could actually find problems when and
if they have them. And these messages would in fact be the most
valuable thing in the log for a lot of users. What reasonable DBA
cares more about the fact that the application attempted an insert
that violated a foreign key constraint than they do about a checkpoint
that took 20 minutes to fsync everything? The former is expected; if
you thought that foreign key violations would never occur, you
wouldn't need to incur the expense of having the system enforce them.
The latter is unexpected and basically undiscoverable with default
settings.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: should we enable log_checkpoints out of the box?

From

Peter Geoghegan

Date:

02 November 2021, 19:09:37

On Tue, Nov 2, 2021 at 8:55 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I think shipping with log_checkpoints=on and
> log_autovacuum_min_duration=10m or so would be one of the best things
> we could possibly do to allow ex-post-facto troubleshooting of
> system-wide performance issues. The idea that users care more about
> the inconvenience of a handful of extra log messages than they do
> about being able to find problems when they have them matches no part
> of my experience. I suspect that many users would be willing to incur
> *way more* useless log messages than those settings would ever
> generate if it meant that they could actually find problems when and
> if they have them.

I fully agree.

-- 
Peter Geoghegan

Re: should we enable log_checkpoints out of the box?

From

Andrew Dunstan

Date:

02 November 2021, 19:18:14

On 11/2/21 12:09, Peter Geoghegan wrote:
> On Tue, Nov 2, 2021 at 8:55 AM Robert Haas <robertmhaas@gmail.com> wrote:
>> I think shipping with log_checkpoints=on and
>> log_autovacuum_min_duration=10m or so would be one of the best things
>> we could possibly do to allow ex-post-facto troubleshooting of
>> system-wide performance issues. The idea that users care more about
>> the inconvenience of a handful of extra log messages than they do
>> about being able to find problems when they have them matches no part
>> of my experience. I suspect that many users would be willing to incur
>> *way more* useless log messages than those settings would ever
>> generate if it meant that they could actually find problems when and
>> if they have them.
> I fully agree.


/metoo


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: should we enable log_checkpoints out of the box?

From

Julien Rouhaud

Date:

02 November 2021, 19:35:45

Le mer. 3 nov. 2021 à 00:18, Andrew Dunstan <andrew@dunslane.net> a écrit :

On 11/2/21 12:09, Peter Geoghegan wrote:
> On Tue, Nov 2, 2021 at 8:55 AM Robert Haas <robertmhaas@gmail.com> wrote:
>> I think shipping with log_checkpoints=on and
>> log_autovacuum_min_duration=10m or so would be one of the best things
>> we could possibly do to allow ex-post-facto troubleshooting of
>> system-wide performance issues. The idea that users care more about
>> the inconvenience of a handful of extra log messages than they do
>> about being able to find problems when they have them matches no part
>> of my experience. I suspect that many users would be willing to incur
>> *way more* useless log messages than those settings would ever
>> generate if it meant that they could actually find problems when and
>> if they have them.
> I fully agree.

/metoo

same here

Re: should we enable log_checkpoints out of the box?

From

David Steele

Date:

02 November 2021, 20:33:50

On 11/2/21 12:35 PM, Julien Rouhaud wrote:
> Le mer. 3 nov. 2021 à 00:18, Andrew Dunstan <andrew@dunslane.net 
> <mailto:andrew@dunslane.net>> a écrit :
> 
> 
>     On 11/2/21 12:09, Peter Geoghegan wrote:
>      > On Tue, Nov 2, 2021 at 8:55 AM Robert Haas <robertmhaas@gmail.com
>     <mailto:robertmhaas@gmail.com>> wrote:
>      >> I think shipping with log_checkpoints=on and
>      >> log_autovacuum_min_duration=10m or so would be one of the best
>     things
>      >> we could possibly do to allow ex-post-facto troubleshooting of
>      >> system-wide performance issues. The idea that users care more about
>      >> the inconvenience of a handful of extra log messages than they do
>      >> about being able to find problems when they have them matches no
>     part
>      >> of my experience. I suspect that many users would be willing to
>     incur
>      >> *way more* useless log messages than those settings would ever
>      >> generate if it meant that they could actually find problems when and
>      >> if they have them.
>      > I fully agree.
> 
> 
>     /metoo
> 
> 
> same here

+1

-- 
-David
david@pgmasters.net

Re: should we enable log_checkpoints out of the box?

From

Jan Wieck

Date:

02 November 2021, 21:19:49

On 11/2/21 12:09, Peter Geoghegan wrote:
> On Tue, Nov 2, 2021 at 8:55 AM Robert Haas <robertmhaas@gmail.com> wrote:
>> I think shipping with log_checkpoints=on and
>> log_autovacuum_min_duration=10m or so would be one of the best things
>> we could possibly do to allow ex-post-facto troubleshooting of
>> system-wide performance issues. The idea that users care more about
>> the inconvenience of a handful of extra log messages than they do
>> about being able to find problems when they have them matches no part
>> of my experience. I suspect that many users would be willing to incur
>> *way more* useless log messages than those settings would ever
>> generate if it meant that they could actually find problems when and
>> if they have them.
> 
> I fully agree.
> 

+1

-- 
Jan Wieck

Re: should we enable log_checkpoints out of the box?

From

Nikolay Samokhvalov

Date:

02 November 2021, 21:39:35

On Tue, Nov 2, 2021 at 8:55 AM Robert Haas <robertmhaas@gmail.com> wrote:

I think shipping with log_checkpoints=on and
log_autovacuum_min_duration=10m or so would be one of the best things
we could possibly do to allow ex-post-facto troubleshooting of
system-wide performance issues.

Fully agree, it would be really great. Glad that you added autovacuum logging

into the picture.

Maybe, 1m would be better – I recently observed a system with

high TPS, where autoANALYZE took very long for a system, causing

huge slowdown on standbys, starting after a few minutes of ANALYZE launched.

Correlation and then causation was confirmed only thanks to

log_autovacuum_min_duration configured -- without it, no one would be

able to understand what happened.

Back to checkpoint logging. With log_checkpoints = off, and high write activity

with low max_wal_size we're already "spamming" the logs with lots of

"checkpoints are occurring too frequently" – and this happens very often,

any DBA running a restore process on Postgres with default max_wal_size

(1GB, very low for modern DBs) saw it.

Without details, this definitely looks like "spam" – and it's already happening

here and there. Details provided by log_checkpoints = on will give something

leading to making the decision on max_wal_size reconfiguration based on data,

not guesswork.

+1 for log_checkpoints = on

and +1 for log_autovacuum_min_duration = 1m or so.

Re: should we enable log_checkpoints out of the box?

From

Alvaro Herrera

Date:

02 November 2021, 21:39:49

On 2021-Nov-02, Robert Haas wrote:

> I think shipping with log_checkpoints=on and
> log_autovacuum_min_duration=10m or so would be one of the best things
> we could possibly do to allow ex-post-facto troubleshooting of
> system-wide performance issues. The idea that users care more about
> the inconvenience of a handful of extra log messages than they do
> about being able to find problems when they have them matches no part
> of my experience.

I agree.

There are things that are much more likely to be unhelpful and
irritating -- say, enabling log_connections by default.  Such messages
would be decididly useless for a large fraction of users and a burden.
That's not something you can claim about checkpoints and large-autovac
messages, though; not only because they are much less frequent, but also
because each line concisely represents a large amount of work.

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"En las profundidades de nuestro inconsciente hay una obsesiva necesidad
de un universo lógico y coherente. Pero el universo real se halla siempre
un paso más allá de la lógica" (Irulan)

Re: should we enable log_checkpoints out of the box?

From

Alvaro Herrera

Date:

02 November 2021, 21:48:37

On 2021-Nov-02, Nikolay Samokhvalov wrote:

> Back to checkpoint logging. With log_checkpoints = off, and high write
> activity with low max_wal_size we're already "spamming" the logs with
> lots of "checkpoints are occurring too frequently" – and this happens
> very often, any DBA running a restore process on Postgres with default
> max_wal_size (1GB, very low for modern DBs) saw it.
> 
> Without details, this definitely looks like "spam"

Speaking of which -- I think we could easily remove checkpoint_warning
without any loss of useful functionality.  Or, if we have to keep it, we
could change the way it works: when that condition triggers, then cause
regular "checkpoint started/complete" messages to occur (if they are
disabled), rather than the current "checkpoints are occurring too
frequently" message.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"E pur si muove" (Galileo Galilei)

Re: should we enable log_checkpoints out of the box?

From

Robert Haas

Date:

02 November 2021, 21:50:06

On Tue, Nov 2, 2021 at 2:39 PM Nikolay Samokhvalov
<samokhvalov@gmail.com> wrote:
> +1 for  log_checkpoints = on
> and +1 for log_autovacuum_min_duration = 1m or so.

I almost proposed 1m rather than 10m, but then I thought the better of
it. I think it's unlikely that an autovacuum that takes 1 minute is
really the cause of some big problem you're having on your system.
Typical problem cases I see are hours or days long, so even 10 minutes
is pretty short. compared to what's likely to cause you real issues.
And at the same time 10 minutes is also high enough that you won't get
frequent log messages. 1 minute might not be: with 3 autovacuum
workers by default, that could print out a message every 20 seconds,
which does not feel worthwhile. I think a 10 minute threshold is much
more likely to only capture events that you actually care about.

Now I do think that a lot of people would benefit from a lower setting
than 10 minutes, just to get more information about what's happening.
But IMHO, making the default as low as 1 minute is a bit much.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: should we enable log_checkpoints out of the box?

From

Nikolay Samokhvalov

Date:

02 November 2021, 21:57:04

On Tue, Nov 2, 2021 at 11:50 AM Robert Haas <robertmhaas@gmail.com> wrote:

I almost proposed 1m rather than 10m, but then I thought the better of
it. I think it's unlikely that an autovacuum that takes 1 minute is
really the cause of some big problem you're having on your system.
Typical problem cases I see are hours or days long, so even 10 minutes
is pretty short.

I'm talking about the autoANALYZE part, not VACUUM. In my case, it was a few tables

~100GB-1TB in size, with 1-2 GIN indexes (with fastupdate, default pending list size limit, 4MB),

10 workers with quite high bar in terms of throttling. And default_statistics_target = 1000.

Observed autoANALYZE timing reached dozens of minutes, sometimes ~1 hour for a table.

The problem is that, it looks, ANALYZE (unlike VACUUM) holds snapshot, takes XID -- and it

all leads to the issues on standbys, if it takes so long. I'm going to post the findings in a separate

thread, but the point is that autoANALYZE running minutes *may* cause big performance issues.

That's why 1m seems a good threshold to me, even if leads to having 3 log entries per minute from

3 workers. It's a quite low log traffic, but the data there is really useful for retrospective analysis.

Re: should we enable log_checkpoints out of the box?

From

Peter Geoghegan

Date:

02 November 2021, 22:04:57

On Tue, Nov 2, 2021 at 11:50 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I almost proposed 1m rather than 10m, but then I thought the better of
> it. I think it's unlikely that an autovacuum that takes 1 minute is
> really the cause of some big problem you're having on your system.
> Typical problem cases I see are hours or days long, so even 10 minutes
> is pretty short. compared to what's likely to cause you real issues.
> And at the same time 10 minutes is also high enough that you won't get
> frequent log messages.

I think that 10 minutes is fine. But if it has to be 30 minutes, then
that's also probably fine.

I think that the case for enabling autovacuum logging is particularly
good. The really big problems with autovacuum often involve
anti-wraparound VACUUM, where you really want to have every possible
opportunity to learn why VACUUM took much longer than expected.

Going from doing index vacuuming in one single pass to requiring more
than one pass can very significantly delay things, quite suddenly.
Even when you have 95% of the maintenance_work_mem required to process
indexes in a single pass, it might not be that much better than 50% or
less. It's way too unpredictable in cases where users actually run
into real problems -- cases where it actually matters.

-- 
Peter Geoghegan

Re: should we enable log_checkpoints out of the box?

From

Michael Banck

Date:

02 November 2021, 22:51:54

Hi,

On Tue, Nov 02, 2021 at 11:55:23AM -0400, Robert Haas wrote:
> I think shipping with log_checkpoints=on and
> log_autovacuum_min_duration=10m or so would be one of the best things
> we could possibly do to allow ex-post-facto troubleshooting of
> system-wide performance issues.

I don't disagree, but while we're at it, I'd like to throw
log_lock_waits into the ring as well. IME, once you get to a situation
where you have it pop up, the overall log volume usually dwarfs it, but
it's pretty important to possibly answer the "why was that query slow 5
days ago" question.

Michael

-- 
Michael Banck
Team Lead PostgreSQL
Project Manager
Tel.: +49 2166 9901-171
Mail: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Management: Dr. Michael Meskes, Geoff Richardson, Peter Lilley

Our handling of personal data is subject to:
https://www.credativ.de/en/contact/privacy/

Re: should we enable log_checkpoints out of the box?

From

Robert Haas

Date:

02 November 2021, 23:47:06

On Tue, Nov 2, 2021 at 3:05 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I think that 10 minutes is fine. But if it has to be 30 minutes, then
> that's also probably fine.

+1.

> I think that the case for enabling autovacuum logging is particularly
> good. The really big problems with autovacuum often involve
> anti-wraparound VACUUM, where you really want to have every possible
> opportunity to learn why VACUUM took much longer than expected.

+1.

> Going from doing index vacuuming in one single pass to requiring more
> than one pass can very significantly delay things, quite suddenly.
> Even when you have 95% of the maintenance_work_mem required to process
> indexes in a single pass, it might not be that much better than 50% or
> less. It's way too unpredictable in cases where users actually run
> into real problems -- cases where it actually matters.

Yeah. I have only very rarely run into cases where people actually end
up needing multiple passes, but it's always something I need to rule
out as part of the troubleshooting process, and it's hard to do that
without the log_autovacuum_min_duration output. It's pretty common for
me to see cases where, for example, the I/O performed by autovacuum
read a bunch of data off disk, which shoved a bunch of hot data out of
cache, and then performance tanked really hard. Or where the vacuum
cost limit is ridiculously low relative to the table size and the
vacuums take an unreasonably long time. In those kinds of cases what
you really need to know is that there was a vacuum on a certain table,
and how long it took, and when that happened.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: should we enable log_checkpoints out of the box?

From

Vik Fearing

Date:

03 November 2021, 00:35:34

On 10/31/21 10:24 PM, Michael Banck wrote:
> To put another option on the table: maybe a compromise could be to log
> xlog checkpoints unconditionally, and the (checkpoint_timeout) time ones
> only if log_checkpoints are set (maybe with some exponential backoff to
> avoid log spam)?

If we're going to do something like that, we should convert it from a
boolean to an enum.

    log_checkpoints = wal

(I'm being very careful not to slip on that slope.)
-- 
Vik Fearing

Re: should we enable log_checkpoints out of the box?

From

Peter Geoghegan

Date:

03 November 2021, 00:39:51

On Tue, Nov 2, 2021 at 1:47 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Yeah. I have only very rarely run into cases where people actually end
> up needing multiple passes, but it's always something I need to rule
> out as part of the troubleshooting process, and it's hard to do that
> without the log_autovacuum_min_duration output. It's pretty common for
> me to see cases where, for example, the I/O performed by autovacuum
> read a bunch of data off disk, which shoved a bunch of hot data out of
> cache, and then performance tanked really hard.

Right. Indexes are a big source of the variability IME. I agree that
having to do more than one pass isn't all that common. More often it's
something to do with the fact that there are 20 indexes, or the fact
that they use UUID indexes. VACUUM can very easily end up dirtying far
more pages than might be expected, just because of these kinds of
variations.

Variability in the duration of VACUUM due to issues on the heapam side
tends to be due to things like the complicated relationship between
work done and XIDs consumed, the fact that autovacuum scheduling makes
VACUUM kick in at geometric intervals/in geometric series (the scale
factor is applied to the size of the table at the end of the last
autovacuum), and problems with maintaining accurate information about
the number of dead tuples in the table.

> Or where the vacuum
> cost limit is ridiculously low relative to the table size and the
> vacuums take an unreasonably long time. In those kinds of cases what
> you really need to know is that there was a vacuum on a certain table,
> and how long it took, and when that happened.

In short: very often everything is just fine, but when things do go
wrong, the number of plausible sources of trouble is vast. There is no
substitute for already having reasonably good instrumentation of
VACUUM in place, to show how things have changed over time. The time
dimension is often very important, for experts and non-experts alike.

The more success we have with improving VACUUM (e.g., the VM/freeze
map stuff), the more likely it is that the notable remaining problems
will be complicated and kind of weird. That's what I see.

--
Peter Geoghegan

Re: should we enable log_checkpoints out of the box?

From

Bruce Momjian

Date:

03 November 2021, 02:46:46

On Sun, Oct 31, 2021 at 10:24:38PM +0100, Michael Banck wrote:
> On Sun, Oct 31, 2021 at 01:16:33PM -0700, Andres Freund wrote:
> > Shrug. It's based on many years of doing or being around people doing
> > postgres support escalation shifts. And it's not like log_checkpoints
> > incurs meaningful overhead or causes that much log volume.
> 
> It could be a bit of reverse-survivorship-bias, i.e., you're only seeing
> the pathological cases, while 99% of the Postgres installations out
> there just hum along fine without anybody ever having to look at the
> checkpoint entries in the log.
> 
> But yeah, for serious production instances, it makes sense to turn that
> option on, but I'm not convinced it should be the default.

Yes, I agree with this analysis. There is a sense that people who
regularly diagnose Postgres problems want this information in the logs
by default, while the majority of sites might want silent logs for
normal query activity.

I realize in this thread that Robert Haas mentioned foreign key
violations that would like also appear in logs, but those are
ERROR/WARNING etc. messages that can be filtered out with
log_min_message.  I think if we want to offer a more verbose set of
analytic output, by default or not, we might want to relabel them as
something other than LOG messages so they can be filtered out with
log_min_messages.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: should we enable log_checkpoints out of the box?

From

Tom Lane

Date:

03 November 2021, 03:02:41

Bruce Momjian <bruce@momjian.us> writes:
> On Sun, Oct 31, 2021 at 10:24:38PM +0100, Michael Banck wrote:
>> It could be a bit of reverse-survivorship-bias, i.e., you're only seeing
>> the pathological cases, while 99% of the Postgres installations out
>> there just hum along fine without anybody ever having to look at the
>> checkpoint entries in the log.
>> 
>> But yeah, for serious production instances, it makes sense to turn that
>> option on, but I'm not convinced it should be the default.

> Yes, I agree with this analysis. There is a sense that people who
> regularly diagnose Postgres problems want this information in the logs
> by default, while the majority of sites might want silent logs for
> normal query activity.

I'm still of the position that the default ought to be that a
normally-functioning server generates no ongoing log output.
Only people who have got Nagios watching their logs, or some
such setup, are going to want anything different.  And that is
a minority use-case.  There are going to be way more people
bitching because their postmaster log overflowed their disk
than there will be people who are happier because you made
such output the default.  (Don't forget that our default
logging setup does not rotate the logs.)

However, I'm prepared to accept a tight definition of what
"normally-functioning" means.  For instance, I don't have a
problem with the idea of having log_autovacuum_min_duration
defaulting to something positive, as long as it's fairly large.
Then it's only going to emit anything if there is a situation
that really needs attention.

My objection to log_checkpoints=on is that that's going to produce
a constant stream of messages even when *nothing at all* is wrong.
Worse yet, a novice DBA would likely have a hard time understanding
from those messages whether there was anything to worry about or not.
If we could design a variant of log_checkpoints that would produce
output only when the situation really needs attention, I'd be fine
with enabling that by default.

            regards, tom lane

Re: should we enable log_checkpoints out of the box?

From

Nikolay Samokhvalov

Date:

03 November 2021, 04:34:37

On Tue, Nov 2, 2021 at 5:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm still of the position that the default ought to be that a
normally-functioning server generates no ongoing log output.
Only people who have got Nagios watching their logs, or some
such setup, are going to want anything different. And that is
a minority use-case. There are going to be way more people
bitching because their postmaster log overflowed their disk
than there will be people who are happier because you made
such output the default. (Don't forget that our default
logging setup does not rotate the logs.)

Is it known how many new Postgres installations are from

popular packages (that have log rotation enabled) compared

to custom-built and managed in their own way?

If people do not use packages these days, they should take

care of themselves – it includes log rotation and, for example,

autostart. The same people who might complain of overflowed

disks should already be complaining about Postgres not surviving

machine restarts, right?

Nik

Re: should we enable log_checkpoints out of the box?

From

Jan Wieck

Date:

03 November 2021, 06:18:35

On 11/2/21 20:02, Tom Lane wrote:

> My objection to log_checkpoints=on is that that's going to produce
> a constant stream of messages even when *nothing at all* is wrong.
> Worse yet, a novice DBA would likely have a hard time understanding
> from those messages whether there was anything to worry about or not.
> If we could design a variant of log_checkpoints that would produce
> output only when the situation really needs attention, I'd be fine
> with enabling that by default.


Making log_checkpoints an enum sort of thing as already suggested might 
do that. Or (also already suggested) elevating checkpoint logging once 
it happened because of WAL for a while.

The thing I don't want to see us doing is *nothing at all* when pretty 
much everyone with some customer experience in the field is saying "this 
is the information we want to see post incident and nobody has it so we 
sit there waiting for the next time it happens."


Regards

-- 
Jan Wieck

Re: should we enable log_checkpoints out of the box?

From

Robert Haas

Date:

03 November 2021, 15:08:29

On Tue, Nov 2, 2021 at 11:18 PM Jan Wieck <jan@wi3ck.info> wrote:
> The thing I don't want to see us doing is *nothing at all* when pretty
> much everyone with some customer experience in the field is saying "this
> is the information we want to see post incident and nobody has it so we
> sit there waiting for the next time it happens."

Quite so.

I'm not convinced that the proposal to log checkpoints only when
they're triggered by WAL rather than by time is really solving
anything. It isn't as if a time-based checkpoint couldn't have caused
a problem. What you're going to be looking for is something much more
complicated than that. Were the fsyncs slow? Did the checkpoint around
the time the user reported a problem write significantly more data
than the other checkpoints? I guess if a checkpoint wrote 1MB of data
and took 0.1 seconds to complete the fsyncs, I don't much care whether
it shows up in the log or not. If it wrote 4GB of data, or if it took
15 seconds to complete the fsyncs, I care. That's easily enough to
account for some problem that somebody had.

I'm not sure whether there are any other interesting criteria.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: should we enable log_checkpoints out of the box?

From

Robert Haas

Date:

03 November 2021, 15:45:46

On Tue, Nov 2, 2021 at 8:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm still of the position that the default ought to be that a
> normally-functioning server generates no ongoing log output.
> Only people who have got Nagios watching their logs, or some
> such setup, are going to want anything different.  And that is
> a minority use-case.  There are going to be way more people
> bitching because their postmaster log overflowed their disk
> than there will be people who are happier because you made
> such output the default.  (Don't forget that our default
> logging setup does not rotate the logs.)

I mean, you still seem to be postulating a setup where none of the
user activity ever causes any logging messages to get generated, which
is just not the way real systems behave. And as to log volume, if you
log a checkpoint every 5 minutes, generating roughly 1/4 kB of log
output each time, that's ~3kB/hour, or ~27MB/year.

Just for fun, I went and had a look at the least-used PostgreSQL
instances that I know about, which is the one running on my Raspberry
Pi. It powers a little web site that I wrote and people sometimes use
that web site but not that often. It's running PostgreSQL 9.4.5
because it's not an important enough system that I can be bothered to
upgrade it. The OS that came preinstalled is some Debian derivative
that arranges to rotate the logs automatically, so I only have logs
going back to May. For the most part, it generates pretty much nothing
in the logs, although back end of May/early June it kicked out about
56MB of logs. I guess I must have temporarily enabled
log_statement=all to debug something or other.

Anyway, there are three points I'd like to make about this machine.
The first is that the root partition, where pretty much all the files
are located on this tiny box, is 6GB. It currently has 825MB of free
space. If you enabled log_checkpoints=on on this machine, first of all
I suppose most of the time it wouldn't log anything because there's
usually no activity and thus the checkpoint could just be skipped.
Second, even if it did checkpoint every 5 minutes and generate
27MB/year of logs, that would take roughly 30 years to free up the
825MB of free space I have on the root partition. I doubt this machine
will be around in 30 years; it's uncertain that *I'll* be around in 30
years. Third, this machine came with PostgreSQL log compression and
rotation set up by the operating system, which I think is pretty good
evidence that configurations including those features are common. The
compression ratio on the logs this machine is generating seems to be
about 98.5%. After compression, the overhead of the log_checkpoints
output for the three months of logs that it keeps - assuming it had
enough activity to checkpoint every 5 minutes - would be perhaps
100kB, which is not only little enough not to be a problem on this
machine, but also little enough not to be a problem on the first IBM
PC that I ever used, back in the 1980s.

If log_checkpoints=on wouldn't fill up the disk on my 5+-year old
Raspberry Pi in less time that it takes to raise multiple children to
adulthood even after disabling the OS-provided log rotation and
compression, then it seems more than fair to say that for the vast
majority of users, this isn't a real problem. And for those few for
whom it *is* a real problem, they can still shut off log_checkpoints.
It's not like anybody is proposing to remove the option.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: should we enable log_checkpoints out of the box?

From

Bruce Momjian

Date:

03 November 2021, 15:53:55

On Tue, Nov  2, 2021 at 08:02:41PM -0400, Tom Lane wrote:
> I'm still of the position that the default ought to be that a
> normally-functioning server generates no ongoing log output.
> Only people who have got Nagios watching their logs, or some
> such setup, are going to want anything different.  And that is
> a minority use-case.  There are going to be way more people
> bitching because their postmaster log overflowed their disk
> than there will be people who are happier because you made
> such output the default.  (Don't forget that our default
> logging setup does not rotate the logs.)
> 
> However, I'm prepared to accept a tight definition of what
> "normally-functioning" means.  For instance, I don't have a
> problem with the idea of having log_autovacuum_min_duration
> defaulting to something positive, as long as it's fairly large.
> Then it's only going to emit anything if there is a situation
> that really needs attention.
> 
> My objection to log_checkpoints=on is that that's going to produce
> a constant stream of messages even when *nothing at all* is wrong.
> Worse yet, a novice DBA would likely have a hard time understanding
> from those messages whether there was anything to worry about or not.
> If we could design a variant of log_checkpoints that would produce
> output only when the situation really needs attention, I'd be fine
> with enabling that by default.

I wonder if we need to follow the Unix model on this by having kernel
messages logged to a file via syslog (and potentially filtered), and
then have all recent activity logged into a ring buffer and visible via
something like dmesg.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: should we enable log_checkpoints out of the box?

From

Bruce Momjian

Date:

03 November 2021, 16:04:28

On Wed, Nov  3, 2021 at 08:45:46AM -0400, Robert Haas wrote:
> If log_checkpoints=on wouldn't fill up the disk on my 5+-year old
> Raspberry Pi in less time that it takes to raise multiple children to
> adulthood even after disabling the OS-provided log rotation and
> compression, then it seems more than fair to say that for the vast
> majority of users, this isn't a real problem. And for those few for
> whom it *is* a real problem, they can still shut off log_checkpoints.
> It's not like anybody is proposing to remove the option.

Well, another issue is that if something unusual does happen, it appears
very visibly if you are looking just for LOG messages, while if you have
many checkpoint log messages, it might get lost.  If we want to log more
by default, I think we are looking at several issues:

*  enabling log rotation and log file reuse in the default install
*  changing the labels of some of the normal-operation log messages
*  changing the way some of these log messages are controlled
*  perhaps using a ring buffer for common log messages

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: should we enable log_checkpoints out of the box?

From

Robert Haas

Date:

03 November 2021, 16:09:18

On Tue, Nov 2, 2021 at 7:46 PM Bruce Momjian <bruce@momjian.us> wrote:
> I realize in this thread that Robert Haas mentioned foreign key
> violations that would like also appear in logs, but those are
> ERROR/WARNING etc. messages that can be filtered out with
> log_min_message.

I think that's pretty impractical. In order to get rid of all of the
harmless application errors, you'd have to set log_min_messages to
FATAL. That's a bad idea for two reasons. First, it wouldn't even
filter out all the garbage that you don't care about, and second, it
would filter out a lot of really important things that you do care
about. For example, you would still get messages in the logs like
"FATAL: connection to client lost" and "FATAL: canceling
authentication due to timeout" that probably nobody really cares
about, but you would not get messages like "ERROR: downlink or sibling
link points to deleted block in index <whatever>" which are indicative
of data corruption and thus important. For better or for worse, the
distinction between ERROR, FATAL, and PANIC is entirely based on what
we do after printing the message, and NOT on how serious the message
is.

Now maybe we want to improve that at some point, but it shouldn't
stand in the way of this proposal. If people are living with the
gigantic volume of ERROR and FATAL messages that typically end up in
the logs today, they can certainly live with the absolutely tiny
volume of additional information that would be generated by
log_checkpoints=on.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: should we enable log_checkpoints out of the box?

From

Bruce Momjian

Date:

03 November 2021, 16:17:00

On Wed, Nov  3, 2021 at 09:09:18AM -0400, Robert Haas wrote:
> I think that's pretty impractical. In order to get rid of all of the
> harmless application errors, you'd have to set log_min_messages to
> FATAL. That's a bad idea for two reasons. First, it wouldn't even
> filter out all the garbage that you don't care about, and second, it
> would filter out a lot of really important things that you do care
> about. For example, you would still get messages in the logs like
> "FATAL: connection to client lost" and "FATAL: canceling
> authentication due to timeout" that probably nobody really cares
> about, but you would not get messages like "ERROR: downlink or sibling
> link points to deleted block in index <whatever>" which are indicative
> of data corruption and thus important. For better or for worse, the
> distinction between ERROR, FATAL, and PANIC is entirely based on what
> we do after printing the message, and NOT on how serious the message
> is.
> 
> Now maybe we want to improve that at some point, but it shouldn't
> stand in the way of this proposal. If people are living with the
> gigantic volume of ERROR and FATAL messages that typically end up in
> the logs today, they can certainly live with the absolutely tiny
> volume of additional information that would be generated by
> log_checkpoints=on.

See my later email --- I think we need to make a few changes for this to
work well.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: should we enable log_checkpoints out of the box?

From

Robert Haas

Date:

03 November 2021, 16:19:28

On Wed, Nov 3, 2021 at 9:04 AM Bruce Momjian <bruce@momjian.us> wrote:
> Well, another issue is that if something unusual does happen, it appears
> very visibly if you are looking just for LOG messages, while if you have
> many checkpoint log messages, it might get lost.

That's a theoretical risk, but in practice the problem is actually
exactly the opposite of that, at least in my experience. If somebody
sends me a log file from a machine with log_checkpoints=on, the file
is typically hundreds of megabytes in size and in that file someplace
there are a handful of log lines generated by log_checkpoints mixed in
with millions of log lines about other things. So it's not that
log_checkpoints=on would keep you from noticing other things happening
on the machine. It's that other things happening on the machine would
keep you from noticing the log_checkpoints=on output.

In fact this problem is true for pretty much anything important that
shows up in the log: it's likely to be completely swamped by other
messages logged *at the exact same log level* that are not important
at all. The ERROR that could have alerted you to data corruption is
likely mixed in with millions of innocuous ERROR messages, for
example.

> If we want to log more
> by default, I think we are looking at several issues:
>
> *  enabling log rotation and log file reuse in the default install
> *  changing the labels of some of the normal-operation log messages
> *  changing the way some of these log messages are controlled
> *  perhaps using a ring buffer for common log messages

If we were talking about increasing the log volume by an amount that
was actually meaningful, we might need to think about these things,
but that's not what is being proposed. The only systems where this is
going to lead to a significant percentage increase in log volume are
the ones that are pretty much idle now. On real systems, this is going
to lead to a change that is less than 1%, and maybe less than 0.001%.
We don't have to rearchitect anything as a result of decisions that
have so little practical impact - especially considering that at least
some packagers are already putting log rotation in place
automatically, in a way consistent with distro policies.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: should we enable log_checkpoints out of the box?

From

Jan Wieck

Date:

03 November 2021, 20:43:55

On 11/3/21 09:09, Robert Haas wrote:

> For better or for worse, the
> distinction between ERROR, FATAL, and PANIC is entirely based on what
> we do after printing the message, and NOT on how serious the message
> is.

THAT is a real problem with our error handling and logging system. Often 
using RAISE in a user defined function or procedure is part of the 
communication between the application's code in the backend and the 
parts in the middleware. The information that a function rejected user 
input after a sanity check doesn't belong in the system log. It should 
produce an error message on the user's end and that is it. However, 
currently there is no way for a UDF to ensure the transaction gets 
rolled back without raising an ERROR and bloating the log.

For example the BenchmarkSQL UDF implementation raises such ERROR for 
the transactions that the TPC-C requires to contain an input error 
(non-existing ITEM), which happens on 1% of all NEW-ORDER transactions. 
Running an 8-core system with plenty of RAM close to saturation will 
generate about 10MB of log just for that per hour. That is a quarter GB 
of useless garbage, no DBA is ever going to be interested in, every day.

If anybody is worried about producing too much log output, this should 
be their primary focus.

Regards, Jan

Re: should we enable log_checkpoints out of the box?

From

Michael Paquier

Date:

04 November 2021, 10:41:04

On Tue, Nov 02, 2021 at 11:55:23AM -0400, Robert Haas wrote:
> I think shipping with log_checkpoints=on and
> log_autovacuum_min_duration=10m or so would be one of the best things
> we could possibly do to allow ex-post-facto troubleshooting of
> system-wide performance issues. The idea that users care more about
> the inconvenience of a handful of extra log messages than they do
> about being able to find problems when they have them matches no part
> of my experience. I suspect that many users would be willing to incur
> *way more* useless log messages than those settings would ever
> generate if it meant that they could actually find problems when and
> if they have them. And these messages would in fact be the most
> valuable thing in the log for a lot of users. What reasonable DBA
> cares more about the fact that the application attempted an insert
> that violated a foreign key constraint than they do about a checkpoint
> that took 20 minutes to fsync everything? The former is expected; if
> you thought that foreign key violations would never occur, you
> wouldn't need to incur the expense of having the system enforce them.
> The latter is unexpected and basically undiscoverable with default
> settings.

+1.
--
Michael

Attachment

signature.asc

Re: should we enable log_checkpoints out of the box?

From

Bharath Rupireddy

Date:

04 November 2021, 17:58:48

On Thu, Nov 4, 2021 at 1:11 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Nov 02, 2021 at 11:55:23AM -0400, Robert Haas wrote:
> > I think shipping with log_checkpoints=on and
> > log_autovacuum_min_duration=10m or so would be one of the best things
> > we could possibly do to allow ex-post-facto troubleshooting of
> > system-wide performance issues. The idea that users care more about
> > the inconvenience of a handful of extra log messages than they do
> > about being able to find problems when they have them matches no part
> > of my experience. I suspect that many users would be willing to incur
> > *way more* useless log messages than those settings would ever
> > generate if it meant that they could actually find problems when and
> > if they have them. And these messages would in fact be the most
> > valuable thing in the log for a lot of users. What reasonable DBA
> > cares more about the fact that the application attempted an insert
> > that violated a foreign key constraint than they do about a checkpoint
> > that took 20 minutes to fsync everything? The former is expected; if
> > you thought that foreign key violations would never occur, you
> > wouldn't need to incur the expense of having the system enforce them.
> > The latter is unexpected and basically undiscoverable with default
> > settings.
>
> +1.

Thanks all for your inputs. Here's the v1 doing above. Please review it.

With log_checkpoints=on, the "./initdb -D data" generates few
checkpoints logs [1]. I hope this is okay as it's a one-time thing per
database cluster. Thoughts?

[1]
creating configuration files ... ok
running bootstrap script ... 2021-11-04 14:50:34.163 UTC [865574] LOG:
 checkpoint starting: shutdown immediate
2021-11-04 14:50:34.166 UTC [865574] LOG:  checkpoint complete: wrote
222 buffers (1.4%); 0 WAL file(s) added, 0 removed, 0 recycled;
write=0.002 s, sync=0.001 s, total=0.003 s; sync files=0,
longest=0.000 s, average=0.000 s; distance=698 kB, estimate=698 kB
ok
performing post-bootstrap initialization ... 2021-11-04 14:50:35.069
UTC [865576] LOG:  checkpoint starting: immediate force wait flush-all
2021-11-04 14:50:35.069 UTC [865576] STATEMENT:  CREATE DATABASE
template0 IS_TEMPLATE = true ALLOW_CONNECTIONS = false;

2021-11-04 14:50:35.076 UTC [865576] LOG:  checkpoint complete: wrote
731 buffers (4.5%); 0 WAL file(s) added, 0 removed, 0 recycled;
write=0.008 s, sync=0.001 s, total=0.008 s; sync files=0,
longest=0.000 s, average=0.000 s; distance=3889 kB, estimate=3889 kB
2021-11-04 14:50:35.076 UTC [865576] STATEMENT:  CREATE DATABASE
template0 IS_TEMPLATE = true ALLOW_CONNECTIONS = false;

2021-11-04 14:50:35.094 UTC [865576] LOG:  checkpoint starting:
immediate force wait
2021-11-04 14:50:35.094 UTC [865576] STATEMENT:  CREATE DATABASE
template0 IS_TEMPLATE = true ALLOW_CONNECTIONS = false;

2021-11-04 14:50:35.095 UTC [865576] LOG:  checkpoint complete: wrote
0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
write=0.001 s, sync=0.001 s, total=0.002 s; sync files=0,
longest=0.000 s, average=0.000 s; distance=0 kB, estimate=3500 kB
2021-11-04 14:50:35.095 UTC [865576] STATEMENT:  CREATE DATABASE
template0 IS_TEMPLATE = true ALLOW_CONNECTIONS = false;

2021-11-04 14:50:35.101 UTC [865576] LOG:  checkpoint starting:
immediate force wait flush-all
2021-11-04 14:50:35.101 UTC [865576] STATEMENT:  CREATE DATABASE postgres;

2021-11-04 14:50:35.102 UTC [865576] LOG:  checkpoint complete: wrote
12 buffers (0.1%); 0 WAL file(s) added, 0 removed, 0 recycled;
write=0.001 s, sync=0.001 s, total=0.001 s; sync files=0,
longest=0.000 s, average=0.000 s; distance=18 kB, estimate=3152 kB
2021-11-04 14:50:35.102 UTC [865576] STATEMENT:  CREATE DATABASE postgres;

2021-11-04 14:50:35.120 UTC [865576] LOG:  checkpoint starting:
immediate force wait
2021-11-04 14:50:35.120 UTC [865576] STATEMENT:  CREATE DATABASE postgres;

2021-11-04 14:50:35.122 UTC [865576] LOG:  checkpoint complete: wrote
0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
write=0.001 s, sync=0.001 s, total=0.002 s; sync files=0,
longest=0.000 s, average=0.000 s; distance=0 kB, estimate=2836 kB
2021-11-04 14:50:35.122 UTC [865576] STATEMENT:  CREATE DATABASE postgres;

2021-11-04 14:50:35.123 UTC [865576] LOG:  checkpoint starting:
shutdown immediate
2021-11-04 14:50:35.124 UTC [865576] LOG:  checkpoint complete: wrote
4 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled;
write=0.001 s, sync=0.001 s, total=0.001 s; sync files=0,
longest=0.000 s, average=0.000 s; distance=0 kB, estimate=2553 kB
ok
syncing data to disk ... ok

initdb: warning: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.

Success. You can now start the database server using:

    ./pg_ctl -D data -l logfile start

Regards,
Bharath Rupireddy.

Attachment

v1-0001-set-log_checkpoints-on-log_autovacuum_min_duratio.patch

Re: should we enable log_checkpoints out of the box?

From

Robert Haas

Date:

04 November 2021, 18:04:58

On Thu, Nov 4, 2021 at 10:59 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> With log_checkpoints=on, the "./initdb -D data" generates few
> checkpoints logs [1]. I hope this is okay as it's a one-time thing per
> database cluster. Thoughts?

I think you should arrange to suppress that output. There's no reason
why initdb can't pass -c log_checkpoints=off. See backend_options in
initdb.c.

I didn't spot any other problems on a quick read-through.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: should we enable log_checkpoints out of the box?

From

Fujii Masao

Date:

04 November 2021, 20:20:49


On 2021/11/05 0:04, Robert Haas wrote:
> On Thu, Nov 4, 2021 at 10:59 AM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
>> With log_checkpoints=on, the "./initdb -D data" generates few
>> checkpoints logs [1]. I hope this is okay as it's a one-time thing per
>> database cluster. Thoughts?
> 
> I think you should arrange to suppress that output.

+1

> I didn't spot any other problems on a quick read-through.

-bool        log_checkpoints = false;
+bool        log_checkpoints = true;

It's better to initialize the global variable Log_autovacuum_min_duration
with 600000 like the above change?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: should we enable log_checkpoints out of the box?

From

Bharath Rupireddy

Date:

04 November 2021, 20:49:47

On Thu, Nov 4, 2021 at 10:50 PM Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
>
> On 2021/11/05 0:04, Robert Haas wrote:
> > On Thu, Nov 4, 2021 at 10:59 AM Bharath Rupireddy
> > <bharath.rupireddyforpostgres@gmail.com> wrote:
> >> With log_checkpoints=on, the "./initdb -D data" generates few
> >> checkpoints logs [1]. I hope this is okay as it's a one-time thing per
> >> database cluster. Thoughts?
> >
> > I think you should arrange to suppress that output.
>
> +1

Thanks. I did that.

> > I didn't spot any other problems on a quick read-through.
>
> -bool           log_checkpoints = false;
> +bool           log_checkpoints = true;
>
> It's better to initialize the global variable Log_autovacuum_min_duration
> with 600000 like the above change?

I missed it. Added now.

Please review the attached v2 patch.

Regards,
Bharath Rupireddy.

Attachment

v2-0001-set-log_checkpoints-on-log_autovacuum_min_duratio.patch

Re: should we enable log_checkpoints out of the box?

From

Andres Freund

Date:

04 November 2021, 23:04:50

Hi,

On 2021-11-02 11:55:23 -0400, Robert Haas wrote:
> On Sun, Oct 31, 2021 at 10:59 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > The general policy at the moment is that a normally-functioning server
> > should emit *no* log traffic by default (other than a few messages
> > at startup and shutdown).  log_checkpoints is a particularly poor
> > candidate for an exception to that policy, because it would produce so
> > much traffic.  No DBA would be likely to consider it as anything but
> > log spam.
> 
> That's absolutely false. On any system where there's anything actually
> happening, there's going to be tons of stuff in the log because there
> are going to be failed connection attempts, queries that result in
> errors, and all kinds of other things like that. By any reasonable
> metric, the log volume of log_checkpoints=on is tiny.

Yes.

I think we do have significant issues with noisy mesages drowning out all
signal in the log, but that's largely stuff that's logged by default based on
client actions, at a high rate, rather than something occasional like log
checkpoints.

It's *hard* to find relevant stuff in postgres log files. Most instances with
some amount of traffic will have non-graceful disconnects (each logging two
messages, one "LOG: could not send data to client: Broken pipe" and one
"FATAL: connection to client lost"). It's normal to have some amount of
constraint violations. Etc. One cannot realistically see LOG or ERROR messages
indicating real trouble because we do not provide a realistic way to separate
such "normal" log messages from others fairly reliably indicating a problem.

> Besides appearing to be unwarranted mockery of what Bharath wrote,

Indeed.

Greetings,

Andres Freund

Re: should we enable log_checkpoints out of the box?

From

Alvaro Herrera

Date:

05 November 2021, 17:27:10

On 2021-Nov-03, Jan Wieck wrote:

> On 11/3/21 09:09, Robert Haas wrote:
> 
> > For better or for worse, the distinction between ERROR, FATAL, and
> > PANIC is entirely based on what we do after printing the message,
> > and NOT on how serious the message is.
> 
> THAT is a real problem with our error handling and logging system.

Agreed.

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"No es bueno caminar con un hombre muerto"

Re: should we enable log_checkpoints out of the box?

From

Michael Banck

Date:

05 November 2021, 17:29:37

Am Freitag, dem 05.11.2021 um 11:27 -0300 schrieb Alvaro Herrera:
> On 2021-Nov-03, Jan Wieck wrote:
> > On 11/3/21 09:09, Robert Haas wrote:
> > 
> > > For better or for worse, the distinction between ERROR, FATAL,
> > > and
> > > PANIC is entirely based on what we do after printing the message,
> > > and NOT on how serious the message is.
> > 
> > THAT is a real problem with our error handling and logging system.
> 
> Agreed.

Well that, and the fact those distinctions are only done for user-
facing events, whereas it seems to me we only distinguish between LOG
and PANIC for server-facing events; maybe we need one or more
additional levels here in order to make it easier for admins to see the
really bad things that are happening?

Michael

-- 
Michael Banck
Teamleiter PostgreSQL-Team
Projektleiter
Tel.: +49 2166 9901-171
E-Mail: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Geoff Richardson, Peter Lilley

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Re: should we enable log_checkpoints out of the box?

From

Alvaro Herrera

Date:

05 November 2021, 17:50:30

On 2021-Nov-05, Michael Banck wrote:

> Well that, and the fact those distinctions are only done for user-
> facing events, whereas it seems to me we only distinguish between LOG
> and PANIC for server-facing events; maybe we need one or more
> additional levels here in order to make it easier for admins to see the
> really bad things that are happening?

I think what we need is an orthogonal classification.  "This FATAL here
is routine; that ERROR there denotes a severe problem in the underlying
OS".  Additional levels won't help with that.  Maybe adding the concept
of "severity" or "criticality" to some messages would be useful to
decide what to keep and what to discard.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Uno puede defenderse de los ataques; contra los elogios se esta indefenso"

Re: should we enable log_checkpoints out of the box?

From

Robert Haas

Date:

05 November 2021, 18:04:40

On Fri, Nov 5, 2021 at 10:50 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I think what we need is an orthogonal classification.  "This FATAL here
> is routine; that ERROR there denotes a severe problem in the underlying
> OS".  Additional levels won't help with that.  Maybe adding the concept
> of "severity" or "criticality" to some messages would be useful to
> decide what to keep and what to discard.

It would be really nice if we could do that in some kind of sensible
way. I think it's hard to get agreement though because different
people are going to care about different things. In fact, that
consideration weighs heavily against trying to adopt any sort of
project policy at all: somebody's bound to think it sucks. Even people
who mostly like it will demand the ability to customize it. On the
other hand, I also can't endorse the current policy of generating a
paper trail long enough to reach Alpha Centauri, so maybe we ought to
try harder to reach some sort of consensus.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: should we enable log_checkpoints out of the box?

From

Bruce Momjian

Date:

05 November 2021, 18:26:22

On Fri, Nov  5, 2021 at 03:29:37PM +0100, Michael Banck wrote:
> Am Freitag, dem 05.11.2021 um 11:27 -0300 schrieb Alvaro Herrera:
> > On 2021-Nov-03, Jan Wieck wrote:
> > > On 11/3/21 09:09, Robert Haas wrote:
> > > 
> > > > For better or for worse, the distinction between ERROR, FATAL,
> > > > and
> > > > PANIC is entirely based on what we do after printing the message,
> > > > and NOT on how serious the message is.
> > > 
> > > THAT is a real problem with our error handling and logging system.
> > 
> > Agreed.
> 
> Well that, and the fact those distinctions are only done for user-
> facing events, whereas it seems to me we only distinguish between LOG
> and PANIC for server-facing events; maybe we need one or more
> additional levels here in order to make it easier for admins to see the
> really bad things that are happening?

Agreed.  We have three levels of non-error message for the client (info,
notice, warning), but only one level for error (query is stopped),
ERROR.  Seems we need another level that stops the current query and
indicates it contains information that would be useful for database
server administrators.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: should we enable log_checkpoints out of the box?

From

Jan Wieck

Date:

05 November 2021, 18:36:31

On 11/5/21 10:50, Alvaro Herrera wrote:
> On 2021-Nov-05, Michael Banck wrote:
> 
>> Well that, and the fact those distinctions are only done for user-
>> facing events, whereas it seems to me we only distinguish between LOG
>> and PANIC for server-facing events; maybe we need one or more
>> additional levels here in order to make it easier for admins to see the
>> really bad things that are happening?
> 
> I think what we need is an orthogonal classification.  "This FATAL here
> is routine; that ERROR there denotes a severe problem in the underlying
> OS".  Additional levels won't help with that.  Maybe adding the concept
> of "severity" or "criticality" to some messages would be useful to
> decide what to keep and what to discard.
> 

That would go a long way. I would add a third classification that is 
"area", indicating if this is for example resource or application logic 
related. An FK violation is app-logic, running checkpoints too often is 
a resource problem. Allow the DBA to create some filter based on 
combinations of them and it should work well enough.

Regards, Jan Wieck

Re: should we enable log_checkpoints out of the box?

From

Robert Haas

Date:

10 November 2021, 18:21:03

On Thu, Nov 4, 2021 at 1:49 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> Please review the attached v2 patch.

It looks OK to me.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: should we enable log_checkpoints out of the box?

From

Bharath Rupireddy

Date:

09 December 2021, 19:01:53

On Wed, Nov 10, 2021 at 8:51 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Nov 4, 2021 at 1:49 PM Bharath Rupireddy
> <bharath.rupireddyforpostgres@gmail.com> wrote:
> > Please review the attached v2 patch.
>
> It looks OK to me.

May I know if the v2 patch (attached at [1] in this thread) is still
of interest?

CF entry is here - https://commitfest.postgresql.org/36/3401/

[1] - https://www.postgresql.org/message-id/CALj2ACU9cK4pCzcqvey71F57PTPsdxtUGmfUnQ7-GR4pTUgmeA%40mail.gmail.com

Regards,
Bharath Rupireddy.

Re: should we enable log_checkpoints out of the box?

From

Robert Haas

Date:

13 December 2021, 17:52:15

On Thu, Dec 9, 2021 at 11:02 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> May I know if the v2 patch (attached at [1] in this thread) is still
> of interest?
>
> CF entry is here - https://commitfest.postgresql.org/36/3401/
>
> [1] - https://www.postgresql.org/message-id/CALj2ACU9cK4pCzcqvey71F57PTPsdxtUGmfUnQ7-GR4pTUgmeA%40mail.gmail.com

Committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com