Thread: RFC: pg_stat_logmsg

RFC: pg_stat_logmsg

From

Joe Conway

Date:

30 June 2023, 23:57:09

Greetings,

Attached please find a tarball (rather than a patch) for a proposed new 
contrib extension, pg_stat_logmsg.

The basic idea is to mirror how pg_stat_statements works, except the 
logged messages keyed by filename, lineno, and elevel are saved with a 
aggregate count. The format string is displayed (similar to a query 
jumble) for context, along with function name and sqlerrcode.

I threw this together rather quickly over the past couple of days 
between meetings, so not claiming that it is committable (and lacks 
documentation and regression tests as well), but I would love to get 
feedback on:

1/ the general concept
2/ the pg_stat_statement-like implementation
3/ contrib vs core vs external project

Some samples and data:

`make installcheck` with the extension loaded:
8<------------------
# All 215 tests passed.


real    2m24.854s
user    0m0.086s
sys     0m0.283s
8<------------------

`make installcheck` without the extension loaded:
8<------------------

# All 215 tests passed.

real    2m26.765s
user    0m0.076s
sys     0m0.293s
8<------------------

Sample output after running make installcheck a couple times (plus a few 
manually generated ERRORs):

8<------------------
test=# select sum(count) from pg_stat_logmsg where elevel > 20;
   sum
-------
  10554
(1 row)

test=# \x
Expanded display is on.
test=# select * from pg_stat_logmsg where elevel > 20 order by count desc;
-[ RECORD 1 ]-------------------------------
filename   | aclchk.c
lineno     | 2811
elevel     | 21
funcname   | aclcheck_error
sqlerrcode | 42501
message    | permission denied for schema %s
count      | 578
-[ RECORD 2 ]-------------------------------
filename   | scan.l
lineno     | 1241
elevel     | 21
funcname   | scanner_yyerror
sqlerrcode | 42601
message    | %s at or near "%s"
count      | 265
...

test=# select * from pg_stat_logmsg where elevel > 20 and sqlerrcode = 
'XX000';
-[ RECORD 1 ]---------------------------------------
filename   | tid.c
lineno     | 352
elevel     | 21
funcname   | currtid_for_view
sqlerrcode | XX000
message    | ctid isn't of type TID
count      | 2
-[ RECORD 2 ]---------------------------------------
filename   | pg_locale.c
lineno     | 2493
elevel     | 21
funcname   | pg_ucol_open
sqlerrcode | XX000
message    | could not open collator for locale "%s": %s
count      | 2
...

8<------------------

Part of the thinking is that people with fleets of postgres instances 
can use this to scan for various errors that they care about. 
Additionally it would be useful to look for "should not happen" errors.

I will register this in the July CF and will appreciate feedback.

Thanks!

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

pg_stat_logmsg-000.tgz

Re: RFC: pg_stat_logmsg

From

Pavel Stehule

Date:

01 July 2023, 03:20:08

so 1. 7. 2023 v 1:57 odesílatel Joe Conway <mail@joeconway.com> napsal:

Greetings,

Attached please find a tarball (rather than a patch) for a proposed new
contrib extension, pg_stat_logmsg.

The basic idea is to mirror how pg_stat_statements works, except the
logged messages keyed by filename, lineno, and elevel are saved with a
aggregate count. The format string is displayed (similar to a query
jumble) for context, along with function name and sqlerrcode.

I threw this together rather quickly over the past couple of days
between meetings, so not claiming that it is committable (and lacks
documentation and regression tests as well), but I would love to get
feedback on:

1/ the general concept
2/ the pg_stat_statement-like implementation
3/ contrib vs core vs external project

Some samples and data:

`make installcheck` with the extension loaded:
8<------------------
# All 215 tests passed.

real 2m24.854s
user 0m0.086s
sys 0m0.283s
8<------------------

`make installcheck` without the extension loaded:
8<------------------

# All 215 tests passed.

real 2m26.765s
user 0m0.076s
sys 0m0.293s
8<------------------

Sample output after running make installcheck a couple times (plus a few
manually generated ERRORs):

8<------------------
test=# select sum(count) from pg_stat_logmsg where elevel > 20;
sum
-------
10554
(1 row)

test=# \x
Expanded display is on.
test=# select * from pg_stat_logmsg where elevel > 20 order by count desc;
-[ RECORD 1 ]-------------------------------
filename | aclchk.c
lineno | 2811
elevel | 21
funcname | aclcheck_error
sqlerrcode | 42501
message | permission denied for schema %s
count | 578
-[ RECORD 2 ]-------------------------------
filename | scan.l
lineno | 1241
elevel | 21
funcname | scanner_yyerror
sqlerrcode | 42601
message | %s at or near "%s"
count | 265
...

test=# select * from pg_stat_logmsg where elevel > 20 and sqlerrcode =
'XX000';
-[ RECORD 1 ]---------------------------------------
filename | tid.c
lineno | 352
elevel | 21
funcname | currtid_for_view
sqlerrcode | XX000
message | ctid isn't of type TID
count | 2
-[ RECORD 2 ]---------------------------------------
filename | pg_locale.c
lineno | 2493
elevel | 21
funcname | pg_ucol_open
sqlerrcode | XX000
message | could not open collator for locale "%s": %s
count | 2
...

8<------------------

Part of the thinking is that people with fleets of postgres instances
can use this to scan for various errors that they care about.
Additionally it would be useful to look for "should not happen" errors.

I will register this in the July CF and will appreciate feedback.

This can be a very interesting feature. I like it.

Regards

Pavel

Thanks!

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: RFC: pg_stat_logmsg

From

Joe Conway

Date:

01 July 2023, 19:52:50

On 6/30/23 23:20, Pavel Stehule wrote:
> so 1. 7. 2023 v 1:57 odesílatel Joe Conway <mail@joeconway.com 
> <mailto:mail@joeconway.com>> napsal:
>     Part of the thinking is that people with fleets of postgres instances
>     can use this to scan for various errors that they care about.
>     Additionally it would be useful to look for "should not happen" errors.
> 
>     I will register this in the July CF and will appreciate feedback.
> 
> This can be a very interesting feature. I like it.

Thanks!

FWIW, I just modified it to provide the localized text of the elevel 
rather than the internal number. I also localized the message format string:

8<------------------------------
psql (16beta2)
Type "help" for help.

test=# \x
Expanded display is on.
test=# select * from pg_stat_logmsg where elevel = 'ERROR' and 
sqlerrcode = 'XX000' and count > 1;
-[ RECORD 1 ]---------------------------------------------
filename   | tablecmds.c
lineno     | 10908
elevel     | ERROR
funcname   | ATExecAlterConstraint
sqlerrcode | XX000
message    | cannot alter constraint "%s" on relation "%s"
count      | 2
-[ RECORD 2 ]---------------------------------------------
filename   | user.c
lineno     | 2130
elevel     | ERROR
funcname   | check_role_membership_authorization
sqlerrcode | XX000
message    | role "%s" cannot have explicit members
count      | 2

test=# set lc_messages ='sv_SE.UTF8';
SET
test=# select * from pg_stat_logmsg where elevel = 'FEL' and sqlerrcode 
= 'XX000' and count > 1;
-[ RECORD 1 ]---------------------------------------------
filename   | tablecmds.c
lineno     | 10908
elevel     | FEL
funcname   | ATExecAlterConstraint
sqlerrcode | XX000
message    | kan inte ändra villkoret "%s" i relation "%s"
count      | 2
-[ RECORD 2 ]---------------------------------------------
filename   | user.c
lineno     | 2130
elevel     | FEL
funcname   | check_role_membership_authorization
sqlerrcode | XX000
message    | rollen "%s" kan inte ha explicita medlemmar
count      | 2
8<------------------------------

New tarball attached.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

pg_stat_logmsg-001.tgz

Re: RFC: pg_stat_logmsg

From

Masahiko Sawada

Date:

06 July 2023, 07:36:13

Hi,

On Sat, Jul 1, 2023 at 8:57 AM Joe Conway <mail@joeconway.com> wrote:
>
> Greetings,
>
> Attached please find a tarball (rather than a patch) for a proposed new
> contrib extension, pg_stat_logmsg.
>
> The basic idea is to mirror how pg_stat_statements works, except the
> logged messages keyed by filename, lineno, and elevel are saved with a
> aggregate count. The format string is displayed (similar to a query
> jumble) for context, along with function name and sqlerrcode.
>
>
> Part of the thinking is that people with fleets of postgres instances
> can use this to scan for various errors that they care about.
> Additionally it would be useful to look for "should not happen" errors.

Interesting idea and use cases.

I'm concerned that displaying the format string could not be helpful
in some cases. For example, when raising an ERROR while reading WAL
records, we typically write the error message stored in
record->errormsg_buf:

in ReadRecord():
            if (errormsg)
                ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
                        (errmsg_internal("%s", errormsg) /* already
translated */ ));

In this case, the error message stored in pg_stat_logmsg is just '%s'.
The filename and line number columns might help identify the actual
error but it requires users to read the source code and cannot know
the actual error message.

A similar case is where we construct the error message on the fly. For
example, in LogRecoveryConflict() the string of the recovery conflict
description comes from get_recovery_conflict_desc():

in LogRecoveryConflict():
        ereport(LOG,
                errmsg("recovery still waiting after %ld.%03d ms: %s",
                       msecs, usecs, get_recovery_conflict_desc(reason)),
                nprocs > 0 ? errdetail_log_plural("Conflicting process: %s.",
                                                  "Conflicting processes: %s.",
                                                  nprocs, buf.data) : 0);

The user might want to search the error message by the actual conflict
reason, but cannot. In this case, I'd like to see the actual error
message (I'd like to normalize the number part, though).

That being said, using the format string for the error messages like
"ERROR:  relation "nonexist_table" does not exist" makes sense to me
since we can avoid having too many similar entries.

So I feel that we might need to figure out what part of the log
message should be normalized like pg_stat_statement does for query
strings.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: RFC: pg_stat_logmsg

From

Gurjeet Singh

Date:

07 July 2023, 05:38:33

On Thu, Jul 6, 2023 at 12:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Sat, Jul 1, 2023 at 8:57 AM Joe Conway <mail@joeconway.com> wrote:
> >
> > The basic idea is to mirror how pg_stat_statements works, except the
> > logged messages keyed by filename, lineno, and elevel are saved with a
> > aggregate count. The format string is displayed (similar to a query
> > jumble) for context, along with function name and sqlerrcode.
> >
> >
> > Part of the thinking is that people with fleets of postgres instances
> > can use this to scan for various errors that they care about.
> > Additionally it would be useful to look for "should not happen" errors.

> I'm concerned that displaying the format string could not be helpful
> in some cases. For example, when raising an ERROR while reading WAL
> records, we typically write the error message stored in
> record->errormsg_buf:
>
> in ReadRecord():
>             if (errormsg)
>                 ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
>                         (errmsg_internal("%s", errormsg) /* already
> translated */ ));
>
> In this case, the error message stored in pg_stat_logmsg is just '%s'.
> The filename and line number columns might help identify the actual
> error but it requires users to read the source code and cannot know
> the actual error message.

I believe that the number of such error messages, the ones with very
little descriptive content, is very low in Postgres code. These kinds
of messages are not prevalent, and must be used when code hits obscure
conditions, like seen in your example above. These are the kinds of
errors that Joe is referring to as "should not happen". In these
cases, even if the error message was descriptive, the user will very
likely have to dive deep into code to find out the real cause.

I feel that the unique combination of file name and line number is
useful information, even in cases where the format string not very
descriptive. So I believe the extension's behaviour in this regard is
acceptable.

In cases where the extension's output is not descriptive enough, the
user can use the filename:lineno pair to look for fully formed error
messages in the actual log files; they may have to make appropriate
changes to log_* parameters, though.

If we still feel strongly about getting the actual message for these
cases, perhaps we can develop a heuristic to identify such messages,
and use either full or a prefix of the 'message' field, instead of
'message_id' field. The heuristic could be: strlen(message_id) is too
short, or that message_id is all/overwhelmingly format specifiers
(e.g. '%s: %d').

The core benefit of this extension is to make it easy for the user to
discover error messages affecting their workload. The user may be
interested in knowing the most common log messages in their server
logs, so that they can work on reducing those errors or warnings. Or
they may be interested in knowing when their workload hits
unexpected/serious error messages, even if they're infrequent. The
data exposed by pg_stat_logmsg view would serve as a starting point,
but I'm guessing it would not be sufficient for their investigation.

On Fri, Jun 30, 2023 at 4:57 PM Joe Conway <mail@joeconway.com> wrote:
> I would love to get
> feedback on:
>
> 1/ the general concept
> 2/ the pg_stat_statement-like implementation
> 3/ contrib vs core vs external project

+1 for the idea; a monitoring system trained at this view can bubble
up problems to users that may otherwise go unnoticed. Piggybacking on,
or using pg_stat_statments implementation as a model seems fine. I
believe making this available as a contrib module hits the right
balance; not all installations may want this, hence in-core, always
collecting data, seems undesirable; if it's an external project, it
won't be discoverable, I see that as a very high bar for someone to
trust it and begin to use it.

Best regards,
Gurjeet
http://Gurje.et

Re: RFC: pg_stat_logmsg

From

Joe Conway

Date:

09 July 2023, 18:13:09

On 7/7/23 01:38, Gurjeet Singh wrote:
> On Thu, Jul 6, 2023 at 12:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Sat, Jul 1, 2023 at 8:57 AM Joe Conway <mail@joeconway.com> wrote:
>> >
>> > The basic idea is to mirror how pg_stat_statements works, except the
>> > logged messages keyed by filename, lineno, and elevel are saved with a
>> > aggregate count. The format string is displayed (similar to a query
>> > jumble) for context, along with function name and sqlerrcode.
>> >
>> >
>> > Part of the thinking is that people with fleets of postgres instances
>> > can use this to scan for various errors that they care about.
>> > Additionally it would be useful to look for "should not happen" errors.
> 
>> I'm concerned that displaying the format string could not be helpful
>> in some cases. For example, when raising an ERROR while reading WAL
>> records, we typically write the error message stored in
>> record->errormsg_buf:
>>
>> in ReadRecord():
>>             if (errormsg)
>>                 ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
>>                         (errmsg_internal("%s", errormsg) /* already
>> translated */ ));
>>
>> In this case, the error message stored in pg_stat_logmsg is just '%s'.
>> The filename and line number columns might help identify the actual
>> error but it requires users to read the source code and cannot know
>> the actual error message.
> 
> I believe that the number of such error messages, the ones with very
> little descriptive content, is very low in Postgres code. These kinds
> of messages are not prevalent, and must be used when code hits obscure
> conditions, like seen in your example above. These are the kinds of
> errors that Joe is referring to as "should not happen". In these
> cases, even if the error message was descriptive, the user will very
> likely have to dive deep into code to find out the real cause.

Agreed. As an example, after running `make installcheck`

8<-----------------
select sum(count) from pg_stat_logmsg;
  sum
------
  6005
(1 row)

select message,sum(count)
from pg_stat_logmsg
where length(message) < 30
   and elevel in ('ERROR','FATAL','PANIC')
   and message like '%\%s%' escape '\'
group by message
order by length(message);
             message            | sum
-------------------------------+-----
  %s                            | 107
  "%s" is a view                |   9
  "%s" is a table               |   4
  %s is a procedure             |   1
  invalid size: "%s"            |  13
  %s at or near "%s"            | 131
  %s at end of input            |   3
...
8<-----------------

I only see three message formats there that are generic enough to be of 
concern (the first and last two shown -- FWIW I did not see any more of 
them as the fmt string gets longer)

So out of 6005 log events, 241 fit this concern.

Perhaps given the small number of message format strings affected, it 
would make sense to special case those, but I am not sure it is worth 
the effort, at least for version 1.

> I feel that the unique combination of file name and line number is
> useful information, even in cases where the format string not very
> descriptive. So I believe the extension's behaviour in this regard is
> acceptable.
> 
> In cases where the extension's output is not descriptive enough, the
> user can use the filename:lineno pair to look for fully formed error
> messages in the actual log files; they may have to make appropriate
> changes to log_* parameters, though.

Right

> If we still feel strongly about getting the actual message for these
> cases, perhaps we can develop a heuristic to identify such messages,
> and use either full or a prefix of the 'message' field, instead of
> 'message_id' field. The heuristic could be: strlen(message_id) is too
> short, or that message_id is all/overwhelmingly format specifiers
> (e.g. '%s: %d').

Based on the above analysis (though granted, not all inclusive), it 
seems like just special casing the specific message format strings of 
interest would work.

> The core benefit of this extension is to make it easy for the user to
> discover error messages affecting their workload. The user may be
> interested in knowing the most common log messages in their server
> logs, so that they can work on reducing those errors or warnings. Or
> they may be interested in knowing when their workload hits
> unexpected/serious error messages, even if they're infrequent. The
> data exposed by pg_stat_logmsg view would serve as a starting point,
> but I'm guessing it would not be sufficient for their investigation.

Yes, exactly.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: RFC: pg_stat_logmsg

From

Joe Conway

Date:

13 September 2023, 19:30:45

On 7/9/23 14:13, Joe Conway wrote:
> On 7/7/23 01:38, Gurjeet Singh wrote:
>>> In this case, the error message stored in pg_stat_logmsg is just '%s'.
>>> The filename and line number columns might help identify the actual
>>> error but it requires users to read the source code and cannot know
>>> the actual error message.
>> 
>> I believe that the number of such error messages, the ones with very
>> little descriptive content, is very low in Postgres code. These kinds
>> of messages are not prevalent, and must be used when code hits obscure
>> conditions, like seen in your example above. These are the kinds of
>> errors that Joe is referring to as "should not happen". In these
>> cases, even if the error message was descriptive, the user will very
>> likely have to dive deep into code to find out the real cause.
> 
> Agreed. As an example, after running `make installcheck`
> 
> 8<-----------------
> select sum(count) from pg_stat_logmsg;
>    sum
> ------
>    6005
> (1 row)
> 
> select message,sum(count)
> from pg_stat_logmsg
> where length(message) < 30
>     and elevel in ('ERROR','FATAL','PANIC')
>     and message like '%\%s%' escape '\'
> group by message
> order by length(message);
>               message            | sum
> -------------------------------+-----
>    %s                            | 107
>    "%s" is a view                |   9
>    "%s" is a table               |   4
>    %s is a procedure             |   1
>    invalid size: "%s"            |  13
>    %s at or near "%s"            | 131
>    %s at end of input            |   3
> ...
> 8<-----------------
> 
> I only see three message formats there that are generic enough to be of
> concern (the first and last two shown -- FWIW I did not see any more of
> them as the fmt string gets longer)
> 
> So out of 6005 log events, 241 fit this concern.
> 
> Perhaps given the small number of message format strings affected, it
> would make sense to special case those, but I am not sure it is worth
> the effort, at least for version 1.

Attached is an update, this time as a patch against 17devel. It is not 
quite complete, but getting close IMHO.

Changes:
--------
1. Now is integrated into contrib as a "Additional Supplied Extension"

2. Documentation added

3. I had a verbal conversation with Andres and he reminded me that the 
original idea for this was to collect data across fleets of pg hosts so 
we could look for how often "should never happen" errors actually 
happen. As such, we need to think in terms of aggregating the info from 
many hosts, potentially with many localized languages for the messages. 
So I converted the "message" column back to an untranslated string, and 
added a "translated_message" column which is localized.

An alternative approach might be to provide a separate function that 
accepts the message string and returns the translation. Thoughts on that?

4. In the same vein, I added a pgversion column since the filename and 
line number for the same log message could vary across major or even 
minor releases.

Not done:
---------
1. The main thing lacking at this point is a regression test.

2. No special casing for message == "%s". I am still not convinced it is 
worthwhile to do so.

Comments gratefully welcomed.

Thanks,

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

20230912-000.diff

Re: RFC: pg_stat_logmsg

From

Tomas Vondra

Date:

16 July 2024, 22:14:36

Hi,

I noticed this patch hasn't moved since September 2023, so I wonder
what's the main blocker / what is needed to move this?

As for the feature, I've never done a fleet-wide analysis, so if this is
one of the main use cases, I'm not really sure I can judge if this is a
good tool for that. It seems like it might be a convenient way to do
that, but does that require we add the module to contrib?

As for the code, I wonder if the instability of line numbers could be a
problem - these can change (a little bit) between minor releases, so
after an upgrade we'll read the dump file with line numbers from the old
release, and then start adding entries with new line numbers. Do we need
to handle this in some way?

This might be partially solved by eviction of entries from the old
release - we apply decay, so after a while their usage will be 0. But
what if there's no pressure for space, we'll not actually evict them.
And it'll be confusing to have a mix of old/new line numbers.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: RFC: pg_stat_logmsg

From

Michael Paquier

Date:

16 July 2024, 23:08:48

On Wed, Jul 17, 2024 at 12:14:36AM +0200, Tomas Vondra wrote:
> I noticed this patch hasn't moved since September 2023, so I wonder
> what's the main blocker / what is needed to move this?

+ /* Location of permanent stats file (valid when database is shut down) */
+ #define PGLM_DUMP_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY
"/pg_stat_logmsg.stat

Perhaps this does not count as a valid reason, but does it really make
sense to implement things this way, knowing that this could be changed
to rely on a potential pluggable pgstats?  I mean this one I've
proposed:
https://www.postgresql.org/message-id/Zmqm9j5EO0I4W8dx%40paquier.xyz

One potential implementation is to stick that to be fixed-numbered,
because there is a maximum cap to the number of entries proposed by
the patch, while keeping the whole in memory.

+ logmsg_store(ErrorData *edata, Size *logmsg_offset,
+              int *logmsg_len, int *gc_count)

The patch shares a lot of perks with pg_stat_statements that don't
scale well.  I'm wondering if it is a good idea to duplicate these
properties in a second, different, module, like the external file can
could be written out on a periodic basis depending on the workload.
I am not saying that the other thread is a magic solution for
everything (not looked yet at how this would plug with the cap in
entries that pg_stat_statements wants), just one option on the table.

> As for the code, I wonder if the instability of line numbers could be a
> problem - these can change (a little bit) between minor releases, so
> after an upgrade we'll read the dump file with line numbers from the old
> release, and then start adding entries with new line numbers. Do we need
> to handle this in some way?

Indeed.  Perhaps a PostgreSQL version number assigned to each entry to
know from which binary an entry comes from?  This would cost a couple
of extra bytes for each entry still that would be the best information
possible to match that with a correct code tree.  If it comes to that,
even getting down to a commit SHA1 could be useful to provide the
lowest level of granularity.  Another thing would be to give up on the
line number, stick to the uniqueness in the stats entries with the
errcode and the file name, but that won't help for things like
tablecmds.c.

> This might be partially solved by eviction of entries from the old
> release - we apply decay, so after a while their usage will be 0. But
> what if there's no pressure for space, we'll not actually evict them.
> And it'll be confusing to have a mix of old/new line numbers.

Once we know that these stats not going to be relevant anymore as of a
minor upgrade flow, resetting them could be the move that makes the
most sense, leaving the reset to the provider doing the upgrades,
while taking a snapshot of the past data before the reset?  I find the
whole problem tricky to define, TBH.
--
Michael

Attachment

signature.asc

Re: RFC: pg_stat_logmsg

From

Joe Conway

Date:

17 July 2024, 11:43:13

On 7/16/24 18:14, Tomas Vondra wrote:
> I noticed this patch hasn't moved since September 2023, so I wonder
> what's the main blocker / what is needed to move this?

Mainly me finding time I'm afraid.

> As for the feature, I've never done a fleet-wide analysis, so if this is
> one of the main use cases, I'm not really sure I can judge if this is a
> good tool for that. It seems like it might be a convenient way to do
> that, but does that require we add the module to contrib?

I had an offlist chat with Andres about this IIRC and he suggested he 
thought it ought to be just built in to the backend as part of the 
statistics subsystem. Lately though I have been toying with the idea of 
keeping it as an extension and basing it off Michael Paquier's work for 
Pluggable cumulative statistics.

> As for the code, I wonder if the instability of line numbers could be a
> problem - these can change (a little bit) between minor releases, so
> after an upgrade we'll read the dump file with line numbers from the old
> release, and then start adding entries with new line numbers. Do we need
> to handle this in some way?

Hmm, yeah, I had been planning to include postgres version as part of 
the output, but maybe it would need to be part of the key.

> This might be partially solved by eviction of entries from the old
> release - we apply decay, so after a while their usage will be 0. But
> what if there's no pressure for space, we'll not actually evict them.
> And it'll be confusing to have a mix of old/new line numbers.

Agreed.

I am going to try hard to get back to this sooner rather than later, but 
realistically that might be in time for the September commitfest rather 
than during this one.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: RFC: pg_stat_logmsg

From

Joe Conway

Date:

17 July 2024, 11:48:15

On 7/16/24 19:08, Michael Paquier wrote:
> On Wed, Jul 17, 2024 at 12:14:36AM +0200, Tomas Vondra wrote:
>> I noticed this patch hasn't moved since September 2023, so I wonder
>> what's the main blocker / what is needed to move this?
> 
> + /* Location of permanent stats file (valid when database is shut down) */
> + #define PGLM_DUMP_FILE    PGSTAT_STAT_PERMANENT_DIRECTORY
> "/pg_stat_logmsg.stat
> 
> Perhaps this does not count as a valid reason, but does it really make
> sense to implement things this way, knowing that this could be changed
> to rely on a potential pluggable pgstats?  I mean this one I've
> proposed:
> https://www.postgresql.org/message-id/Zmqm9j5EO0I4W8dx%40paquier.xyz

Yep, see my adjacent reply to Tomas.

> One potential implementation is to stick that to be fixed-numbered,
> because there is a maximum cap to the number of entries proposed by
> the patch, while keeping the whole in memory.
> 
> + logmsg_store(ErrorData *edata, Size *logmsg_offset,
> +              int *logmsg_len, int *gc_count)
> 
> The patch shares a lot of perks with pg_stat_statements that don't
> scale well.  I'm wondering if it is a good idea to duplicate these
> properties in a second, different, module, like the external file can
> could be written out on a periodic basis depending on the workload.
> I am not saying that the other thread is a magic solution for
> everything (not looked yet at how this would plug with the cap in
> entries that pg_stat_statements wants), just one option on the table.
> 
>> As for the code, I wonder if the instability of line numbers could be a
>> problem - these can change (a little bit) between minor releases, so
>> after an upgrade we'll read the dump file with line numbers from the old
>> release, and then start adding entries with new line numbers. Do we need
>> to handle this in some way?
> 
> Indeed.  Perhaps a PostgreSQL version number assigned to each entry to
> know from which binary an entry comes from?  This would cost a couple
> of extra bytes for each entry still that would be the best information
> possible to match that with a correct code tree.  If it comes to that,
> even getting down to a commit SHA1 could be useful to provide the
> lowest level of granularity.  Another thing would be to give up on the
> line number, stick to the uniqueness in the stats entries with the
> errcode and the file name, but that won't help for things like
> tablecmds.c.

I think including version in the key makes most sense. Also do we even 
have a mechanism to grab the commit sha in running code?


-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: RFC: pg_stat_logmsg

From

Michael Paquier

Date:

18 July 2024, 04:26:29

On Wed, Jul 17, 2024 at 07:48:15AM -0400, Joe Conway wrote:
> I think including version in the key makes most sense. Also do we even have
> a mechanism to grab the commit sha in running code?

Not directly, still that's doable.

The closest thing I would consider here is to get the output of
something like 'git rev-parse --short HEAD` and attach it to
PG_VERSION with --with-extra-version.  I do that in my local builds
because I always want to know from which commit I am building
something.  Then, PG_VERSION could be stored with the entries while
hashing the stats key with the version string, the error code, the
source file name and/or the line number for uniqueness.  32 bytes of
room would be most likely enough when it comes to the PG_VERSION data
stored in the stats entries.
--
Michael

Attachment

signature.asc

Re: RFC: pg_stat_logmsg

From

Michael Paquier

Date:

18 July 2024, 04:32:48

On Wed, Jul 17, 2024 at 07:43:13AM -0400, Joe Conway wrote:
> On 7/16/24 18:14, Tomas Vondra wrote:
>> As for the feature, I've never done a fleet-wide analysis, so if this is
>> one of the main use cases, I'm not really sure I can judge if this is a
>> good tool for that. It seems like it might be a convenient way to do
>> that, but does that require we add the module to contrib?
>
> I had an offlist chat with Andres about this IIRC and he suggested he
> thought it ought to be just built in to the backend as part of the
> statistics subsystem. Lately though I have been toying with the idea of
> keeping it as an extension and basing it off Michael Paquier's work for
> Pluggable cumulative statistics.

This may live better as a contrib/ module, serving as well as an extra
template for what can be done with the pluggable stats.  Adding that
in core is of course OK for me if that's the consensus.  The APIs for
pluggable stats are really the same as what you would store in core,
minus the system functions you'd want to add in the catalog .dat
files, of course.

I'd like to get it this part done by the end of this commit fest to
have room with pg_stat_statements for this release, but well, we'll
see.  As far as I can see everybody who commented on the thread seems
kind of OK with the idea to fix the stats kinds IDs in time, like
custom RMGRs.  That's just simpler implementation-wise, but I'm also
looking for more opinions.

> Hmm, yeah, I had been planning to include postgres version as part of the
> output, but maybe it would need to be part of the key.

Seems to me that you should do both, then: add PG_VERSION to the
entries, and hash the keys with it for uniqueness.  You could also
have a reset function that performs a removal of the stats for
anything else than the current PG_VERSION, for example.
--
Michael

Attachment

signature.asc