Thread: Operation log for major operations

Operation log for major operations

From

Dmitry Koval

Date:

21 November 2022, 11:41:22

Hi, hackers!

It is important for customer support to know what system operations 
(pg_resetwal, pg_rewind, pg_upgrade, ...) have been executed on the 
database.  A variant of implementation of the log for system operations 
(operation log) is attached to this email.

Introduction.
-------------
Operation log is designed to store information about critical system 
events (like pg_upgrade, pg_resetwal, pg_resetwal, etc.).
This information is not interesting to the ordinary user, but it is very 
important for the vendor's technical support.
An example: client complains about DB damage to technical support 
(damage was caused by pg_resetwal which was "silent" executed by one of 
administrators).

Concepts.
---------
* operation log is placed in the file 'global/pg_control', starting from 
position 4097 (log size is 4kB);
* user can not modify the operation log;  log can be changed  by 
function call only (from postgres or from postgres utilities);
* operation log is a ring buffer (with CRC-32 protection), deleting 
entries from the operation log is possible only when the buffer is 
overflowed;
* SQL-function is used to read data of operation log.

Example of operation log data.
------------------------------

 >select * from pg_operation_log();
    event    |edition|version|   lsn   |         last         |count
------------+-------+-------+---------+----------------------+------
  startup    |vanilla|10.22.0|0/8000028|2022-11-18 23:06:27+03|    1
  pg_upgrade |vanilla|15.0.0 |0/8000028|2022-11-18 23:06:27+03|    1
  startup    |vanilla|15.0.0 |0/80001F8|2022-11-18 23:11:53+03|    3
  pg_resetwal|vanilla|15.0.0 |0/80001F8|2022-11-18 23:09:53+03|    2
(4 rows)

Some details about inserting data to operation log.
---------------------------------------------------
There are two modes of inserting information about events in the 
operation log.

* MERGE mode (events "startup", "pg_resetwal", "pg_rewind").
We searches in ring buffer of operation log an event with the same type 
("startup" for example) with the same version number.
If event was found, we will increment event counter by 1 and update the 
date/time of event ("last" field) with the current value.
If event was not found, we will add this event to the ring buffer (see 
INSERT mode).
* INSERT mode (events "bootstrap", "pg_upgrade", "promoted").
We will add an event to the ring buffer (without searching).


P.S. File 'global/pg_control' was chosen as operation log storage 
because data of this file cannot be removed or modified in a simple way 
and no need to change any extensions and utilities to support this file.

I attached the patch (v1-0001-Operation-log.patch) and extended 
description of operation log (Operation-log.txt).


With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com

Attachment

Re: Operation log for major operations

From

Justin Pryzby

Date:

21 November 2022, 14:03:15

See also prior discussions:
https://www.postgresql.org/message-id/flat/62750df5b126e1d8ee039a79ef3cc64ac3d47cd5.camel%40j-davis.com
https://www.postgresql.org/message-id/flat/20180228214311.jclah37cwh572t2c%40alap3.anarazel.de

-- 
Justin

Re: Operation log for major operations

From

Dmitry Koval

Date:

21 November 2022, 16:55:36

Thanks for references, Justin!

Couple comments about these references.

1) "Make unlogged table resets detectable".
https://www.postgresql.org/message-id/flat/62750df5b126e1d8ee039a79ef3cc64ac3d47cd5.camel%40j-davis.com

This conversation is about specific problem (unlogged table repairing). 
Operation log has another use - it is primary a helper for diagnostics.

2) "RFC: Add 'taint' field to pg_control."
https://www.postgresql.org/message-id/flat/20180228214311.jclah37cwh572t2c%40alap3.anarazel.de

This is almost the same problem that we want to solve with operation 
log. Differences between the operation log and what is discussed in the 
thread:
* there suggested to store operation log in pg_control file - but 
separate from pg_control main data (and write data separately too);
* operation log data can be represented in relational form (not flags), 
this is more usable for RDBMS;
* number of registered event types can be increased easy (without 
changes of storage).

-- 
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com

Re: Operation log for major operations

From

Alvaro Herrera

Date:

23 November 2022, 12:00:21

On 2022-Nov-21, Dmitry Koval wrote:

> Concepts.
> ---------
> * operation log is placed in the file 'global/pg_control', starting from
> position 4097 (log size is 4kB);

I think storing this in pg_control is a bad idea.  That file is
extremely critical and if you break it, you're pretty much SOL on
recovering your data.  I suggest that this should use a separate file.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"All rings of power are equal,
But some rings of power are more equal than others."
                                 (George Orwell's The Lord of the Rings)

Re: Operation log for major operations

From

Dmitry Koval

Date:

05 December 2022, 11:11:51

Hi!

 >I think storing this in pg_control is a bad idea.  That file is
 >extremely critical and if you break it, you're pretty much SOL on
 >recovering your data.  I suggest that this should use a separate file.

Thanks. Operation log location changed to 'global/pg_control_log' (if 
the name is not pretty, it can be changed).

I attached the patch (v2-0001-Operation-log.patch) and description of 
operation log (Operation-log.txt).


With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com

Attachment

Re: Operation log for major operations

From

vignesh C

Date:

14 January 2023, 09:53:55

On Mon, 5 Dec 2022 at 13:42, Dmitry Koval <d.koval@postgrespro.ru> wrote:
>
> Hi!
>
>  >I think storing this in pg_control is a bad idea.  That file is
>  >extremely critical and if you break it, you're pretty much SOL on
>  >recovering your data.  I suggest that this should use a separate file.
>
> Thanks. Operation log location changed to 'global/pg_control_log' (if
> the name is not pretty, it can be changed).
>
> I attached the patch (v2-0001-Operation-log.patch) and description of
> operation log (Operation-log.txt).

The patch does not apply on top of HEAD as in [1], please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
ff23b592ad6621563d3128b26860bcb41daf9542 ===
=== applying patch ./v2-0001-Operation-log.patch
....
patching file src/tools/msvc/Mkvcbuild.pm
Hunk #1 FAILED at 134.
1 out of 1 hunk FAILED -- saving rejects to file src/tools/msvc/Mkvcbuild.pm.rej

[1] - http://cfbot.cputube.org/patch_41_4018.log

Regards,
Vignesh

Re: Operation log for major operations

From

Dmitry Koval

Date:

14 January 2023, 13:17:07

Hi!

 >The patch does not apply on top of HEAD ...

Here is a fixed version.
Small additional fixes:
1) added CRC calculation for empty 'pg_control_log' file;
2) added saving 'errno' before calling LWLockRelease and restoring after 
that;
3) corrected pg_upgrade for case old cluster does not have 
'pg_control_log' file.

-- 
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com

Attachment

v3-0001-Operation-log.patch

Re: Operation log for major operations

From

vignesh C

Date:

19 January 2023, 14:22:58

On Sat, 14 Jan 2023 at 15:47, Dmitry Koval <d.koval@postgrespro.ru> wrote:
>
> Hi!
>
>  >The patch does not apply on top of HEAD ...
>
> Here is a fixed version.
> Small additional fixes:
> 1) added CRC calculation for empty 'pg_control_log' file;
> 2) added saving 'errno' before calling LWLockRelease and restoring after
> that;
> 3) corrected pg_upgrade for case old cluster does not have
> 'pg_control_log' file.

The patch does not apply on top of HEAD as in [1], please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
14bdb3f13de16523609d838b725540af5e23ddd3 ===
=== applying patch ./v3-0001-Operation-log.patch
...
patching file src/tools/msvc/Mkvcbuild.pm
Hunk #1 FAILED at 134.
1 out of 1 hunk FAILED -- saving rejects to file src/tools/msvc/Mkvcbuild.pm.rej

[1] - http://cfbot.cputube.org/patch_41_4018.log

Regards,
Vignesh

Re: Operation log for major operations

From

Dmitry Koval

Date:

20 January 2023, 00:11:58

 >The patch does not apply on top of HEAD ...

Thanks!
Here is a fixed version.

Additional changes:
1) get_operation_log() function doesn't create empty operation log file;
2) removed extra unlink() call.

-- 
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com

Attachment

v4-0001-Operation-log.patch

Re: Operation log for major operations

From

Ted Yu

Date:

20 January 2023, 00:36:24

On Thu, Jan 19, 2023 at 1:12 PM Dmitry Koval <d.koval@postgrespro.ru> wrote:

>The patch does not apply on top of HEAD ...

Thanks!
Here is a fixed version.

Additional changes:
1) get_operation_log() function doesn't create empty operation log file;
2) removed extra unlink() call.

--
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com

Hi,

Please update year for the license in pg_controllog.c

+check_file_exists(const char *datadir, const char *path)

There is existing helper function such as:

src/backend/utils/fmgr/dfmgr.c:static bool file_exists(const char *name);

Is it possible to reuse that code ?

Cheers

Re: Operation log for major operations

From

Dmitry Koval

Date:

20 January 2023, 12:19:23

Thanks, Ted Yu!

 > Please update year for the license in pg_controllog.c

License updated for files pg_controllog.c, controllog_utils.c, 
pg_controllog.h, controllog_utils.h.

 > +check_file_exists(const char *datadir, const char *path)
 > There is existing helper function such as:
 > src/backend/utils/fmgr/dfmgr.c:static bool file_exists(const char *name);
 > Is it possible to reuse that code ?

There are a lot of functions that check the file existence:

1) src/backend/utils/fmgr/dfmgr.c:static bool file_exists(const char *name);
2) src/backend/jit/jit.c:static bool file_exists(const char *name);
3) src/test/regress/pg_regress.c:bool file_exists(const char *file);
4) src/bin/pg_upgrade/exec.c:bool pid_lock_file_exists(const char *datadir);
5) src/backend/commands/extension.c:bool extension_file_exists(const 
char *extensionName);

But there is no unified function: different components use their own 
function with their own specific.
Probably we can not reuse code of dfmgr.c:file_exists() because this 
function skip "errno == EACCES" (this error is critical for us).
I copied for src/bin/pg_rewind/file_ops.c:check_file_exists() code of 
function jit.c:file_exists() (with adaptation for the utility).

-- 
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com

Attachment

v5-0001-Operation-log.patch

Re: Operation log for major operations

From

Ted Yu

Date:

20 January 2023, 16:23:16

On Fri, Jan 20, 2023 at 1:19 AM Dmitry Koval <d.koval@postgrespro.ru> wrote:

Thanks, Ted Yu!

> Please update year for the license in pg_controllog.c

License updated for files pg_controllog.c, controllog_utils.c,
pg_controllog.h, controllog_utils.h.

> +check_file_exists(const char *datadir, const char *path)
> There is existing helper function such as:
> src/backend/utils/fmgr/dfmgr.c:static bool file_exists(const char *name);
> Is it possible to reuse that code ?

There are a lot of functions that check the file existence:

1) src/backend/utils/fmgr/dfmgr.c:static bool file_exists(const char *name);
2) src/backend/jit/jit.c:static bool file_exists(const char *name);
3) src/test/regress/pg_regress.c:bool file_exists(const char *file);
4) src/bin/pg_upgrade/exec.c:bool pid_lock_file_exists(const char *datadir);
5) src/backend/commands/extension.c:bool extension_file_exists(const
char *extensionName);

But there is no unified function: different components use their own
function with their own specific.
Probably we can not reuse code of dfmgr.c:file_exists() because this
function skip "errno == EACCES" (this error is critical for us).
I copied for src/bin/pg_rewind/file_ops.c:check_file_exists() code of
function jit.c:file_exists() (with adaptation for the utility).

--
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com

Hi,

Maybe another discussion thread can be created for the consolidation of XX_file_exists functions.

Cheers

Re: Operation log for major operations

From

Dmitry Koval

Date:

25 January 2023, 11:35:24

Hi!

> Maybe another discussion thread can be created for the consolidation of 
> XX_file_exists functions.

Usually XX_file_exists functions are simple. They contain single call 
stat() or open() and specific error processing after this call.

Likely the unified function will be too complex to cover all the uses of 
XX_file_exists functions.

-- 
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com

Re: Operation log for major operations

From

"Gregory Stark (as CFM)"

Date:

01 March 2023, 23:59:23

On Thu, 19 Jan 2023 at 16:12, Dmitry Koval <d.koval@postgrespro.ru> wrote:
>
>  >The patch does not apply on top of HEAD ...
>
> Thanks!
> Here is a fixed version.

Sorry to say, but this needs a rebase again... Setting to Waiting on Author...

Are there specific feedback needed to make progress? Once it's rebased
if you think it's ready set it to Ready for Committer or if you still
need feedback then Needs Review -- but it's usually more helpful to do
that with an email expressing what questions you're blocked on.

-- 
Gregory Stark
As Commitfest Manager

Re: Operation log for major operations

From

Dmitry Koval

Date:

02 March 2023, 20:57:43

Hi!

These changes did not interest the community. It was expected (topic is 
very specifiс: vendor's technical support). So no sense to distract 
developers ...

I'll move patch to Withdrawn.

-- 
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com

Re: Operation log for major operations

From

Justin Pryzby

Date:

02 March 2023, 21:37:03

On Thu, Mar 02, 2023 at 08:57:43PM +0300, Dmitry Koval wrote:
> These changes did not interest the community. It was expected (topic is very
> specifiс: vendor's technical support). So no sense to distract developers

Actually, I think there is interest, but it has to be phrased in a
limited sense to go into the control file.

In November, I referenced 2 threads, but I think you misunderstood one
of them.  If you skim the first couple mails, you'll find a discussion
about recording crash information in the control file.  

https://www.postgresql.org/message-id/666c2599a07addea00ae2d0af96192def8441974.camel%40j-davis.com

It's come up several times now, and there seems to be ample support for
adding some limited information.

But a "log" which might exceed a few dozen bytes (now or later), that's
inconsistent with the pre-existing purpose served by pg_control.

-- 
Justin

Re: Operation log for major operations

From

Tom Lane

Date:

02 March 2023, 22:36:43

Justin Pryzby <pryzby@telsasoft.com> writes:
> On Thu, Mar 02, 2023 at 08:57:43PM +0300, Dmitry Koval wrote:
>> These changes did not interest the community. It was expected (topic is very
>> specifiс: vendor's technical support). So no sense to distract developers

> Actually, I think there is interest, but it has to be phrased in a
> limited sense to go into the control file.

> In November, I referenced 2 threads, but I think you misunderstood one
> of them.  If you skim the first couple mails, you'll find a discussion
> about recording crash information in the control file.

> https://www.postgresql.org/message-id/666c2599a07addea00ae2d0af96192def8441974.camel%40j-davis.com

> It's come up several times now, and there seems to be ample support for
> adding some limited information.

> But a "log" which might exceed a few dozen bytes (now or later), that's
> inconsistent with the pre-existing purpose served by pg_control.

I'm pretty dubious about recording *anything* in the control file.
Every time we write to that, we risk the entire database on completing
the write successfully.  I don't want to do that more often than once
per checkpoint.  If you want to put crash info in some less-critical
place, maybe we could talk.

            regards, tom lane

Re: Operation log for major operations

From

Dmitry Koval

Date:

03 March 2023, 00:09:38

I'll try to expand my explanation.
I fully understand and accept the arguments about "limited sense to go 
into the control file" and "about recording *anything* in the control 
file". This is totally correct for vanilla.
But vendors have forks of PostgreSQL with custom features and extensions.
Sometimes (especially at the first releases) these custom components 
have bugs which can causes rare problems in data.
These problems can migrate with using pg_upgrade and "lazy" upgrade of 
pages to higher versions of PostgreSQL fork.

So in error cases "recording crash information" etc. is not the only 
important information.
Very important is history of this database (pg_upgrades, promotions, 
pg_resets, pg_rewinds etc.).
Often these "history" allows you to determine from which version of the 
PostgreSQL fork the error came from and what causes of errors we can 
discard immediately.

This "history" is the information that our technical support wants (and 
reason of this patch), but this information is not needed for vanilla...

Another important condition is that the user should not have easy ways 
to delete information about "history" (about reason to use pg_control 
file as "history" storage, but write into it from position 4kB, 8kB,...).

-- 
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com

Re: Operation log for major operations

From

Kirk Wolak

Date:

03 March 2023, 00:49:43

On Thu, Mar 2, 2023 at 4:09 PM Dmitry Koval <d.koval@postgrespro.ru> wrote:

I'll try to expand my explanation.
I fully understand and accept the arguments about "limited sense to go
into the control file" and "about recording *anything* in the control
file". This is totally correct for vanilla.
But vendors have forks of PostgreSQL with custom features and extensions.
Sometimes (especially at the first releases) these custom components
have bugs which can causes rare problems in data.
These problems can migrate with using pg_upgrade and "lazy" upgrade of
pages to higher versions of PostgreSQL fork.

So in error cases "recording crash information" etc. is not the only
important information.
Very important is history of this database (pg_upgrades, promotions,
pg_resets, pg_rewinds etc.).
Often these "history" allows you to determine from which version of the
PostgreSQL fork the error came from and what causes of errors we can
discard immediately.

This "history" is the information that our technical support wants (and
reason of this patch), but this information is not needed for vanilla...

Another important condition is that the user should not have easy ways
to delete information about "history" (about reason to use pg_control
file as "history" storage, but write into it from position 4kB, 8kB,...).

--
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com

Dmitry, this is a great explanation. Thinking outside the box, it feels like:

We need some kind of semaphore flag that tells us something awkward happened.

When it happened, and a little bit of extra information.

You also make the point that if such things have happened, it would probably be a good idea to NOT

allow pg_upgrade to run. It might even be a reason to constantly bother someone until the issue is repaired.

To that point, this feels like a "postgresql_panic.log" file (within the postgresql files?)... Something that would prevent pg_upgrade,

etc. That everyone recognizes is serious. Especially 3rd party vendors.

I see the need for such a thing. I have to agree with others about questioning the proper place to write this.

Are there other places that make sense, that you could use, especially if knowing it exists means there was a serious issue?

Kirk

Re: Operation log for major operations

From

Dmitry Koval

Date:

14 March 2023, 00:36:14

Kirk, I'm sorry about the long pause in my reply.

 >We need some kind of semaphore flag that tells us something awkward
 >happened. When it happened, and a little bit of extra information.

I agree that we do not have this kind of information.
Additionally, legal events like start of pg_rewind, pg_reset, ...  are 
interesting.


 >You also make the point that if such things have happened, it would
 >probably be a good idea to NOT allow pg_upgrade to run.
 >It might even be a reason to constantly bother someone until
 >the issue is repaired.

I think no reason to forbid the run of pg_upgrade for the user 
(especially in automatic mode).
If we automatically do NOT allow pg_upgrade, what should the user do for 
allow pg_upgrade?
Unfortunately, PostgreSQL does not have the utilities to correct errors 
in the database (in case of errors users uses copies of the DB or 
corrects errors manually).
An ordinary user cannot correct errors on his own ...
So we cannot REQUIRE the user to correct database errors, we can only 
INFORM about them.


 >To that point, this feels like a "postgresql_panic.log" file (within
 >the postgresql files?)...  Something that would prevent pg_upgrade,
 >etc. That everyone recognizes is serious. Especially 3rd party vendors.
 >I see the need for such a thing. I have to agree with others about
 >questioning the proper place to write this.
 >Are there other places that make sense, that you could use, especially
 >if knowing it exists means there was a serious issue?

The location of the operation log (like a "postgresql_panic.log") is not 
easy question.
Our technical support is sure that the large number of failures are 
caused by "human factor" (actions of database administrators).
It is not difficult for database administrators to delete the 
"postgresql_panic.log" file or edit it (for example, replacing it with 
an old version; CRC will not save you from such an action).

Therefore, our technical support decided to place the operation log at 
the end of the pg_control file, at an offset of 8192 bytes (and protect 
this log with CRC).
About writing to the pg_control file what worries Tom Lane: information 
in pg_control is written once at system startup (twice in case of 
"promote").
Also, some utilities write information to the operation log too - 
pg_resetwal, pg_rewind, pg_upgrade (these utilities also modify the 
pg_control file without the operation log).

If you are interested, I can attach the current patch (for info - I 
think it makes no sense to offer this patch at commitfest).

-- 
With best regards,
Dmitry Koval

Postgres Professional: http://postgrespro.com