Re: Design of pg_stat_subscription_workers vs pgstats - Mailing list pgsql-hackers

From David G. Johnston
Subject Re: Design of pg_stat_subscription_workers vs pgstats
Date
Msg-id CAKFQuwYHFkW8fP_a62wk-YBb4o+n9UXG4Ji3E4O9DwZrv0jgQQ@mail.gmail.com
Whole thread Raw
In response to Re: Design of pg_stat_subscription_workers vs pgstats  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Design of pg_stat_subscription_workers vs pgstats  (Amit Kapila <amit.kapila16@gmail.com>)
Re: Design of pg_stat_subscription_workers vs pgstats  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
On Tue, Feb 1, 2022 at 11:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Feb 2, 2022 at 9:41 AM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> On Tue, Feb 1, 2022 at 8:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Tue, Feb 1, 2022 at 11:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> >
>> > I see that it's better to use a better IPC for ALTER SUBSCRIPTION SKIP
>> > feature to pass error-XID or error-LSN information to the worker
>> > whereas I'm also not sure of the advantages in storing all error
>> > information in a system catalog. Since what we need to do for this
>> > purpose is only error-XID/LSN, we can store only error-XID/LSN in the
>> > catalog? That is, the worker stores error-XID/LSN in the catalog on an
>> > error, and ALTER SUBSCRIPTION SKIP command enables the worker to skip
>> > the transaction in question. The worker clears the error-XID/LSN after
>> > successfully applying or skipping the first non-empty transaction.
>> >
>>
>> Where do you propose to store this information?
>
>
> pg_subscription_worker
>
> The error message and context is very important.  Just make sure it is only non-null when the worker state is "syncing failed" (or whatever term we use).
>
>

Sure, but is this the reason you want to store all the error info in
the system catalog? I agree that providing more error info could be
useful and also possibly the previously failed (apply) xacts info as
well but I am not able to see why you want to have that sort of info
in the catalog. I could see storing info like err_lsn/err_xid that can
allow to proceed to apply worker automatically or to slow down the
launch of errored apply worker but not all sort of other error info
(like err_cnt, err_code, err_message, err_time, etc.). I want to know
why you are insisting to make all the error info persistent via the
system catalog?

I look at the catalog and am informed that the worker has stopped because of an error.  I'd rather simply read the error message right then instead of having to go look at the log file.  And if I am going to take an action in order to overcome the error I would have to know what that error is; so the error message is not something I can ignore.  The error is an attribute of system state, and the catalog stores the current state of the (workers) system.

I already explained that the concept of err_cnt is not useful.  The fact that you include it here makes me think you are still thinking that this all somehow is meant to keep track of history.  It is not.  The workers are state machines and "error" is one of the states - with relevant attributes to display to the user, and system, while in that state.  The state machine reporting does not care about historical states nor does it report on them.  There is some uncertainty if we continue with the automatic re-launch; which, now that I write this, I can see where what you call err_cnt is effectively a count of how many times the worker re-launched without the underlying problem being resolved and thus encountered the same error.  If we persist with the re-launch behavior then maybe err_cnt should be left in place - with the description for it basically being the ah-ha! comment I just made. In a world where we do not typically re-launch and simply re-try without being informed there is a change - such a count remains of minimal value.

I don't really understand the confusion here though - this error data already exists in the pg_stat_subscription_workers stat collector view - the fact that I want to keep it around (just changing the reset behavior) - doesn't seem like it should be controversial.  I, thinking as a user, really don't care about all of these implementation details.  Whether it is a pg_stat_* view (collector or shmem IPC) or a pg_* catalog is immaterial to me.  The behavior I observe is what matters.  As a developer I don't want to use the statistics collector because these are not statistics and the collector is unreliable.  I don't know enough about the relevant differences between shared memory IPC and catalog tables to decide between them.  But catalog tables seem like a lower bar to meet and seem like they can implement the user-facing requirements as I envision them.

David J.

pgsql-hackers by date:

Previous
From: Teodor Sigaev
Date:
Subject: Re: Pluggable toaster
Next
From: Andy Fan
Date:
Subject: Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not?