Re: Design of pg_stat_subscription_workers vs pgstats - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Design of pg_stat_subscription_workers vs pgstats
Date
Msg-id CAD21AoCKxcVB9xh5o_Zm8-q0qukuQncNfBD6LVkY=my8ZJbqkQ@mail.gmail.com
Whole thread Raw
In response to Re: Design of pg_stat_subscription_workers vs pgstats  ("David G. Johnston" <david.g.johnston@gmail.com>)
Responses Re: Design of pg_stat_subscription_workers vs pgstats
Re: Design of pg_stat_subscription_workers vs pgstats
List pgsql-hackers
On Wed, Feb 2, 2022 at 4:36 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> On Tue, Feb 1, 2022 at 11:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Wed, Feb 2, 2022 at 9:41 AM David G. Johnston
>> <david.g.johnston@gmail.com> wrote:
>> >
>> > On Tue, Feb 1, 2022 at 8:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>> >>
>> >> On Tue, Feb 1, 2022 at 11:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> >>
>> >> >
>> >> > I see that it's better to use a better IPC for ALTER SUBSCRIPTION SKIP
>> >> > feature to pass error-XID or error-LSN information to the worker
>> >> > whereas I'm also not sure of the advantages in storing all error
>> >> > information in a system catalog. Since what we need to do for this
>> >> > purpose is only error-XID/LSN, we can store only error-XID/LSN in the
>> >> > catalog? That is, the worker stores error-XID/LSN in the catalog on an
>> >> > error, and ALTER SUBSCRIPTION SKIP command enables the worker to skip
>> >> > the transaction in question. The worker clears the error-XID/LSN after
>> >> > successfully applying or skipping the first non-empty transaction.
>> >> >
>> >>
>> >> Where do you propose to store this information?
>> >
>> >
>> > pg_subscription_worker
>> >
>> > The error message and context is very important.  Just make sure it is only non-null when the worker state is
"syncingfailed" (or whatever term we use). 
>> >
>> >
>>
>> Sure, but is this the reason you want to store all the error info in
>> the system catalog? I agree that providing more error info could be
>> useful and also possibly the previously failed (apply) xacts info as
>> well but I am not able to see why you want to have that sort of info
>> in the catalog. I could see storing info like err_lsn/err_xid that can
>> allow to proceed to apply worker automatically or to slow down the
>> launch of errored apply worker but not all sort of other error info
>> (like err_cnt, err_code, err_message, err_time, etc.). I want to know
>> why you are insisting to make all the error info persistent via the
>> system catalog?
>
>
> I look at the catalog and am informed that the worker has stopped because of an error.  I'd rather simply read the
errormessage right then instead of having to go look at the log file.  And if I am going to take an action in order to
overcomethe error I would have to know what that error is; so the error message is not something I can ignore.  The
erroris an attribute of system state, and the catalog stores the current state of the (workers) system. 
>
> I already explained that the concept of err_cnt is not useful.  The fact that you include it here makes me think you
arestill thinking that this all somehow is meant to keep track of history.  It is not.  The workers are state machines
and"error" is one of the states - with relevant attributes to display to the user, and system, while in that state.
Thestate machine reporting does not care about historical states nor does it report on them.  There is some uncertainty
ifwe continue with the automatic re-launch; which, now that I write this, I can see where what you call err_cnt is
effectivelya count of how many times the worker re-launched without the underlying problem being resolved and thus
encounteredthe same error.  If we persist with the re-launch behavior then maybe err_cnt should be left in place - with
thedescription for it basically being the ah-ha! comment I just made. In a world where we do not typically re-launch
andsimply re-try without being informed there is a change - such a count remains of minimal value. 
>
> I don't really understand the confusion here though - this error data already exists in the
pg_stat_subscription_workersstat collector view - the fact that I want to keep it around (just changing the reset
behavior)- doesn't seem like it should be controversial.  I, thinking as a user, really don't care about all of these
implementationdetails.  Whether it is a pg_stat_* view (collector or shmem IPC) or a pg_* catalog is immaterial to me.
Thebehavior I observe is what matters.  As a developer I don't want to use the statistics collector because these are
notstatistics and the collector is unreliable.  I don't know enough about the relevant differences between shared
memoryIPC and catalog tables to decide between them.  But catalog tables seem like a lower bar to meet and seem like
theycan implement the user-facing requirements as I envision them. 

I see that important information such as error-XID that can be used
for ALTER SUBSCRIPTION SKIP needs to be stored in a reliable way, and
using system catalogs is a reasonable way for this purpose. But it's
still unclear to me why all error information that is currently shown
in pg_stat_subscription_workers view, including error-XID and the
error message, relation OID, action, etc., need to be stored in the
catalog. The information other than error-XID doesn't necessarily need
to be reliable compared to error-XID. I think we can have
error-XID/LSN in the pg_subscription catalog and have other error
information in pg_stat_subscription_workers view. After the user
checks the current status of logical replication by checking
error-XID/LSN, they can check pg_stat_subscription_workers for
details.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/



pgsql-hackers by date:

Previous
From: Julien Rouhaud
Date:
Subject: Re: Unclear problem reports
Next
From: Julien Rouhaud
Date:
Subject: Re: support for CREATE MODULE