Re: Design of pg_stat_subscription_workers vs pgstats - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Design of pg_stat_subscription_workers vs pgstats
Date
Msg-id CAA4eK1K3KHAT9gc85dM0HVHbPYzuSsXKRmNQmGB9ANdUVAEbgA@mail.gmail.com
Whole thread Raw
In response to Re: Design of pg_stat_subscription_workers vs pgstats  ("David G. Johnston" <david.g.johnston@gmail.com>)
Responses Re: Design of pg_stat_subscription_workers vs pgstats  ("David G. Johnston" <david.g.johnston@gmail.com>)
List pgsql-hackers
On Wed, Feb 2, 2022 at 1:06 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> On Tue, Feb 1, 2022 at 11:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Wed, Feb 2, 2022 at 9:41 AM David G. Johnston
>> <david.g.johnston@gmail.com> wrote:
>> >
>> > On Tue, Feb 1, 2022 at 8:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>> >>
>> >> On Tue, Feb 1, 2022 at 11:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> >>
>> >> >
>> >> > I see that it's better to use a better IPC for ALTER SUBSCRIPTION SKIP
>> >> > feature to pass error-XID or error-LSN information to the worker
>> >> > whereas I'm also not sure of the advantages in storing all error
>> >> > information in a system catalog. Since what we need to do for this
>> >> > purpose is only error-XID/LSN, we can store only error-XID/LSN in the
>> >> > catalog? That is, the worker stores error-XID/LSN in the catalog on an
>> >> > error, and ALTER SUBSCRIPTION SKIP command enables the worker to skip
>> >> > the transaction in question. The worker clears the error-XID/LSN after
>> >> > successfully applying or skipping the first non-empty transaction.
>> >> >
>> >>
>> >> Where do you propose to store this information?
>> >
>> >
>> > pg_subscription_worker
>> >
>> > The error message and context is very important.  Just make sure it is only non-null when the worker state is
"syncingfailed" (or whatever term we use). 
>> >
>> >
>>
>> Sure, but is this the reason you want to store all the error info in
>> the system catalog? I agree that providing more error info could be
>> useful and also possibly the previously failed (apply) xacts info as
>> well but I am not able to see why you want to have that sort of info
>> in the catalog. I could see storing info like err_lsn/err_xid that can
>> allow to proceed to apply worker automatically or to slow down the
>> launch of errored apply worker but not all sort of other error info
>> (like err_cnt, err_code, err_message, err_time, etc.). I want to know
>> why you are insisting to make all the error info persistent via the
>> system catalog?
>
>
...
...
>
> I already explained that the concept of err_cnt is not useful.  The fact that you include it here makes me think you
arestill thinking that this all somehow is meant to keep track of history.  It is not.  The workers are state machines
and"error" is one of the states - with relevant attributes to display to the user, and system, while in that state.
Thestate machine reporting does not care about historical states nor does it report on them.  There is some uncertainty
ifwe continue with the automatic re-launch; 
>

I think automatic retry will help to allow some transient errors say
like network glitches that can be resolved on retry and will keep the
behavior transparent. This is also consistent with what we do in
standby mode where if there is an error on primary due to which
standby is not able to fetch some data it will just retry. We can't
fix any error that occurred on the server-side, so the way is to retry
which is true for both standby and subscribers. Basically, I don't
think every kind of error demands user intervention. We can allow to
control it via some parameter say disable_on_error as is discussed in
CF entry [1] but don't think that should be the default.

[1] - https://commitfest.postgresql.org/36/3407/

--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work
Next
From: "kuroda.hayato@fujitsu.com"
Date:
Subject: RE: [Proposal] Add foreign-server health checks infrastructure