Re: Design of pg_stat_subscription_workers vs pgstats - Mailing list pgsql-hackers

From David G. Johnston
Subject Re: Design of pg_stat_subscription_workers vs pgstats
Date
Msg-id CAKFQuwaTr6wszUiBjf+0u-nhPx3w1j=gRiXLWH6oGJZ93O1bCQ@mail.gmail.com
Whole thread Raw
In response to Re: Design of pg_stat_subscription_workers vs pgstats  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Wed, Feb 2, 2022 at 5:08 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Wed, Feb 2, 2022 at 1:06 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:

...
>
> I already explained that the concept of err_cnt is not useful.  The fact that you include it here makes me think you are still thinking that this all somehow is meant to keep track of history.  It is not.  The workers are state machines and "error" is one of the states - with relevant attributes to display to the user, and system, while in that state.  The state machine reporting does not care about historical states nor does it report on them.  There is some uncertainty if we continue with the automatic re-launch;
>

I think automatic retry will help to allow some transient errors say
like network glitches that can be resolved on retry and will keep the
behavior transparent. This is also consistent with what we do in
standby mode where if there is an error on primary due to which
standby is not able to fetch some data it will just retry. We can't
fix any error that occurred on the server-side, so the way is to retry
which is true for both standby and subscribers.

Good points.  In short there are two subsets of problems to deal with here.  We should address them separately, though the pg_subscription_worker table should provide relevant information for both cases.  If we are in a retry situation relevant information, like next_scheduled_retry (estimated), should be provided (if there is some kind of delay involved).  In a situation like "unique constraint violation" the "next_scheduled_retry" would be null; or make the field a text field and print "Manual Intervention Required".  Likewise, the XID/LSN would be null in a retry situation since we haven't received a wholly intact transaction from the publisher (we may know of such an ID but if the final COMMIT message is never even seen before the feed dies we should not be exposing that incomplete information to the user).

A standby is not expected to encounter any user data constraint problems so even a system with manual intervention for such will work for standbys because they will never hit that code path.  And you cannot simply skip applying the failed transaction and move onto the next one - that data also never came over.

David J.

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Server-side base backup: why superuser, not pg_write_server_files?
Next
From: Bharath Rupireddy
Date:
Subject: pg_receivewal - couple of improvements