Re: Design of pg_stat_subscription_workers vs pgstats - Mailing list pgsql-hackers

From David G. Johnston
Subject Re: Design of pg_stat_subscription_workers vs pgstats
Date
Msg-id CAKFQuwb8yaWxxH-gSt4NG9HhVnmKK_GnCEotVtjG1JQohOb0Qw@mail.gmail.com
Whole thread Raw
In response to Re: Design of pg_stat_subscription_workers vs pgstats  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Design of pg_stat_subscription_workers vs pgstats  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Tue, Feb 1, 2022 at 8:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Feb 1, 2022 at 11:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

>
> I see that it's better to use a better IPC for ALTER SUBSCRIPTION SKIP
> feature to pass error-XID or error-LSN information to the worker
> whereas I'm also not sure of the advantages in storing all error
> information in a system catalog. Since what we need to do for this
> purpose is only error-XID/LSN, we can store only error-XID/LSN in the
> catalog? That is, the worker stores error-XID/LSN in the catalog on an
> error, and ALTER SUBSCRIPTION SKIP command enables the worker to skip
> the transaction in question. The worker clears the error-XID/LSN after
> successfully applying or skipping the first non-empty transaction.
>

Where do you propose to store this information?

pg_subscription_worker

The error message and context is very important.  Just make sure it is only non-null when the worker state is "syncing failed" (or whatever term we use).

Records are removed upon server restart (the launcher can handle this).  Consider recording a last activity timestamp (some protection/visibility against bugs or, say, a worker ending without reporting that fact).  Records stay around even when the worker goes away (the user can filter the state field to omit inactive rows).  I'd consider just removing them when done and/or having a reset function that the DBA could run (it should never be wrong to clear the table).

Re: XID and/or LSN, I don't know enough yet to really judge this...

The other possibility
could be to invent a new catalog for this info but I guess it will
then have to have some duplicate info from pg_subscription/_rel.

The other point is after this, do we want an interface where the user
can also be allowed to specify error_lsn or error_xid?

...but whatever is decided, tell me, the user, what my options are, the limitations, and what info to copy from this catalog into the command(s) that I issue to the server, that will make the errors go away.  This is generic, not specific to the skipping a commit command or the skip-to-lsn functions, but also includes considering performing DML on the relevant table(s) to avoid the error.

I don't think the fields would be duplicated.  While some of the fields seem similar, aside from the key fields the data we would show would be state info for a given worker.  None of the v14 fields do this at the worker scope.

That all makes the new catalog a generally useful monitoring source and a standalone patch.  I'd personally start a new thread, with a functioning patch as the first message, and a recap of what and why this rework is being done.  In order for Andres to make progress on the shared memory statistics patch I would suggest reverting this and building the new patch as if this statistics collector approach never happened.

I'd still like to get some clarity regarding the observation that our error-die-restart process seems problematic.  Since that process needs to talk to the new catalog anyway I'd rather commit the changes to the process (if any, but I hope we can either all agree on the status quo or get something better in for v15), and the new catalog that provides insight into that process, as part of this first commit.  That includes a probable user function to restart a halted worker instead of doing so continually (even with the suggested back-off protocol).

Then the SKIP commit can go in, leveraging the state information exposed in the catalog.  That discussion and work should be restarted on a new thread with an intro recap message.  The existing patch should be adapted to leverage the new pg_subscription_worker catalog before starting the new thread.

David J.

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Doc: CREATE_REPLICATION_SLOT command requires the plugin name
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: Make mesage at end-of-recovery less scary.