Re: Design of pg_stat_subscription_workers vs pgstats - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Design of pg_stat_subscription_workers vs pgstats
Date
Msg-id 20220127211518.qil7cmfvtxgfhmh3@alap3.anarazel.de
Whole thread Raw
In response to Re: Design of pg_stat_subscription_workers vs pgstats  ("David G. Johnston" <david.g.johnston@gmail.com>)
Responses Re: Design of pg_stat_subscription_workers vs pgstats
List pgsql-hackers
Hi,

On 2022-01-27 13:18:51 -0700, David G. Johnston wrote:
> Repeating myself here to try and keep complaints regarding
> pg_stat_subscription_worker in one place.

Thanks!


> This is my specific email with respect to the pg_stat_scription_workers
> design.
>
> https://www.postgresql.org/message-id/CAKFQuwZbFuPSV1WLiNFuODst1sUZon2Qwbj8d9tT%3D38hMhJfvw%40mail.gmail.com
>
> Specifically,
>
> pg_stat_subscription_workers is defined as showing:
> "will contain one row per subscription
> worker on which errors have occurred, for workers applying logical
> replication changes and workers handling the initial data copy of the
> subscribed tables."
>
> The fact that these errors remain (last_error_*) even after they are no
> longer relevant is my main gripe regarding this feature.  The information
> itself is generally useful though last_error_count is not.  These fields
> should auto-clear and be named current_error_* as they exist for purposes
> of describing the current state of any error-encountering logical
> replication worker so that ALTER SUBSCRIPTION SKIP, or someone other manual
> intervention, can be done with that knowledge without having to scan the
> subscriber's server logs.

Indeed.

Another related thing is that using a 32bit xid for allowing skipping is a bad
idea anyway - we shouldn't adding new interfaces with xid wraparound dangers -
it's getting more and more common to have multiple wraparounds a day.  An
easily better alternative would be the LSN at which a transaction starts.


> This is my email trying to understand reality better in order to figure out
> what exactly is causing the limitations that are negatively impacting the
> design of this feature.
>
> https://www.postgresql.org/message-id/CAKFQuwYJ7dsW%2BStsw5%2BZVoY3nwQ9j6pPt-7oYjGddH-h7uVb%2Bg%40mail.gmail.com
>
> In short, it was convenient to use the statistics collector here even if
> doing so resulted in a non-user friendly (IMO) design.

And importantly, the whole justification for the scheme, namely the inability
to change actual tables in that state, just doesn't hold up. It's a few lines
to abort the failed transaction and log the error after.


Just retrying over and over at full pace doesn't seem like a good thing. We
should start to back off retries - the retries themselves can very well
contribute to making it harder to fix the problem, by holding locks etc. For
that the launcher (or workers) should check whether there's no worker because
it's errored out. With pgstats such a check would need this full sequence:

1) worker sends failure stats message
2) pgstats receive stats message
3) launcher sends ping to pgstats to request file to be written out
4) pgstats writes out the whole database's stats
5) launcher reads the whole stats file

That's a big and expensive cannon for a check whether we should delay the
launcher of a worker.


> Given all of the
> limitations to the statistics collection infrastructure, and the fact that
> this data is not statistical in the usual usage of the term, I find that to
> be less than satisfying.  To the point that I'd be inclined to revert this
> feature and hold up the ALTER SUBSCRIPTION SET patch until a more
> user-friendly design can be done using proper IPC techniques.

Same.


> In my second email I did some tracing and ended up at the PG_CATCH() block
> in src/backend/replication/logical/worker.c:L3629.  When mentioning trying
> to get rid of the PG_RE_THROW() there apparently doing so completely is
> unwarranted due to fatal/panic errors.  I am curious that the addition of
> the statistic reporting logic doesn't seem to care about the same.

We shouldn't even think about doing stuff like stats updates when we've
PANICed. You could argue its safe to do that in the FATAL case - but where
would such a FATAL validly come from? It'd be something like a user calling
pg_terminate_backend(), which isn't transaction specific, so it'd not make
sense to record details like xid in pg_stat_subscription_workers.

But the argument of needing to do something in PG_CATCH in the fatal/panic
case is bogus, because FATAL/PANIC doesn't reach PG_CATCH. errfinish() only
throws when elevel == ERROR, in the FATAL case we end with proc_exit(1), with
PANIC we abort().


> Andres, I do not know how to be more precise than your comment "But: You
> don't need to. Just abort the current transaction, start a new one, and
> update the state.".  When I suggested that idea it didn't seem to resonate
> with anyone on the other thread.  Starting at the main PG_TRY() loop in
> worker.c noted above, could you maybe please explain in a bit more detail
> whether, and how hard, it would be to go from "just PG_RE_THROW();" to
> "abort and start a new transaction"?

It's pretty easy from the POV of getting into a new transaction.

PG_CATCH():

    /* get us out of the failed transaction */
    AbortOutOfAnyTransaction();

    StartTransactionCommand();
    /* do something to remember the error we just got */
    CommitTransactionCommand();


It may be a bit harder to afterwards to to not just error out the whole
worker, because we'd need to know what to do instead.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: "Imseih (AWS), Sami"
Date:
Subject: Re: Add index scan progress to pg_stat_progress_vacuum
Next
From: Thomas Munro
Date:
Subject: Re: A test for replay of regression tests