Re: Design of pg_stat_subscription_workers vs pgstats - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Design of pg_stat_subscription_workers vs pgstats
Date
Msg-id 20220215181742.372brts5t7q3gkpr@alap3.anarazel.de
Whole thread Raw
In response to Re: Design of pg_stat_subscription_workers vs pgstats  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Design of pg_stat_subscription_workers vs pgstats  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
Hi,

On 2022-02-04 09:23:06 +0530, Amit Kapila wrote:
> On Thu, Feb 3, 2022 at 3:25 PM Peter Eisentraut
> <peter.eisentraut@enterprisedb.com> wrote:
> >
> > On 02.02.22 07:54, Amit Kapila wrote:
> >
> > > Sure, but is this the reason you want to store all the error info in
> > > the system catalog? I agree that providing more error info could be
> > > useful and also possibly the previously failed (apply) xacts info as
> > > well but I am not able to see why you want to have that sort of info
> > > in the catalog. I could see storing info like err_lsn/err_xid that can
> > > allow to proceed to apply worker automatically or to slow down the
> > > launch of errored apply worker but not all sort of other error info
> > > (like err_cnt, err_code, err_message, err_time, etc.). I want to know
> > > why you are insisting to make all the error info persistent via the
> > > system catalog?
> >
> > Let's flip this around and ask, why not?
> >
> 
> Because we don't necessarily need all this information after the crash
> and neither is this information about any system object which we
> require for performing operations on objects.

I find this not particularly convincing. IMO data that leads the user to
compromise "replication integrity" is pretty crucial.

And skipped data needs to be logged somewhere persistent, so that there's a
chance to analyze / recover.

We also should utilize more detailed knowledge about errors to influence at
which interval replication is retried. Serialization error: retry soon. Other
errors: retry with increasing backoff.


> In walreceiver (for standby), we don't store the errors/conflicts in any
> table, they are either reported in logs or shared via stats.

That's imo quite different - they're fundamentally time-limited problems. And
they aren't leading the user / DBA to skip transactions etc.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work
Next
From: Robert Haas
Date:
Subject: adding 'zstd' as a compression algorithm