Re: Design of pg_stat_subscription_workers vs pgstats - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: Design of pg_stat_subscription_workers vs pgstats |
Date | |
Msg-id | CAD21AoCKxcVB9xh5o_Zm8-q0qukuQncNfBD6LVkY=my8ZJbqkQ@mail.gmail.com Whole thread Raw |
In response to | Re: Design of pg_stat_subscription_workers vs pgstats ("David G. Johnston" <david.g.johnston@gmail.com>) |
Responses |
Re: Design of pg_stat_subscription_workers vs pgstats
Re: Design of pg_stat_subscription_workers vs pgstats |
List | pgsql-hackers |
On Wed, Feb 2, 2022 at 4:36 PM David G. Johnston <david.g.johnston@gmail.com> wrote: > > On Tue, Feb 1, 2022 at 11:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> On Wed, Feb 2, 2022 at 9:41 AM David G. Johnston >> <david.g.johnston@gmail.com> wrote: >> > >> > On Tue, Feb 1, 2022 at 8:07 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> >> >> On Tue, Feb 1, 2022 at 11:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> >> >> >> > >> >> > I see that it's better to use a better IPC for ALTER SUBSCRIPTION SKIP >> >> > feature to pass error-XID or error-LSN information to the worker >> >> > whereas I'm also not sure of the advantages in storing all error >> >> > information in a system catalog. Since what we need to do for this >> >> > purpose is only error-XID/LSN, we can store only error-XID/LSN in the >> >> > catalog? That is, the worker stores error-XID/LSN in the catalog on an >> >> > error, and ALTER SUBSCRIPTION SKIP command enables the worker to skip >> >> > the transaction in question. The worker clears the error-XID/LSN after >> >> > successfully applying or skipping the first non-empty transaction. >> >> > >> >> >> >> Where do you propose to store this information? >> > >> > >> > pg_subscription_worker >> > >> > The error message and context is very important. Just make sure it is only non-null when the worker state is "syncingfailed" (or whatever term we use). >> > >> > >> >> Sure, but is this the reason you want to store all the error info in >> the system catalog? I agree that providing more error info could be >> useful and also possibly the previously failed (apply) xacts info as >> well but I am not able to see why you want to have that sort of info >> in the catalog. I could see storing info like err_lsn/err_xid that can >> allow to proceed to apply worker automatically or to slow down the >> launch of errored apply worker but not all sort of other error info >> (like err_cnt, err_code, err_message, err_time, etc.). I want to know >> why you are insisting to make all the error info persistent via the >> system catalog? > > > I look at the catalog and am informed that the worker has stopped because of an error. I'd rather simply read the errormessage right then instead of having to go look at the log file. And if I am going to take an action in order to overcomethe error I would have to know what that error is; so the error message is not something I can ignore. The erroris an attribute of system state, and the catalog stores the current state of the (workers) system. > > I already explained that the concept of err_cnt is not useful. The fact that you include it here makes me think you arestill thinking that this all somehow is meant to keep track of history. It is not. The workers are state machines and"error" is one of the states - with relevant attributes to display to the user, and system, while in that state. Thestate machine reporting does not care about historical states nor does it report on them. There is some uncertainty ifwe continue with the automatic re-launch; which, now that I write this, I can see where what you call err_cnt is effectivelya count of how many times the worker re-launched without the underlying problem being resolved and thus encounteredthe same error. If we persist with the re-launch behavior then maybe err_cnt should be left in place - with thedescription for it basically being the ah-ha! comment I just made. In a world where we do not typically re-launch andsimply re-try without being informed there is a change - such a count remains of minimal value. > > I don't really understand the confusion here though - this error data already exists in the pg_stat_subscription_workersstat collector view - the fact that I want to keep it around (just changing the reset behavior)- doesn't seem like it should be controversial. I, thinking as a user, really don't care about all of these implementationdetails. Whether it is a pg_stat_* view (collector or shmem IPC) or a pg_* catalog is immaterial to me. Thebehavior I observe is what matters. As a developer I don't want to use the statistics collector because these are notstatistics and the collector is unreliable. I don't know enough about the relevant differences between shared memoryIPC and catalog tables to decide between them. But catalog tables seem like a lower bar to meet and seem like theycan implement the user-facing requirements as I envision them. I see that important information such as error-XID that can be used for ALTER SUBSCRIPTION SKIP needs to be stored in a reliable way, and using system catalogs is a reasonable way for this purpose. But it's still unclear to me why all error information that is currently shown in pg_stat_subscription_workers view, including error-XID and the error message, relation OID, action, etc., need to be stored in the catalog. The information other than error-XID doesn't necessarily need to be reliable compared to error-XID. I think we can have error-XID/LSN in the pg_subscription catalog and have other error information in pg_stat_subscription_workers view. After the user checks the current status of logical replication by checking error-XID/LSN, they can check pg_stat_subscription_workers for details. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
pgsql-hackers by date: