Re: Design of pg_stat_subscription_workers vs pgstats - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: Design of pg_stat_subscription_workers vs pgstats |
Date | |
Msg-id | CAD21AoA-MVZcMA1iYwbtvTCOHx1SwG8amQUTT8j2ssh2zuc6OQ@mail.gmail.com Whole thread Raw |
In response to | Re: Design of pg_stat_subscription_workers vs pgstats (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: Design of pg_stat_subscription_workers vs pgstats
Re: Design of pg_stat_subscription_workers vs pgstats |
List | pgsql-hackers |
On Wed, Feb 16, 2022 at 8:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Feb 16, 2022 at 5:49 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Feb 15, 2022 at 11:56 PM Andres Freund <andres@anarazel.de> wrote: > > > > > > On 2022-02-03 13:33:08 +0900, Masahiko Sawada wrote: > > > > I see that important information such as error-XID that can be used > > > > for ALTER SUBSCRIPTION SKIP needs to be stored in a reliable way, and > > > > using system catalogs is a reasonable way for this purpose. But it's > > > > still unclear to me why all error information that is currently shown > > > > in pg_stat_subscription_workers view, including error-XID and the > > > > error message, relation OID, action, etc., need to be stored in the > > > > catalog. The information other than error-XID doesn't necessarily need > > > > to be reliable compared to error-XID. I think we can have > > > > error-XID/LSN in the pg_subscription catalog and have other error > > > > information in pg_stat_subscription_workers view. After the user > > > > checks the current status of logical replication by checking > > > > error-XID/LSN, they can check pg_stat_subscription_workers for > > > > details. > > > > > > The stats system isn't geared towards storing error messages and > > > such. Generally it is about storing counts of events etc, not about > > > information about a single event. Obviously there are a few cases where that > > > boundary isn't that clear... > > > > > > > True, in the beginning, we discussed this information to be stored in > > system catalog [1] (See .... Also, I am thinking that instead of a > > stat view, do we need to consider having a system table .. ) but later > > discussion led us to store this as stats. > > > > > IOW, storing information like: > > > - subscription oid > > > - retryable error count > > > - hard error count > > > - #replicated inserts > > > - #replicated updates > > > - #replicated deletes > > > > > > is what pgstats is for. > > > > > > > Some of this and similar ((like error count, last_error_time)) is > > present in stats and something on the lines of the other information > > is proposed in [2] (xacts successfully replicated (committed), > > aborted, etc) to be stored along with it. > > > > > But not > > > - subscription oid > > > - error message > > > - xid of error > > > - ... > > > > > > > I think from the current set of things we are capturing, the other > > thing in this list will be error_command (insert/update/delete..) and > > or probably error_code. So, we can keep this information in a system > > table. > > Agreed. Also, we can have also commit-LSN or replace error-XID with error-LSN? > > > > > Based on this discussion, it appears to me what we want here is to > > store the error info in some persistent way (system table) and the > > counters (both error and success) of logical replication in stats. If > > we can't achieve this work (separation) in the next few weeks (before > > the feature freeze) then I'll revert the work related to stats. > > There was an idea to use shmem to store error info but it seems to be > better to use a system catalog to persist them. > > I'll summarize changes we discussed and make a plan of changes and > feature designs toward the feature freeze (and v16). I think that once > we get a consensus on them we can start implementation and move it > forward. > Here is the summary of the discussion, changes, and plan. 1. Move some error information such as the error message to a new system catalog, pg_subscription_error. The pg_subscription_error table would have the following columns: * sesubid : subscription Oid. * serelid : relation Oid (NULL for apply worker). * seerrlsn : commit-LSN or the error transaction. * seerrcmd : command (INSERT, UPDATE, etc.) of the error transaction. * seerrmsg : error message The tuple is inserted or updated when an apply worker or a tablesync worker raises an error. If the same error occurs in a row, the update is skipped. The tuple is removed in the following cases: * the subscription is dropped. * the table is dropped. * the table is removed from the subscription. * the worker successfully committed a non-empty transaction. With this change, pg_stat_subscription_workers will be like: * subid * subname * subrelid * error_count * last_error_timestamp This view will be extended by adding transaction statistics proposed on another thread[1]. 2. Fix a bug in pg_stat_subscription_workers. As pointed out by Andres, there is a bug in pg_stat_subscription_workers; it doesn't drop entries for already-dropped tables. We need to fix it. 3. Show commit-LSN of the error transaction in errcontext. Currently, we show XID and commit timestamp in the errcontext. But given that we use LSN in pg_subscriptoin_error, it's better to show commit-LSN as well (or instead of error-XID). 4. Skipping the particular conflicted transaction. There are two proposals: 4-1. Use the existing replication_origin_advance() SQL function. We don't need to add any additional syntax, instead use replication_origin_advance() with the error-LSN reported in either pg_subscription_error or server logs to skip the particular transaction. 4-2. Introduce a new syntax like ALTER SUBSCRIPTION ... SKIP. This proposal further has two options: (1) the user specifies error-LSN manually and (2) the user just enables skipping behavior and error-LSN is automatically fetched from pg_subscription_error. In any way, the command raises an error when there is no error entry in pg_subscription_error. We can discuss this item for details on the original thread. 5. Record skipped data to the system catalog, say pg_subscription_conflict_history so that there is a chance to analyze and recover. The pg_subscription_conflict_history I'm thinking is that we record the following of all skipped changes: * command (INSERT, UPDATE etc.) * commit-LSN * before row (in json format?) * after row (in json format?) Given that we end up writing a huge amount of history if the transaction is very large and I think there are peoples who want to check what changes will be skipped together before enabling skipping behavior, I think it could be optional. Therefore I think we can provide an option for ALTER SUBSCRIPTION ... SKIP to whether the skipped data is recorded or not and to the pg_subscription_conflict_history or server logs. We can discuss the details in a new thread. 4 and 5 might be better introduced together but I think since the user is able to check what changes will be skipped on the publisher side we can do 5 for v16. Feedback and comments are very welcome. Regards, [1] https://www.postgresql.org/message-id/flat/OSBPR01MB48887CA8F40C8D984A6DC00CED199@OSBPR01MB4888.jpnprd01.prod.outlook.com -- Masahiko Sawada EDB: https://www.enterprisedb.com/
pgsql-hackers by date: