Re: Skipping logical replication transactions on subscriber side - Mailing list pgsql-hackers
From | David G. Johnston |
---|---|
Subject | Re: Skipping logical replication transactions on subscriber side |
Date | |
Msg-id | CAKFQuwZbFuPSV1WLiNFuODst1sUZon2Qwbj8d9tT=38hMhJfvw@mail.gmail.com Whole thread Raw |
In response to | Re: Skipping logical replication transactions on subscriber side (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: Skipping logical replication transactions on subscriber side
|
List | pgsql-hackers |
On Tue, Jan 25, 2022 at 8:33 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Given that we cannot use rely on the pg_stat_subscription_workers view
for this purpose, we would need either a new sub-system that tracks
each logical replication status so the system can set the error XID to
subskipxid, or to wait for shared-memory based stats collector.
I'm reading over the monitoring-stats page to try and get my head around all of this. First of all, it defines two kinds of views:
1. PostgreSQL's statistics collector is a subsystem that supports collection and reporting of information about server activity.
2. PostgreSQL also supports reporting dynamic information ... This facility is independent of the collector process.
In then has two tables:
28.1 Dynamic Statistics Views (describing #2 above)
28.2 Collected Statistics Views (describing #1 above)
Apparently the "collector process" is UDP-like, not reliable. The documentation fails to mention this fact. I'd argue that this is a documentation bug.
I do see that the pg_stat_subscription_workers view is correctly placed in Table 28.2
Reviewing the other views listed in that table only pg_stat_archiver abuses the statistics collector in a similar fashion. All of the others are actually metric oriented.
I don't care for the specification: "will contain one row per subscription worker on which errors have occurred, for workers applying logical replication changes and workers handling the initial data copy of the subscribed tables."
I would much rather have this behave similar to pg_stat_activity (which, of course, is a Dynamic Statistics View...) in that it shows only and all workers that are presently working. The tablesync workers should go away when they have finished synchronizing. I should not have to manually intervene to get rid of unreliable expired data. The log file feels like a superior solution to this monitoring view.
Alternatively, if the tablesync workers are done but we've been accumulating real statistics for them, then by all means keep them included in the view - but regardless of whether they encountered an error. But maybe the view can right join in pg_stat_subscription as show a column for "(pid is not null) AS is_active".
Maybe we need to add a track_finished_tablesync_workers GUC so the DBA can decide whether to devote storage and processing resources to that historical information.
If you had kept the original view name, "pg_stat_subscription_error", this whole issue goes away. But you decided to make it more generic and call it "pg_stat_subscription_workers" - which means you need to get rid of the error-specific condition in the WHERE clause for the view. Show all workers - I can filter on is_active. Showing only active workers is also acceptable. You won't get to change your mind so decide whether this wants to show only current and running state or whether historical statistics for now defunct tablesync workers are desired. Personally, I would just show active workers and if someone wants to add the feature they can add a track_tablesync_worker_stats GUC and a matching view.
From that, every apply worker should be sending a statistics message to the collector periodically. If error info is not present and the state is "all is well", clear out any existing error info from the view. The attempt to include an actual statistic field here doesn't seem useful nor redeeming. I would add a "state" field in its place (well, after subrelid). And I would still rename the columns to current_error_* and note that these should be null unless the status field shows error (there may be some additional complexity here). Just get rid of last_error_count.
David J.
P.S. I saw the discussion regarding pg_dump'ing the subskipid field. I didn't notice any discussion around creating and restoring a basebackup. It seems like during server startup subskipid should just be cleared out. Then it doesn't matter what one does during backup.
pgsql-hackers by date: