I've been playing with logical replication (currently on PG14), specifically in an AWS RDS Postgres context, but NOT using AWS' own replication tooling. I'm generally familiar with the challenges of distributed systems (such causality, time synchronisation etc), but not especially familiar with PG.
In looking at how to tell how a given subscriber has caught up with its publisher, there is plenty of advice around the Web, for example https://dba.stackexchange.com/questions/314324/monitor-logical-replication-using-lsn. Like this example, much advice ends up talking about using separate queries on the publisher and the subscriber to compare LSNs. First, (I think) I understand the core difficulty that comparing LSNs is inherently racy, but given that, I'm a bit unclear as to why a single query on the publisher is not enough...IIUC:
Changes sent from the publisher to the subscriber are identified by LSN.
The publisher knows it's own current latest LSN (pg_current_wal_lsn()), but this seems not to be exposed at the subscriber.
The subscriber knows what it has applied locally and even tells the publisher (pg_stat_subscription.latest_end_lsn), but it does not seem to be exposed at the publisher.
Have I missed something? Is there a way to track the LSN delta (given that this is known to be racy) just by querying one end?
Second, how do folk "know" when replication is "done". For example, if the two LSNs continued to match for 1 * replication lag? Or N * replication lag? What would be a plausible N?
Third, as we know when logical replication is started, the initial table state is captured in a snapshot, and sent across using COPY TABLE under the covers. Now, let's say that the publisher is idle (i.e. no SQL writes to the user's schema...obviously pg_catalog might change as replication is configured and enabled) and that the replication starts with the publisher as LSN_start. How could one know when the copying is done:
I initially assumed that the publisher's LSN would not change from LSN_start, but as the copying proceeds, I see that it DOES change (presumably because there are updates happening to pg_catalog, such as the temporary slots coming and going).
Is there some kind of singleton state on either publisher or subscriber that could be checked to know? (At the moment, I am counting the records in all copied tables).
I realise that the knowledge that the publisher is "idle" is a special case, but right now, my test for being "done" is:
Number of records in copied tables matches AND the publisher's pg_stat_subscription matches the subscriber's pg_stat_subscription.latest_end_lsn.
Plus or minus the bit about replication lag, is there a better way?