Re: Monitoring logical replication - Mailing list pgsql-general

From Shaheed Haque
Subject Re: Monitoring logical replication
Date
Msg-id CAHAc2jeWZ1y4S+ibkGJ8uh7QXyv8Z-O=AAO5FKz2ZX3ZysmM7w@mail.gmail.com
Whole thread Raw
In response to Re: Monitoring logical replication  (Ron Johnson <ronljohnsonjr@gmail.com>)
List pgsql-general
Mostly to close the loop on this, now that I have things going seemingly reliably...

On Tue, 18 Jun 2024 at 14:33, Ron Johnson <ronljohnsonjr@gmail.com> wrote:
On Tue, Jun 18, 2024 at 5:03 AM Shaheed Haque <shaheedhaque@gmail.com> wrote:
Hi all,

Is there an "official" pairing of LSN values on the publication and subscription sides that should be used to track the delta between the two systems? I ask because Google is full of different pairs being used. I tried to identify the highest level interface points exposed, i.e. what is documented on https://www.postgresql.org/docs/current/replication-origins.html, the pg_stat_subscription table, the pg_stat_publication table and the pg_current_wal_lsn() function on the publisher, but these seem to be barely used.

The attached scripts (whose guts I took from a Stack Exchange post) might be a good starting point.  It certainly works for physical replication!
 
P.S. On a related note, I see a (stalled?) discussion on providing LSN -> timestamp conversion, I'd just like to say that something like that would be very useful.
 
Out of curiosity, how does that work?  Is an instance's initial LSN really based on Epoch?

According to the docs at https://www.postgresql.org/docs/current/datatype-pg-lsn.html, the LSN is "a 64-bit integer, representing a byte position in the write-ahead log stream", so I guess some black magic is required to turn that into a timestamp.

My use case might not be all that common, as I am NOT interested in a long-term replica. What I am doing is making a copy of an "old" Django deployment to a "new" deployment such that when the two ends are in close sync, I can freeze traffic to the old deployment, pause for any final catchup, and then run a Django migration on the new, before switching on the new (thereby minimising the down time for the app). That being said:
  • It turns out that one cannot use the LSN alone to check for sync since, during the initial "full table copying" phase of the replication, the LSNs at the two ends seem to be the same.
  • Instead, I track three metrics on each end:
    • On the publication end:
      • "count(*)" for each table
      • pg_current_wal_lsn()
      • tuples_processed from pg_stat_progress_copy for each table
    • On the subscription end:
      • "count(*)" for each table
      • latest_end_lsn from pg_stat_subscription
      • tuples_processed from pg_stat_progress_copy for each table
  • The sync is considered complete when all three metrics are aligned.
  • I then freeze activity on the "old" deployment, wait for any in-flight metric changes to come through, and then run the migration on the "new" deployment.
As always perhaps that will be of use to somebody, but YMMV.

Thanks all,

Shaheed

pgsql-general by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: PG17 optimizations to vacuum
Next
From: Dominique Devienne
Date:
Subject: Re: How to grant role to other user