PG16 and replication, ensure a clean switchover after a stop of the primary server - Mailing list pgsql-general
From | François Lafont |
---|---|
Subject | PG16 and replication, ensure a clean switchover after a stop of the primary server |
Date | |
Msg-id | 8f745b7e-24a4-41e4-b331-6344a33ac89a@gmail.com Whole thread Raw |
List | pgsql-general |
Hi, First, I have put all details of this post in this gist https://gist.github.com/flaf/ccedf78d0290d231e79581077fd92dbc (with a little video/demo to see with your eyes ;)). I have 2 PostgreSQL servers version 16.8 on RedHat 8.10, with physical streaming async replication: * postgres-1 the primary server, * postgres-2 the warm standby server. It's a detail but on these servers, PGDATA=/pg_data/pginc and the UniX account and superuser is "pginc". There is continuous WAL archiving via the /pg_archives/pginc/ NFS share mounted on postgres-1 and postgres-2. My goal is to stop nicely the PostgreSQL daemon on the primary server and then ensure that the standby server is well synchronized (ie it contains *exactly* the same data as the stopped primary server). To check that, I use this command on the primary and on the standby: ~$ pg_controldata /pg_data/pginc/ | grep REDO -C 6 Here an example of output on the primary when PostgreSQL has been stopped: ------------------------------------------------------------------- Catalog version number: 202307071 Database system identifier: 7470764453209630447 Database cluster state: shut down pg_control last modified: Tue 25 Feb 2025 03:51:34 PM CET Latest checkpoint location: 0/6F000028 Latest checkpoint's REDO location: 0/6F000028 Latest checkpoint's REDO WAL file: 00000001000000000000006F Latest checkpoint's TimeLineID: 1 Latest checkpoint's PrevTimeLineID: 1 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 0:4349 Latest checkpoint's NextOID: 18418 ------------------------------------------------------------------- To check if the standby server is well synchronized I compare the value of "Latest checkpoint's REDO location" on primary and on standby (but is it a reliable check? <= it's my question 4). Here are my questions. 1. When I stop nicely PostgreSQL on the primary server (postgres-1) with `systemctl stop postgresql.service` and then I run the `CHECKPOINT;` query on the secondary server (postgres-2), the latest checkpoint's REDO location are not the same on postgres-1 and postgres-2, except from time to time (rather rarely) in a rather random way. Why? I thought that walsender on the primary server would delay the PostgreSQL shutdown request until confirming all sent data are flushed on remote side? Is it false? This page https://postgrespro.com/list/thread-id/2630031 seems to confirm it's well true. Note: servers are just VM for test on my desktop. There is no trafic (read/write) on the primary except some INSERTs I do manually to simulate a little traffic (see my video on my gist). 2. In fact, concerning this point (walsenders delay the shutdown request until confirming all sent data are flushed on remote side), is there official documentation? I have not found. 3. Now, I add this line in the postgresql.conf of primary server: wal_keep_size = '512MB' The value seems to be without any importance, I just set the parameter to a non zero value. I reload the configuration. Now, I'm doing exactly the same test as 1. and, at the end, I have **systematically** the same the latest checkpoint's REDO location on postgres-1 and postgres-2. Why? Is it normal? Or is the case 1. abnormal? 4. When the primary server is stopped, if have exactly the same latest checkpoint location and the same latest checkpoint's REDO location, on the primary server and the secondary server, can I be sure that the data on the secondary server is exactly the same as the data on the primary server, with the last changes made to the primary server before its PostgreSQL daemon was shutdown? Thanks for your help. Note: here is the common postgresql.conf of the two servers (but all is in my gist https://gist.github.com/flaf/ccedf78d0290d231e79581077fd92dbc): ------------------------------------------------------------------- listen_addresses = 'localhost,<IP-address-of-postgres-X>' port = '4900' unix_socket_directories = '/var/run/postgresql' unix_socket_permissions = 0700 max_connections = 810 superuser_reserved_connections = 10 timezone = 'Europe/Paris' lc_messages = 'en_US.UTF8' lc_monetary = 'fr_FR.UTF8' lc_numeric = 'fr_FR.UTF8' lc_time = 'fr_FR.UTF8' default_text_search_config = 'pg_catalog.french' datestyle = 'iso,dmy' log_directory = '/pg_logs/pginc' logging_collector = on log_filename = 'postgresql-%a.log' log_rotation_age = 1d log_rotation_size = 0 log_truncate_on_rotation = on log_timezone = 'Europe/Paris' log_min_duration_statement = 5000 # 5 seconds log_autovacuum_min_duration = 0 password_encryption = scram-sha-256 wal_level = replica archive_mode = on archive_command = 'cp %p /pg_archives/pginc/%f' restore_command = 'cp /pg_archives/pginc/%f %p' #wal_keep_size = '512MB' dynamic_shared_memory_type = posix max_worker_processes = 2 max_parallel_workers = 2 shared_buffers = 1420MB # ~0.25 x RAM effective_cache_size = 4260MB # ~0.75 x RAM work_mem = 10MB maintenance_work_mem = 512MB # 512 or 1024 MiB wal_buffers = 16MB min_wal_size = 1GB max_wal_size = 4GB max_standby_streaming_delay = 20min # To allow long readonly queries on the standby server. effective_io_concurrency = 200 random_page_cost = 1.1 track_activity_query_size = 8096 shared_preload_libraries = 'pg_stat_statements,auto_explain' pg_stat_statements.max = 10000 auto_explain.log_min_duration = -1 # disable auto_explain extension currently ------------------------------------------------------------------- -- François Lafont
pgsql-general by date: