Thread: very long secondary->primary switch time
Hello all, I maintain a postgresql cluster that does failover via patroni. The problem is that after a failover happens it takes the secondary too long (that is about 35min) to come up and answer queries. The log of the secondary looks like this: 04:00:29.777 [9679] LOG: received promote request 04:00:29.780 [9693] FATAL: terminating walreceiver process due to administrator command 04:00:29.780 [9679] LOG: invalid record length at 320/B95A1EE0: wanted 24, got 0 04:00:29.783 [9679] LOG: redo done at 320/B95A1EA8 04:00:29.783 [9679] LOG: last completed transaction was at log time 2021-03-03 03:57:46.466342+01 04:35:00.982 [9679] LOG: selected new timeline ID: 15 04:35:01.404 [9679] LOG: archive recovery complete 04:35:02.337 [9662] LOG: database system is ready to accept connections The cluster is "fairly large" with thousands of DBs (sic!) and ~1TB of data. I would like to shorten the failover/startup time drastically. Why does it take the secondary that much time to switch to the primary state? There are no logs between 04:00 and 04:35. What is postgresql doing during those 35min? I am *guessing* that postgresql *might* be doing some consistency check or replaying the WAL (max_wal_size: 16 GB, wal_keep_segments: 100). I am also *guessing* that startup time *might* have to do with the size of the data (~1T) or/and with the numbers of DBs (thousands). If that would be the case, then splitting the cluster into multiple clusters should allow for faster startup times? I have tried to duckduck why the secondary takes that much time to switch to primary mode, but have failed to find information that would enlighten me. So any pointers to information, hints or help are very wellcome. Thanks & greets, *t
Tomas Pospisek <tpo2@sourcepole.ch> writes: > I maintain a postgresql cluster that does failover via patroni. The > problem is that after a failover happens it takes the secondary too long > (that is about 35min) to come up and answer queries. The log of the > secondary looks like this: > 04:00:29.777 [9679] LOG: received promote request > 04:00:29.780 [9693] FATAL: terminating walreceiver process due to > administrator command > 04:00:29.780 [9679] LOG: invalid record length at 320/B95A1EE0: wanted > 24, got 0 > 04:00:29.783 [9679] LOG: redo done at 320/B95A1EA8 > 04:00:29.783 [9679] LOG: last completed transaction was at log time > 2021-03-03 03:57:46.466342+01 > 04:35:00.982 [9679] LOG: selected new timeline ID: 15 > 04:35:01.404 [9679] LOG: archive recovery complete > 04:35:02.337 [9662] LOG: database system is ready to accept connections > The cluster is "fairly large" with thousands of DBs (sic!) and ~1TB of data. Hm. WAL replay is already done at the "redo done" entry. There is a checkpoint after that, I believe, and there may be some effort to search for dead files as well. Still, if your I/O subsystem is better than a wet noodle, 35 minutes is a long time to finish that. One thing I'm not sure about is whether we try to do the checkpoint at maximum speed. If you have set GUC options to throttle checkpoint I/O hard, that could perhaps explain this. You could possibly learn more by strace'ing the startup process to see what it's doing. Also, what PG version is that exactly? regards, tom lane
Hi Tom, hi list participants, thanks a lot for replying Tom. On 27.04.21 22:16, Tom Lane wrote: > Tomas Pospisek <tpo2@sourcepole.ch> writes: >> I maintain a postgresql cluster that does failover via patroni. The >> problem is that after a failover happens it takes the secondary too long >> (that is about 35min) to come up and answer queries. The log of the >> secondary looks like this: > >> 04:00:29.777 [9679] LOG: received promote request >> 04:00:29.780 [9693] FATAL: terminating walreceiver process due to >> administrator command >> 04:00:29.780 [9679] LOG: invalid record length at 320/B95A1EE0: wanted >> 24, got 0 >> 04:00:29.783 [9679] LOG: redo done at 320/B95A1EA8 >> 04:00:29.783 [9679] LOG: last completed transaction was at log time >> 2021-03-03 03:57:46.466342+01 > >> 04:35:00.982 [9679] LOG: selected new timeline ID: 15 >> 04:35:01.404 [9679] LOG: archive recovery complete >> 04:35:02.337 [9662] LOG: database system is ready to accept connections > >> The cluster is "fairly large" with thousands of DBs (sic!) and ~1TB of data. > > Hm. WAL replay is already done at the "redo done" entry. There is a > checkpoint after that, I believe, and there may be some effort to search > for dead files as well. Still, if your I/O subsystem is better than > a wet noodle, 35 minutes is a long time to finish that. > > One thing I'm not sure about is whether we try to do the checkpoint > at maximum speed. If you have set GUC options to throttle checkpoint > I/O hard, that could perhaps explain this. I didn't do much customization of postgresql settings and am mostly running what was provided by the upstreams. These are my explicit settings (all the rest is defaults): allow_system_table_mods: 'off' autovacuum: 'off' checkpoint_completion_target: '0.7' effective_cache_size: 16 GB maintenance_work_mem: 1 GB max_connections: 200 max_wal_size: 16 GB shared_buffers: 8 GB shared_preload_libraries: pg_stat_statements track_activities: 'on' track_activity_query_size: 32 kB track_functions: all wal_keep_segments: 100 work_mem: 64 MB > You could possibly learn more by strace'ing the startup process to > see what it's doing. Will do, thanks, however I'm dreading the next failover downtime :-( > Also, what PG version is that exactly? 12.6-1.pgdg20.04+1 Thankful for any help or pointers regarding the long promotion time, *t