Thread: [GENERAL] Unexpected WAL-archive restore behaviour
Hello everyone. I use postgresql 9.6.1 running under a debian 8 (master - slave hot standby streaming replication). I found out unexpected WAL-archive restore behaviour. master: wal_level = replica archive_mode = on archive_command = 'gzip < %p > /var/lib/postgres/archives/%f' slave recovery.conf: standby_mode = 'on' primary_conninfo = 'host=192.168.50.96 port=5432 user=replication password=password' primary_slot_name = 'myslot' restore_command = 'ssh postgres@192.168.50.96 \'cat /var/lib/postgres/archives/%f\' | gzip -d -c > %p' recovery_target_timeline = 'latest' It was necessary to add replication delay (recovery_min_apply_delay = 30min). Hot standby server was started with new recovery.conf, reached consistent recovery state and resumed to restore WAL segmentsfrom archive. When couple of next wal segments became unavailable slave server just "skipped" them, instead to startstreaming WAL from primary: 2017-02-16 10:54:22 MSK [5956-1] LOG: database system was shut down in recovery at 2017-02-16 10:19:01 MSK 2017-02-16 10:54:22 MSK [5957-1] [unknown]@[unknown] LOG: incomplete startup packet cat: /var/lib/postgres/archives/00000003.history: No such file or directory gzip: stdin: unexpected end of file 2017-02-16 10:54:22 MSK [5956-2] LOG: entering standby mode cat: /var/lib/postgres/archives/00000002.history: No such file or directory gzip: stdin: unexpected end of file 2017-02-16 10:54:22 MSK [5969-1] postgres@postgres FATAL: the database system is starting up 2017-02-16 10:54:23 MSK [5956-3] LOG: restored log file "00000002000012A2000000D0" from archive 2017-02-16 10:54:23 MSK [5977-1] postgres@postgres FATAL: the database system is starting up 2017-02-16 10:54:23 MSK [5956-4] LOG: restored log file "00000002000012A2000000B5" from archive 2017-02-16 10:54:23 MSK [5956-5] LOG: redo starts at 12A2/B5ECC5C0 2017-02-16 10:54:23 MSK [5956-6] LOG: restored log file "00000002000012A2000000B6" from archive ... 2017-02-16 10:54:48 MSK [5956-55] LOG: restored log file "00000002000012A2000000E7" from archive 2017-02-16 10:54:48 MSK [5956-56] LOG: restored log file "00000002000012A2000000E8" from archive 2017-02-16 10:54:48 MSK [5956-57] LOG: consistent recovery state reached at 12A2/E81D86C8 2017-02-16 10:54:48 MSK [5954-3] LOG: database system is ready to accept read only connections 2017-02-16 10:55:07 MSK [6233-1] [unknown]@[unknown] LOG: incomplete startup packet ... 2017-02-16 11:27:37 MSK [5956-115] LOG: restored log file "00000002000012A300000022" from archive 2017-02-16 11:27:38 MSK [5956-116] LOG: restored log file "00000002000012A300000023" from archive 2017-02-16 11:27:59 MSK [5956-117] LOG: restored log file "00000002000012A300000024" from archive cat: /var/lib/postgres/archives/00000002000012A300000025: No such file or directory gzip: stdin: unexpected end of file cat: /var/lib/postgres/archives/00000002000012A300000026: No such file or directory gzip: stdin: unexpected end of file cat: /var/lib/postgres/archives/00000002000012A300000027: No such file or directory gzip: stdin: unexpected end of file 2017-02-16 11:28:37 MSK [5956-118] LOG: restored log file "00000002000012A300000028" from archive 2017-02-16 11:28:39 MSK [5956-119] LOG: restored log file "00000002000012A300000029" from archive 2017-02-16 11:28:59 MSK [5956-120] LOG: restored log file "00000002000012A30000002A" from archive "Skipped" WAL segments looks like normal WAL archives: postgres@myhost:/var/lib/postgres/archives/tmpwal$ /usr/lib/postgresql/9.6/bin/pg_xlogdump -z ./00000002000012A300000026 Type N (%) Record size (%) FPI size (%) Combined size (%) ---- - --- ----------- --- -------- --- ------------- --- XLOG 0 (0.00) 0 (0.00) 0 (0.00) 0 (0.00) Transaction 34 (0.43) 1472 (0.56) 0 (0.00) 1472 (0.01) Storage 0 (0.00) 0 (0.00) 0 (0.00) 0 (0.00) CLOG 0 (0.00) 0 (0.00) 0 (0.00) 0 (0.00) Database 0 (0.00) 0 (0.00) 0 (0.00) 0 (0.00) Tablespace 0 (0.00) 0 (0.00) 0 (0.00) 0 (0.00) MultiXact 0 (0.00) 0 (0.00) 0 (0.00) 0 (0.00) RelMap 0 (0.00) 0 (0.00) 0 (0.00) 0 (0.00) Standby 2 (0.03) 184 (0.07) 0 (0.00) 184 (0.00) Heap2 673 (8.51) 21544 (8.25) 2057620 (12.90) 2079164 (12.83) Heap 4854 (61.35) 171361 (65.59) 7642192 (47.92) 7813553 (48.20) Btree 2330 (29.45) 63248 (24.21) 6248364 (39.18) 6311612 (38.94) Hash 0 (0.00) 0 (0.00) 0 (0.00) 0 (0.00) Gin 0 (0.00) 0 (0.00) 0 (0.00) 0 (0.00) Gist 0 (0.00) 0 (0.00) 0 (0.00) 0 (0.00) Sequence 19 (0.24) 3458 (1.32) 0 (0.00) 3458 (0.02) SPGist 0 (0.00) 0 (0.00) 0 (0.00) 0 (0.00) BRIN 0 (0.00) 0 (0.00) 0 (0.00) 0 (0.00) CommitTs 0 (0.00) 0 (0.00) 0 (0.00) 0 (0.00) ReplicationOrigin 0 (0.00) 0 (0.00) 0 (0.00) 0 (0.00) Generic 0 (0.00) 0 (0.00) 0 (0.00) 0 (0.00) LogicalMessage 0 (0.00) 0 (0.00) 0 (0.00) 0 (0.00) -------- -------- -------- -------- Total 7912 261267 [1.61%] 15948176 [98.39%] 16209443 [100%] Is it normal behaviour? Can it cause standby consistency damage? Thank you.
On 02/18/2017 10:01 PM, Nikolay Petrov wrote: > Hot standby server was started with new recovery.conf, reached > consistent recovery state and resumed to restore WAL segments from > archive. When couple of next wal segments became unavailable slave > server just "skipped" them, instead to start streaming WAL from > primary: What do you mean by "became unavailable"? The restore_command may be called for segments that do not exist - that's expected, and I suspect the "gzip: stdin: unexpected end of file" error messages are caused by that. Then, on the next try, that WAL segment probably existed, so it got restored. It's difficult to say what actually happened, because you've only posted doctored log - for example nowhere in the log is a message that the standby started streaming from the primary, i.e. something like this: LOG: started streaming WAL from primary at 18/DE000000 on timeline 1 But maybe not - it's possible this happening because the archive and restore commands are not atomic. E.g. let's say the restore_command gets executed while the WAL segment is still being archived. In that case it'll find an incomplete gzip archive, will try to decompress it, and will fail with exactly the same ('unexpected end of file') error. So I think you should read the docs about archive/restore command requirements [1] and maybe also the wiki page about setting warm standby [2]: [1] https://www.postgresql.org/docs/9.6/static/continuous-archiving.html [2] https://wiki.postgresql.org/wiki/Warm_Standby regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> On 02/18/2017 02:06, Tomas Vondra wrote: > > What do you mean by "became unavailable"? The restore_command may be called for segments that do not exist - that's expected,and I suspect the "gzip: stdin: unexpected end of file" error messages are caused by that. > Sorry for my unclear description. I've attached a full slave server log file. "became unavailable" - i mean that another shell script on master removed necessary WAL segments from archive (segments,which hadn't applyed on slave server). Here a detailed log of actions: 10:18 replication is working normally, physical replication slot was active and shipped WAL records. 10:19 I stoped slave to make changes in recovery.conf. Master server continued to archive WAL segments, replication slothad become inactive. 10:54 I started the slave server, it reached consistent state and resumed to restore WAL segments from archive. 11:25 Replication slot on the master server was still inactive, it was waiting for connection. Slave was still reading WALarchive segments and was applying them according recovery_min_apply_delay = 30min. 11:27 On master server was started bash script to remove old unnecessary WAL archive segments. Unfortunately in this caseshell script has removed 3 necessary WAL segments, which was not applied on slave server yet. Usually, when WAL archivesegments unavailable, slave-server starts streaming WAL from primary, but in this case slave continued to restoreWAL from archive. According documentation it's normal, when standby server looks for WAL segments that does not exists. In my case, WAL segmentswas created by master server normally, but wasn't shipped to slave correctly. I am worrying about possible slaveconsistency damaging. Thank you for suggestions. Regards.