Thread: [GENERAL] Unexpected WAL-archive restore behaviour

[GENERAL] Unexpected WAL-archive restore behaviour

From

Nikolay Petrov

Date:

19 February 2017, 03:01:42

Hello everyone.
I use postgresql 9.6.1 running under a debian 8 (master - slave hot standby streaming replication).
I found out unexpected WAL-archive restore behaviour.

master:
    wal_level = replica
    archive_mode = on
    archive_command = 'gzip < %p > /var/lib/postgres/archives/%f'

slave recovery.conf:
    standby_mode = 'on'
    primary_conninfo = 'host=192.168.50.96 port=5432 user=replication password=password'
    primary_slot_name = 'myslot'
    restore_command = 'ssh postgres@192.168.50.96 \'cat /var/lib/postgres/archives/%f\' | gzip -d -c > %p'
    recovery_target_timeline = 'latest'

It was necessary to add replication delay (recovery_min_apply_delay = 30min).

Hot standby server was started with new recovery.conf, reached consistent recovery state and resumed to restore WAL
segmentsfrom archive. When couple of next wal segments became unavailable slave server just "skipped" them, instead to
startstreaming WAL from primary: 

2017-02-16 10:54:22 MSK [5956-1] LOG:  database system was shut down in recovery at 2017-02-16 10:19:01 MSK
2017-02-16 10:54:22 MSK [5957-1] [unknown]@[unknown] LOG:  incomplete startup packet
cat: /var/lib/postgres/archives/00000003.history: No such file or directory

gzip: stdin: unexpected end of file
2017-02-16 10:54:22 MSK [5956-2] LOG:  entering standby mode
cat: /var/lib/postgres/archives/00000002.history: No such file or directory

gzip: stdin: unexpected end of file
2017-02-16 10:54:22 MSK [5969-1] postgres@postgres FATAL:  the database system is starting up
2017-02-16 10:54:23 MSK [5956-3] LOG:  restored log file "00000002000012A2000000D0" from archive
2017-02-16 10:54:23 MSK [5977-1] postgres@postgres FATAL:  the database system is starting up
2017-02-16 10:54:23 MSK [5956-4] LOG:  restored log file "00000002000012A2000000B5" from archive
2017-02-16 10:54:23 MSK [5956-5] LOG:  redo starts at 12A2/B5ECC5C0
2017-02-16 10:54:23 MSK [5956-6] LOG:  restored log file "00000002000012A2000000B6" from archive

...

2017-02-16 10:54:48 MSK [5956-55] LOG:  restored log file "00000002000012A2000000E7" from archive
2017-02-16 10:54:48 MSK [5956-56] LOG:  restored log file "00000002000012A2000000E8" from archive
2017-02-16 10:54:48 MSK [5956-57] LOG:  consistent recovery state reached at 12A2/E81D86C8
2017-02-16 10:54:48 MSK [5954-3] LOG:  database system is ready to accept read only connections
2017-02-16 10:55:07 MSK [6233-1] [unknown]@[unknown] LOG:  incomplete startup packet

...

2017-02-16 11:27:37 MSK [5956-115] LOG:  restored log file "00000002000012A300000022" from archive
2017-02-16 11:27:38 MSK [5956-116] LOG:  restored log file "00000002000012A300000023" from archive
2017-02-16 11:27:59 MSK [5956-117] LOG:  restored log file "00000002000012A300000024" from archive
cat: /var/lib/postgres/archives/00000002000012A300000025: No such file or directory

gzip: stdin: unexpected end of file
cat: /var/lib/postgres/archives/00000002000012A300000026: No such file or directory

gzip: stdin: unexpected end of file
cat: /var/lib/postgres/archives/00000002000012A300000027: No such file or directory

gzip: stdin: unexpected end of file
2017-02-16 11:28:37 MSK [5956-118] LOG:  restored log file "00000002000012A300000028" from archive
2017-02-16 11:28:39 MSK [5956-119] LOG:  restored log file "00000002000012A300000029" from archive
2017-02-16 11:28:59 MSK [5956-120] LOG:  restored log file "00000002000012A30000002A" from archive

"Skipped" WAL segments looks like normal WAL archives:

postgres@myhost:/var/lib/postgres/archives/tmpwal$ /usr/lib/postgresql/9.6/bin/pg_xlogdump -z
./00000002000012A300000026
Type                    N  (%)      Record size (%)     FPI size    (%)    Combined size   (%)
----                    -  ---      ----------- ---     --------    ---    -------------   ---
XLOG                    0 (0.00)             0 (0.00)            0 (0.00)              0 (0.00)
Transaction            34 (0.43)          1472 (0.56)            0 (0.00)           1472 (0.01)
Storage                 0 (0.00)             0 (0.00)            0 (0.00)              0 (0.00)
CLOG                    0 (0.00)             0 (0.00)            0 (0.00)              0 (0.00)
Database                0 (0.00)             0 (0.00)            0 (0.00)              0 (0.00)
Tablespace              0 (0.00)             0 (0.00)            0 (0.00)              0 (0.00)
MultiXact               0 (0.00)             0 (0.00)            0 (0.00)              0 (0.00)
RelMap                  0 (0.00)             0 (0.00)            0 (0.00)              0 (0.00)
Standby                 2 (0.03)           184 (0.07)            0 (0.00)            184 (0.00)
Heap2                 673 (8.51)         21544 (8.25)      2057620 (12.90)       2079164 (12.83)
Heap                 4854 (61.35)       171361 (65.59)     7642192 (47.92)       7813553 (48.20)
Btree                2330 (29.45)        63248 (24.21)     6248364 (39.18)       6311612 (38.94)
Hash                    0 (0.00)             0 (0.00)            0 (0.00)              0 (0.00)
Gin                     0 (0.00)             0 (0.00)            0 (0.00)              0 (0.00)
Gist                    0 (0.00)             0 (0.00)            0 (0.00)              0 (0.00)
Sequence               19 (0.24)          3458 (1.32)            0 (0.00)           3458 (0.02)
SPGist                  0 (0.00)             0 (0.00)            0 (0.00)              0 (0.00)
BRIN                    0 (0.00)             0 (0.00)            0 (0.00)              0 (0.00)
CommitTs                0 (0.00)             0 (0.00)            0 (0.00)              0 (0.00)
ReplicationOrigin       0 (0.00)             0 (0.00)            0 (0.00)              0 (0.00)
Generic                 0 (0.00)             0 (0.00)            0 (0.00)              0 (0.00)
LogicalMessage          0 (0.00)             0 (0.00)            0 (0.00)              0 (0.00)
                 --------             --------           --------               --------
Total                7912              261267 [1.61%]    15948176 [98.39%]      16209443 [100%]

Is it normal behaviour?  Can it cause standby consistency damage?
Thank you.

Re: [GENERAL] Unexpected WAL-archive restore behaviour

From

Tomas Vondra

Date:

19 February 2017, 05:06:43

On 02/18/2017 10:01 PM, Nikolay Petrov wrote:
> Hot standby server was started with new recovery.conf, reached
> consistent recovery state and resumed to restore WAL segments from
> archive. When couple of next wal segments became unavailable slave
> server just "skipped" them, instead to start streaming WAL from
> primary:

What do you mean by "became unavailable"? The restore_command may be
called for segments that do not exist - that's expected, and I suspect
the "gzip: stdin: unexpected end of file" error messages are caused by that.

Then, on the next try, that WAL segment probably existed, so it got
restored. It's difficult to say what actually happened, because you've
only posted doctored log - for example nowhere in the log is a message
that the standby started streaming from the primary, i.e. something like
this:

LOG:  started streaming WAL from primary at 18/DE000000 on timeline 1

But maybe not - it's possible this happening because the archive and
restore commands are not atomic. E.g. let's say the restore_command gets
executed while the WAL segment is still being archived. In that case
it'll find an incomplete gzip archive, will try to decompress it, and
will fail with exactly the same ('unexpected end of file') error.

So I think you should read the docs about archive/restore command
requirements [1] and maybe also the wiki page about setting warm standby
[2]:

[1] https://www.postgresql.org/docs/9.6/static/continuous-archiving.html

[2] https://wiki.postgresql.org/wiki/Warm_Standby

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [GENERAL] Unexpected WAL-archive restore behaviour

From

Nikolay Petrov

Date:

19 February 2017, 20:27:56

> On 02/18/2017 02:06, Tomas Vondra wrote:
>
> What do you mean by "became unavailable"? The restore_command may be called for segments that do not exist - that's
expected,and I suspect the "gzip: stdin: unexpected end of file" error messages are caused by that. 
>

Sorry for my unclear description. I've attached a full slave server log file.
"became unavailable" - i mean that another shell script on master removed necessary WAL segments from archive
(segments,which hadn't applyed on slave server).   

Here a detailed log of actions:
 10:18 replication is working normally, physical replication slot was active and shipped WAL records.
 10:19 I stoped slave to make changes in recovery.conf. Master server continued to archive WAL segments, replication
slothad become inactive.  
 10:54 I started the slave server, it reached consistent state and resumed to restore WAL segments from archive.
 11:25 Replication slot on the master server was still inactive, it was waiting for connection. Slave was still reading
WALarchive segments and was applying them according recovery_min_apply_delay = 30min.  
 11:27 On master server was started bash script to remove old unnecessary WAL archive segments. Unfortunately in this
caseshell script has removed 3 necessary WAL segments, which was not applied on slave server yet. Usually, when WAL
archivesegments unavailable, slave-server starts streaming WAL from primary, but in this case slave continued to
restoreWAL from archive.  

 According documentation it's normal, when standby server looks for WAL segments that does not exists. In my case, WAL
segmentswas created by master server normally, but wasn't shipped to slave correctly. I am worrying about possible
slaveconsistency damaging.  

Thank you for suggestions.
Regards.

Attachment

postgresql-2017-02-16_105322.log