Thread: BUG #15638: pg_basebackup with --wal-method=stream incorrectly generates WAL segment created during backup

The following bug has been logged on the website:

Bug reference:      15638
Logged by:          Andre Piwoni
Email address:      apiwoni@webmd.net
PostgreSQL version: 10.5
Operating system:   CentOS 7.3
Description:

When new slave is created by taking base backup from the primary using
pg_basebackup with --wal-method=stream option the WAL file generated during
the backup is different (as compared with diff or cmp command) than that on
the master and in WAL archive directory. Furthermore, this file does not
exist in pg_wal/archive_status with .done extension on new slave, though it
exists in pg_wal directory, resulting in failed attempt to archive this file
when slave node is promoted as master node.
2019-02-15 14:15:58.872 PST [5369] DETAIL:  The failed archive command was:
test ! -f /mnt/pgsql/archive/000000010000000000000002 && cp
pg_wal/000000010000000000000002
/mnt/pgsql/archive/000000010000000000000002

pg_basebackup with --wal-method=fetch option correctly creates this WAL
segment on slave in pg_wal/archive_status with .done extension and in pg_wal
directory and its contents are equivalent to those on the old primary and in
WAL archive directory.

Here are contents of old_primary, new_primary and WAL archive:
OLD PRIMARY
/var/lib/pgsql/10/data/pg_wal/:
-rw-------. 1 postgres postgres 16777216 Feb 15 13:39
000000010000000000000001
-rw-------. 1 postgres postgres 16777216 Feb 15 13:40
000000010000000000000002
-rw-------. 1 postgres postgres      302 Feb 15 13:40
000000010000000000000002.00000028.backup
-rw-------. 1 postgres postgres 16777216 Feb 15 13:40
000000010000000000000003
drwx------. 2 postgres postgres      133 Feb 15 13:40 archive_status

/var/lib/pgsql/10/data/pg_wal/archive_status:
-rw-------. 1 postgres postgres 0 Feb 15 13:39
000000010000000000000001.done
-rw-------. 1 postgres postgres 0 Feb 15 13:40
000000010000000000000002.00000028.backup.done
-rw-------. 1 postgres postgres 0 Feb 15 13:40
000000010000000000000002.done

WAL ARCHIVE
-rw-rw-rw-. 1 root root 16777216 Feb 15 13:39 000000010000000000000001
-rw-rw-rw-. 1 root root 16777216 Feb 15 13:40 000000010000000000000002
-rw-rw-rw-. 1 root root      302 Feb 15 13:40
000000010000000000000002.00000028.backup

NEW PRIMARY (created with --wal-method=stream)
/var/lib/pgsql/10/data/pg_wal/:
-rw-------. 1 postgres root     16777216 Feb 15 13:40
000000010000000000000002
-rw-------. 1 postgres postgres 16777216 Feb 15 13:40
000000010000000000000003
drwx------. 2 postgres root            6 Feb 15 13:40 archive_status
/var/lib/pgsql/10/data/pg_wal/archive_status:
total 0

Please note lack of 000000010000000000000002.done file in
/var/lib/pgsql/10/data/pg_wal/archive_status.
When using backup with --wal-method=fetch this file is created in
/var/lib/pgsql/10/data/pg_wal/archive_status and NEW_PRIMARY won't try to
archive it, which would fail because it has been archived when NEW_PRIMARY
was created as slave from OLD_PRIMARY.

Consequence of all this is that failover does not work correctly.


On Sat, Feb 16, 2019 at 12:26:13AM +0000, PG Bug reporting form wrote:
> When new slave is created by taking base backup from the primary using
> pg_basebackup with --wal-method=stream option the WAL file generated during
> the backup is different (as compared with diff or cmp command) than that on
> the master and in WAL archive directory. Furthermore, this file does not
> exist in pg_wal/archive_status with .done extension on new slave, though it
> exists in pg_wal directory, resulting in failed attempt to archive this file
> when slave node is promoted as master node.
> 2019-02-15 14:15:58.872 PST [5369] DETAIL:  The failed archive command was:
> test ! -f /mnt/pgsql/archive/000000010000000000000002 && cp
> pg_wal/000000010000000000000002
> /mnt/pgsql/archive/000000010000000000000002

How do you promote your standby?  In Postgres 10, the last, partial
WAL segment of a past timeline generated at promotion is renamed
.partial to avoid any conflicts, so as this should normally not
happen if you do not use archive_mode = always.

Please note that your archive command is not safe.  For one, it does
not sync the archived segment before archive_command returns to the
backend..
--
Michael

Attachment
I did a quick test using near to empty databases. I did not do the promote but the basebackup with two different methods.

When I did it on MacOS and PostgreSQL 11.1, the .done file existed only under the data directory I created with -X fetch. The files were the same.

When I did it on FreeBSD and PostgreSQL 10.6, the .done file existed only under the -X fetch directory, and the wal files were also different. II don’t know whether it is a problem or not, but I could reproduce it at the first attempt.

This was after the two basebackups:

$ pg_basebackup -p 5433 -v -R -P -D 1 -X fetch
$ pg_basebackup -p 5433 -v -R -P -D 2 -X stream

$ diff -ur 1/pg_wal/ 2/pg_wal/
Only in 1/pg_wal/: 00000001000000000000000C
Only in 1/pg_wal/: 00000001000000000000000D
Files 1/pg_wal/00000001000000000000000E and 2/pg_wal/00000001000000000000000E differ
Only in 1/pg_wal/archive_status: 00000001000000000000000C.done
Only in 1/pg_wal/archive_status: 00000001000000000000000D.done
Only in 1/pg_wal/archive_status: 00000001000000000000000E.done


$ less log/1/2019-02-16_19-48-29.log
2019-02-16 19:48:29 CET  LOG:  database system was interrupted; last known up at 2019-02-16 19:44:45 CET
2019-02-16 19:48:29 CET  LOG:  entering standby mode
2019-02-16 19:48:29 CET  LOG:  redo starts at 0/C000028
2019-02-16 19:48:29 CET  LOG:  consistent recovery state reached at 0/C000130
2019-02-16 19:48:29 CET  LOG:  database system is ready to accept read only connections
2019-02-16 19:48:29 CET  LOG:  started streaming WAL from primary at 0/D000000 on timeline 1

$ less log/2/2019-02-16_19-48-34.log
2019-02-16 19:48:34 CET  LOG:  database system was interrupted; last known up at 2019-02-16 19:45:15 CET
2019-02-16 19:48:34 CET  LOG:  entering standby mode
2019-02-16 19:48:34 CET  LOG:  redo starts at 0/E000028
2019-02-16 19:48:34 CET  LOG:  consistent recovery state reached at 0/E000130
2019-02-16 19:48:34 CET  LOG:  database system is ready to accept read only connections
2019-02-16 19:48:34 CET  LOG:  started streaming WAL from primary at 0/F000000 on timeline 1

$ diff -ur 1/base/ 2/base/
Files 1/base/16386/pg_internal.init and 2/base/16386/pg_internal.init differ


I did nothing except for starting the two clusters. There was no activity on the master. I did not promote.

M.


On Sat, Feb 16, 2019 at 4:25 PM Michael Paquier <michael@paquier.xyz> wrote:
On Sat, Feb 16, 2019 at 12:26:13AM +0000, PG Bug reporting form wrote:
> When new slave is created by taking base backup from the primary using
> pg_basebackup with --wal-method=stream option the WAL file generated during
> the backup is different (as compared with diff or cmp command) than that on
> the master and in WAL archive directory. Furthermore, this file does not
> exist in pg_wal/archive_status with .done extension on new slave, though it
> exists in pg_wal directory, resulting in failed attempt to archive this file
> when slave node is promoted as master node.
> 2019-02-15 14:15:58.872 PST [5369] DETAIL:  The failed archive command was:
> test ! -f /mnt/pgsql/archive/000000010000000000000002 && cp
> pg_wal/000000010000000000000002
> /mnt/pgsql/archive/000000010000000000000002

How do you promote your standby?  In Postgres 10, the last, partial
WAL segment of a past timeline generated at promotion is renamed
.partial to avoid any conflicts, so as this should normally not
happen if you do not use archive_mode = always.

Please note that your archive command is not safe.  For one, it does
not sync the archived segment before archive_command returns to the
backend..
--
Michael

On 2/16/19 6:25 PM, Michael Paquier wrote:

On Sat, Feb 16, 2019 at 12:26:13AM +0000, PG Bug reporting form wrote:
When new slave is created by taking base backup from the primary using
pg_basebackup with --wal-method=stream option the WAL file generated during
the backup is different (as compared with diff or cmp command) than that on
the master and in WAL archive directory. Furthermore, this file does not
exist in pg_wal/archive_status with .done extension on new slave, though it
exists in pg_wal directory, resulting in failed attempt to archive this file
when slave node is promoted as master node.
2019-02-15 14:15:58.872 PST [5369] DETAIL:  The failed archive command was:
test ! -f /mnt/pgsql/archive/000000010000000000000002 && cp
pg_wal/000000010000000000000002
/mnt/pgsql/archive/000000010000000000000002
How do you promote your standby?  In Postgres 10, the last, partial
WAL segment of a past timeline generated at promotion is renamed
.partial to avoid any conflicts, so as this should normally not
happen if you do not use archive_mode = always.


This issue concerns only segment that contains wal records from pg_basebackup. For example, I have reproduced this issue on master branch getting the following content of archive directory after standby promoting:

000000010000000000000001.node1
000000010000000000000002.00000028.backup.node1
000000010000000000000002.node1
000000010000000000000002.node2
000000010000000000000003.node1
000000010000000000000004.node1
000000010000000000000005.partial.node2
000000020000000000000005.node2
00000002.history.node2

Each wal segment specifically was appended with .<node> suffix to distinguish its source. Initially the node1 was master, node2 - standby. The segment 000000010000000000000002 has the following content:

rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 0/02000028, prev 0/015DB060, desc: RUNNING_XACTS nextXid 474 latestCompletedXid 473 oldestRunningXid 474
rmgr: XLOG        len (rec/tot):    106/   106, tx:          0, lsn: 0/02000060, prev 0/02000028, desc: CHECKPOINT_ONLINE redo 0/2000028; tli 1; prev tli 1; fpw true; xid 0:474; oid 12690; multi 1; offset 0; oldest xid 467 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 474; online
rmgr: XLOG        len (rec/tot):     34/    34, tx:          0, lsn: 0/020000D0, prev 0/02000060, desc: BACKUP_END 0/2000028
rmgr: XLOG        len (rec/tot):     24/    24, tx:          0, lsn: 0/020000F8, prev 0/020000D0, desc: SWITCH


On the current state of master branch the content of two segments are the same:

$ md5sum 000000010000000000000002.node1 000000010000000000000002.node2

252e3e8cf3a85f218bb467005c565e3c  000000010000000000000002.node1
252e3e8cf3a85f218bb467005c565e3c  000000010000000000000002.node2

But on PG 10.6 the contents are differentiated on the tails:

$ cmp 000000010000000000000002.node1 000000010000000000000002.node2

000000010000000000000002.node1 000000010000000000000002.node2 differ: byte 131073, line 1

while logical part (prefix) are the same - the lsn of last record (SWITCH) in this segment is 0/02000130 and have length 24 bytes.


Is it correct behavior? What would be the correct canonical representation of archive command taking into account this issue?


-- 
Regards,
Maksim Milyutin
I call pg_ctl -D /var/lib/pgsql/10/data promote to upgrade slave to master when failover happens. archive_mode is set to "on" and not "always".
I repoint slave to the master by stopping it, updating recovery.conf and restarting it. Let me know if I'm doing it wrong.

I think this problem is created before promotion when new slave is created using pg_basebackup with --wal-method=stream and manifests when actual promotion happens.
What I'm trying to say it does not seem that .partial extension is the issue here but lack of .done extension.

Wouldn't you agree that given directory contents below, when failover happens and new slave is promoted to the primary then with fetch mode master would should not archive 000000010000000000000002 segment because it is marked as .done but in stream method segment is not marked as .done so it would be attempted to be archived again? This obviously fails because segment has been archived by initial master. If you disagree, can you explain the purpose behind .done extension in pg_wal/archive_status directory?

Below are contents of directories after base backup but before promoting new slave to master status..

PRIMARY
/var/lib/pgsql/10/data/pg_wal/:
total 49156
-rw-------. 1 postgres postgres 16777216 Feb 15 13:39 000000010000000000000001
-rw-------. 1 postgres postgres 16777216 Feb 15 13:40 000000010000000000000002
-rw-------. 1 postgres postgres      302 Feb 15 13:40 000000010000000000000002.00000028.backup
-rw-------. 1 postgres postgres 16777216 Feb 15 13:40 000000010000000000000003
drwx------. 2 postgres postgres      133 Feb 15 13:40 archive_status

/var/lib/pgsql/10/data/pg_wal/archive_status:
total 0
-rw-------. 1 postgres postgres 0 Feb 15 13:39 000000010000000000000001.done
-rw-------. 1 postgres postgres 0 Feb 15 13:40 000000010000000000000002.00000028.backup.done
-rw-------. 1 postgres postgres 0 Feb 15 13:40 000000010000000000000002.done

WAL ARCHIVE
-rw-rw-rw-. 1 root root 16777216 Feb 15 13:39 000000010000000000000001
-rw-rw-rw-. 1 root root 16777216 Feb 15 13:40 000000010000000000000002
-rw-rw-rw-. 1 root root      302 Feb 15 13:40 000000010000000000000002.00000028.backup

NEW SLAVE (stream method)
-rw-------. 1 postgres root     16777216 Feb 15 13:40 000000010000000000000002
-rw-------. 1 postgres postgres 16777216 Feb 15 13:40 000000010000000000000003
drwx------. 2 postgres root            6 Feb 15 13:40 archive_status

/var/lib/pgsql/10/data/pg_wal/archive_status:
total 0


NEW SLAVE (stream method)
-rw-------. 1 postgres root     16777216 Feb 15 13:40 000000010000000000000002
-rw-------. 1 postgres postgres 16777216 Feb 15 13:40 000000010000000000000003
drwx------. 2 postgres root            6 Feb 15 13:40 archive_status

/var/lib/pgsql/10/data/pg_wal/archive_status:
total 0

NEW SLAVE (fetch method)
-rw-------. 1 postgres root     16777216 Feb 15 13:24 000000010000000000000002
-rw-------. 1 postgres postgres 16777216 Feb 15 13:24 000000010000000000000003
drwx------. 2 postgres root           43 Feb 15 13:24 archive_status

/var/lib/pgsql/10/data/pg_wal/archive_status:
total 0
-rw-------. 1 postgres root 0 Feb 15 13:24 000000010000000000000002.done

Eventual failover(s) produced this structure on master:
/var/lib/pgsql/10/data/pg_wal/archive_status:
total 0
-rw-------. 1 postgres postgres 0 Feb 15 14:15 000000010000000000000002.ready

Hence attempt to archive again:
2019-02-15 14:15:58.872 PST [5369] DETAIL:  The failed archive command was: test ! -f /mnt/pgsql/archive/000000010000000000000002 && cp pg_wal/000000010000000000000002 /mnt/pgsql/archive/000000010000000000000002

On Mon, Feb 18, 2019 at 9:03 AM Maksim Milyutin <milyutinma@gmail.com> wrote:

On 2/16/19 6:25 PM, Michael Paquier wrote:

On Sat, Feb 16, 2019 at 12:26:13AM +0000, PG Bug reporting form wrote:
When new slave is created by taking base backup from the primary using
pg_basebackup with --wal-method=stream option the WAL file generated during
the backup is different (as compared with diff or cmp command) than that on
the master and in WAL archive directory. Furthermore, this file does not
exist in pg_wal/archive_status with .done extension on new slave, though it
exists in pg_wal directory, resulting in failed attempt to archive this file
when slave node is promoted as master node.
2019-02-15 14:15:58.872 PST [5369] DETAIL:  The failed archive command was:
test ! -f /mnt/pgsql/archive/000000010000000000000002 && cp
pg_wal/000000010000000000000002
/mnt/pgsql/archive/000000010000000000000002
How do you promote your standby?  In Postgres 10, the last, partial
WAL segment of a past timeline generated at promotion is renamed
.partial to avoid any conflicts, so as this should normally not
happen if you do not use archive_mode = always.


This issue concerns only segment that contains wal records from pg_basebackup. For example, I have reproduced this issue on master branch getting the following content of archive directory after standby promoting:

000000010000000000000001.node1
000000010000000000000002.00000028.backup.node1
000000010000000000000002.node1
000000010000000000000002.node2
000000010000000000000003.node1
000000010000000000000004.node1
000000010000000000000005.partial.node2
000000020000000000000005.node2
00000002.history.node2

Each wal segment specifically was appended with .<node> suffix to distinguish its source. Initially the node1 was master, node2 - standby. The segment 000000010000000000000002 has the following content:

rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 0/02000028, prev 0/015DB060, desc: RUNNING_XACTS nextXid 474 latestCompletedXid 473 oldestRunningXid 474
rmgr: XLOG        len (rec/tot):    106/   106, tx:          0, lsn: 0/02000060, prev 0/02000028, desc: CHECKPOINT_ONLINE redo 0/2000028; tli 1; prev tli 1; fpw true; xid 0:474; oid 12690; multi 1; offset 0; oldest xid 467 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 474; online
rmgr: XLOG        len (rec/tot):     34/    34, tx:          0, lsn: 0/020000D0, prev 0/02000060, desc: BACKUP_END 0/2000028
rmgr: XLOG        len (rec/tot):     24/    24, tx:          0, lsn: 0/020000F8, prev 0/020000D0, desc: SWITCH


On the current state of master branch the content of two segments are the same:

$ md5sum 000000010000000000000002.node1 000000010000000000000002.node2

252e3e8cf3a85f218bb467005c565e3c  000000010000000000000002.node1
252e3e8cf3a85f218bb467005c565e3c  000000010000000000000002.node2

But on PG 10.6 the contents are differentiated on the tails:

$ cmp 000000010000000000000002.node1 000000010000000000000002.node2

000000010000000000000002.node1 000000010000000000000002.node2 differ: byte 131073, line 1

while logical part (prefix) are the same - the lsn of last record (SWITCH) in this segment is 0/02000130 and have length 24 bytes.


Is it correct behavior? What would be the correct canonical representation of archive command taking into account this issue?


-- 
Regards,
Maksim Milyutin


--

Andre Piwoni

Sr. Software Developer, BI/Database

WebMD Health Services

Mobile: 801.541.4722

www.webmdhealthservices.com

By the way, you can assert this structure, including now explained diff in WAL segment taken during backup, for fetch vs stream WAL method by creating master and then creating new slave by taking base backup using both methods:

pg_basebackup -h new_slave -D /var/lib/pgsql/10/data -v -P -R -U replication --wal-method=stream
pg_basebackup -h new_slave -D /var/lib/pgsql/10/data -v -P -R -U replication --wal-method=fetch

Please note to see these differences I don't even have to do failover, however, these structural (.done extension) differences seem to affect subsequent failover. I seems that promoted slave is utilizing it's own pg_wal directory with archive status to determine what should be archived next.

On Tue, Feb 19, 2019 at 9:25 AM Andre Piwoni <apiwoni@webmd.net> wrote:
I call pg_ctl -D /var/lib/pgsql/10/data promote to upgrade slave to master when failover happens. archive_mode is set to "on" and not "always".
I repoint slave to the master by stopping it, updating recovery.conf and restarting it. Let me know if I'm doing it wrong.

I think this problem is created before promotion when new slave is created using pg_basebackup with --wal-method=stream and manifests when actual promotion happens.
What I'm trying to say it does not seem that .partial extension is the issue here but lack of .done extension.

Wouldn't you agree that given directory contents below, when failover happens and new slave is promoted to the primary then with fetch mode master would should not archive 000000010000000000000002 segment because it is marked as .done but in stream method segment is not marked as .done so it would be attempted to be archived again? This obviously fails because segment has been archived by initial master. If you disagree, can you explain the purpose behind .done extension in pg_wal/archive_status directory?

Below are contents of directories after base backup but before promoting new slave to master status..

PRIMARY
/var/lib/pgsql/10/data/pg_wal/:
total 49156
-rw-------. 1 postgres postgres 16777216 Feb 15 13:39 000000010000000000000001
-rw-------. 1 postgres postgres 16777216 Feb 15 13:40 000000010000000000000002
-rw-------. 1 postgres postgres      302 Feb 15 13:40 000000010000000000000002.00000028.backup
-rw-------. 1 postgres postgres 16777216 Feb 15 13:40 000000010000000000000003
drwx------. 2 postgres postgres      133 Feb 15 13:40 archive_status

/var/lib/pgsql/10/data/pg_wal/archive_status:
total 0
-rw-------. 1 postgres postgres 0 Feb 15 13:39 000000010000000000000001.done
-rw-------. 1 postgres postgres 0 Feb 15 13:40 000000010000000000000002.00000028.backup.done
-rw-------. 1 postgres postgres 0 Feb 15 13:40 000000010000000000000002.done

WAL ARCHIVE
-rw-rw-rw-. 1 root root 16777216 Feb 15 13:39 000000010000000000000001
-rw-rw-rw-. 1 root root 16777216 Feb 15 13:40 000000010000000000000002
-rw-rw-rw-. 1 root root      302 Feb 15 13:40 000000010000000000000002.00000028.backup

NEW SLAVE (stream method)
-rw-------. 1 postgres root     16777216 Feb 15 13:40 000000010000000000000002
-rw-------. 1 postgres postgres 16777216 Feb 15 13:40 000000010000000000000003
drwx------. 2 postgres root            6 Feb 15 13:40 archive_status

/var/lib/pgsql/10/data/pg_wal/archive_status:
total 0


NEW SLAVE (stream method)
-rw-------. 1 postgres root     16777216 Feb 15 13:40 000000010000000000000002
-rw-------. 1 postgres postgres 16777216 Feb 15 13:40 000000010000000000000003
drwx------. 2 postgres root            6 Feb 15 13:40 archive_status

/var/lib/pgsql/10/data/pg_wal/archive_status:
total 0

NEW SLAVE (fetch method)
-rw-------. 1 postgres root     16777216 Feb 15 13:24 000000010000000000000002
-rw-------. 1 postgres postgres 16777216 Feb 15 13:24 000000010000000000000003
drwx------. 2 postgres root           43 Feb 15 13:24 archive_status

/var/lib/pgsql/10/data/pg_wal/archive_status:
total 0
-rw-------. 1 postgres root 0 Feb 15 13:24 000000010000000000000002.done

Eventual failover(s) produced this structure on master:
/var/lib/pgsql/10/data/pg_wal/archive_status:
total 0
-rw-------. 1 postgres postgres 0 Feb 15 14:15 000000010000000000000002.ready

Hence attempt to archive again:
2019-02-15 14:15:58.872 PST [5369] DETAIL:  The failed archive command was: test ! -f /mnt/pgsql/archive/000000010000000000000002 && cp pg_wal/000000010000000000000002 /mnt/pgsql/archive/000000010000000000000002

On Mon, Feb 18, 2019 at 9:03 AM Maksim Milyutin <milyutinma@gmail.com> wrote:

On 2/16/19 6:25 PM, Michael Paquier wrote:

On Sat, Feb 16, 2019 at 12:26:13AM +0000, PG Bug reporting form wrote:
When new slave is created by taking base backup from the primary using
pg_basebackup with --wal-method=stream option the WAL file generated during
the backup is different (as compared with diff or cmp command) than that on
the master and in WAL archive directory. Furthermore, this file does not
exist in pg_wal/archive_status with .done extension on new slave, though it
exists in pg_wal directory, resulting in failed attempt to archive this file
when slave node is promoted as master node.
2019-02-15 14:15:58.872 PST [5369] DETAIL:  The failed archive command was:
test ! -f /mnt/pgsql/archive/000000010000000000000002 && cp
pg_wal/000000010000000000000002
/mnt/pgsql/archive/000000010000000000000002
How do you promote your standby?  In Postgres 10, the last, partial
WAL segment of a past timeline generated at promotion is renamed
.partial to avoid any conflicts, so as this should normally not
happen if you do not use archive_mode = always.


This issue concerns only segment that contains wal records from pg_basebackup. For example, I have reproduced this issue on master branch getting the following content of archive directory after standby promoting:

000000010000000000000001.node1
000000010000000000000002.00000028.backup.node1
000000010000000000000002.node1
000000010000000000000002.node2
000000010000000000000003.node1
000000010000000000000004.node1
000000010000000000000005.partial.node2
000000020000000000000005.node2
00000002.history.node2

Each wal segment specifically was appended with .<node> suffix to distinguish its source. Initially the node1 was master, node2 - standby. The segment 000000010000000000000002 has the following content:

rmgr: Standby     len (rec/tot):     50/    50, tx:          0, lsn: 0/02000028, prev 0/015DB060, desc: RUNNING_XACTS nextXid 474 latestCompletedXid 473 oldestRunningXid 474
rmgr: XLOG        len (rec/tot):    106/   106, tx:          0, lsn: 0/02000060, prev 0/02000028, desc: CHECKPOINT_ONLINE redo 0/2000028; tli 1; prev tli 1; fpw true; xid 0:474; oid 12690; multi 1; offset 0; oldest xid 467 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 474; online
rmgr: XLOG        len (rec/tot):     34/    34, tx:          0, lsn: 0/020000D0, prev 0/02000060, desc: BACKUP_END 0/2000028
rmgr: XLOG        len (rec/tot):     24/    24, tx:          0, lsn: 0/020000F8, prev 0/020000D0, desc: SWITCH


On the current state of master branch the content of two segments are the same:

$ md5sum 000000010000000000000002.node1 000000010000000000000002.node2

252e3e8cf3a85f218bb467005c565e3c  000000010000000000000002.node1
252e3e8cf3a85f218bb467005c565e3c  000000010000000000000002.node2

But on PG 10.6 the contents are differentiated on the tails:

$ cmp 000000010000000000000002.node1 000000010000000000000002.node2

000000010000000000000002.node1 000000010000000000000002.node2 differ: byte 131073, line 1

while logical part (prefix) are the same - the lsn of last record (SWITCH) in this segment is 0/02000130 and have length 24 bytes.


Is it correct behavior? What would be the correct canonical representation of archive command taking into account this issue?


-- 
Regards,
Maksim Milyutin


--

Andre Piwoni

Sr. Software Developer, BI/Database

WebMD Health Services

Mobile: 801.541.4722

www.webmdhealthservices.com



--

Andre Piwoni

Sr. Software Developer, BI/Database

WebMD Health Services

Mobile: 801.541.4722

www.webmdhealthservices.com

On Tue, Feb 19, 2019 at 09:25:57AM -0800, Andre Piwoni wrote:
> I call pg_ctl -D /var/lib/pgsql/10/data promote to upgrade slave to master
> when failover happens. archive_mode is set to "on" and not "always".
> I repoint slave to the master by stopping it, updating recovery.conf and
> restarting it. Let me know if I'm doing it wrong.

As long as you stop the primary cleanly (stop or smart mode) so as the
primary has the possibility to send its shutdown checkpoint record to
the standby and makes sure that the standby has flushed the record,
that's safe.

> I think this problem is created before promotion when new slave is created
> using pg_basebackup with --wal-method=stream and manifests when actual
> promotion happens.
> What I'm trying to say it does not seem that .partial extension is the
> issue here but lack of .done extension.

Well, sure.  If you begin by reusing an old backup, you have a risk to
potentially archive the same segment multiple times if you use the
same archive location for all your servers.  Since 9.5 this can get
even more complication as archive_mode has gained an "always" mode
which makes also standbys archive segments while in recovery to give
the users a switch for more archiving redundancy, which is useful when
working with async standbys across multiple sites.  My point is that
this stuff has always worked this way.  And people around do not
actually complain about the difference made for archive_status/ when
using the stream of fetch methods with pg_basebackup.  From what I can
see as well, your archive_command is actually unsafe on many points,
so my take is that you should more carefully design it, or rely on an
existing backup solution developed by experts in PostgreSQL backups.
And no, it is not safe to change a behavior that other people may rely
heavily on for their solutions since pg_basebackup got smarter with
its stream mode.
--
Michael

Attachment
The reason you may not see a lot of people complaining about fetch vs. stream mode while using pg_basebackup seems to be largely related to the way many people implement failover. Mainly, when you look at majority of all the failover scripts online when failover happens these scripts take pg_basebackup (directly or via repmgr) on slave(s) as opposed to using WAL archiving+streaming for recovery which is extremely inefficient for large databases. See my point here: https://www.pgpool.net/pipermail/pgpool-general/2019-February/006464.html This is ridiculous given WAL-based system, but I suspect a lot of people are bumping into the same issue given that you mentioned this behavior has not changed.

Also, what do you mean that my archive command is not safe? It is straight from PostrgreSQL docs and other resources:

PostgreSQL docs do not mandate separate location for WAL archive for each server. There's a mention of making sure that when cleaning up WAL archive location that all slaves processed segments which makes sense. I'm not going to use replication slots either because this requires very sophisticated monitoring and maintenance to ensure database space does not blow up.

So far I have no problems doing multiple failovers when creating slaves with base backup in fetch mode but your response is not re-assuring. The only downside that I see from docs is that primary may have removed WAL segment during heavy load (default max_wal_size is 1GB) but then pg_basebackup would fail.

It seems you agree that segment created during pg_basebackup with stream mode, which is not marked with .done extension, would be attempted to be archived again. I also understand stream method during backup is default so changing this behavior may have an impact. If there's any free solution you could recommend for backup strategy please let me know.

Thanks,
Andre Piwoni

On Tue, Feb 19, 2019 at 7:50 PM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Feb 19, 2019 at 09:25:57AM -0800, Andre Piwoni wrote:
> I call pg_ctl -D /var/lib/pgsql/10/data promote to upgrade slave to master
> when failover happens. archive_mode is set to "on" and not "always".
> I repoint slave to the master by stopping it, updating recovery.conf and
> restarting it. Let me know if I'm doing it wrong.

As long as you stop the primary cleanly (stop or smart mode) so as the
primary has the possibility to send its shutdown checkpoint record to
the standby and makes sure that the standby has flushed the record,
that's safe.

> I think this problem is created before promotion when new slave is created
> using pg_basebackup with --wal-method=stream and manifests when actual
> promotion happens.
> What I'm trying to say it does not seem that .partial extension is the
> issue here but lack of .done extension.

Well, sure.  If you begin by reusing an old backup, you have a risk to
potentially archive the same segment multiple times if you use the
same archive location for all your servers.  Since 9.5 this can get
even more complication as archive_mode has gained an "always" mode
which makes also standbys archive segments while in recovery to give
the users a switch for more archiving redundancy, which is useful when
working with async standbys across multiple sites.  My point is that
this stuff has always worked this way.  And people around do not
actually complain about the difference made for archive_status/ when
using the stream of fetch methods with pg_basebackup.  From what I can
see as well, your archive_command is actually unsafe on many points,
so my take is that you should more carefully design it, or rely on an
existing backup solution developed by experts in PostgreSQL backups.
And no, it is not safe to change a behavior that other people may rely
heavily on for their solutions since pg_basebackup got smarter with
its stream mode.
--
Michael


--

Andre Piwoni

Sr. Software Developer, BI/Database

WebMD Health Services

Mobile: 801.541.4722

www.webmdhealthservices.com

On Wed, Feb 20, 2019 at 09:58:12AM -0800, Andre Piwoni wrote:
> Also, what do you mean that my archive command is not safe? It is straight
> from PostrgreSQL docs and other resources:
> https://www.postgresql.org/docs/10/continuous-archiving.html
> https://www.opsdash.com/blog/postgresql-wal-archiving-backup.html

The docs are not the best thing on this side...  One immediate problem
one could note is that cp does not flush the segment.  So if your host
crashes before the OS flushes the copied data then the WAL segment
data is lost.  The flush needs to happen before the archive command
returns its status to the server.

> It seems you agree that segment created during pg_basebackup with stream
> mode, which is not marked with .done extension, would be attempted to be
> archived again. I also understand stream method during backup is default so
> changing this behavior may have an impact. If there's any free solution you
> could recommend for backup strategy please let me know.

pgBackRest and repmgr are tools mentioned on the lists.
--
Michael

Attachment

On 2/19/19 9:08 PM, Andre Piwoni wrote:

By the way, you can assert this structure, including now explained diff in WAL segment taken during backup, for fetch vs stream WAL method by creating master and then creating new slave by taking base backup using both methods:

pg_basebackup -h new_slave -D /var/lib/pgsql/10/data -v -P -R -U replication --wal-method=stream
pg_basebackup -h new_slave -D /var/lib/pgsql/10/data -v -P -R -U replication --wal-method=fetch


I have verified on different versions of PostgreSQL the identity of conflict segments received from master node and from pg_basebackup --wal-method=stream. On PG11+ they are binary identical. However on PG10 and lower the segments are binary different. The logical content are the same while meaningless tails differ.

In case of pg_basebackup --wal-method=fetch the segments are entirely the same. And this case doesn't produce any problem in archiving.
I think the backup tools as well as custom archive scripts have to take account of these points.

-- 
Regards,
Maksim Milyutin
Thank you for checking. Bianary diffs I had were small and at the tail even though there were no changes made to database so this makes sense that these diffs were logically meaningless.

Were you able to confirm that with fetch method slave created with --wal-method=fetch had record in pg_wal/archive_status directory with .done extension after running pg_basebackup and no record with --wal-method=stream? I'm also wondering if PG11 has the same behavior. 

On Thu, Feb 21, 2019 at 10:28 AM Maksim Milyutin <milyutinma@gmail.com> wrote:

On 2/19/19 9:08 PM, Andre Piwoni wrote:

By the way, you can assert this structure, including now explained diff in WAL segment taken during backup, for fetch vs stream WAL method by creating master and then creating new slave by taking base backup using both methods:

pg_basebackup -h new_slave -D /var/lib/pgsql/10/data -v -P -R -U replication --wal-method=stream
pg_basebackup -h new_slave -D /var/lib/pgsql/10/data -v -P -R -U replication --wal-method=fetch


I have verified on different versions of PostgreSQL the identity of conflict segments received from master node and from pg_basebackup --wal-method=stream. On PG11+ they are binary identical. However on PG10 and lower the segments are binary different. The logical content are the same while meaningless tails differ.

In case of pg_basebackup --wal-method=fetch the segments are entirely the same. And this case doesn't produce any problem in archiving.
I think the backup tools as well as custom archive scripts have to take account of these points.

-- 
Regards,
Maksim Milyutin


--

Andre Piwoni

Sr. Software Developer, BI/Database

WebMD Health Services

Mobile: 801.541.4722

www.webmdhealthservices.com

On 2/21/19 9:57 PM, Andre Piwoni wrote:

> Were you able to confirm that with fetch method slave created with 
> --wal-method=fetch had record in pg_wal/archive_status directory with 
> .done extension after running pg_basebackup and no record with 
> --wal-method=stream? I'm also wondering if PG11 has the same behavior.


Yes, it's so. Such behavior is also on PG11 and master branch.


-- 
Regards,
Maksim Milyutin