Thread: BUG #14230: Wrong timeline returned by pg_stop_backup on a standby

BUG #14230: Wrong timeline returned by pg_stop_backup on a standby

From

francesco.canovai@2ndquadrant.it

Date:

06 July 2016, 18:08:06

VGhlIGZvbGxvd2luZyBidWcgaGFzIGJlZW4gbG9nZ2VkIG9uIHRoZSB3ZWJz
aXRlOgoKQnVnIHJlZmVyZW5jZTogICAgICAxNDIzMApMb2dnZWQgYnk6ICAg
ICAgICAgIEZyYW5jZXNjbyBDYW5vdmFpCkVtYWlsIGFkZHJlc3M6ICAgICAg
ZnJhbmNlc2NvLmNhbm92YWlAMm5kcXVhZHJhbnQuaXQKUG9zdGdyZVNRTCB2
ZXJzaW9uOiA5LjZiZXRhMgpPcGVyYXRpbmcgc3lzdGVtOiAgIExpbnV4CkRl
c2NyaXB0aW9uOiAgICAgICAgCgpJJ20gdGFraW5nIGEgY29uY3VycmVudCBi
YWNrdXAgZnJvbSBhIHN0YW5kYnkgaW4gUG9zdGdyZVNRTCBiZXRhMiBhbmQg
SSBnZXQKdGhlIHdyb25nIHRpbWVsaW5lIGZyb20gcGdfc3RvcF9iYWNrdXAo
ZmFsc2UpLg0KDQpUaGlzIGlzIHdoYXQgSSdtIGRvaW5nOg0KDQoxKSBJIHNl
dCB1cCBhbiBlbnZpcm9ubWVudCB3aXRoIGEgcHJpbWFyeSBzZXJ2ZXIgYW5k
IGEgcmVwbGljYSBpbiBzdHJlYW1pbmcKcmVwbGljYXRpb24uDQoNCjIpIE9u
IHRoZSByZXBsaWNhLCBJIHJ1bg0KDQpwb3N0Z3Jlcz0jIFNFTEVDVCBwZ19z
dGFydF9iYWNrdXAoJ3Rlc3RfYmFja3VwJywgdHJ1ZSwgZmFsc2UpOw0KIHBn
X3N0YXJ0X2JhY2t1cCANCi0tLS0tLS0tLS0tLS0tLS0tDQogMC8zMDAwQTAw
DQooMSByb3cpDQoNCjMpIFdoZW4gSSBydW4gcGdfc3RvcF9iYWNrdXAsIGl0
IHJldHVybnMgYSBzdGFydCB3YWwgbG9jYXRpb24gYmVsb25naW5nIHRvIGEK
ZmlsZSB3aXRoIHRpbWVsaW5lIDAuDQoNCnBvc3RncmVzPSMgU0VMRUNUIHBn
X3N0b3BfYmFja3VwKGZhbHNlKTsNCiAgICAgICAgICAgICAgICAgICAgICAg
ICAgICAgIHBnX3N0b3BfYmFja3VwICAgICAgICAgICAgICAgICAgICAgICAg
ICAgICAgCg0KLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t
LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tDQogKDAv
MzAwMEFFMCwiU1RBUlQgV0FMIExPQ0FUSU9OOiAwLzMwMDBBMDAgKGZpbGUK
MDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAzKSsNCiBDSEVDS1BPSU5UIExPQ0FU
SU9OOiAwLzMwMDBBMzggICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg
ICAgICAgICAgICAKKw0KIEJBQ0tVUCBNRVRIT0Q6IHN0cmVhbWVkICAgICAg
ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIAor
DQogQkFDS1VQIEZST006IHN0YW5kYnkgICAgICAgICAgICAgICAgICAgICAg
ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgCisNCiBTVEFSVCBUSU1F
OiAyMDE2LTA3LTA2IDE2OjQ0OjMxIENFU1QgICAgICAgICAgICAgICAgICAg
ICAgICAgICAgICAgICAgICAKKw0KIExBQkVMOiB0ZXN0X2JhY2t1cCAgICAg
ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg
ICAgIAorDQogIiwiIikNCigxIHJvdykNCg0KVGhlIHRpbWVsaW5lIHJldHVy
bmVkIGlzIGZpbmUgKGlzIDEpIHdoZW4gcnVubmluZyB0aGUgc2FtZSBjb21t
YW5kcyBvbiB0aGUKbWFzdGVyLg0KDQpBbiBpbmNvcnJlY3QgYmFja3VwIGxh
YmVsIGRvZXNuJ3QgcHJldmVudCBQb3N0Z3JlU1FMIGZyb20gc3RhcnRpbmcg
dXAsIGJ1dAppdCBhZmZlY3RzIHRoZSB0b29scyB1c2luZyB0aGF0IGluZm9y
bWF0aW9uLgoK

Re: BUG #14230: Wrong timeline returned by pg_stop_backup on a standby

From

Marco Nenciarini

Date:

06 July 2016, 18:38:04

Hi,

On 06/07/16 17:07, francesco.canovai@2ndquadrant.it wrote:
> The following bug has been logged on the website:
>
> Bug reference:      14230
> Logged by:          Francesco Canovai
> Email address:      francesco.canovai@2ndquadrant.it
> PostgreSQL version: 9.6beta2
> Operating system:   Linux
> Description:
>
> I'm taking a concurrent backup from a standby in PostgreSQL beta2 and I get
> the wrong timeline from pg_stop_backup(false).
>
> This is what I'm doing:
>
> 1) I set up an environment with a primary server and a replica in streaming
> replication.
>
> 2) On the replica, I run
>
> postgres=# SELECT pg_start_backup('test_backup', true, false);
>  pg_start_backup
> -----------------
>  0/3000A00
> (1 row)
>
> 3) When I run pg_stop_backup, it returns a start wal location belonging to a
> file with timeline 0.
>
> postgres=# SELECT pg_stop_backup(false);
>                               pg_stop_backup
>
> ---------------------------------------------------------------------------
>  (0/3000AE0,"START WAL LOCATION: 0/3000A00 (file
> 000000000000000000000003)+
>  CHECKPOINT LOCATION: 0/3000A38
> +
>  BACKUP METHOD: streamed
> +
>  BACKUP FROM: standby
> +
>  START TIME: 2016-07-06 16:44:31 CEST
> +
>  LABEL: test_backup
> +
>  ","")
> (1 row)
>
> The timeline returned is fine (is 1) when running the same commands on the
> master.
>
> An incorrect backup label doesn't prevent PostgreSQL from starting up, but
> it affects the tools using that information.
>
>

The issue here is that the do_pg_stop_backup function uses the
ThisTimeLineID variable that is not valid on standbys.

I think that it should read it from
ControlFile->checkPointCopy.ThisTimeLineID as we do in do_pg_start_backup.

Regards,
Marco

--
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciarini@2ndQuadrant.it | www.2ndQuadrant.it

Re: BUG #14230: Wrong timeline returned by pg_stop_backup on a standby

From

Marco Nenciarini

Date:

06 July 2016, 18:42:06

On 06/07/16 17:37, Marco Nenciarini wrote:
> Hi,
>
> On 06/07/16 17:07, francesco.canovai@2ndquadrant.it wrote:
>> The following bug has been logged on the website:
>>
>> Bug reference:      14230
>> Logged by:          Francesco Canovai
>> Email address:      francesco.canovai@2ndquadrant.it
>> PostgreSQL version: 9.6beta2
>> Operating system:   Linux
>> Description:
>>
>> I'm taking a concurrent backup from a standby in PostgreSQL beta2 and I get
>> the wrong timeline from pg_stop_backup(false).
>>
>> This is what I'm doing:
>>
>> 1) I set up an environment with a primary server and a replica in streaming
>> replication.
>>
>> 2) On the replica, I run
>>
>> postgres=# SELECT pg_start_backup('test_backup', true, false);
>>  pg_start_backup
>> -----------------
>>  0/3000A00
>> (1 row)
>>
>> 3) When I run pg_stop_backup, it returns a start wal location belonging to a
>> file with timeline 0.
>>
>> postgres=# SELECT pg_stop_backup(false);
>>                               pg_stop_backup
>>
>> ---------------------------------------------------------------------------
>>  (0/3000AE0,"START WAL LOCATION: 0/3000A00 (file
>> 000000000000000000000003)+
>>  CHECKPOINT LOCATION: 0/3000A38
>> +
>>  BACKUP METHOD: streamed
>> +
>>  BACKUP FROM: standby
>> +
>>  START TIME: 2016-07-06 16:44:31 CEST
>> +
>>  LABEL: test_backup
>> +
>>  ","")
>> (1 row)
>>
>> The timeline returned is fine (is 1) when running the same commands on the
>> master.
>>
>> An incorrect backup label doesn't prevent PostgreSQL from starting up, but
>> it affects the tools using that information.
>>
>>
>
> The issue here is that the do_pg_stop_backup function uses the
> ThisTimeLineID variable that is not valid on standbys.
>
> I think that it should read it from
> ControlFile->checkPointCopy.ThisTimeLineID as we do in do_pg_start_backup.
>

No, that's not the solution.

The backup_label is generated during the do_pg_start_backup call, so
also the copy in  ControlFile->checkPointCopy.ThisTimeLineID is
uninitialized.

Regards,
Marco

--
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciarini@2ndQuadrant.it | www.2ndQuadrant.it

Re: BUG #14230: Wrong timeline returned by pg_stop_backup on a standby

From

Marco Nenciarini

Date:

06 July 2016, 18:57:46

On 06/07/16 17:41, Marco Nenciarini wrote:
> On 06/07/16 17:37, Marco Nenciarini wrote:
>> Hi,
>>
>> On 06/07/16 17:07, francesco.canovai@2ndquadrant.it wrote:
>>> The following bug has been logged on the website:
>>>
>>> Bug reference:      14230
>>> Logged by:          Francesco Canovai
>>> Email address:      francesco.canovai@2ndquadrant.it
>>> PostgreSQL version: 9.6beta2
>>> Operating system:   Linux
>>> Description:
>>>
>>> I'm taking a concurrent backup from a standby in PostgreSQL beta2 and I get
>>> the wrong timeline from pg_stop_backup(false).
>>>
>>> This is what I'm doing:
>>>
>>> 1) I set up an environment with a primary server and a replica in streaming
>>> replication.
>>>
>>> 2) On the replica, I run
>>>
>>> postgres=# SELECT pg_start_backup('test_backup', true, false);
>>>  pg_start_backup
>>> -----------------
>>>  0/3000A00
>>> (1 row)
>>>
>>> 3) When I run pg_stop_backup, it returns a start wal location belonging to a
>>> file with timeline 0.
>>>
>>> postgres=# SELECT pg_stop_backup(false);
>>>                               pg_stop_backup
>>>
>>> ---------------------------------------------------------------------------
>>>  (0/3000AE0,"START WAL LOCATION: 0/3000A00 (file
>>> 000000000000000000000003)+
>>>  CHECKPOINT LOCATION: 0/3000A38
>>> +
>>>  BACKUP METHOD: streamed
>>> +
>>>  BACKUP FROM: standby
>>> +
>>>  START TIME: 2016-07-06 16:44:31 CEST
>>> +
>>>  LABEL: test_backup
>>> +
>>>  ","")
>>> (1 row)
>>>
>>> The timeline returned is fine (is 1) when running the same commands on the
>>> master.
>>>
>>> An incorrect backup label doesn't prevent PostgreSQL from starting up, but
>>> it affects the tools using that information.
>>>
>>>
>>
>> The issue here is that the do_pg_stop_backup function uses the
>> ThisTimeLineID variable that is not valid on standbys.
>>
>> I think that it should read it from
>> ControlFile->checkPointCopy.ThisTimeLineID as we do in do_pg_start_backup.
>>
>
> No, that's not the solution.
>
> The backup_label is generated during the do_pg_start_backup call, so
> also the copy in  ControlFile->checkPointCopy.ThisTimeLineID is
> uninitialized.
>

After further analysis, the issue is that we retrieve the starttli from
the ControlFile structure, but it was using ThisTimeLineID when writing
the backup label.

I've attached a very simple patch that fixes it.

Regards,
Marco

--
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciarini@2ndQuadrant.it | www.2ndQuadrant.it

Attachment

Re: BUG #14230: Wrong timeline returned by pg_stop_backup on a standby

From

Michael Paquier

Date:

07 July 2016, 09:38:35

On Thu, Jul 7, 2016 at 12:57 AM, Marco Nenciarini
<marco.nenciarini@2ndquadrant.it> wrote:
> After further analysis, the issue is that we retrieve the starttli from
> the ControlFile structure, but it was using ThisTimeLineID when writing
> the backup label.
>
> I've attached a very simple patch that fixes it.

ThisTimeLineID is always set at 0 on purpose on a standby, so we
cannot rely on it (well it is set temporarily when recycling old
segments). At recovery when parsing the backup_label file there is no
actual use of the start segment name, so that's only a cosmetic
change. But surely it would be better to get that fixed, because
that's useful for debugging.

While looking at your patch, I thought that it would have been
tempting to use GetXLogReplayRecPtr() to get the timeline ID when in
recovery, but what we really want to know here is the timeline of the
last REDO pointer, which is starttli, and that's more consistent with
the fact that we use startpoint when writing the backup_label file. In
short, +1 for this fix.

I am adding that in the list of open items, adding Magnus in CC whose
commit for non-exclusive backups is at the origin of this defect.
-- 
Michael

Re: BUG #14230: Wrong timeline returned by pg_stop_backup on a standby

From

Marco Nenciarini

Date:

08 July 2016, 12:40:51

On 07/07/16 08:38, Michael Paquier wrote:
> On Thu, Jul 7, 2016 at 12:57 AM, Marco Nenciarini
> <marco.nenciarini@2ndquadrant.it> wrote:
>> After further analysis, the issue is that we retrieve the starttli from
>> the ControlFile structure, but it was using ThisTimeLineID when writing
>> the backup label.
>>
>> I've attached a very simple patch that fixes it.
>
> ThisTimeLineID is always set at 0 on purpose on a standby, so we
> cannot rely on it (well it is set temporarily when recycling old
> segments). At recovery when parsing the backup_label file there is no
> actual use of the start segment name, so that's only a cosmetic
> change. But surely it would be better to get that fixed, because
> that's useful for debugging.
>
> While looking at your patch, I thought that it would have been
> tempting to use GetXLogReplayRecPtr() to get the timeline ID when in
> recovery, but what we really want to know here is the timeline of the
> last REDO pointer, which is starttli, and that's more consistent with
> the fact that we use startpoint when writing the backup_label file. In
> short, +1 for this fix.
>
> I am adding that in the list of open items, adding Magnus in CC whose
> commit for non-exclusive backups is at the origin of this defect.
>

While we were testing the patch we noticed another behavior that is not
strictly a bug, but can confuse backup tools:

To quickly produce some WAL files we were executing a series of
pg_switch_xlog+CHECKPOINT, and we noticed that doing a backup from a
standby after that results in a startpoint higher than the stoppoint.

Let me show it on a brand new master/replica cluster (master is port
5496, replica is 6496). The script is attached.

-------------------------------------------------------------------
You are now connected to database "postgres" as user "postgres" via
socket in "/tmp" at port "5496".
SELECT pg_is_in_recovery();
-[ RECORD 1 ]-----+--
pg_is_in_recovery | f

CHECKPOINT;
CHECKPOINT
SELECT pg_switch_xlog();
-[ RECORD 1 ]--+----------
pg_switch_xlog | 0/30000E8

CHECKPOINT;
CHECKPOINT
SELECT pg_switch_xlog();
-[ RECORD 1 ]--+----------
pg_switch_xlog | 0/40000E8

You are now connected to database "postgres" as user "postgres" via
socket in "/tmp" at port "6496".
SELECT pg_is_in_recovery();
-[ RECORD 1 ]-----+--
pg_is_in_recovery | t

SELECT pg_start_backup('tst backup',TRUE,FALSE);
-[ RECORD 1 ]---+----------
pg_start_backup | 0/4000028

SELECT * FROM pg_stop_backup(FALSE);
-[ RECORD 1 ]-------------------------------------------------------------
lsn        | 0/20000F8
labelfile  | START WAL LOCATION: 0/4000028 (file 000000000000000000000004)+
           | CHECKPOINT LOCATION: 0/4000060                               +
           | BACKUP METHOD: streamed                                      +
           | BACKUP FROM: standby                                         +
           | START TIME: 2016-07-08 10:46:55 CEST                         +
           | LABEL: tst backup                                            +
           |
spcmapfile |

SELECT * FROM pg_control_checkpoint();
-[ RECORD 1 ]--------+-------------------------
checkpoint_location  | 0/4000060
prior_location       | 0/2000060
redo_location        | 0/4000028
redo_wal_file        | 000000010000000000000004
timeline_id          | 1
prev_timeline_id     | 1
full_page_writes     | t
next_xid             | 0:865
next_oid             | 12670
next_multixact_id    | 1
next_multi_offset    | 0
oldest_xid           | 858
oldest_xid_dbid      | 1
oldest_active_xid    | 865
oldest_multi_xid     | 1
oldest_multi_dbid    | 1
oldest_commit_ts_xid | 865
newest_commit_ts_xid | 865
checkpoint_time      | 2016-07-08 10:46:55+02

SELECT * FROM pg_control_recovery();
-[ RECORD 1 ]-----------------+----------
min_recovery_end_location     | 0/20000F8
min_recovery_end_timeline     | 1
backup_start_location         | 0/0
backup_end_location           | 0/0
end_of_backup_record_required | f

-------------------------------------------------------------------

In particular, the pg_start_backup LSN is 0/4000028 and the
pg_stop_backup LSN is 0/20000F8.


The same issue is present when you do a backup using pg_basebackup:

-------------------------------------------------------------------
transaction log start point: 0/8000028 on timeline 1
pg_basebackup: starting background WAL receiver
22244/22244 kB (100%), 1/1 tablespace
transaction log end point: 0/20000F8
pg_basebackup: waiting for background process to finish streaming ...
pg_basebackup: base backup completed
-------------------------------------------------------------------

The resulting backup is working perfectly, because Postgres has no use
for pg_stop_backup LSN, but this can confuse any tool that uses the stop
LSN to figure out which WAL files are needed by the backup (in this case
the only file needed is the one containing the start checkpoint).

After some discussion with Álvaro, my proposal is to avoid that by
returning the stoppoint as the maximum between the startpoint and the
min_recovery_end_location, in case of backup from the standby.

The patch is once again a very simple one line diff.

I have attached both patches to this email, as in my opinion they should
go together, because the subject is the same: avoid giving misleading
information to backup tools.

Regards,
Marco

--
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciarini@2ndQuadrant.it | www.2ndQuadrant.it

Attachment

Re: BUG #14230: Wrong timeline returned by pg_stop_backup on a standby

From

Michael Paquier

Date:

08 July 2016, 14:10:45

On Fri, Jul 8, 2016 at 6:40 PM, Marco Nenciarini
<marco.nenciarini@2ndquadrant.it> wrote:
> The resulting backup is working perfectly, because Postgres has no use
> for pg_stop_backup LSN, but this can confuse any tool that uses the stop
> LSN to figure out which WAL files are needed by the backup (in this case
> the only file needed is the one containing the start checkpoint).
>
> After some discussion with Álvaro, my proposal is to avoid that by
> returning the stoppoint as the maximum between the startpoint and the
> min_recovery_end_location, in case of backup from the standby.

You are facing a pattern similar to the problem reported already on
this thread by Horiguchi-san:
http://www.postgresql.org/message-id/20160609.215558.118976703.horiguchi.kyotaro@lab.ntt.co.jp
And it seems to me that you are jumping to an incorrect conclusion,
what we'd want to do is to update a bit more aggressively the minimum
recovery point in cases on a node in recovery in the case where no
buffers are flushed by other backends.
--
Michael

Re: BUG #14230: Wrong timeline returned by pg_stop_backup on a standby

From

Marco Nenciarini

Date:

08 July 2016, 15:22:46

On 08/07/16 13:10, Michael Paquier wrote:
> On Fri, Jul 8, 2016 at 6:40 PM, Marco Nenciarini
> <marco.nenciarini@2ndquadrant.it> wrote:
>> The resulting backup is working perfectly, because Postgres has no use
>> for pg_stop_backup LSN, but this can confuse any tool that uses the stop
>> LSN to figure out which WAL files are needed by the backup (in this case
>> the only file needed is the one containing the start checkpoint).
>>
>> After some discussion with Álvaro, my proposal is to avoid that by
>> returning the stoppoint as the maximum between the startpoint and the
>> min_recovery_end_location, in case of backup from the standby.
>
> You are facing a pattern similar to the problem reported already on
> this thread by Horiguchi-san:
> http://www.postgresql.org/message-id/20160609.215558.118976703.horiguchi.kyotaro@lab.ntt.co.jp
> And it seems to me that you are jumping to an incorrect conclusion,
> what we'd want to do is to update a bit more aggressively the minimum
> recovery point in cases on a node in recovery in the case where no
> buffers are flushed by other backends.
>

Yes, it is exactly the same bug. My proposal was based on the assumption
that it were only a cosmetic issue, but given that it can trigger
errors, I agree that the right solution is to advance the  minimum
recovery point in that case.

Regards,
Marco

--
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciarini@2ndQuadrant.it | www.2ndQuadrant.it

Re: BUG #14230: Wrong timeline returned by pg_stop_backup on a standby

From

Noah Misch

Date:

09 July 2016, 04:52:14

On Thu, Jul 07, 2016 at 03:38:26PM +0900, Michael Paquier wrote:
> On Thu, Jul 7, 2016 at 12:57 AM, Marco Nenciarini
> <marco.nenciarini@2ndquadrant.it> wrote:
> > After further analysis, the issue is that we retrieve the starttli from
> > the ControlFile structure, but it was using ThisTimeLineID when writing
> > the backup label.
> >
> > I've attached a very simple patch that fixes it.
> 
> ThisTimeLineID is always set at 0 on purpose on a standby, so we
> cannot rely on it (well it is set temporarily when recycling old
> segments). At recovery when parsing the backup_label file there is no
> actual use of the start segment name, so that's only a cosmetic
> change. But surely it would be better to get that fixed, because
> that's useful for debugging.
> 
> While looking at your patch, I thought that it would have been
> tempting to use GetXLogReplayRecPtr() to get the timeline ID when in
> recovery, but what we really want to know here is the timeline of the
> last REDO pointer, which is starttli, and that's more consistent with
> the fact that we use startpoint when writing the backup_label file. In
> short, +1 for this fix.
> 
> I am adding that in the list of open items, adding Magnus in CC whose
> commit for non-exclusive backups is at the origin of this defect.

[Action required within 72 hours.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 9.6 open item.  Magnus,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
9.6 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within 72 hours of this
message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping 9.6rc1.  Consequently, I will appreciate your
efforts toward speedy resolution.  Thanks.

[1] http://www.postgresql.org/message-id/20160527025039.GA447393@tornado.leadboat.com

Re: BUG #14230: Wrong timeline returned by pg_stop_backup on a standby

From

Amit Kapila

Date:

09 July 2016, 09:30:23

On Thu, Jul 7, 2016 at 12:08 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Jul 7, 2016 at 12:57 AM, Marco Nenciarini
> <marco.nenciarini@2ndquadrant.it> wrote:
>> After further analysis, the issue is that we retrieve the starttli from
>> the ControlFile structure, but it was using ThisTimeLineID when writing
>> the backup label.
>>
>> I've attached a very simple patch that fixes it.
>
> ThisTimeLineID is always set at 0 on purpose on a standby, so we
> cannot rely on it (well it is set temporarily when recycling old
> segments). At recovery when parsing the backup_label file there is no
> actual use of the start segment name, so that's only a cosmetic
> change. But surely it would be better to get that fixed, because
> that's useful for debugging.
>
> While looking at your patch, I thought that it would have been
> tempting to use GetXLogReplayRecPtr() to get the timeline ID when in
> recovery, but what we really want to know here is the timeline of the
> last REDO pointer, which is starttli, and that's more consistent with
> the fact that we use startpoint when writing the backup_label file. In
> short, +1 for this fix.
>

+1, the fix looks right to me as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: BUG #14230: Wrong timeline returned by pg_stop_backup on a standby

From

Magnus Hagander

Date:

09 July 2016, 19:54:47

<p dir="ltr"><br /> On Jul 9, 2016 4:52 AM, "Noah Misch" <<a
href="mailto:noah@leadboat.com">noah@leadboat.com</a>>wrote:<br /> ><br /> > On Thu, Jul 07, 2016 at
03:38:26PM+0900, Michael Paquier wrote:<br /> > > On Thu, Jul 7, 2016 at 12:57 AM, Marco Nenciarini<br /> >
><<a href="mailto:marco.nenciarini@2ndquadrant.it">marco.nenciarini@2ndquadrant.it</a>> wrote:<br /> > >
>After further analysis, the issue is that we retrieve the starttli from<br /> > > > the ControlFile
structure,but it was using ThisTimeLineID when writing<br /> > > > the backup label.<br /> > > ><br
/>> > > I've attached a very simple patch that fixes it.<br /> > ><br /> > > ThisTimeLineID is
alwaysset at 0 on purpose on a standby, so we<br /> > > cannot rely on it (well it is set temporarily when
recyclingold<br /> > > segments). At recovery when parsing the backup_label file there is no<br /> > >
actualuse of the start segment name, so that's only a cosmetic<br /> > > change. But surely it would be better to
getthat fixed, because<br /> > > that's useful for debugging.<br /> > ><br /> > > While looking at
yourpatch, I thought that it would have been<br /> > > tempting to use GetXLogReplayRecPtr() to get the timeline
IDwhen in<br /> > > recovery, but what we really want to know here is the timeline of the<br /> > > last
REDOpointer, which is starttli, and that's more consistent with<br /> > > the fact that we use startpoint when
writingthe backup_label file. In<br /> > > short, +1 for this fix.<br /> > ><br /> > > I am adding
thatin the list of open items, adding Magnus in CC whose<br /> > > commit for non-exclusive backups is at the
originof this defect.<br /> ><br /> > [Action required within 72 hours.  This is a generic notification.]<br />
><br/> > The above-described topic is currently a PostgreSQL 9.6 open item.  Magnus,<br /> > since you
committedthe patch believed to have created it, you own this open<br /> > item.  If some other commit is more
relevantor if this does not belong as a<br /> > 9.6 open item, please let us know.  Otherwise, please observe the
policyon<br /> > open item ownership[1] and send a status update within 72 hours of this<br /> > message. 
Includea date for your subsequent status update.  Testers may<br /> > discover new open items at any time, and I
wantto plan to get them all fixed<br /> > well in advance of shipping 9.6rc1.  Consequently, I will appreciate
your<br/> > efforts toward speedy resolution.  Thanks.<br /> ><br /> > [1] <a
href="http://www.postgresql.org/message-id/20160527025039.GA447393@tornado.leadboat.com">http://www.postgresql.org/message-id/20160527025039.GA447393@tornado.leadboat.com</a><br
/><pdir="ltr">I'll take a look at this on Monday when I'm back home from Russia. It looks like people have it under
control,so hopefully that just means committing the available solution in which case it'll be finished by then. <p
dir="ltr">/Magnus<br />

Re: BUG #14230: Wrong timeline returned by pg_stop_backup on a standby

From

Magnus Hagander

Date:

11 July 2016, 13:01:48

On Thu, Jul 7, 2016 at 8:38 AM, Michael Paquier <michael.paquier@gmail.com> wrote:

On Thu, Jul 7, 2016 at 12:57 AM, Marco Nenciarini
<marco.nenciarini@2ndquadrant.it> wrote:
> After further analysis, the issue is that we retrieve the starttli from
> the ControlFile structure, but it was using ThisTimeLineID when writing
> the backup label.
>
> I've attached a very simple patch that fixes it.

ThisTimeLineID is always set at 0 on purpose on a standby, so we
cannot rely on it (well it is set temporarily when recycling old
segments). At recovery when parsing the backup_label file there is no
actual use of the start segment name, so that's only a cosmetic
change. But surely it would be better to get that fixed, because
that's useful for debugging.

While looking at your patch, I thought that it would have been
tempting to use GetXLogReplayRecPtr() to get the timeline ID when in
recovery, but what we really want to know here is the timeline of the
last REDO pointer, which is starttli, and that's more consistent with
the fact that we use startpoint when writing the backup_label file. In
short, +1 for this fix.

I am adding that in the list of open items, adding Magnus in CC whose
commit for non-exclusive backups is at the origin of this defect.

I agree this looks correct.

But isn't this also a pre-existing bug in 9.5? Or did we change something else that suddenly made it visible?

Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

Re: BUG #14230: Wrong timeline returned by pg_stop_backup on a standby

From

Amit Kapila

Date:

11 July 2016, 14:27:07

On Mon, Jul 11, 2016 at 3:31 PM, Magnus Hagander <magnus@hagander.net> wrote:
>
>
> On Thu, Jul 7, 2016 at 8:38 AM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>>
>> On Thu, Jul 7, 2016 at 12:57 AM, Marco Nenciarini
>> <marco.nenciarini@2ndquadrant.it> wrote:
>> > After further analysis, the issue is that we retrieve the starttli from
>> > the ControlFile structure, but it was using ThisTimeLineID when writing
>> > the backup label.
>> >
>> > I've attached a very simple patch that fixes it.
>>
>> ThisTimeLineID is always set at 0 on purpose on a standby, so we
>> cannot rely on it (well it is set temporarily when recycling old
>> segments). At recovery when parsing the backup_label file there is no
>> actual use of the start segment name, so that's only a cosmetic
>> change. But surely it would be better to get that fixed, because
>> that's useful for debugging.
>>
>> While looking at your patch, I thought that it would have been
>> tempting to use GetXLogReplayRecPtr() to get the timeline ID when in
>> recovery, but what we really want to know here is the timeline of the
>> last REDO pointer, which is starttli, and that's more consistent with
>> the fact that we use startpoint when writing the backup_label file. In
>> short, +1 for this fix.
>>
>> I am adding that in the list of open items, adding Magnus in CC whose
>> commit for non-exclusive backups is at the origin of this defect.
>
>
> I agree this looks correct.
>
> But isn't this also a pre-existing bug in 9.5? Or did we change something
> else that suddenly made it visible?
>

I think the bug is pre-existing, but it becomes visible to user now by new API.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: BUG #14230: Wrong timeline returned by pg_stop_backup on a standby

From

Michael Paquier

Date:

11 July 2016, 16:06:03

On Mon, Jul 11, 2016 at 7:01 PM, Magnus Hagander <magnus@hagander.net> wrote:
> But isn't this also a pre-existing bug in 9.5? Or did we change something
> else that suddenly made it visible?

What has been patched here is a defect caused by pg_start_backup(),
and not pg_basebackup. In the case of the latter, ThisTimelineID gets
set by GetStandbyFlushRecPtr() in the context of the WAL sender used
to send the base backup. In short, this is only a defect of 9.6, where
pg_start_backup() can be used on standbys for the first time for
non-exclusive backups.

So the issue does not actually pre-exist, GetStandbyFlushRecPtr()
playing its role to set up the timeline ID.
-- 
Michael

Re: BUG #14230: Wrong timeline returned by pg_stop_backup on a standby

From

Magnus Hagander

Date:

11 July 2016, 16:15:20

On Mon, Jul 11, 2016 at 3:05 PM, Michael Paquier <michael.paquier@gmail.com> wrote:

On Mon, Jul 11, 2016 at 7:01 PM, Magnus Hagander <magnus@hagander.net> wrote:
> But isn't this also a pre-existing bug in 9.5? Or did we change something
> else that suddenly made it visible?

What has been patched here is a defect caused by pg_start_backup(),
and not pg_basebackup. In the case of the latter, ThisTimelineID gets
set by GetStandbyFlushRecPtr() in the context of the WAL sender used
to send the base backup. In short, this is only a defect of 9.6, where
pg_start_backup() can be used on standbys for the first time for
non-exclusive backups.

So the issue does not actually pre-exist, GetStandbyFlushRecPtr()
playing its role to set up the timeline ID.

Ah, that's where we gt it from. Gotcha, makes sense. Thanks for confirming!

Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/