Re: patch proposal - Mailing list pgsql-hackers

From Venkata B Nagothi
Subject Re: patch proposal
Date
Msg-id CAEyp7J9u8AqLSHK5Ce9bOvz51yG0-R0TJtWfLf353OSLDeb1+w@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] patch proposal  (David Steele <david@pgmasters.net>)
List pgsql-hackers
Hi David, 

On Thu, Mar 23, 2017 at 4:21 AM, David Steele <david@pgmasters.net> wrote:
On 3/21/17 8:45 PM, Venkata B Nagothi wrote:
On Tue, Mar 21, 2017 at 8:46 AM, David Steele <david@pgmasters.net

    Unfortunately, I don't think the first patch (recoveryStartPoint)
    will work as currently implemented.  The problem I see is that the
    new function recoveryStartsHere() depends on pg_control containing a
    checkpoint right at the end of the backup.  There's no guarantee
    that this is true, even if pg_control is copied last.  That means a
    time, lsn, or xid that occurs somewhere in the middle of the backup
    can be selected without complaint from this code depending on timing.


Yes, that is true.  The latest best position, checkpoint position, xid
and timestamp of the restored backup of the backup is shown up in the
pg_controldata, which means, that is the position from which the
recovery would start.

Backup recovery starts from the checkpoint in the backup_label, not from the checkpoint in pg_control.  The original checkpoint that started the backup is generally overwritten in pg_control by the end of the backup.

Yes, I totally agree. My initial intention was to compare the recovery target position(s) with the contents in the backup_label, but, then, the checks would fails if the recovery is performed without the backup_label file. Then, i decided to compare the recovery target positions with the contents in the pg_control file.


Which in-turn means, WALs start getting replayed
from that position towards --> minimum recovery position (which is the
end backup, which again means applying WALs generated between start and
the end backup) all the way through to  --> recovery target position.

minRecoveryPoint is only used when recovering a backup that was made from a standby.  For backups taken on the master, the backup end WAL record is used.

The best start position to check with would the position shown up in the
pg_control file, which is way much better compared to the current
postgresql behaviour.

I don't agree, for the reasons given previously.

As explained above, my intention was to ensure that the recovery start positions checks are successfully performed irrespective of the presence of the backup_label file.

I did some analysis before deciding to use pg_controldata's output instead of backup_label file contents.

Comparing the output of the pg_controldata with the contents of backup_label contents.

Recovery Target LSN

START WAL LOCATION (which is 0/9C000028) in the backup_label = pg_controldata's latest checkpoint's REDO location (Latest checkpoint's REDO location:  0/9C000028)

Recovery Target TIME

backup start time in the backup_label (START TIME: 2017-03-26 11:55:46 AEDT) = pg_controldata's latest checkpoint time (Time of latest checkpoint :  Sun 26 Mar 2017 11:55:46 AM AEDT)

Recovery Target XID

To begin with backup_label does contain any start XID. So, the only option is to depend on pg_controldata's output. 
After a few quick tests and thorough observation, i do notice that, the pg_control file information is copied as it is to the backup location at the pg_start_backup. I performed some quick tests by running few transactions between pg_start_backup and pg_stop_backup. So, i believe, this is ideal start point for WAL replay.

Am i missing anything here ?


    The tests pass (or rather fail as expected) because they are written
    using values before the start of the backup.  It's actually the end
    of the backup (where the database becomes consistent on recovery)
    that defines where PITR can start and this distinction definitely
    matters for very long backups.  A better test would be to start the
    backup, get a time/lsn/xid after writing some data, and then make
    sure that time/lsn/xid is flagged as invalid on recovery.

The current behaviour is that, if the XID shown-up by the pg_controldata
is say 1400 and the specified recovery_target_xid is say 200, then,
postgresql just completes the recovery and promotes at the immediate
consistent recovery target, which means, the DBA has to all the way
start the restoration and recovery process again to promote the database
at the intended position.

I'm not saying that the current behavior is good, only that the proposed behavior is incorrect as far as I can tell.

Please consider my explanation above and share your thoughts.
 

    It is also problematic to assume that transaction IDs commit in
    order. If checkPoint.latestCompletedXid contains 5 then a recovery
    to xid 4 may or may not be successful depending on the commit
    order.  However, it appears in this case the patch would disallow
    recovery to 4.

If the txid 4 is the recovery target xid, then, the appropriate backup
taken previous to txid 4 must be used or as an alternative recovery
target like recovery_target_timestamp must be used to proceed to the
respective recovery target xid.

I'm not sure I follow you here, but I do know that using the last committed xid says little about which xids precede or follow it.

Yes, I agree, latestCompletedXID only says about the latest completed XID and has nothing to do with the order of XIDs being committed. It just shows the latest completed one and that is the natural and expected behaviour - isn't it ?

Sure. I will try to explain. The point i wanted to make is that, it is kind of difficult to identify the XIDs which got committed in the order and the only option we have is to look at the pg_controldata information and also, backup_label file does not contain any such information about XID. 

Generally, in real-time when DBAs intend to perform recovery until a particular point-in-time by choosing a recovery target in the form of XID, TIMESTAMP or LSN depending on what ever information they have or whatever target they feel is appropriate to use. If the XID seems to be in-appropriate to use (may be due to the order in which they got committed), then, the equivalent target like timestamp would be chosen to perform recovery. Hope i am clear here.

Having said that, the only other options which i see here is to use oldestXID, oldestActiveXID or NextXID and none of them seem appropriate to use.


    I think this logic would need to be baked into recoveryStopsAfter()
    and/or recoveryStopsBefore() in order to work.  It's not clear to me
    what that would look like, however, or if the xid check is even
    practical.


The above two functions would determine if the recovery process has to
stop at a particular position or not and the patch has nothing to do
with it.

To summarise, this patch only determines if the WAL replay has to start
at all. In other words, if the specified recovery target falls prior to
the restored database position, then, the WAL replay should not start,
which prevent the database from getting promoted at an unintended
recovery target.

I understand what you are trying to do.  My point is that the information you are working from (whatever checkpoint happens to be in pg_control when recovery starts) is not sufficient for the task.  This method is inaccurate and would disallow valid recovery scenarios.

Please consider my explanations above.
 

I suggest doing a thorough read-through of StartupXLOG(), particularly the section where the backup_label is loaded.

As communicated above, not choosing to use the backup_label file contents for recovery start point check was intentional. I will be happy to hear any comments/objections on that.

Also, the patch does not apply to the latest master and attached is the updated patch re-based with the latest master. I have also attached the re-based 2nd patch.

Thank you again for your comments on my patch.

Regards,

Venkata B N
Database Consultant
Attachment

pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: crashes due to setting max_parallel_workers=0
Next
From: David Rowley
Date:
Subject: Re: crashes due to setting max_parallel_workers=0