On Mon, Mar 30, 2020 at 2:59 PM Andres Freund <andres@anarazel.de> wrote:
> I think it wouldn't be too hard to compute that information while taking
> the base backup. We know the end timeline (ThisTimeLineID), so we can
> just call readTimeLineHistory(ThisTimeLineID). Which should then allow
> for something pretty trivial along the lines of
>
> timelines = readTimeLineHistory(ThisTimeLineID);
> last_start = InvalidXLogRecPtr;
> foreach(lc, timelines)
> {
> TimeLineHistoryEntry *he = lfirst(lc);
>
> if (he->end < startptr)
> continue;
>
> //
> manifest_emit_wal_range(Min(he->begin, startptr), he->end);
> last_start = he->end;
> }
>
> if (last_start == InvalidXlogRecPtr)
> start = startptr;
> else
> start = last_start;
>
> manifest_emit_wal_range(start, entptr);
I made an attempt to implement this. In the attached patch set, 0001
and 0002 are (I think) unmodified from the last version. 0003 is a
slightly-rejiggered version of your new pg_waldump option. 0004 whacks
0002 around so that the WAL ranges are included in the manifest and
pg_validatebackup tries to run pg_waldump for each WAL range. It
appears to work in light testing, but I haven't yet (1) tested it
extensively, (2) written good regression tests for it above and beyond
what pg_validatebackup had already, or (3) updated the documentation.
I'm going to work on those things. I would appreciate *very timely*
feedback on anything people do or do not like about this, because I
want to commit this patch set by the end of the work week and that
isn't very far away. I would also appreciate if people would bear in
mind the principle that half a loaf is better than none, and further
improvements can be made in future releases.
As part of my light testing, I tried promoting a standby that was
running pg_basebackup, and found that pg_basebackup failed like this:
pg_basebackup: error: could not get COPY data stream: ERROR: the
standby was promoted during online backup
HINT: This means that the backup being taken is corrupt and should
not be used. Try taking another online backup.
pg_basebackup: removing data directory "/Users/rhaas/pgslave2"
My first thought was that this error message is hard to reconcile with
this comment:
/*
* Send timeline history files too. Only the latest timeline history
* file is required for recovery, and even that only if there happens
* to be a timeline switch in the first WAL segment that contains the
* checkpoint record, or if we're taking a base backup from a standby
* server and the target timeline changes while the backup is taken.
* But they are small and highly useful for debugging purposes, so
* better include them all, always.
*/
But then it occurred to me that this might be a cascading standby.
Maybe the original master died and this machine's master got promoted,
so it has to follow a timeline switch but doesn't itself get promoted.
I think I might try to test out that scenario and see what happens,
but I haven't done so as of this writing. Regardless, it seems like a
really good idea to store a list of WAL ranges rather than a single
start/end/timeline, because even if it's impossible today it might
become possible in the future. Still, unless there's an easy way to
set up a test scenario where multiple WAL ranges need to be verified,
it may be hard to test that this code actually behaves properly.
Thoughts?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company