Re: Standby trying "restore_command" before local WAL - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: Standby trying "restore_command" before local WAL
Date
Msg-id 20180807154204.GV27724@tamriel.snowman.net
Whole thread Raw
In response to Re: Standby trying "restore_command" before local WAL  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: Standby trying "restore_command" before local WAL
Re: Standby trying "restore_command" before local WAL
List pgsql-hackers
Greetings,

* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> On 08/06/2018 09:32 PM, Stephen Frost wrote:
> >* Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> >>On 08/06/2018 06:11 PM, Stephen Frost wrote:
> >WAL checksums are per WAL record, not across the whole file...  And,
> >yes, this seems like something we could probably write code to ensure
> >we're doing correctly, possibly even without changing the existing WAL
> >format or maybe we would have to, but either way, seems like additional
> >code that would need to be written and some serious thought put into it
> >to make sure it's correct.  I'm all for that, just to be clear, but what
> >I don't think we can do is move forward with a change to just prefer
> >pg_wal over restore command.
>
> While the checksums are per-record, the record also contains xl_prev, so
> it's effectively a chain of checksums covering the whole file. The other
> thing you can do is look at the first record of the next segment and use the
> xl_prev to detect if the previous segment was not partial.
>
> But I do agree doing this properly may require writing some new code and
> checking that those cases are detected correctly.

Right, might be possible but isn't something we necessairly guarantee
will always work correctly today in this situation where we've got old
WAL files that were pulled over a period of time (imagine that we copy
WAL file ABC before PG was done with it, but we don't get around to
copying WAL file DEF until much later after ABC has been completed and
DEF is only partial, or other such scenarios).  Remember, the scenario
we're talking about here is where someone did a pg_start_backup, took
their time copying all the files in PGDATA, including pg_wal, and then
at some later point ran pg_stop_backup.  We have no idea when those WAL
files were copied and they could have been anywhere between the
pg_start_backup point and the pg_stop_backup point.

> (Note: There was a discussion about replacing xl_prev with LSN of the
> current record, and IMHO that would work just as well, although it might be
> a bit more expensive for some operations.)

I haven't thought very much about this so don't have much of an opinion
on it one way or the other at this point.

> >CRC's are per WAL record, and possibly some WAL records might not be ok
> >to replay, or at least we need to make sure that we replay the right set
> >of WAL in the right order even when there are partial WAL files being
> >given to PG (that aren't named that way...).  The more I think about
> >this, I think we really need to avoid partial WAL files entirely- what
> >are we going to do when we get to the end of one?  We'd need to request
> >the full one from the restore command anyway, so seems like we should
> >just go ahead and get it from the archive, the question is if there's an
> >easy/cheap way to detect partial WAL files in pg_wal.
>
> As explained above, I don't think this is actually a problem. The checksums
> do cover the whole file thanks to chaining, and there are ways to detect
> partial segments. IMHO it's fine if we replay a segment and then find out it
> was partial and that we need to fetch it from archive anyway and re-apply it
> - it should not be very common case, except when the user does something
> silly.

As long as we *do* go off and try to fetch that WAL file and replay it,
and don't assume that the end of that partial WAL file means the end of
WAL replay, then I think you may be right and that it'd be fine, but it
does seem a bit risky to me.

> >As I mention above, seems like what we should really be doing is having
> >a way to know when a WAL file is full but still in pg_wal for some
> >reason (hasn't been replayed yet due to intential lag on the replica, or
> >unintentional lag on the replica, etc), and perhaps we even only do that
> >on replicas or have tools like pg_basebackup and pgbackrest do that when
> >they're doing a restore, so that it's clear to PG that all these files
> >are full WAL and they can replay them completely.
>
> IMHO we can deduce if from first xl_prev of the next segment (assuming we
> have the next segment available locally, which we likely have except for the
> last one).

See above for why I don't think it's actually that simple..

> >I was actually just thinking about having pgbackrest do exactly that. :)
> >We already have the full sha256 checksum of every WAL file in the
> >pgbackrest repository, it seems like it wouldn't be too hard to
> >calculate the checksum of the files in pg_wal and ask the repo what the
> >checksums are for those WAL files and then use the files in pg_wal if
> >they checksums match.  I can already imagine the argument coming back
> >though- that'd require additional testing to ensure it's done correctly
> >and I'm really not sure that it'd end up being all that much better
> >except in some very special circumstances where there's a low bandwidth
> >connection to the repo and often a lot of WAL left over in pg_wal.
>
> I don't think it can be done in an external tool entirely, at least not
> using restore_command alone. We remove the local segment before even
> invoking restore_command, so it's too late to check checksums etc.

Wasn't this entire discussion started because we were calling
restore_command first instead of reading from pg_wal..?  Or do we
actively go wipe out pg_wal before doing that (I didn't think we did,
but I could certainly be wrong on that point).

> We'd need invoking some other command before restore_command, which would do
> this comparison and decide whether to use local or remote WAL.

Ugh, that sounds like a lot of additional complication that I'm not sure
would be all that useful or sensible for this particular case, if that's
what it would require.  I'd rather we try to figure out some way to get
rid of restore_command entirely instead of bolting on more things around
it.

Thanks!

Stephen

Attachment

pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: [PATCH] Include application_name in "connection authorized" logmessage
Next
From: Tomas Vondra
Date:
Subject: Re: Standby trying "restore_command" before local WAL