Home > mailing lists

Re: Standby trying "restore_command" before local WAL - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: Standby trying "restore_command" before local WAL
Date	August 7, 2018 18:57:07
Msg-id	4e22f312-50bb-3103-e114-10498afc9342@2ndquadrant.com Whole thread Raw
In response to	Re: Standby trying "restore_command" before local WAL (Stephen Frost <sfrost@snowman.net>)
Responses	Re: Standby trying "restore_command" before local WAL
List	pgsql-hackers

Tree view


On 08/07/2018 05:42 PM, Stephen Frost wrote:
> Greetings,
> 
> * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
>> On 08/06/2018 09:32 PM, Stephen Frost wrote:
>>> * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
>>>> On 08/06/2018 06:11 PM, Stephen Frost wrote:
>>> WAL checksums are per WAL record, not across the whole file...  And,
>>> yes, this seems like something we could probably write code to ensure
>>> we're doing correctly, possibly even without changing the existing WAL
>>> format or maybe we would have to, but either way, seems like additional
>>> code that would need to be written and some serious thought put into it
>>> to make sure it's correct.  I'm all for that, just to be clear, but what
>>> I don't think we can do is move forward with a change to just prefer
>>> pg_wal over restore command.
>>
>> While the checksums are per-record, the record also contains xl_prev, so
>> it's effectively a chain of checksums covering the whole file. The other
>> thing you can do is look at the first record of the next segment and use the
>> xl_prev to detect if the previous segment was not partial.
>>
>> But I do agree doing this properly may require writing some new code and
>> checking that those cases are detected correctly.
> 
> Right, might be possible but isn't something we necessairly guarantee
> will always work correctly today in this situation where we've got old
> WAL files that were pulled over a period of time (imagine that we copy
> WAL file ABC before PG was done with it, but we don't get around to
> copying WAL file DEF until much later after ABC has been completed and
> DEF is only partial, or other such scenarios).  Remember, the scenario
> we're talking about here is where someone did a pg_start_backup, took
> their time copying all the files in PGDATA, including pg_wal, and then
> at some later point ran pg_stop_backup.  We have no idea when those WAL
> files were copied and they could have been anywhere between the
> pg_start_backup point and the pg_stop_backup point.
> 
>> (Note: There was a discussion about replacing xl_prev with LSN of the
>> current record, and IMHO that would work just as well, although it might be
>> a bit more expensive for some operations.)
> 
> I haven't thought very much about this so don't have much of an opinion
> on it one way or the other at this point.
> 
>>> CRC's are per WAL record, and possibly some WAL records might not be ok
>>> to replay, or at least we need to make sure that we replay the right set
>>> of WAL in the right order even when there are partial WAL files being
>>> given to PG (that aren't named that way...).  The more I think about
>>> this, I think we really need to avoid partial WAL files entirely- what
>>> are we going to do when we get to the end of one?  We'd need to request
>>> the full one from the restore command anyway, so seems like we should
>>> just go ahead and get it from the archive, the question is if there's an
>>> easy/cheap way to detect partial WAL files in pg_wal.
>>
>> As explained above, I don't think this is actually a problem. The checksums
>> do cover the whole file thanks to chaining, and there are ways to detect
>> partial segments. IMHO it's fine if we replay a segment and then find out it
>> was partial and that we need to fetch it from archive anyway and re-apply it
>> - it should not be very common case, except when the user does something
>> silly.
> 
> As long as we *do* go off and try to fetch that WAL file and replay it,
> and don't assume that the end of that partial WAL file means the end of
> WAL replay, then I think you may be right and that it'd be fine, but it
> does seem a bit risky to me.
> 
>>> As I mention above, seems like what we should really be doing is having
>>> a way to know when a WAL file is full but still in pg_wal for some
>>> reason (hasn't been replayed yet due to intential lag on the replica, or
>>> unintentional lag on the replica, etc), and perhaps we even only do that
>>> on replicas or have tools like pg_basebackup and pgbackrest do that when
>>> they're doing a restore, so that it's clear to PG that all these files
>>> are full WAL and they can replay them completely.
>>
>> IMHO we can deduce if from first xl_prev of the next segment (assuming we
>> have the next segment available locally, which we likely have except for the
>> last one).
> 
> See above for why I don't think it's actually that simple..
> 
>>> I was actually just thinking about having pgbackrest do exactly that. :)
>>> We already have the full sha256 checksum of every WAL file in the
>>> pgbackrest repository, it seems like it wouldn't be too hard to
>>> calculate the checksum of the files in pg_wal and ask the repo what the
>>> checksums are for those WAL files and then use the files in pg_wal if
>>> they checksums match.  I can already imagine the argument coming back
>>> though- that'd require additional testing to ensure it's done correctly
>>> and I'm really not sure that it'd end up being all that much better
>>> except in some very special circumstances where there's a low bandwidth
>>> connection to the repo and often a lot of WAL left over in pg_wal.
>>
>> I don't think it can be done in an external tool entirely, at least not
>> using restore_command alone. We remove the local segment before even
>> invoking restore_command, so it's too late to check checksums etc.
> 
> Wasn't this entire discussion started because we were calling
> restore_command first instead of reading from pg_wal..?  Or do we
> actively go wipe out pg_wal before doing that (I didn't think we did,
> but I could certainly be wrong on that point).
> 

That's how I read this part of RestoreArchivedFile:

https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlogarchive.c#L110

The very first thing it does is checking if the local file exists, and 
if it does then calling unlink().

>> We'd need invoking some other command before restore_command, which would do
>> this comparison and decide whether to use local or remote WAL.
> 
> Ugh, that sounds like a lot of additional complication that I'm not sure
> would be all that useful or sensible for this particular case, if that's
> what it would require.  I'd rather we try to figure out some way to get
> rid of restore_command entirely instead of bolting on more things around
> it.
> 

The simpler the better of course.

All I'm saying is that (assuming my understanding of RestoreArchivedFile 
is correct) we can't just do that in the current restore_command. We do 
need a way to ask the archive for some metadata/checksums, and 
restore_command is too late.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql-hackers by date:

From: Stephen Frost
Date: 07 August 2018, 18:42:04
Subject: Re: Standby trying "restore_command" before local WAL

From: Stephen Frost
Date: 07 August 2018, 19:05:08
Subject: Re: Standby trying "restore_command" before local WAL

Re: Standby trying "restore_command" before local WAL - Mailing list pgsql-hackers

Previous

Next