Home > mailing lists
Re: Standby trying "restore_command" before local WAL - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: Standby trying "restore_command" before local WAL
Date	August 7, 2018 15:25:18
Msg-id	1c73846e-272f-b98a-329d-0a385f747182@2ndquadrant.com Whole thread Raw
In response to	Re: Standby trying "restore_command" before local WAL (Stephen Frost <sfrost@snowman.net>)
Responses	Re: Standby trying "restore_command" before local WAL
List	pgsql-hackers
Tree view

On 08/06/2018 09:32 PM, Stephen Frost wrote:
> Greetings,
> 
> * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
>> On 08/06/2018 06:11 PM, Stephen Frost wrote:
>>> * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
>>>> On 08/06/2018 05:19 PM, Stephen Frost wrote:
>>>>> * David Steele (david@pgmasters.net) wrote:
>>>>>> I think for the stated scenario (known good standby that has been
>>>>>> shutdown gracefully) it makes perfect sense to trust the contents of
>>>>>> pg_wal.  Call this scenario #1.
>>>>>>
>>>>>> An alternate scenario (#2) is that the data directory was copied using a
>>>>>> basic copy tool and the pg_wal directory was not excluded from the copy.
>>>>>>   This means the contents of pg_wal will be in an inconsistent state.
>>>>>> The files that are there might be partials (not with the extension,
>>>>>> though) and you can easily have multiple partials.  You will almost
>>>>>> certainly not have everything you need to get to consistency.
>>>>
>>>> Yeah. But as Simon said, we do have fairly strong protections about applying
>>>> corrupted WAL - every record is CRC-checked. So why not to fall-back to the
>>>> restore_command only if the locally available WAL is not fully consistent?
>>>
>>> "Corrupted" doesn't necessairly only mean "the data file was munged by
>>> the storage somehow."  In this case, corrupted could be an old and only
>>> partial WAL file, in which case we'd possibly be missing WAL that needs
>>> to be replayed to bring the cluster back to a valid state, no?
>>
>> Why wouldn't that be detected by checksums? Also, why wouldn't this be
>> handled correctly by the logic I proposed, i.e. falling-back to remote WAL
>> segment if the file is incomplete/broken in some sense?
> 
> WAL checksums are per WAL record, not across the whole file...  And,
> yes, this seems like something we could probably write code to ensure
> we're doing correctly, possibly even without changing the existing WAL
> format or maybe we would have to, but either way, seems like additional
> code that would need to be written and some serious thought put into it
> to make sure it's correct.  I'm all for that, just to be clear, but what
> I don't think we can do is move forward with a change to just prefer
> pg_wal over restore command.
> 

While the checksums are per-record, the record also contains xl_prev, so 
it's effectively a chain of checksums covering the whole file. The other 
thing you can do is look at the first record of the next segment and use 
the xl_prev to detect if the previous segment was not partial.

But I do agree doing this properly may require writing some new code and 
checking that those cases are detected correctly.

(Note: There was a discussion about replacing xl_prev with LSN of the 
current record, and IMHO that would work just as well, although it might 
be a bit more expensive for some operations.)

>>>>>> But there's another good scenario (#3): where the pg_wal directory was
>>>>>> preloaded with all the WAL required to make the cluster consistent or
>>>>>> all the WAL that was available at restore time.  In this case, it would
>>>>>> be make sense to prefer the contents of pg_wal and only switch to
>>>>>> restore_command after that has been exhausted.
>>>>>>
>>>>>> So, the choice of whether to prefer locally-stored or
>>>>>> restore_command-fetched WAL is context-dependent, in my mind.
>>>>>
>>>>> Agreed.
>>>>
>>>> Maybe, not sure.
>>>
>>> The argument that David makes above in scenario #2 certainly looks
>>> entirely likely to me and I don't think we've got any real protections
>>> against that.  The current common use-cases happen to work around the
>>> risk because tools like pg_basebackup ignore the existing pg_wal
>>> directory when doing the backup and instead populate it with exactly the
>>> correct WAL that's needed, and in cases where a restore command is
>>> specified will always pull back only valid WAL, but I don't think we can
>>> decide that this scenario (#2 from above):
>>
>> I'm probably missing something, but why couldn't we detect that using CRC.
>> Or make sure we can detect that, e.g. by adding some additional info into
>> each WAL segment?
> 
> CRC's are per WAL record, and possibly some WAL records might not be ok
> to replay, or at least we need to make sure that we replay the right set
> of WAL in the right order even when there are partial WAL files being
> given to PG (that aren't named that way...).  The more I think about
> this, I think we really need to avoid partial WAL files entirely- what
> are we going to do when we get to the end of one?  We'd need to request
> the full one from the restore command anyway, so seems like we should
> just go ahead and get it from the archive, the question is if there's an
> easy/cheap way to detect partial WAL files in pg_wal.
> 

As explained above, I don't think this is actually a problem. The 
checksums do cover the whole file thanks to chaining, and there are ways 
to detect partial segments. IMHO it's fine if we replay a segment and 
then find out it was partial and that we need to fetch it from archive 
anyway and re-apply it - it should not be very common case, except when 
the user does something silly.

>>>>>> Ideally we could have a default that is safe in each scenario with
>>>>>> perhaps an override if the user knows better.  Scenario #1 would allow
>>>>>> WAL to be read from pg_wal by default, scenario #2 would prefer fetched
>>>>>> WAL, and scenario #3 could use a GUC to override the default fetch behavior.
>>>>>
>>>>> Not sure how we'd be able to automatically realize which scenario we're
>>>>> in though..?
>>>>
>>>> But do we need to know it? I mean, can't we try the local WAL first, use it
>>>> if it passes the CRC checks (and possibly some other checks), and only
>>>> fallback to the remote WAL if it's identified as broken?
>>>
>>> Maybe- but I think we need to be quite sure about that and I don't
>>> believe that just checking the CRCs is enough.
>>
>> Sure. So what else would we need to add to each WAL segment to make that
>> possible? It needs to be cheap enough not to cause perf regression, but e.g.
>> a CRC of each segment + amount of data actually stored in it should be
>> sufficient.
> 
> As I mention above, seems like what we should really be doing is having
> a way to know when a WAL file is full but still in pg_wal for some
> reason (hasn't been replayed yet due to intential lag on the replica, or
> unintentional lag on the replica, etc), and perhaps we even only do that
> on replicas or have tools like pg_basebackup and pgbackrest do that when
> they're doing a restore, so that it's clear to PG that all these files
> are full WAL and they can replay them completely.
> 

IMHO we can deduce if from first xl_prev of the next segment (assuming 
we have the next segment available locally, which we likely have except 
for the last one).

>> Another idea - what if we instead allow fetching additional information of
>> the archived WAL, and make decision based on that? For example, before
>> restore_command we could request md5 checksum from the archive, compare it
>> to the local WAL, and if they match then use the local one. Otherwise
>> request the remote one. That'd still be a win.
> 
> I was actually just thinking about having pgbackrest do exactly that. :)
> We already have the full sha256 checksum of every WAL file in the
> pgbackrest repository, it seems like it wouldn't be too hard to
> calculate the checksum of the files in pg_wal and ask the repo what the
> checksums are for those WAL files and then use the files in pg_wal if
> they checksums match.  I can already imagine the argument coming back
> though- that'd require additional testing to ensure it's done correctly
> and I'm really not sure that it'd end up being all that much better
> except in some very special circumstances where there's a low bandwidth
> connection to the repo and often a lot of WAL left over in pg_wal.
> 

I don't think it can be done in an external tool entirely, at least not 
using restore_command alone. We remove the local segment before even
invoking restore_command, so it's too late to check checksums etc.

We'd need invoking some other command before restore_command, which 
would do this comparison and decide whether to use local or remote WAL.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date:
From: Jesper Pedersen
Date: 07 August 2018, 15:14:08
Subject: Re: partition tree inspection functions
From: Lætitia Avrot
Date: 07 August 2018, 15:29:43
Subject: Re: Constraint documentation
Re: Standby trying "restore_command" before local WAL - Mailing list pgsql-hackers

Previous

Next