Re: Standby trying "restore_command" before local WAL - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Standby trying "restore_command" before local WAL |
Date | |
Msg-id | 1c73846e-272f-b98a-329d-0a385f747182@2ndquadrant.com Whole thread Raw |
In response to | Re: Standby trying "restore_command" before local WAL (Stephen Frost <sfrost@snowman.net>) |
Responses |
Re: Standby trying "restore_command" before local WAL
|
List | pgsql-hackers |
On 08/06/2018 09:32 PM, Stephen Frost wrote: > Greetings, > > * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: >> On 08/06/2018 06:11 PM, Stephen Frost wrote: >>> * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: >>>> On 08/06/2018 05:19 PM, Stephen Frost wrote: >>>>> * David Steele (david@pgmasters.net) wrote: >>>>>> I think for the stated scenario (known good standby that has been >>>>>> shutdown gracefully) it makes perfect sense to trust the contents of >>>>>> pg_wal. Call this scenario #1. >>>>>> >>>>>> An alternate scenario (#2) is that the data directory was copied using a >>>>>> basic copy tool and the pg_wal directory was not excluded from the copy. >>>>>> This means the contents of pg_wal will be in an inconsistent state. >>>>>> The files that are there might be partials (not with the extension, >>>>>> though) and you can easily have multiple partials. You will almost >>>>>> certainly not have everything you need to get to consistency. >>>> >>>> Yeah. But as Simon said, we do have fairly strong protections about applying >>>> corrupted WAL - every record is CRC-checked. So why not to fall-back to the >>>> restore_command only if the locally available WAL is not fully consistent? >>> >>> "Corrupted" doesn't necessairly only mean "the data file was munged by >>> the storage somehow." In this case, corrupted could be an old and only >>> partial WAL file, in which case we'd possibly be missing WAL that needs >>> to be replayed to bring the cluster back to a valid state, no? >> >> Why wouldn't that be detected by checksums? Also, why wouldn't this be >> handled correctly by the logic I proposed, i.e. falling-back to remote WAL >> segment if the file is incomplete/broken in some sense? > > WAL checksums are per WAL record, not across the whole file... And, > yes, this seems like something we could probably write code to ensure > we're doing correctly, possibly even without changing the existing WAL > format or maybe we would have to, but either way, seems like additional > code that would need to be written and some serious thought put into it > to make sure it's correct. I'm all for that, just to be clear, but what > I don't think we can do is move forward with a change to just prefer > pg_wal over restore command. > While the checksums are per-record, the record also contains xl_prev, so it's effectively a chain of checksums covering the whole file. The other thing you can do is look at the first record of the next segment and use the xl_prev to detect if the previous segment was not partial. But I do agree doing this properly may require writing some new code and checking that those cases are detected correctly. (Note: There was a discussion about replacing xl_prev with LSN of the current record, and IMHO that would work just as well, although it might be a bit more expensive for some operations.) >>>>>> But there's another good scenario (#3): where the pg_wal directory was >>>>>> preloaded with all the WAL required to make the cluster consistent or >>>>>> all the WAL that was available at restore time. In this case, it would >>>>>> be make sense to prefer the contents of pg_wal and only switch to >>>>>> restore_command after that has been exhausted. >>>>>> >>>>>> So, the choice of whether to prefer locally-stored or >>>>>> restore_command-fetched WAL is context-dependent, in my mind. >>>>> >>>>> Agreed. >>>> >>>> Maybe, not sure. >>> >>> The argument that David makes above in scenario #2 certainly looks >>> entirely likely to me and I don't think we've got any real protections >>> against that. The current common use-cases happen to work around the >>> risk because tools like pg_basebackup ignore the existing pg_wal >>> directory when doing the backup and instead populate it with exactly the >>> correct WAL that's needed, and in cases where a restore command is >>> specified will always pull back only valid WAL, but I don't think we can >>> decide that this scenario (#2 from above): >> >> I'm probably missing something, but why couldn't we detect that using CRC. >> Or make sure we can detect that, e.g. by adding some additional info into >> each WAL segment? > > CRC's are per WAL record, and possibly some WAL records might not be ok > to replay, or at least we need to make sure that we replay the right set > of WAL in the right order even when there are partial WAL files being > given to PG (that aren't named that way...). The more I think about > this, I think we really need to avoid partial WAL files entirely- what > are we going to do when we get to the end of one? We'd need to request > the full one from the restore command anyway, so seems like we should > just go ahead and get it from the archive, the question is if there's an > easy/cheap way to detect partial WAL files in pg_wal. > As explained above, I don't think this is actually a problem. The checksums do cover the whole file thanks to chaining, and there are ways to detect partial segments. IMHO it's fine if we replay a segment and then find out it was partial and that we need to fetch it from archive anyway and re-apply it - it should not be very common case, except when the user does something silly. >>>>>> Ideally we could have a default that is safe in each scenario with >>>>>> perhaps an override if the user knows better. Scenario #1 would allow >>>>>> WAL to be read from pg_wal by default, scenario #2 would prefer fetched >>>>>> WAL, and scenario #3 could use a GUC to override the default fetch behavior. >>>>> >>>>> Not sure how we'd be able to automatically realize which scenario we're >>>>> in though..? >>>> >>>> But do we need to know it? I mean, can't we try the local WAL first, use it >>>> if it passes the CRC checks (and possibly some other checks), and only >>>> fallback to the remote WAL if it's identified as broken? >>> >>> Maybe- but I think we need to be quite sure about that and I don't >>> believe that just checking the CRCs is enough. >> >> Sure. So what else would we need to add to each WAL segment to make that >> possible? It needs to be cheap enough not to cause perf regression, but e.g. >> a CRC of each segment + amount of data actually stored in it should be >> sufficient. > > As I mention above, seems like what we should really be doing is having > a way to know when a WAL file is full but still in pg_wal for some > reason (hasn't been replayed yet due to intential lag on the replica, or > unintentional lag on the replica, etc), and perhaps we even only do that > on replicas or have tools like pg_basebackup and pgbackrest do that when > they're doing a restore, so that it's clear to PG that all these files > are full WAL and they can replay them completely. > IMHO we can deduce if from first xl_prev of the next segment (assuming we have the next segment available locally, which we likely have except for the last one). >> Another idea - what if we instead allow fetching additional information of >> the archived WAL, and make decision based on that? For example, before >> restore_command we could request md5 checksum from the archive, compare it >> to the local WAL, and if they match then use the local one. Otherwise >> request the remote one. That'd still be a win. > > I was actually just thinking about having pgbackrest do exactly that. :) > We already have the full sha256 checksum of every WAL file in the > pgbackrest repository, it seems like it wouldn't be too hard to > calculate the checksum of the files in pg_wal and ask the repo what the > checksums are for those WAL files and then use the files in pg_wal if > they checksums match. I can already imagine the argument coming back > though- that'd require additional testing to ensure it's done correctly > and I'm really not sure that it'd end up being all that much better > except in some very special circumstances where there's a low bandwidth > connection to the repo and often a lot of WAL left over in pg_wal. > I don't think it can be done in an external tool entirely, at least not using restore_command alone. We remove the local segment before even invoking restore_command, so it's too late to check checksums etc. We'd need invoking some other command before restore_command, which would do this comparison and decide whether to use local or remote WAL. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: