Re: Standby trying "restore_command" before local WAL - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Standby trying "restore_command" before local WAL |
Date | |
Msg-id | 4e22f312-50bb-3103-e114-10498afc9342@2ndquadrant.com Whole thread Raw |
In response to | Re: Standby trying "restore_command" before local WAL (Stephen Frost <sfrost@snowman.net>) |
Responses |
Re: Standby trying "restore_command" before local WAL
|
List | pgsql-hackers |
On 08/07/2018 05:42 PM, Stephen Frost wrote: > Greetings, > > * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: >> On 08/06/2018 09:32 PM, Stephen Frost wrote: >>> * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: >>>> On 08/06/2018 06:11 PM, Stephen Frost wrote: >>> WAL checksums are per WAL record, not across the whole file... And, >>> yes, this seems like something we could probably write code to ensure >>> we're doing correctly, possibly even without changing the existing WAL >>> format or maybe we would have to, but either way, seems like additional >>> code that would need to be written and some serious thought put into it >>> to make sure it's correct. I'm all for that, just to be clear, but what >>> I don't think we can do is move forward with a change to just prefer >>> pg_wal over restore command. >> >> While the checksums are per-record, the record also contains xl_prev, so >> it's effectively a chain of checksums covering the whole file. The other >> thing you can do is look at the first record of the next segment and use the >> xl_prev to detect if the previous segment was not partial. >> >> But I do agree doing this properly may require writing some new code and >> checking that those cases are detected correctly. > > Right, might be possible but isn't something we necessairly guarantee > will always work correctly today in this situation where we've got old > WAL files that were pulled over a period of time (imagine that we copy > WAL file ABC before PG was done with it, but we don't get around to > copying WAL file DEF until much later after ABC has been completed and > DEF is only partial, or other such scenarios). Remember, the scenario > we're talking about here is where someone did a pg_start_backup, took > their time copying all the files in PGDATA, including pg_wal, and then > at some later point ran pg_stop_backup. We have no idea when those WAL > files were copied and they could have been anywhere between the > pg_start_backup point and the pg_stop_backup point. > >> (Note: There was a discussion about replacing xl_prev with LSN of the >> current record, and IMHO that would work just as well, although it might be >> a bit more expensive for some operations.) > > I haven't thought very much about this so don't have much of an opinion > on it one way or the other at this point. > >>> CRC's are per WAL record, and possibly some WAL records might not be ok >>> to replay, or at least we need to make sure that we replay the right set >>> of WAL in the right order even when there are partial WAL files being >>> given to PG (that aren't named that way...). The more I think about >>> this, I think we really need to avoid partial WAL files entirely- what >>> are we going to do when we get to the end of one? We'd need to request >>> the full one from the restore command anyway, so seems like we should >>> just go ahead and get it from the archive, the question is if there's an >>> easy/cheap way to detect partial WAL files in pg_wal. >> >> As explained above, I don't think this is actually a problem. The checksums >> do cover the whole file thanks to chaining, and there are ways to detect >> partial segments. IMHO it's fine if we replay a segment and then find out it >> was partial and that we need to fetch it from archive anyway and re-apply it >> - it should not be very common case, except when the user does something >> silly. > > As long as we *do* go off and try to fetch that WAL file and replay it, > and don't assume that the end of that partial WAL file means the end of > WAL replay, then I think you may be right and that it'd be fine, but it > does seem a bit risky to me. > >>> As I mention above, seems like what we should really be doing is having >>> a way to know when a WAL file is full but still in pg_wal for some >>> reason (hasn't been replayed yet due to intential lag on the replica, or >>> unintentional lag on the replica, etc), and perhaps we even only do that >>> on replicas or have tools like pg_basebackup and pgbackrest do that when >>> they're doing a restore, so that it's clear to PG that all these files >>> are full WAL and they can replay them completely. >> >> IMHO we can deduce if from first xl_prev of the next segment (assuming we >> have the next segment available locally, which we likely have except for the >> last one). > > See above for why I don't think it's actually that simple.. > >>> I was actually just thinking about having pgbackrest do exactly that. :) >>> We already have the full sha256 checksum of every WAL file in the >>> pgbackrest repository, it seems like it wouldn't be too hard to >>> calculate the checksum of the files in pg_wal and ask the repo what the >>> checksums are for those WAL files and then use the files in pg_wal if >>> they checksums match. I can already imagine the argument coming back >>> though- that'd require additional testing to ensure it's done correctly >>> and I'm really not sure that it'd end up being all that much better >>> except in some very special circumstances where there's a low bandwidth >>> connection to the repo and often a lot of WAL left over in pg_wal. >> >> I don't think it can be done in an external tool entirely, at least not >> using restore_command alone. We remove the local segment before even >> invoking restore_command, so it's too late to check checksums etc. > > Wasn't this entire discussion started because we were calling > restore_command first instead of reading from pg_wal..? Or do we > actively go wipe out pg_wal before doing that (I didn't think we did, > but I could certainly be wrong on that point). > That's how I read this part of RestoreArchivedFile: https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlogarchive.c#L110 The very first thing it does is checking if the local file exists, and if it does then calling unlink(). >> We'd need invoking some other command before restore_command, which would do >> this comparison and decide whether to use local or remote WAL. > > Ugh, that sounds like a lot of additional complication that I'm not sure > would be all that useful or sensible for this particular case, if that's > what it would require. I'd rather we try to figure out some way to get > rid of restore_command entirely instead of bolting on more things around > it. > The simpler the better of course. All I'm saying is that (assuming my understanding of RestoreArchivedFile is correct) we can't just do that in the current restore_command. We do need a way to ask the archive for some metadata/checksums, and restore_command is too late. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: