Archiving done right - Mailing list pgsql-hackers
From | David Steele |
---|---|
Subject | Archiving done right |
Date | |
Msg-id | 55D734BC.9000602@pgmasters.net Whole thread Raw |
Responses |
Re: Archiving done right
|
List | pgsql-hackers |
I split this into a new thread to talk about archiving best practices. On 8/18/15 2:25 PM, Stephen Frost wrote: > * Stephen Frost (sfrost@snowman.net) wrote: >> * Robert Haas (robertmhaas@gmail.com) wrote: >>> On Mon, Aug 17, 2015 at 2:50 PM, Peter Eisentraut <peter_e@gmx.net> wrote: >>>> The commit message for de76884 contains some important information about >>>> the purpose and use of the new .partial WAL files. But I don't see >>>> anything about this in the documentation or another user-visible place. >>>> We should probably add something. >>> >>> +1. Heikki talked about this at his PGCon presentation, and I thought >>> that was pretty helpful stuff, but not everyone was there (or will >>> remember what he said when the time comes that they need to know). >> >> Hopefully the audio or video of it will be posted eventually.. I was >> hoping to review it myself and to discuss the points he made in depth >> with David, to make sure we cover all of them (pretty sure we do, but >> good to verify). I finally had a chance to watch Heikki's lecture on warm standbys from PGCon. It's clear to me that the "correct" archiving of WAL segments is in many ways the core issue. I've been working on this problem for quite some time in the context of my work with pgBackRest so I thought it might be good to sum up the salient points from the talk, where I've gotten with implementation, and what remains to be done. Hopefully this will provoke some comments. 1) The archiver should fsync the file before returning success I took it one step further at Andres' recommendation: a. copy the WAL segment to a temp file b. fsync the file c. rename the WAL segment to the final file name d. fsync the directory 2) The archiver should accept duplicate WAL segments Since archive_status is not copied during backups there are scenarios outside of standby where Postgres may try to archive a WAL segment that has already been archived. Heikki mentioned some time ago that a good archiver should allow duplicates without complaint (I'm not sure of the thread, but something to do with the 9.5 work). Since pgBackRest calculates SHA1 hashes for everything it handles this was pretty straight-forward to implement. The WAL file name and hash are compared against the archive - if it's an exact match then 0 is returned to Postgres. 3) The archiver should accept partial segments Here's where it gets interesting. 9.5 will label the partial WAL segment with a special extension but previous versions do not. This left a choice - rename or delete the conflicting WAL segment that was already in the archive - IF the master had already archived it. Controlled failovers are almost guaranteed to have the conflicting segment archived because of the graceful shutdown, emergency failovers somewhat less so. It really depends on the nature of the failure. One technique < 9.5 would be to accept duplicate WAL and just assign a different name (.dup1, .dup2, ...) but I decided not to do that because it might mask real issues and the delete/rename WAL segment is a one-time deal after promotion. So far I've stuck with that. Heikki's solution for 9.5 is clearly better but pgBackRest currently supports Postgres >= 8.3 so it will be many years before this can be the only solution. Anyone have ideas on what could be done here in an automated way for older versions? 4) On a possible archive gap after promotion If the master was behind on archiving before it died (a very possible scenario) then there could be a gap in archive between where the master left off and the standby began. In the current scheme, if the standby has not been archiving and then gets promoted, won't it try to push the older WAL segments because the .done files do not exist in archive_status? Or does the archiver not cross timelines? I don't have specific tests for this yet but I thought that was the behavior I had seen in the past. Of course, this is not fool-proof because some of the WAL segments in the gap may have already been recycled on the standby. As Heikki mentioned, this could be solved with pg_receivexlog running on the backup server (assuming you have one). I've resisted integrating this tool with pgBackRest but now I realize that it could work in concert with archive_command to provide a more resilient solution. 5) Which brings us to archive_mode = always I already have tests to make sure that the archiver and asynchronous archiver do not conflict and I think this can easily be extended to multiple synchronous/asynchronous archivers against a single archive. I believe it may already be "safe" but one process might receive an error depending on timing. pgBackRest has its own protocol and remote process so taking locks to prevent race conditions works either locally or remotely. If the Linux kernel is sufficiently recent (>= 2.6.12) it should work on NFS as well but I have not tested it yet. 6) And lastly, archiving to the correct archive During the talk somebody mentioned the possibility of archiving to the wrong location and mixing up two databases' WAL segments. pgBackRest prevents this by reading the xlog version and database system id out of the WAL header and comparing that to what is already in the archive. -- -David david@pgmasters.net
pgsql-hackers by date: