Thread: Two fsync related performance issues?
Hello hackers,
1. StartupXLOG() does fsync on the whole data directory early in the crash recovery. I'm wondering if we could skip some directories (at least the pg_log/, table directories) since wal, etc could ensure consistency. Here is the related code.
if (ControlFile->state != DB_SHUTDOWNED &&
ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
{
RemoveTempXlogFiles();
SyncDataDirectory();
}
ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
{
RemoveTempXlogFiles();
SyncDataDirectory();
}
I have this concern since I saw an issue in a real product environment that the startup process needs 10+ seconds to start wal replay after relaunch due to elog(PANIC) (it was seen on postgres based product Greenplum but it is a common issue in postgres also). I highly suspect the delay was mostly due to this. Also it is noticed that on public clouds fsync is much slower than that on local storage so the slowness should be more severe on cloud. If we at least disable fsync on the table directories we could skip a lot of file fsync - this may save a lot of seconds during crash recovery.
2. CheckPointTwoPhase()
This may be a small issue.
See the code below,
for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
RecreateTwoPhaseFile(gxact->xid, buf, len);
RecreateTwoPhaseFile() writes a state file for a prepared transaction and does fsync. It might be good to do fsync for all files once after writing them, given the kernel is able to do asynchronous flush when writing those file contents. If the TwoPhaseState->numPrepXacts is large we could do batching to avoid the fd resource limit. I did not test them yet but this should be able to speed up checkpoint/restartpoint a bit.
Any thoughts?
Regards.
On 2020/05/12 9:42, Paul Guo wrote: > Hello hackers, > > 1. StartupXLOG() does fsync on the whole data directory early in the crash recovery. I'm wondering if we could skip somedirectories (at least the pg_log/, table directories) since wal, etc could ensure consistency. I agree that we can skip log directory but I'm not sure if skipping table directory is really safe. Also ISTM that we can skip the directories that those contents are removed or zeroed during recovery, for example, pg_snapshots, pg_substrans, etc. > Here is the related code. > > if (ControlFile->state != DB_SHUTDOWNED && > ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY) > { > RemoveTempXlogFiles(); > SyncDataDirectory(); > } > > I have this concern since I saw an issue in a real product environment that the startup process needs 10+ seconds to startwal replay after relaunch due to elog(PANIC) (it was seen on postgres based product Greenplum but it is a common issuein postgres also). I highly suspect the delay was mostly due to this. Also it is noticed that on public clouds fsyncis much slower than that on local storage so the slowness should be more severe on cloud. If we at least disable fsyncon the table directories we could skip a lot of file fsync - this may save a lot of seconds during crash recovery. > > 2. CheckPointTwoPhase() > > This may be a small issue. > > See the code below, > > for (i = 0; i < TwoPhaseState->numPrepXacts; i++) > RecreateTwoPhaseFile(gxact->xid, buf, len); > > RecreateTwoPhaseFile() writes a state file for a prepared transaction and does fsync. It might be good to do fsync forall files once after writing them, given the kernel is able to do asynchronous flush when writing those file contents.If the TwoPhaseState->numPrepXacts is large we could do batching to avoid the fd resource limit. I did not testthem yet but this should be able to speed up checkpoint/restartpoint a bit. > > Any thoughts? It seems worth making the patch and measuring the performance improvement. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
On Tue, May 12, 2020 at 12:55:37PM +0900, Fujii Masao wrote: > On 2020/05/12 9:42, Paul Guo wrote: >> 1. StartupXLOG() does fsync on the whole data directory early in >> the crash recovery. I'm wondering if we could skip some >> directories (at least the pg_log/, table directories) since wal, >> etc could ensure consistency. > > I agree that we can skip log directory but I'm not sure if skipping > table directory is really safe. Also ISTM that we can skip the directories > that those contents are removed or zeroed during recovery, > for example, pg_snapshots, pg_substrans, etc. Basically excludeDirContents[] as of basebackup.c. >> RecreateTwoPhaseFile() writes a state file for a prepared >> transaction and does fsync. It might be good to do fsync for all >> files once after writing them, given the kernel is able to do >> asynchronous flush when writing those file contents. If >> the TwoPhaseState->numPrepXacts is large we could do batching to >> avoid the fd resource limit. I did not test them yet but this >> should be able to speed up checkpoint/restartpoint a bit. > > It seems worth making the patch and measuring the performance improvement. You would need to do some micro-benchmarking here, so you could plug-in some pg_rusage_init() & co within this code path with many 2PC files present at the same time. However, I would believe that this is not really worth the potential code complications. -- Michael
Attachment
Thanks for the replies.
On Tue, May 12, 2020 at 2:04 PM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, May 12, 2020 at 12:55:37PM +0900, Fujii Masao wrote:
> On 2020/05/12 9:42, Paul Guo wrote:
>> 1. StartupXLOG() does fsync on the whole data directory early in
>> the crash recovery. I'm wondering if we could skip some
>> directories (at least the pg_log/, table directories) since wal,
>> etc could ensure consistency.
>
> I agree that we can skip log directory but I'm not sure if skipping
> table directory is really safe. Also ISTM that we can skip the directories
> that those contents are removed or zeroed during recovery,
> for example, pg_snapshots, pg_substrans, etc.
Basically excludeDirContents[] as of basebackup.c.
table directories & wal fsync probably dominates the fsync time. Do we
know any possible real scenario that requires table directory fsync?
Paul Guo <pguo@pivotal.io> writes: > table directories & wal fsync probably dominates the fsync time. Do we > know any possible real scenario that requires table directory fsync? Yes, there are filesystems where that's absolutely required. See past discussions that led to putting in those fsyncs (we did not always have them). regards, tom lane
On Mon, May 11, 2020 at 8:43 PM Paul Guo <pguo@pivotal.io> wrote: > I have this concern since I saw an issue in a real product environment that the startup process needs 10+ seconds to startwal replay after relaunch due to elog(PANIC) (it was seen on postgres based product Greenplum but it is a common issuein postgres also). I highly suspect the delay was mostly due to this. Also it is noticed that on public clouds fsyncis much slower than that on local storage so the slowness should be more severe on cloud. If we at least disable fsyncon the table directories we could skip a lot of file fsync - this may save a lot of seconds during crash recovery. I've seen this problem be way worse than that. Running fsync() on all the files and performing the unlogged table cleanup steps can together take minutes or, I think, even tens of minutes. What I think sucks most in this area is that we don't even emit any log messages if the process takes a long time, so the user has no idea why things are apparently hanging. I think we really ought to try to figure out some way to give the user a periodic progress indication when this kind of thing is underway, so that they at least have some idea what's happening. As Tom says, I don't think there's any realistic way that we can disable it altogether, but maybe there's some way we could make it quicker, like some kind of parallelism, or by overlapping it with other things. It seems to me that we have to complete the fsync pass before we can safely checkpoint, but I don't know that it needs to be done any sooner than that... not sure though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, May 20, 2020 at 12:51 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, May 11, 2020 at 8:43 PM Paul Guo <pguo@pivotal.io> wrote: > > I have this concern since I saw an issue in a real product environment that the startup process needs 10+ seconds tostart wal replay after relaunch due to elog(PANIC) (it was seen on postgres based product Greenplum but it is a commonissue in postgres also). I highly suspect the delay was mostly due to this. Also it is noticed that on public cloudsfsync is much slower than that on local storage so the slowness should be more severe on cloud. If we at least disablefsync on the table directories we could skip a lot of file fsync - this may save a lot of seconds during crash recovery. > > I've seen this problem be way worse than that. Running fsync() on all > the files and performing the unlogged table cleanup steps can together > take minutes or, I think, even tens of minutes. What I think sucks > most in this area is that we don't even emit any log messages if the > process takes a long time, so the user has no idea why things are > apparently hanging. I think we really ought to try to figure out some > way to give the user a periodic progress indication when this kind of > thing is underway, so that they at least have some idea what's > happening. > > As Tom says, I don't think there's any realistic way that we can > disable it altogether, but maybe there's some way we could make it > quicker, like some kind of parallelism, or by overlapping it with > other things. It seems to me that we have to complete the fsync pass > before we can safely checkpoint, but I don't know that it needs to be > done any sooner than that... not sure though. I suppose you could with the whole directory tree what register_dirty_segment() does, for the pathnames that you recognise as regular md.c segment names. Then it'd be done as part of the next checkpoint, though you might want to bring the pre_sync_fname() stuff back into it somehow to get more I/O parallelism on Linux (elsewhere it does nothing). Of course that syscall could block, and the checkpointer queue can fill up and then you have to do it synchronously anyway, so you'd have to look into whether that's enough. The whole concept of SyncDataDirectory() bothers me a bit though, because although it's apparently trying to be safe by being conservative, it assumes a model of write back error handling that we now know to be false on Linux. And then it thrashes the inode cache to make it more likely that error state is forgotten, just for good measure. What would a precise version of this look like? Maybe we really only need to fsync relation files that recovery modifies (as we already do), plus those that it would have touched but didn't because of the page LSN (a new behaviour to catch segment files that were written by the last run but not yet flushed, which I guess in practice would only happen with full_page_writes=off)? (If you were paranoid enough to believe that the buffer cache were actively trying to trick you and marked dirty pages clean and lost the error state so you'll never hear about it, you might even want to rewrite such pages once.)
On Tue, May 19, 2020 at 4:31 PM Thomas Munro <thomas.munro@gmail.com> wrote: > What would a precise version of this look like? Maybe we really only > need to fsync relation files that recovery modifies (as we already > do), plus those that it would have touched but didn't because of the > page LSN (a new behaviour to catch segment files that were written by > the last run but not yet flushed, which I guess in practice would only > happen with full_page_writes=off)? (If you were paranoid enough to > believe that the buffer cache were actively trying to trick you and > marked dirty pages clean and lost the error state so you'll never hear > about it, you might even want to rewrite such pages once.) I suspect there was also a worry that perhaps we'd been running before with fsync=off, or that maybe we'd just created this data directory with a non-fsyncing tool like 'cp' or 'tar -xvf'. In normal cases we shouldn't need to be nearly that conservative, but it's unclear how we can know when it's needed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, 12 May 2020, 08:42 Paul Guo, <pguo@pivotal.io> wrote:
Hello hackers,1. StartupXLOG() does fsync on the whole data directory early in the crash recovery. I'm wondering if we could skip some directories (at least the pg_log/, table directories) since wal, etc could ensure consistency. Here is the related code.if (ControlFile->state != DB_SHUTDOWNED &&
ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
{
RemoveTempXlogFiles();
SyncDataDirectory();
}
This would actually be a good candidate for a thread pool. Dispatch sync requests and don't wait. Come back later when they're done.
Unsure if that's at all feasible given that pretty much all the Pg APIs aren't thread safe though. No palloc, no elog/ereport, etc. However I don't think we're ready to run bgworkers or use shm_mq etc at that stage.
Of course if OSes would provide asynchronous IO interfaces that weren't utterly vile and broken, we wouldn't have to worry...
RecreateTwoPhaseFile() writes a state file for a prepared transaction and does fsync. It might be good to do fsync for all files once after writing them, given the kernel is able to do asynchronous flush when writing those file contents. If the TwoPhaseState->numPrepXacts is large we could do batching to avoid the fd resource limit. I did not test them yet but this should be able to speed up checkpoint/restartpoint a bit.
I seem to recall some hints we can set on a FD or mmapped range that encourage dirty buffers to be written without blocking us, too. I'll have to look them up...
On Wed, May 27, 2020 at 12:31 AM Craig Ringer <craig@2ndquadrant.com> wrote: > On Tue, 12 May 2020, 08:42 Paul Guo, <pguo@pivotal.io> wrote: >> 1. StartupXLOG() does fsync on the whole data directory early in the crash recovery. I'm wondering if we could skip somedirectories (at least the pg_log/, table directories) since wal, etc could ensure consistency. Here is the related code. >> >> if (ControlFile->state != DB_SHUTDOWNED && >> ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY) >> { >> RemoveTempXlogFiles(); >> SyncDataDirectory(); >> } > > This would actually be a good candidate for a thread pool. Dispatch sync requests and don't wait. Come back later whenthey're done. > > Unsure if that's at all feasible given that pretty much all the Pg APIs aren't thread safe though. No palloc, no elog/ereport,etc. However I don't think we're ready to run bgworkers or use shm_mq etc at that stage. We could run auxiliary processes. I think it'd be very useful if we had a general purpose worker pool that could perform arbitrary tasks and could even replace current and future dedicated launcher and worker processes, but in this case I think the problem is quite closely linked to infrastructure that we already have. I think we should: 1. Run the checkpointer during crash recovery (see https://commitfest.postgresql.org/29/2706/). 2. In walkdir(), don't call stat() on all the files, so that we don't immediately fault in all 42 bazillion inodes synchronously on a cold system (see https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BFzxupGGN4GpUdbzZN%2Btn6FQPHo8w0Q%2BAPH5Wz8RG%2Bww%40mail.gmail.com). 3. In src/common/relpath.c, add a function ParseRelationPath() that does the opposite of GetRelationPath(). 4. In datadir_fsync_fname(), if ParseRelationPath() is able to recognise a file as being a relation file, build a FileTag and call RegisterSyncRequest() to tell the checkpointer to sync this file as part of the end checkpoint (currently the end-of-recovery checkpoint, but that could also be relaxed). There are a couple of things I'm not sure about though, which is why I don't have a patch for 3 and 4: 1. You have to move a few things around to avoid hideous modularity violations: it'd be weird if fd.c knew how to make md.c's sync requests. So you'd need to find a new place to put the crash-recovery data-dir-sync routine, but then it can't use walkdir(). 2. I don't know how to fit the pre_sync_fname() part into this scheme. Or even if you still need it: if recovery is short, it probably doesn't help much, and if it's long then the kernel is likely to have started writing back before the checkpoint anyway due to dirty writeback timer policies. On the other hand, it'd be nice to start the work of *opening* the files sooner than the the start of the checkpoint, if cold inode access is slow, so perhaps a little bit more infrastructure is needed; a new way to queue a message for the checkpointer that says "hey, why don't you start presyncing stuff". On the third hand, it's arguably better to wait for more pages to get dirtied by recovery before doing any pre-sync work anyway, because the WAL will likely be redirtying the same pages again we'd ideally not like to write our data out twice, which is one of the reasons to want to collapse the work into the next checkpoint. So I'm not sure what the best plan is here. As I mentioned earlier, I think it's also possible to do smarter analysis based on WAL information so that we don't even need to open all 42 bazillion files, but just the ones touched since the last checkpoint, if you're prepared to ignore the previously-running-with-fsync=off scenario Robert mentioned. I'm not too sure about that. But something like the above scheme would at least get some more concurrency going, without changing the set of hazards we believe our scheme protects against (I mean, you could argue that SyncDataDirectory() is a bit like using a sledgehammer to crack an unspecified nut, and then not even quite hitting it if it's a Linux nut, but I'm trying to practise kai-zen here).
On Tue, May 12, 2020 at 12:43 PM Paul Guo <pguo@pivotal.io> wrote: > 2. CheckPointTwoPhase() > > This may be a small issue. > > See the code below, > > for (i = 0; i < TwoPhaseState->numPrepXacts; i++) > RecreateTwoPhaseFile(gxact->xid, buf, len); > > RecreateTwoPhaseFile() writes a state file for a prepared transaction and does fsync. It might be good to do fsync forall files once after writing them, given the kernel is able to do asynchronous flush when writing those file contents.If the TwoPhaseState->numPrepXacts is large we could do batching to avoid the fd resource limit. I did not testthem yet but this should be able to speed up checkpoint/restartpoint a bit. Hi Paul, I hadn't previously focused on this second part of your email. I think the fsync() call in RecreateTwoPhaseFile() might be a candidate for processing by the checkpoint code through the new facilities in sync.c, which effectively does something like what you describe. Take a look at the code I'm proposing for slru.c in https://commitfest.postgresql.org/29/2669/ and also in md.c. You'd need a way to describe the path of these files in a FileTag struct, so you can defer the work; it will invoke your callback to sync the file as part of the next checkpoint (or panic if it can't). You also need to make sure to tell it to forget the sync request before you unlink the file. Although it still has to call fsync one-by-one on the files at checkpoint time, by deferring it until then there is a good chance the kernel has already done the work so you don't have to go off-CPU at all. I hope that means we'd often never have to perform the fsync at all, because the file is usually gone before we reach the checkpoint. Do I have that right?
On Thu, Sep 3, 2020 at 11:30 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Wed, May 27, 2020 at 12:31 AM Craig Ringer <craig@2ndquadrant.com> wrote: > > On Tue, 12 May 2020, 08:42 Paul Guo, <pguo@pivotal.io> wrote: > >> 1. StartupXLOG() does fsync on the whole data directory early in the crash recovery. I'm wondering if we could skipsome directories (at least the pg_log/, table directories) since wal, etc could ensure consistency. Here is the relatedcode. > >> > >> if (ControlFile->state != DB_SHUTDOWNED && > >> ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY) > >> { > >> RemoveTempXlogFiles(); > >> SyncDataDirectory(); > >> } > 4. In datadir_fsync_fname(), if ParseRelationPath() is able to > recognise a file as being a relation file, build a FileTag and call > RegisterSyncRequest() to tell the checkpointer to sync this file as > part of the end checkpoint (currently the end-of-recovery checkpoint, > but that could also be relaxed). For the record, Andres Freund mentioned a few problems with this off-list and suggested we consider calling Linux syncfs() for each top level directory that could potentially be on a different filesystem. That seems like a nice idea to look into.
On Thu, Sep 3, 2020 at 12:09 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Tue, May 12, 2020 at 12:43 PM Paul Guo <pguo@pivotal.io> wrote: > > RecreateTwoPhaseFile(gxact->xid, buf, len); > I hadn't previously focused on this second part of your email. I > think the fsync() call in RecreateTwoPhaseFile() might be a candidate > for processing by the checkpoint code through the new facilities in > sync.c, which effectively does something like what you describe. Take I looked at this more closely and realised that I misunderstood; I was thinking of a problem like the one that was already solved years ago with commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71. Sorry for the noise.
On Wed, Sep 9, 2020 at 3:49 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Thu, Sep 3, 2020 at 11:30 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Wed, May 27, 2020 at 12:31 AM Craig Ringer <craig@2ndquadrant.com> wrote: > > > On Tue, 12 May 2020, 08:42 Paul Guo, <pguo@pivotal.io> wrote: > > >> 1. StartupXLOG() does fsync on the whole data directory early in the crash recovery. I'm wondering if we could skipsome directories (at least the pg_log/, table directories) since wal, etc could ensure consistency. Here is the relatedcode. > > >> > > >> if (ControlFile->state != DB_SHUTDOWNED && > > >> ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY) > > >> { > > >> RemoveTempXlogFiles(); > > >> SyncDataDirectory(); > > >> } > > > 4. In datadir_fsync_fname(), if ParseRelationPath() is able to > > recognise a file as being a relation file, build a FileTag and call > > RegisterSyncRequest() to tell the checkpointer to sync this file as > > part of the end checkpoint (currently the end-of-recovery checkpoint, > > but that could also be relaxed). > > For the record, Andres Freund mentioned a few problems with this > off-list and suggested we consider calling Linux syncfs() for each top > level directory that could potentially be on a different filesystem. > That seems like a nice idea to look into. Here's an experimental patch to try that. One problem is that before Linux 5.8, syncfs() doesn't report failures[1]. I'm not sure what to think about that; in the current coding we just log them and carry on anyway, but any theoretical problems that causes for BLK_DONE should be moot anyway because of FPIs which result in more writes and syncs. Another is that it may affect other files that aren't under pgdata as collateral damage, but that seems acceptable. It also makes me a bit sad that this wouldn't help other OSes. (Archeological note: The improved syncfs() error reporting is linked to 2018 PostgreSQL/Linux hacker discussions[2], because it was thought that syncfs() might be useful for checkpointing, though I believe since then things have moved on and the new thinking is that we'd use a new proposed interface to read per-filesystem I/O error counters while checkpointing.) [1] https://man7.org/linux/man-pages/man2/sync.2.html [2] https://lwn.net/Articles/752063/
Attachment
On Mon, Oct 5, 2020 at 2:38 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Wed, Sep 9, 2020 at 3:49 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > For the record, Andres Freund mentioned a few problems with this > > off-list and suggested we consider calling Linux syncfs() for each top > > level directory that could potentially be on a different filesystem. > > That seems like a nice idea to look into. > > Here's an experimental patch to try that. One problem is that before > Linux 5.8, syncfs() doesn't report failures[1]. I'm not sure what to > think about that; in the current coding we just log them and carry on > anyway, but any theoretical problems that causes for BLK_DONE should > be moot anyway because of FPIs which result in more writes and syncs. > Another is that it may affect other files that aren't under pgdata as > collateral damage, but that seems acceptable. It also makes me a bit > sad that this wouldn't help other OSes. ... and for comparison/discussion, here is an alternative patch that figures out precisely which files need to be fsync'd using information in the WAL. On a system with full_page_writes=on, this effectively means that we don't have to do anything at all for relation files, as described in more detail in the commit message. You still need to fsync the WAL files to make sure you can't replay records from the log that were written but not yet fdatasync'd, addressed in the patch. I'm not yet sure which other kinds of special files might need special treatment. Some thoughts: 1. Both patches avoid the need to open many files. With 1 million tables this can take over a minute even on a fast system with warm caches and/or fast local storage, before replay begins, and for a cold system with high latency storage it can be a serious problem. 2. The syncfs() one is comparatively simple, but it only works on Linux. If you used sync() (or sync(); sync()) instead, you'd be relying on non-POSIX behaviour, because POSIX says it doesn't wait for completion and indeed many non-Linux systems really are like that. 3. Though we know of kernel/filesystem pairs that can mark buffers clean while retaining the dirty contents (which would cause corruption with the current code in master, or either of these patches), fortunately those systems can't possibly run with full_page_writes=off so such problems are handled the same way torn pages are fixed. 4. Perhaps you could set a flag in the buffer to say 'needs sync' as a way to avoid repeatedly requesting syncs for the same page, but it might not be worth the hassle involved. Some other considerations that have been mentioned to me by colleagues I talked to about this: 1. The ResetUnloggedRelations() code still has to visit all relation files looking for init forks after a crash. But that turns out to be OK, it only reads directory entries in a straight line. It doesn't stat() or open() files with non-matching names, so unless you have very many unlogged tables, the problem is already avoided. (It also calls fsync() on the result, which could perhaps be replaced with a deferred request too, not sure, for another day.) 2. There may be some more directories that need special fsync() treatment. SLRUs are already covered (either handled by checkpoint or they hold ephemeral data), and I think pg_tblspc changes will be redone. pg_logical? I am not sure.
Attachment
On Wed, Oct 7, 2020 at 6:17 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Mon, Oct 5, 2020 at 2:38 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Wed, Sep 9, 2020 at 3:49 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > For the record, Andres Freund mentioned a few problems with this > > > off-list and suggested we consider calling Linux syncfs() for each top > > > level directory that could potentially be on a different filesystem. > > > That seems like a nice idea to look into. > ... and for comparison/discussion, here is an alternative patch that > figures out precisely which files need to be fsync'd using information > in the WAL. [...] Michael Banck reported[1] a system that spent 20 minute in SyncDataDirectory(). His summary caused me to go back and read the discussions[1][2] that produced the current behaviour via commits 2ce439f3 and d8179b00, and I wanted to add a couple more observations about the two draft patches mentioned above. About the need to sync files that were dirtied during a previous run: 1. The syncfs() patch has the same ignore errors-and-press-on behaviour as d8179b00 gave us, though on Linux < 5.8 it would not even report them at LOG level. 2. The "precise" fsync() patch defers the work until after redo, but if you get errors while processing the queued syncs, you would not be able to complete the end-of-recovery checkpoint. This is correct behaviour in my opinion; any such checkpoint that is allowed to complete would be a lie, and would make the corruption permanent, releasing the WAL that was our only hope of recovering. If you want to run a so-damaged system for forensic purposes, you could always bring it up with fsync=off, or consider the idea from a nearby thread to allow the end-of-recovery checkpoint to be disabled (then the eventual first checkpoint will likely take down your system, but that's the case with the current ignore-errors-and-hope-for-the-best-after-crash coding for the *next* checkpoint, assuming your damaged filesystem continues to produce errors, it's just more principled IMHO). I recognise that this sounds an absolutist argument that might attract some complaints on practical grounds, but I don't think it really makes too much difference in practice. Consider a typical Linux filesystem: individual errors aren't going to be reported more than once, and full_page_writes must be on on such a system so we'll be writing out all affected pages again and then trying to fsync again in the end-of-recovery checkpoint, so despite our attempt at creating a please-ignore-errors-and-corrupt-my-database-and-play-on mode, you'll likely panic again if I/O errors persist, or survive and continue without corruption if the error-producing-conditions were fleeting and transient (as in the thin provisioning EIO or NFS ENOSPC conditions discussed in other threads). About the need to fsync everything in sight on a system that previously ran with fsync=off, as discussed in the earlier threads: 1. The syncfs() patch provides about the same weak guarantee as the current coding. Something like: it can convert all checkpoints that were logged in the time since the kernel started from fiction to fact, except those corrupted by (unlikely) I/O errors, which may only be reported in the kernel logs if at all. 2. The "precise" fsync() patch provides no such weak guarantee. It takes the last checkpoint at face value, and can't help you with anything that happened before that. The problem I have with this is that the current coding *only does it for crash scenarios*. So, if you're moving a system from fsync=off to fsync=on with a clean shutdown in between, we already don't help you. Effectively, you need to run sync(1) (but see man page for caveats and kernel logs for errors) to convert your earlier checkpoints from fiction to fact. So all we're discussing here is what we do with a system that crashed. Why is that a sane time to transition from fsync=off to fsync=on, and, supposing someone does that, why should we offer any more guarantees about that than we do when you make the same transition on a system that shut down cleanly? [1] https://www.postgresql.org/message-id/flat/4a5d233fcd28b5f1008aec79119b02b5a9357600.camel%40credativ.de [2] https://www.postgresql.org/message-id/flat/20150114105908.GK5245%40awork2.anarazel.de#1525fab691dbaaef35108016f0b99467 [3] https://www.postgresql.org/message-id/flat/20150523172627.GA24277%40msg.df7cb.de
Hi, Am Mittwoch, den 07.10.2020, 18:17 +1300 schrieb Thomas Munro: > On Mon, Oct 5, 2020 at 2:38 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Wed, Sep 9, 2020 at 3:49 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > For the record, Andres Freund mentioned a few problems with this > > > off-list and suggested we consider calling Linux syncfs() for each top > > > level directory that could potentially be on a different filesystem. > > > That seems like a nice idea to look into. > > > > Here's an experimental patch to try that. One problem is that before > > Linux 5.8, syncfs() doesn't report failures[1]. I'm not sure what to > > think about that; in the current coding we just log them and carry on > > anyway, but any theoretical problems that causes for BLK_DONE should > > be moot anyway because of FPIs which result in more writes and syncs. > > Another is that it may affect other files that aren't under pgdata as > > collateral damage, but that seems acceptable. It also makes me a bit > > sad that this wouldn't help other OSes. > > ... and for comparison/discussion, here is an alternative patch that > figures out precisely which files need to be fsync'd using information > in the WAL. On a system with full_page_writes=on, this effectively > means that we don't have to do anything at all for relation files, as > described in more detail in the commit message. You still need to > fsync the WAL files to make sure you can't replay records from the log > that were written but not yet fdatasync'd, addressed in the patch. > I'm not yet sure which other kinds of special files might need special > treatment. > > Some thoughts: > 1. Both patches avoid the need to open many files. With 1 million > tables this can take over a minute even on a fast system with warm > caches and/or fast local storage, before replay begins, and for a cold > system with high latency storage it can be a serious problem. You mention "serious problem" and "over a minute", but I don't recall you mentioning how long it takes with those two patches (or maybe I missed it), so here goes: I created an instance with 250 databases on 250 tablespaces on a SAN storage. This is on 12.4, the patches can be backpatched with minimal changes. After pg_ctl stop -m immediate, a pg_ctl start -w (or rather the time between the two log messages "database system was interrupted; last known up at %s" and "database system was not properly shut down; automatic recovery in progress" took 1. 12-13 seconds on vanilla 2. usually < 10 ms, sometimes 70-80 ms with the syncfs patch 3. 4 ms with the optimized sync patch That's a dramatic improvement, but maybe also a best case scenario as no traffic happened since the last checkpoint. I did some light pgbench before killing the server again, but couldn't get the optimzid sync patch to take any longer, might try harder at some point but won't have much more time to test today. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Hi, Am Mittwoch, den 07.10.2020, 18:17 +1300 schrieb Thomas Munro: > ... and for comparison/discussion, here is an alternative patch that > figures out precisely which files need to be fsync'd using information > in the WAL. One question about this: Did you consider the case of a basebackup being copied/restored somewhere and the restore/PITR being started? Shouldn't Postgres then sync the whole data directory first in order to assure durability, or do we consider this to be on the tool that does the copying? Or is this not needed somehow? My understanding is that Postgres would go through the same code path during PITR: if (ControlFile->state != DB_SHUTDOWNED && ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY) { RemoveTempXlogFiles(); SyncDataDirectory(); If I didn't miss anything, that would be a point for the syncfs patch? Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
On Wed, Oct 14, 2020 at 12:53 AM Michael Banck <michael.banck@credativ.de> wrote: > Am Mittwoch, den 07.10.2020, 18:17 +1300 schrieb Thomas Munro: > > ... and for comparison/discussion, here is an alternative patch that > > figures out precisely which files need to be fsync'd using information > > in the WAL. > > One question about this: Did you consider the case of a basebackup being > copied/restored somewhere and the restore/PITR being started? Shouldn't > Postgres then sync the whole data directory first in order to assure > durability, or do we consider this to be on the tool that does the > copying? Or is this not needed somehow? To go with precise fsyncs, we'd have to say that it's the job of the creator of the secondary copy. Unfortunately that's not a terribly convenient thing to do (or at least the details vary). [The devil's advocate enters the chat] Let me ask you this: If you copy the pgdata directory of a system that has shut down cleanly, for example with cp or rsync as described in the manual, who will sync the files before you start up the cluster? Not us, anyway, because SyncDataDirectory() only runs after a crash. A checkpoint is, after all, a promise that all changes up to some LSN are durably on disk, and we leave it up to people who are copying files around underneath us while we're not running to worry about what happens if you make that untrue. Now, why is a database that crashed any different? > My understanding is that Postgres would go through the same code path > during PITR: > > if (ControlFile->state != DB_SHUTDOWNED && > ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY) > { > RemoveTempXlogFiles(); > SyncDataDirectory(); > > If I didn't miss anything, that would be a point for the syncfs patch? Yeah.
On Wed, Oct 14, 2020 at 02:48:18PM +1300, Thomas Munro wrote: > On Wed, Oct 14, 2020 at 12:53 AM Michael Banck > <michael.banck@credativ.de> wrote: >> One question about this: Did you consider the case of a basebackup being >> copied/restored somewhere and the restore/PITR being started? Shouldn't >> Postgres then sync the whole data directory first in order to assure >> durability, or do we consider this to be on the tool that does the >> copying? Or is this not needed somehow? > > To go with precise fsyncs, we'd have to say that it's the job of the > creator of the secondary copy. Unfortunately that's not a terribly > convenient thing to do (or at least the details vary). Yeah, it is safer to assume that it is the responsability of the backup tool to ensure that because it could be possible that a host is unplugged just after taking a backup, and having Postgres do this work at the beginning of archive recovery would not help in most cases. IMO this comes back to the point where we usually should not care much how long a backup takes as long as it is done right. Users care much more about how long a restore takes until consistency is reached. And this is in line with things that have been done via bc34223b or 96a7128. -- Michael
Attachment
Hi, Am Mittwoch, den 14.10.2020, 14:06 +0900 schrieb Michael Paquier: > On Wed, Oct 14, 2020 at 02:48:18PM +1300, Thomas Munro wrote: > > On Wed, Oct 14, 2020 at 12:53 AM Michael Banck > > <michael.banck@credativ.de> wrote: > > > One question about this: Did you consider the case of a basebackup being > > > copied/restored somewhere and the restore/PITR being started? Shouldn't > > > Postgres then sync the whole data directory first in order to assure > > > durability, or do we consider this to be on the tool that does the > > > copying? Or is this not needed somehow? > > > > To go with precise fsyncs, we'd have to say that it's the job of the > > creator of the secondary copy. Unfortunately that's not a terribly > > convenient thing to do (or at least the details vary). > > Yeah, it is safer to assume that it is the responsability of the > backup tool to ensure that because it could be possible that a host is > unplugged just after taking a backup, and having Postgres do this work > at the beginning of archive recovery would not help in most cases. > IMO this comes back to the point where we usually should not care much > how long a backup takes as long as it is done right. Users care much > more about how long a restore takes until consistency is reached. And > this is in line with things that have been done via bc34223b or > 96a7128. I agree that the backup tool should make sure the backup is durable and this is out of scope. I was worried more about the restore part, right now, https://www.postgresql.org/docs/13/continuous-archiving.html#BACKUP-PITR-RECOVERY says this in point 4: |Restore the database files from your file system backup. Be sure that |they are restored with the right ownership (the database system user, |not root!) and with the right permissions. If you are using |tablespaces, you should verify that the symbolic links in pg_tblspc/ |were correctly restored. There's no word of running sync afterwards or making otherwise sure that everything is back in a durable fashion. Currently, we run fsync on all the files on startup anyway during a PITR, but with the second patch, that will no longer happen. Maybe that is not a problem, but if that's the case, it's at least not clear to me. Also, Tom seemed to imply up-thread in 3750.1589826415@sss.pgh.pa.us that syncing the files was necessary, but maybe I'm reading too much into his rather short mail. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
On Wed, 14 Oct 2020, 13:06 Michael Paquier, <michael@paquier.xyz> wrote:
On Wed, Oct 14, 2020 at 02:48:18PM +1300, Thomas Munro wrote:
> On Wed, Oct 14, 2020 at 12:53 AM Michael Banck
> <michael.banck@credativ.de> wrote:
>> One question about this: Did you consider the case of a basebackup being
>> copied/restored somewhere and the restore/PITR being started? Shouldn't
>> Postgres then sync the whole data directory first in order to assure
>> durability, or do we consider this to be on the tool that does the
>> copying? Or is this not needed somehow?
>
> To go with precise fsyncs, we'd have to say that it's the job of the
> creator of the secondary copy. Unfortunately that's not a terribly
> convenient thing to do (or at least the details vary).
Yeah, it is safer to assume that it is the responsability of the
backup tool to ensure that because it could be possible that a host is
unplugged just after taking a backup, and having Postgres do this work
at the beginning of archive recovery would not help in most cases.
Let's document that assumption in the docs for pg_basebackup and the file system copy based replica creation docs. With a reference to initdb's datadir sync option.
IMO this comes back to the point where we usually should not care much
how long a backup takes as long as it is done right. Users care much
more about how long a restore takes until consistency is reached. And
this is in line with things that have been done via bc34223b or
96a7128.
--
Michael
On Tue, Dec 01, 2020 at 07:39:30PM +0800, Craig Ringer wrote: > On Wed, 14 Oct 2020, 13:06 Michael Paquier, <michael@paquier.xyz> wrote: >> Yeah, it is safer to assume that it is the responsability of the >> backup tool to ensure that because it could be possible that a host is >> unplugged just after taking a backup, and having Postgres do this work >> at the beginning of archive recovery would not help in most cases. > > Let's document that assumption in the docs for pg_basebackup and the file > system copy based replica creation docs. With a reference to initdb's > datadir sync option. Do you have any suggestion about that, in the shape of a patch perhaps? -- Michael
Attachment
On Wed, 2 Dec 2020 at 15:41, Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Dec 01, 2020 at 07:39:30PM +0800, Craig Ringer wrote:
> On Wed, 14 Oct 2020, 13:06 Michael Paquier, <michael@paquier.xyz> wrote:
>> Yeah, it is safer to assume that it is the responsability of the
>> backup tool to ensure that because it could be possible that a host is
>> unplugged just after taking a backup, and having Postgres do this work
>> at the beginning of archive recovery would not help in most cases.
>
> Let's document that assumption in the docs for pg_basebackup and the file
> system copy based replica creation docs. With a reference to initdb's
> datadir sync option.
Do you have any suggestion about that, in the shape of a patch perhaps?
I'll try to get to that, but I have some other docs patches outstanding that I'd like to resolve first.
--