Re: Two fsync related performance issues? - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Two fsync related performance issues? |
Date | |
Msg-id | CA+hUKGLi20-oSkTFS_uc3fuWxrJ=-X=zt7rYaeCiJKGsauG=xg@mail.gmail.com Whole thread Raw |
In response to | Re: Two fsync related performance issues? (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Two fsync related performance issues?
|
List | pgsql-hackers |
On Wed, May 20, 2020 at 12:51 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, May 11, 2020 at 8:43 PM Paul Guo <pguo@pivotal.io> wrote: > > I have this concern since I saw an issue in a real product environment that the startup process needs 10+ seconds tostart wal replay after relaunch due to elog(PANIC) (it was seen on postgres based product Greenplum but it is a commonissue in postgres also). I highly suspect the delay was mostly due to this. Also it is noticed that on public cloudsfsync is much slower than that on local storage so the slowness should be more severe on cloud. If we at least disablefsync on the table directories we could skip a lot of file fsync - this may save a lot of seconds during crash recovery. > > I've seen this problem be way worse than that. Running fsync() on all > the files and performing the unlogged table cleanup steps can together > take minutes or, I think, even tens of minutes. What I think sucks > most in this area is that we don't even emit any log messages if the > process takes a long time, so the user has no idea why things are > apparently hanging. I think we really ought to try to figure out some > way to give the user a periodic progress indication when this kind of > thing is underway, so that they at least have some idea what's > happening. > > As Tom says, I don't think there's any realistic way that we can > disable it altogether, but maybe there's some way we could make it > quicker, like some kind of parallelism, or by overlapping it with > other things. It seems to me that we have to complete the fsync pass > before we can safely checkpoint, but I don't know that it needs to be > done any sooner than that... not sure though. I suppose you could with the whole directory tree what register_dirty_segment() does, for the pathnames that you recognise as regular md.c segment names. Then it'd be done as part of the next checkpoint, though you might want to bring the pre_sync_fname() stuff back into it somehow to get more I/O parallelism on Linux (elsewhere it does nothing). Of course that syscall could block, and the checkpointer queue can fill up and then you have to do it synchronously anyway, so you'd have to look into whether that's enough. The whole concept of SyncDataDirectory() bothers me a bit though, because although it's apparently trying to be safe by being conservative, it assumes a model of write back error handling that we now know to be false on Linux. And then it thrashes the inode cache to make it more likely that error state is forgotten, just for good measure. What would a precise version of this look like? Maybe we really only need to fsync relation files that recovery modifies (as we already do), plus those that it would have touched but didn't because of the page LSN (a new behaviour to catch segment files that were written by the last run but not yet flushed, which I guess in practice would only happen with full_page_writes=off)? (If you were paranoid enough to believe that the buffer cache were actively trying to trick you and marked dirty pages clean and lost the error state so you'll never hear about it, you might even want to rewrite such pages once.)
pgsql-hackers by date: