Re: Two fsync related performance issues? - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Two fsync related performance issues? |
Date | |
Msg-id | CA+hUKGK+12pnYA-rs2cG+A-CfdYWo+WXCgxugLx5+PDgoMVWVw@mail.gmail.com Whole thread Raw |
In response to | Re: Two fsync related performance issues? (Craig Ringer <craig@2ndquadrant.com>) |
Responses |
Re: Two fsync related performance issues?
|
List | pgsql-hackers |
On Wed, May 27, 2020 at 12:31 AM Craig Ringer <craig@2ndquadrant.com> wrote: > On Tue, 12 May 2020, 08:42 Paul Guo, <pguo@pivotal.io> wrote: >> 1. StartupXLOG() does fsync on the whole data directory early in the crash recovery. I'm wondering if we could skip somedirectories (at least the pg_log/, table directories) since wal, etc could ensure consistency. Here is the related code. >> >> if (ControlFile->state != DB_SHUTDOWNED && >> ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY) >> { >> RemoveTempXlogFiles(); >> SyncDataDirectory(); >> } > > This would actually be a good candidate for a thread pool. Dispatch sync requests and don't wait. Come back later whenthey're done. > > Unsure if that's at all feasible given that pretty much all the Pg APIs aren't thread safe though. No palloc, no elog/ereport,etc. However I don't think we're ready to run bgworkers or use shm_mq etc at that stage. We could run auxiliary processes. I think it'd be very useful if we had a general purpose worker pool that could perform arbitrary tasks and could even replace current and future dedicated launcher and worker processes, but in this case I think the problem is quite closely linked to infrastructure that we already have. I think we should: 1. Run the checkpointer during crash recovery (see https://commitfest.postgresql.org/29/2706/). 2. In walkdir(), don't call stat() on all the files, so that we don't immediately fault in all 42 bazillion inodes synchronously on a cold system (see https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BFzxupGGN4GpUdbzZN%2Btn6FQPHo8w0Q%2BAPH5Wz8RG%2Bww%40mail.gmail.com). 3. In src/common/relpath.c, add a function ParseRelationPath() that does the opposite of GetRelationPath(). 4. In datadir_fsync_fname(), if ParseRelationPath() is able to recognise a file as being a relation file, build a FileTag and call RegisterSyncRequest() to tell the checkpointer to sync this file as part of the end checkpoint (currently the end-of-recovery checkpoint, but that could also be relaxed). There are a couple of things I'm not sure about though, which is why I don't have a patch for 3 and 4: 1. You have to move a few things around to avoid hideous modularity violations: it'd be weird if fd.c knew how to make md.c's sync requests. So you'd need to find a new place to put the crash-recovery data-dir-sync routine, but then it can't use walkdir(). 2. I don't know how to fit the pre_sync_fname() part into this scheme. Or even if you still need it: if recovery is short, it probably doesn't help much, and if it's long then the kernel is likely to have started writing back before the checkpoint anyway due to dirty writeback timer policies. On the other hand, it'd be nice to start the work of *opening* the files sooner than the the start of the checkpoint, if cold inode access is slow, so perhaps a little bit more infrastructure is needed; a new way to queue a message for the checkpointer that says "hey, why don't you start presyncing stuff". On the third hand, it's arguably better to wait for more pages to get dirtied by recovery before doing any pre-sync work anyway, because the WAL will likely be redirtying the same pages again we'd ideally not like to write our data out twice, which is one of the reasons to want to collapse the work into the next checkpoint. So I'm not sure what the best plan is here. As I mentioned earlier, I think it's also possible to do smarter analysis based on WAL information so that we don't even need to open all 42 bazillion files, but just the ones touched since the last checkpoint, if you're prepared to ignore the previously-running-with-fsync=off scenario Robert mentioned. I'm not too sure about that. But something like the above scheme would at least get some more concurrency going, without changing the set of hazards we believe our scheme protects against (I mean, you could argue that SyncDataDirectory() is a bit like using a sledgehammer to crack an unspecified nut, and then not even quite hitting it if it's a Linux nut, but I'm trying to practise kai-zen here).
pgsql-hackers by date: