Re: Two fsync related performance issues? - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Two fsync related performance issues?
Date
Msg-id CA+hUKGK+12pnYA-rs2cG+A-CfdYWo+WXCgxugLx5+PDgoMVWVw@mail.gmail.com
Whole thread Raw
In response to Re: Two fsync related performance issues?  (Craig Ringer <craig@2ndquadrant.com>)
Responses Re: Two fsync related performance issues?
List pgsql-hackers
On Wed, May 27, 2020 at 12:31 AM Craig Ringer <craig@2ndquadrant.com> wrote:
> On Tue, 12 May 2020, 08:42 Paul Guo, <pguo@pivotal.io> wrote:
>> 1. StartupXLOG() does fsync on the whole data directory early in the crash recovery. I'm wondering if we could skip
somedirectories (at least the pg_log/, table directories) since wal, etc could ensure consistency. Here is the related
code.
>>
>>       if (ControlFile->state != DB_SHUTDOWNED &&
>>           ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
>>       {
>>           RemoveTempXlogFiles();
>>           SyncDataDirectory();
>>       }
>
> This would actually be a good candidate for a thread pool. Dispatch sync requests and don't wait. Come back later
whenthey're done.
 
>
> Unsure if that's at all feasible given that pretty much all the Pg APIs aren't thread safe though. No palloc, no
elog/ereport,etc. However I don't think we're ready to run bgworkers or use shm_mq etc at that stage.
 

We could run auxiliary processes.  I think it'd be very useful if we
had a general purpose worker pool that could perform arbitrary tasks
and could even replace current and future dedicated launcher and
worker processes, but in this case I think the problem is quite
closely linked to infrastructure that we already have.  I think we
should:

1.  Run the checkpointer during crash recovery (see
https://commitfest.postgresql.org/29/2706/).
2.  In walkdir(), don't call stat() on all the files, so that we don't
immediately fault in all 42 bazillion inodes synchronously on a cold
system (see
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BFzxupGGN4GpUdbzZN%2Btn6FQPHo8w0Q%2BAPH5Wz8RG%2Bww%40mail.gmail.com).
3.  In src/common/relpath.c, add a function ParseRelationPath() that
does the opposite of GetRelationPath().
4.  In datadir_fsync_fname(), if ParseRelationPath() is able to
recognise a file as being a relation file, build a FileTag and call
RegisterSyncRequest() to tell the checkpointer to sync this file as
part of the end checkpoint (currently the end-of-recovery checkpoint,
but that could also be relaxed).

There are a couple of things I'm not sure about though, which is why I
don't have a patch for 3 and 4:
1.  You have to move a few things around to avoid hideous modularity
violations: it'd be weird if fd.c knew how to make md.c's sync
requests.  So you'd need to find a new place to put the crash-recovery
data-dir-sync routine, but then it can't use walkdir().
2.  I don't know how to fit the pre_sync_fname() part into this
scheme.  Or even if you still need it: if recovery is short, it
probably doesn't help much, and if it's long then the kernel is likely
to have started writing back before the checkpoint anyway due to dirty
writeback timer policies.  On the other hand, it'd be nice to start
the work of *opening* the files sooner than the the start of the
checkpoint, if cold inode access is slow, so perhaps a little bit more
infrastructure is needed; a new way to queue a message for the
checkpointer that says "hey, why don't you start presyncing stuff".
On the third hand, it's arguably better to wait for more pages to get
dirtied by recovery before doing any pre-sync work anyway, because the
WAL will likely be redirtying the same pages again we'd ideally not
like to write our data out twice, which is one of the reasons to want
to collapse the work into the next checkpoint.  So I'm not sure what
the best plan is here.

As I mentioned earlier, I think it's also possible to do smarter
analysis based on WAL information so that we don't even need to open
all 42 bazillion files, but just the ones touched since the last
checkpoint, if you're prepared to ignore the
previously-running-with-fsync=off scenario Robert mentioned.  I'm not
too sure about that.  But something like the above scheme would at
least get some more concurrency going, without changing the set of
hazards we believe our scheme protects against (I mean, you could
argue that SyncDataDirectory() is a bit like using a sledgehammer to
crack an unspecified nut, and then not even quite hitting it if it's a
Linux nut, but I'm trying to practise kai-zen here).



pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: [NBTREE] Possible NULL pointer dereference (backend/access/nbtree/nbutils.c)
Next
From: Tatsuro Yamada
Date:
Subject: Re: list of extended statistics on psql