Re: Refactoring the checkpointer's fsync request queue - Mailing list pgsql-hackers

From Shawn Debnath
Subject Re: Refactoring the checkpointer's fsync request queue
Date
Msg-id 20190222181857.GA77583@f01898859afd.ant.amazon.com
Whole thread Raw
In response to Re: Refactoring the checkpointer's fsync request queue  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Refactoring the checkpointer's fsync request queue
List pgsql-hackers
> It's interesting that you measure performance improvements that IIUC
> can come only from dropping the bitmapset stuff (or I guess bugs).  I
> don't understand the mechanism for that improvement yet.

I will be digging into this a bit more to understand what really is the 
cause for the improvements. But first, need to get the refactor patch in 
better shape :-)

> The idea of just including the segment number (in some representation,
> possibly opaque to this code) in the hash table key instead of
> carrying a segment list seems pretty good from here (and I withdraw
> the sorted vector machinery I mentioned earlier as it's redundant for
> this project)... except for one detail.  In your patch, you still have
> FORGET_RELATION_FSYNC and FORGET_DATABASE_FSYNC.  That relies on this
> sync manager code being able to understand which stuff to drop from
> the hash table, which means that is still has knowledge of the key
> hierarchy, so that it can match all entries for the relation or
> database.  That's one of the things that Andres is arguing against.

You are correct. I actually did mention having a callback to do the 
request resolution in an response to Andres back up in the thread, but 
oops, completely slipped my mind with my last patch.

> How about we take all that sync-related stuff, that Shawn has moved
> from md.c into smgr.c, and my earlier patch had moved out of md.c into
> smgrsync.c, and we put it into a new place
> src/backend/storage/file/sync.c?  Or somewhere else, but not under
> smgr.  It doesn't know anything about smgr concepts, and it can be
> used to schedule file sync for any kind of file, not just the smgr
> implementations' files.  Though they'd be the main customers, I guess.
> md.c and undofile.c etc would call it to register stuff, and
> checkpointer.c would call it to actually perform all the fsync calls.
> 
> If we do that, the next question is how to represent filenames.  One
> idea is to use tags that can be converted back to a path.  I suppose
> there could be a table of function pointers somewhere, and the tag
> could be a discriminated union?  Or just a descriminator + a small
> array of dumb uint32_t of a size big enough for all users, a bit like
> lock tags.
> 
> One reason to want to use small fixed-sized tags is to avoid atomicity
> problems in the future when it comes to the fd-passing work, as
> mentioned up-thread.  Here are some other ideas, to avoid having to
> use tags:
> 
> * send the paths through a shm queue, but the fds through the Unix
> domain socket; the messages carrying fds somehow point to the pathname
> in the shm queue (and deal with slight disorder...)
> * send the paths through the socket, but hold an LWLock while doing so
> to make sure it's atomic, no matter what the size
> * somehow prove that it's really already atomic even for long paths,
> on every operating system we support, and that it's never change, so
> there is no problem here
> 
> Another problem with variable sized pathnames even without the future
> fd-passing work is that it's harder to size the shm queue: the current
> code sets max_requests to NBuffers, which makes some kind of sense
> because that's a hard upper bound on the number of dirty segments
> there could possibly be at a given moment in time (one you'll
> practically never hit), after deduplication.  It's harder to come up
> with a decent size for a new variable-sized-message queue; MAXPGPATH *
> NBuffers would be insanely large (it's be 1/8th the size of the buffer
> pool), but if you make it any smaller there is no guarantee that
> compacting it can create space.  Perhaps the solution to that is
> simply to block/wait while shm queue is full -- but that might have
> deadlock problems.

I think I have a lot better understanding of what Andres is envisioning 
and agree with what Thomas has said so far. To summarize, we want a 
"sync" component at the level of fd, that components higher up the chain 
like md, undo, slru and checkpointer will use to track and process fsync 
requests (I am refraining from putting in an ascii diagram here!).
These checkpoint requests will be opaque to the sync machinery and will 
rely on requesters to provide the necessary details.  I, agree with 
Thomas, in that I don't think passing full pathnames or variable 
pathnames is the right way to go for all the reasons Thomas mentioned in 
his email.  However, if we want to, in the future we can easily extend 
the checkpoint request to include a type, CHECKPOINT_REQUEST_FD or 
CHECKPOINT_REQUEST_PATH, and delegate the current relfilenode to be of 
type CHECKPOINT_REQUEST_RELFILENODE. Sync can then act on the requests 
based on the type, and in some cases wouldn't need to interact with any 
other component.

The pieces of information we need to process fsyncs are (1) determine if 
a request is to invalidate other requests the queue currently holds and 
(2) determine the full path to  the file to issue fsync on.

I think using callbacks is the better path forward than having md or 
other components issue an invalidate request for each and every segment 
which can get quite heavy handed for large databases. Performance would 
be the same as today since we already scan the entire hash table when we 
encounter a forget request. This new approach will involve one 
additional function call inside the loop which does a simple compare.

And Thomas brought up a good point offline: if we followed the path of 
smgr for the callbacks, it will lead to header file dependency 
nightmare. It would be best for components like md to register it's 
callback functions with sync so that sync doesn't have to include higher 
level header files to get access to their prototypes.

At the time of smgrinit(), mdinit() would call into sync and register 
it's callbacks with an ID. We can use the same value that we are using 
for smgr_which to map the callbacks. Each fsync request will then also 
accompany this ID which the sync mechanism will use to call handlers for 
resolving forget requests or obtaining paths for files.

I think this should satisfy Andres' requirement for not teaching smgr 
anything about segmentation while keeping sync related knowledge far 
below the smgr layer. This scheme works for undo and generic block 
storage well. Though might have to teach immedsync to accept block 
numbers so that undo and other storage managers can determine what 
segment it maps to.

Thoughts? I am going to get started with revising the patch unless I 
hear otherwise.  And love the great feedback btw, thank you.

-- 
Shawn Debnath
Amazon Web Services (AWS)


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Autovaccuum vs temp tables crash
Next
From: Tom Lane
Date:
Subject: Re: [patch] Add schema total size to psql \dn+