Re: Refactoring the checkpointer's fsync request queue - Mailing list pgsql-hackers
From | Shawn Debnath |
---|---|
Subject | Re: Refactoring the checkpointer's fsync request queue |
Date | |
Msg-id | 20190222181857.GA77583@f01898859afd.ant.amazon.com Whole thread Raw |
In response to | Re: Refactoring the checkpointer's fsync request queue (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: Refactoring the checkpointer's fsync request queue
|
List | pgsql-hackers |
> It's interesting that you measure performance improvements that IIUC > can come only from dropping the bitmapset stuff (or I guess bugs). I > don't understand the mechanism for that improvement yet. I will be digging into this a bit more to understand what really is the cause for the improvements. But first, need to get the refactor patch in better shape :-) > The idea of just including the segment number (in some representation, > possibly opaque to this code) in the hash table key instead of > carrying a segment list seems pretty good from here (and I withdraw > the sorted vector machinery I mentioned earlier as it's redundant for > this project)... except for one detail. In your patch, you still have > FORGET_RELATION_FSYNC and FORGET_DATABASE_FSYNC. That relies on this > sync manager code being able to understand which stuff to drop from > the hash table, which means that is still has knowledge of the key > hierarchy, so that it can match all entries for the relation or > database. That's one of the things that Andres is arguing against. You are correct. I actually did mention having a callback to do the request resolution in an response to Andres back up in the thread, but oops, completely slipped my mind with my last patch. > How about we take all that sync-related stuff, that Shawn has moved > from md.c into smgr.c, and my earlier patch had moved out of md.c into > smgrsync.c, and we put it into a new place > src/backend/storage/file/sync.c? Or somewhere else, but not under > smgr. It doesn't know anything about smgr concepts, and it can be > used to schedule file sync for any kind of file, not just the smgr > implementations' files. Though they'd be the main customers, I guess. > md.c and undofile.c etc would call it to register stuff, and > checkpointer.c would call it to actually perform all the fsync calls. > > If we do that, the next question is how to represent filenames. One > idea is to use tags that can be converted back to a path. I suppose > there could be a table of function pointers somewhere, and the tag > could be a discriminated union? Or just a descriminator + a small > array of dumb uint32_t of a size big enough for all users, a bit like > lock tags. > > One reason to want to use small fixed-sized tags is to avoid atomicity > problems in the future when it comes to the fd-passing work, as > mentioned up-thread. Here are some other ideas, to avoid having to > use tags: > > * send the paths through a shm queue, but the fds through the Unix > domain socket; the messages carrying fds somehow point to the pathname > in the shm queue (and deal with slight disorder...) > * send the paths through the socket, but hold an LWLock while doing so > to make sure it's atomic, no matter what the size > * somehow prove that it's really already atomic even for long paths, > on every operating system we support, and that it's never change, so > there is no problem here > > Another problem with variable sized pathnames even without the future > fd-passing work is that it's harder to size the shm queue: the current > code sets max_requests to NBuffers, which makes some kind of sense > because that's a hard upper bound on the number of dirty segments > there could possibly be at a given moment in time (one you'll > practically never hit), after deduplication. It's harder to come up > with a decent size for a new variable-sized-message queue; MAXPGPATH * > NBuffers would be insanely large (it's be 1/8th the size of the buffer > pool), but if you make it any smaller there is no guarantee that > compacting it can create space. Perhaps the solution to that is > simply to block/wait while shm queue is full -- but that might have > deadlock problems. I think I have a lot better understanding of what Andres is envisioning and agree with what Thomas has said so far. To summarize, we want a "sync" component at the level of fd, that components higher up the chain like md, undo, slru and checkpointer will use to track and process fsync requests (I am refraining from putting in an ascii diagram here!). These checkpoint requests will be opaque to the sync machinery and will rely on requesters to provide the necessary details. I, agree with Thomas, in that I don't think passing full pathnames or variable pathnames is the right way to go for all the reasons Thomas mentioned in his email. However, if we want to, in the future we can easily extend the checkpoint request to include a type, CHECKPOINT_REQUEST_FD or CHECKPOINT_REQUEST_PATH, and delegate the current relfilenode to be of type CHECKPOINT_REQUEST_RELFILENODE. Sync can then act on the requests based on the type, and in some cases wouldn't need to interact with any other component. The pieces of information we need to process fsyncs are (1) determine if a request is to invalidate other requests the queue currently holds and (2) determine the full path to the file to issue fsync on. I think using callbacks is the better path forward than having md or other components issue an invalidate request for each and every segment which can get quite heavy handed for large databases. Performance would be the same as today since we already scan the entire hash table when we encounter a forget request. This new approach will involve one additional function call inside the loop which does a simple compare. And Thomas brought up a good point offline: if we followed the path of smgr for the callbacks, it will lead to header file dependency nightmare. It would be best for components like md to register it's callback functions with sync so that sync doesn't have to include higher level header files to get access to their prototypes. At the time of smgrinit(), mdinit() would call into sync and register it's callbacks with an ID. We can use the same value that we are using for smgr_which to map the callbacks. Each fsync request will then also accompany this ID which the sync mechanism will use to call handlers for resolving forget requests or obtaining paths for files. I think this should satisfy Andres' requirement for not teaching smgr anything about segmentation while keeping sync related knowledge far below the smgr layer. This scheme works for undo and generic block storage well. Though might have to teach immedsync to accept block numbers so that undo and other storage managers can determine what segment it maps to. Thoughts? I am going to get started with revising the patch unless I hear otherwise. And love the great feedback btw, thank you. -- Shawn Debnath Amazon Web Services (AWS)
pgsql-hackers by date: