Re: Refactoring the checkpointer's fsync request queue - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Refactoring the checkpointer's fsync request queue |
Date | |
Msg-id | CAEepm=0_0r4H1LLbG42XbVHHLoDnRh3JWGtj-rnvdWa4TV77Vw@mail.gmail.com Whole thread Raw |
In response to | Re: Refactoring the checkpointer's fsync request queue (Shawn Debnath <sdn@amazon.com>) |
Responses |
Re: Refactoring the checkpointer's fsync request queue
|
List | pgsql-hackers |
On Wed, Feb 13, 2019 at 3:58 PM Shawn Debnath <sdn@amazon.com> wrote: > On Wed, Jan 30, 2019 at 09:59:38PM -0800, Shawn Debnath wrote: > > I wonder if it might be better to introduce two different functions > > catering to the two different use cases for forcing an immediate sync: > > > > - sync a relation > > smgrimmedsyncrel(SMgrRelation, ForkNumber) > > - sync a specific segment > > smgrimmedsyncseg(SMgrRelation, ForkNumber, SegmentNumber) > > > > This will avoid having to specify InvalidSegmentNumber for majority of > > the callers today. > > I have gone ahead and rebased the refactor patch so it could cleanly > apply on heapam.c, see patch v7. > > I am also attaching a patch (v8) that implements smgrimmedsyncrel() and > smgrimmedsyncseg() as I mentioned in the previous email. It avoids > callers to pass in InvalidSegmentNumber when they just want to sync the > whole relation. As a side effect, I was able to get rid of some extra > checkpointer.h includes. Hi Shawn, Thanks! And sorry for not replying sooner -- I got distracted by FOSDEM (and the associated 20 thousand miles of travel). On that trip I had a chance to discuss this patch with Andres Freund in person, and he opined that it might be better for the fsync request queue to work in terms of pathnames. Instead of the approach in this patch, where a backend sends an fsync request for { reflfilenode, segno } inside mdwrite(), and then the checkpointer processes the request by calling smgrdimmedsyncrel(), he speculated that it'd be better to have mdwrite() send an fsync request for a pathname, and then the checkpointer would just open that file by name and fsync() it. That is, the checkpointer wouldn't call back into smgr. One of the advantages of that approach is that there are probably other files that need to be fsync'd for each checkpoint that could benefit from being offloaded to the checkpointer. Another is that you break the strange cycle mentioned above. Here's a possible problem with it. The fsync request messages would have to be either large (MAXPGPATH) or variable sized and potentially large. I am a bit worried that such messages would be problematic for the atomicity requirement of the (future) fd-passing patch that passes it via a Unix domain socket (which is a bit tricky because SOCK_SEQPACKET and SOCK_DGRAM aren't portable enough, so we probably have to use SOCK_STREAM, but there is no formal guarantee like PIPE_BUF; we know that in practice small messages will be atomic, but certainty decreases with larger messages. This needs more study...). You need to include the path even in a message containing an fd, because the checkpointer will use that as a hashtable key to merge received requests. Perhaps you'd solve that by using a small tag that can be converted back to a path (as you noticed my last patch had some leftover dead code from an experiment along those lines), but then I think you'd finish up needing an smgr interface to convert it back to the path (implementation different for md.c, undofile.c, slru.c). So you don't exactly break the cycle mentioned earlier. Hmm, or perhaps you could avoid even thinking about atomicity by passing 1 byte fd-bearing messages via the pipe, and pathnames via shared memory, in the same order. Another consideration if we do that is that the existing scheme has a kind of hierarchy that allows fsync requests to be cancelled in bulk when you drop relations and databases. That is, the checkpointer knows about the internal hierarchy of tablespace, db, rel, seg. If we get rid of that and have just paths, it seems like a bad idea to teach the checkpointer about the internal structure of the paths (even though we know they contain the same elements encoded somehow). You'd have to send an explicit cancel for every key; that is, if you're dropping a relation, you need to generate a cancel message for every segment, and if you're dropping a database, you need to generate a cancel message for every segment of every relation. Once again, if you used some kind of tag that is passed back to smgr, you could probably come up with a way to do it. Thoughts? -- Thomas Munro http://www.enterprisedb.com
pgsql-hackers by date: