Re: Avoiding smgrimmedsync() during nbtree index builds - Mailing list pgsql-hackers
From | Melanie Plageman |
---|---|
Subject | Re: Avoiding smgrimmedsync() during nbtree index builds |
Date | |
Msg-id | CAAKRu_ZQEpk6Q1WtNLgfXBdCmdU5xN_w0boVO6faO_Ax+ckjig@mail.gmail.com Whole thread Raw |
In response to | Re: Avoiding smgrimmedsync() during nbtree index builds (Justin Pryzby <pryzby@telsasoft.com>) |
Responses |
Re: Avoiding smgrimmedsync() during nbtree index builds
Re: Avoiding smgrimmedsync() during nbtree index builds |
List | pgsql-hackers |
On Wed, Mar 2, 2022 at 8:09 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > Rebased to appease cfbot. > > I ran these paches under a branch which shows code coverage in cirrus. It > looks good to my eyes. > https://api.cirrus-ci.com/v1/artifact/task/5212346552418304/coverage/coverage/00-index.html thanks for doing that and for the rebase! since I made updates, the attached version 6 is also rebased. To Dmitry's question: On Sun, Feb 13, 2022 at 9:33 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > > > On Wed, Feb 09, 2022 at 01:49:30PM -0500, Melanie Plageman wrote: > > I don't see it in the discussion, so naturally curious -- why directmgr > is not used for bloom index, e.g. in blbuildempty? thanks for pointing this out. blbuildempty() is also included now. bloom doesn't seem to use smgr* anywhere except blbuildempty(), so that is the only place I made changes in bloom index build. v6 has the following updates/changes: - removed erroneous extra calls to unbuffered_prep() and unbuffered_finish() in GiST and btree index builds - removed unnecessary includes - some comments were updated to accurately reflect use of directmgr - smgrwrite doesn't WAL-log anymore. This one I'm not sure about. I think it makes sense that unbuffered_extend() of non-empty pages of WAL-logged relations (or the init fork of unlogged relations) do log_newpage(), but I wasn't so sure about smgrwrite(). Currently all callers of smgrwrite() do log_newpage() and anyone using mdwrite() will end up writing the whole page. However, it seems possible that a caller of the directmgr API might wish to do a write to a particular offset and, either because of that, or, for some other reason, require different logging than that done in log_newpage(). I didn't want to make a separate wrapper function for WAL-logging in directmgr because it felt like one more step to forget. - heapam_relation_set_new_filenode doesn't use directmgr API anymore for unlogged relations. It doesn't extend or write, so it seemed like a special case better left alone. Note that the ambuildempty() functions which also write to the init fork of an unlogged relation still use the directmgr API. It is a bit confusing because they pass do_wal=true to unbuffered_prep() even though they are unlogged relations because they must log and fsync. - interface changes to unbuffered_prep() I removed the parameters to unbuffered_prep() which required the user to specify if fsync should be added to pendingOps or done with smgrimmedsync(). Understanding all of the combinations of these parameters and when they were needed was confusing and the interface felt like a foot gun. Special cases shouldn't use this interface. I prefer the idea that users of this API expect that 1) empty pages won't be checksummed or WAL logged 2) for relations that are WAL-logged, when the build is done, the relation will be fsync'd by the backend (unless the fsync optimization is being used) 3) the only case in which fsync requests are added to the pendingOps queue is when the fsync optimization is being used (which saves the redo pointer and check it later to determine if it needs to fsync itself) I also added the parameter "do_wal" to unbuffered_prep() and the UnBufferedWriteState struct. This is used when extending the file to determine whether or not to log_newpage(). unbuffered_extend() and unbuffered_write() no longer take do_wal as a parameter. Note that callers need to pass do_wal=true/false to unbuffered_prep() based on whether or not they want log_newpage() called during unbuffered_extend()--not simply based on whether or not the relation in question is WAL-logged. do_wal is the only member of the UnBufferedWriteState struct in the first patch in the set, but I think it makes sense to keep the struct around since I anticipate that the patch containing the other members needed for the fsync optimization will be committed at the same time. One final note on unbuffered_prep() -- I am thinking of renaming "do_optimization" to "try_optimization" or maybe "request_fsync_optimization". The interface (of unbuffered_prep()) would be better if we always tried to do the optimization when relevant (when the relation is WAL-logged). - freespace map, visimap, and hash index don't use directmgr API anymore For most use cases writing/extending outside shared buffers, it isn't safe to rely on requesting fsync from checkpointer. visimap, fsm, and hash index all request fsync from checkpointer for the pages they write with smgrextend() and don't smgrimmedsync() when finished adding pages (even when the hash index is WAL-logged). Supporting these exceptions made the interface incoherent, so I cut them. - added unbuffered_extend_range() This one is a bit unfortunate. Because GiST index build uses log_newpages() to log a whole page range but calls smgrextend() for each of those pages individually, I couldn't use the unbuffered_extend() function easily. So, I thought it might be useful in other contexts as well to have a function which calls smgrextend() for a range of pages and then calls log_newpages(). I've added this. However, there are two parts of GiST index build flush ready pages that didn't work with this either. The first is that it does an error check on the block numbers one at a time while looping through them to write the pages. To retain this check, I loop through the ready pages an extra time before calling unbuffered_extend(), which is probably not acceptable. Also, GiST needs to use a custom LSN for the pages. To achieve this, I added a "custom_lsn" parameter to unbuffered_extend_range(). This isn't great either. If this was a more common case, I could pass the custom LSN to unbuffered_prep(). - Melanie
Attachment
pgsql-hackers by date: