Re: Avoiding smgrimmedsync() during nbtree index builds - Mailing list pgsql-hackers
From | Melanie Plageman |
---|---|
Subject | Re: Avoiding smgrimmedsync() during nbtree index builds |
Date | |
Msg-id | CAAKRu_aw72w70X1P=ba20K8iGUvSkyz7Yk03wPPh3f9WgmcJ3g@mail.gmail.com Whole thread Raw |
In response to | Re: Avoiding smgrimmedsync() during nbtree index builds (Melanie Plageman <melanieplageman@gmail.com>) |
Responses |
Re: Avoiding smgrimmedsync() during nbtree index builds
|
List | pgsql-hackers |
On Mon, May 3, 2021 at 5:24 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > On Thu, Jan 21, 2021 at 5:51 PM Andres Freund <andres@anarazel.de> wrote: > > On 2021-01-21 23:54:04 +0200, Heikki Linnakangas wrote: > > > On 21/01/2021 22:36, Andres Freund wrote: > > > > > > > > Thinking through the correctness of replacing smgrimmedsync() with sync > > > > requests, the potential problems that I can see are: > > > > > > > > 1) redo point falls between the log_newpage() and the > > > > write()/register_dirty_segment() in smgrextend/smgrwrite. > > > > 2) redo point falls between write() and register_dirty_segment() > > > > > > > > But both of these are fine in the context of initially filling a newly > > > > created relfilenode, as far as I can tell? Otherwise the current > > > > smgrimmedsync() approach wouldn't be safe either, as far as I can tell? > > > > > > Hmm. If the redo point falls between write() and the > > > register_dirty_segment(), and the checkpointer finishes the whole checkpoint > > > before register_dirty_segment(), you are not safe. That can't happen with > > > write from the buffer manager, because the checkpointer would block waiting > > > for the flush of the buffer to finish. > > > > Hm, right. > > > > The easiest way to address that race would be to just record the redo > > pointer in _bt_leafbuild() and continue to do the smgrimmedsync if a > > checkpoint started since the start of the index build. > > > > Another approach would be to utilize PGPROC.delayChkpt, but I would > > rather not unnecessarily expand the use of that. > > > > It's kind of interesting - in my aio branch I moved the > > register_dirty_segment() to before the actual asynchronous write (due to > > availability of the necessary data), which ought to be safe because of > > the buffer interlocking. But that doesn't work here, or for other places > > doing writes without going through s_b. It'd be great if we could come > > up with a general solution, but I don't immediately see anything great. > > > > The best I can come up with is adding helper functions to wrap some of > > the complexity for "unbuffered" writes of doing an immedsync iff the > > redo pointer changed. Something very roughly like > > > > typedef struct UnbufferedWriteState { XLogRecPtr redo; uint64 numwrites;} UnbufferedWriteState; > > void unbuffered_prep(UnbufferedWriteState* state); > > void unbuffered_write(UnbufferedWriteState* state, ...); > > void unbuffered_extend(UnbufferedWriteState* state, ...); > > void unbuffered_finish(UnbufferedWriteState* state); > > > > which wouldn't just do the dance to avoid the immedsync() if possible, > > but also took care of PageSetChecksumInplace() (and PageEncryptInplace() > > if we get that [1]). > > > > Regarding the implementation, I think having an API to do these > "unbuffered" or "direct" writes outside of shared buffers is a good > idea. In this specific case, the proposed API would not change the code > much. I would just wrap the small diffs I added to the beginning and end > of _bt_load() in the API calls for unbuffered_prep() and > unbuffered_finish() and then tuck away the second half of > _bt_blwritepage() in unbuffered_write()/unbuffered_extend(). I figured I > would do so after ensuring the correctness of the logic in this patch. > Then I will work on a patch which implements the unbuffered_write() API > and demonstrates its utility with at least a few of the most compelling > most compelling use cases in the code. > I've taken a pass at writing the API for "direct" or "unbuffered" writes and extends. It introduces the suggested functions: unbuffered_prep(), unbuffered_finish(), unbuffered_write(), and unbuffered_extend(). This is a rough cut -- corrections welcome and encouraged! unbuffered_prep() saves the xlog redo pointer at the time it is called. Then, if the redo pointer hasn't changed by the time unbuffered_finish() is called, the backend can avoid calling smgrimmedsync(). Note that this only works if intervening calls to smgrwrite() and smgrextend() pass skipFsync=False. unbuffered_write() and unbuffered_extend() might be able to be used even if unbuffered_prep() and unbuffered_finish() are not used -- for example hash indexes do something I don't entirely understand in which they call smgrextend() directly when allocating buckets but then initialize the new bucket pages using the bufmgr machinery. I also intend to move accounting of pages written and extended into the unbuffered_write() and unbuffered_extend() functions using the functions I propose in [1] to populate a new view, pg_stat_buffers. Then this "unbuffered" IO would be counted in stats. I picked the name "direct" for the directory in /src/backend/storage because I thought that these functions are analogous to direct IO in Linux -- in that they are doing IO without going through Postgres bufmgr -- unPGbuffered, basically. Other suggestions were "raw" and "relIO". Raw seemed confusing since raw device IO is pretty far from what is happening here. RelIO didn't seem like it belonged next to bufmgr (to me). However, direct and unbuffered will both soon become fraught terminology with the introduction of AIO and direct IO to Postgres... - Melanie [1] https://www.postgresql.org/message-id/flat/20200124195226.lth52iydq2n2uilq%40alap3.anarazel.de
Attachment
pgsql-hackers by date: