Re: Avoiding smgrimmedsync() during nbtree index builds - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Avoiding smgrimmedsync() during nbtree index builds |
Date | |
Msg-id | 20220123215541.xxv54akpdcjy5klb@alap3.anarazel.de Whole thread Raw |
In response to | Re: Avoiding smgrimmedsync() during nbtree index builds (Melanie Plageman <melanieplageman@gmail.com>) |
Responses |
Re: Avoiding smgrimmedsync() during nbtree index builds
|
List | pgsql-hackers |
Hi, On 2022-01-11 12:10:54 -0500, Melanie Plageman wrote: > On Mon, Jan 10, 2022 at 5:50 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > I have attached a v3 which includes two commits -- one of which > > implements the directmgr API and uses it and the other which adds > > functionality to use either directmgr or bufmgr API during index build. > > > > Also registering for march commitfest. > > Forgot directmgr.h. Attached v4 Are you looking at the failures Justin pointed out? Something isn't quite right yet. See https://postgr.es/m/20220116202559.GW14051%40telsasoft.com and the subsequent mail about it also triggering on once on linux. > Thus, the backend must ensure that > either the Redo pointer has not moved or that the data is fsync'd before > freeing the page. "freeing"? > This is not a problem with pages written in shared buffers because the > checkpointer will block until all buffers that were dirtied before it > began finish before it moves the Redo pointer past their associated WAL > entries. > This commit makes two main changes: > > 1) It wraps smgrextend() and smgrwrite() in functions from a new API > for writing data outside of shared buffers, directmgr. > > 2) It saves the XLOG Redo pointer location before doing the write or > extend. It also adds an fsync request for the page to the > checkpointer's pending-ops table. Then, after doing the write or > extend, if the Redo pointer has moved (meaning a checkpoint has > started since it saved it last), then the backend fsync's the page > itself. Otherwise, it lets the checkpointer take care of fsync'ing > the page the next time it processes the pending-ops table. Why combine those two into one commit? > @@ -654,9 +657,8 @@ vm_extend(Relation rel, BlockNumber vm_nblocks) > /* Now extend the file */ > while (vm_nblocks_now < vm_nblocks) > { > - PageSetChecksumInplace((Page) pg.data, vm_nblocks_now); > - > - smgrextend(reln, VISIBILITYMAP_FORKNUM, vm_nblocks_now, pg.data, false); > + // TODO: aren't these pages empty? why checksum them > + unbuffered_extend(&ub_wstate, VISIBILITYMAP_FORKNUM, vm_nblocks_now, (Page) pg.data, false); Yea, it's a bit odd. PageSetChecksumInplace() will just return immediately: /* If we don't need a checksum, just return */ if (PageIsNew(page) || !DataChecksumsEnabled()) return; OTOH, it seems easier to have it there than to later forget it, when e.g. adding some actual initial content to the pages during the smgrextend(). > @@ -560,6 +562,8 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2) > > wstate.heap = btspool->heap; > wstate.index = btspool->index; > + wstate.ub_wstate.smgr_rel = RelationGetSmgr(btspool->index); > + wstate.ub_wstate.redo = InvalidXLogRecPtr; > wstate.inskey = _bt_mkscankey(wstate.index, NULL); > /* _bt_mkscankey() won't set allequalimage without metapage */ > wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true); > @@ -656,31 +660,19 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno) > if (!wstate->btws_zeropage) > wstate->btws_zeropage = (Page) palloc0(BLCKSZ); > /* don't set checksum for all-zero page */ > - smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM, > - wstate->btws_pages_written++, > - (char *) wstate->btws_zeropage, > - true); > + unbuffered_extend(&wstate->ub_wstate, MAIN_FORKNUM, wstate->btws_pages_written++, wstate->btws_zeropage, true); > } There's a bunch of long lines in here... > - /* > - * When we WAL-logged index pages, we must nonetheless fsync index files. > - * Since we're building outside shared buffers, a CHECKPOINT occurring > - * during the build has no way to flush the previously written data to > - * disk (indeed it won't know the index even exists). A crash later on > - * would replay WAL from the checkpoint, therefore it wouldn't replay our > - * earlier WAL entries. If we do not fsync those pages here, they might > - * still not be on disk when the crash occurs. > - */ > if (wstate->btws_use_wal) > - smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM); > + unbuffered_finish(&wstate->ub_wstate, MAIN_FORKNUM); > } The API of unbuffered_finish() only sometimes getting called, but unbuffered_prep() being unconditional, strikes me as prone to bugs. Perhaps it'd make sense to pass in whether the relation needs to be synced or not instead? > spgbuildempty(Relation index) > { > Page page; > + UnBufferedWriteState wstate; > + > + wstate.smgr_rel = RelationGetSmgr(index); > + unbuffered_prep(&wstate); I don't think that's actually safe, and one of the instances could be the cause cause of the bug CI is seeing: * Note: since a relcache flush can cause the file handle to be closed again, * it's unwise to hold onto the pointer returned by this function for any * long period. Recommended practice is to just re-execute RelationGetSmgr * each time you need to access the SMgrRelation. It's quite cheap in * comparison to whatever an smgr function is going to do. */ static inline SMgrRelation RelationGetSmgr(Relation rel) Greetings, Andres Freund
pgsql-hackers by date: