Re: Avoiding smgrimmedsync() during nbtree index builds - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Avoiding smgrimmedsync() during nbtree index builds |
Date | |
Msg-id | 20220310183251.3h4qokwp5leit34j@alap3.anarazel.de Whole thread Raw |
In response to | Re: Avoiding smgrimmedsync() during nbtree index builds (Melanie Plageman <melanieplageman@gmail.com>) |
Responses |
Re: Avoiding smgrimmedsync() during nbtree index builds
|
List | pgsql-hackers |
Hi, > From a06407b19c8d168ea966e45c0e483b38d49ddc12 Mon Sep 17 00:00:00 2001 > From: Melanie Plageman <melanieplageman@gmail.com> > Date: Fri, 4 Mar 2022 14:48:39 -0500 > Subject: [PATCH v6 1/4] Add unbuffered IO API I think this or one of the following patches should update src/backend/access/transam/README > @@ -164,6 +164,16 @@ void > blbuildempty(Relation index) > { > Page metapage; > + UnBufferedWriteState wstate; > + > + /* > + * Though this is an unlogged relation, pass do_wal=true since the init > + * fork of an unlogged index must be wal-logged and fsync'd. This currently > + * has no effect, as unbuffered_write() does not do log_newpage() > + * internally. However, were this to be replaced with unbuffered_extend(), > + * do_wal must be true to ensure the data is logged and fsync'd. > + */ > + unbuffered_prep(&wstate, true); Wonder if unbuffered_write should have an assert checking that no writes to INIT_FORKNUM are non-durable? Looks like it's pretty easy to forget... I'd choose unbuffered_begin over _prep(). > /* Construct metapage. */ > metapage = (Page) palloc(BLCKSZ); > @@ -176,18 +186,13 @@ blbuildempty(Relation index) > * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record. Therefore, we need > * this even when wal_level=minimal. > */ > - PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO); > - smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BLOOM_METAPAGE_BLKNO, > - (char *) metapage, true); > + unbuffered_write(&wstate, RelationGetSmgr(index), INIT_FORKNUM, > + BLOOM_METAPAGE_BLKNO, metapage); > + > log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM, > BLOOM_METAPAGE_BLKNO, metapage, true); > > - /* > - * An immediate sync is required even if we xlog'd the page, because the > - * write did not go through shared_buffers and therefore a concurrent > - * checkpoint may have moved the redo pointer past our xlog record. > - */ > - smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM); > + unbuffered_finish(&wstate, RelationGetSmgr(index), INIT_FORKNUM); > } I mildly prefer complete over finish, but ... > - * We can't use the normal heap_insert function to insert into the new > - * heap, because heap_insert overwrites the visibility information. > - * We use a special-purpose raw_heap_insert function instead, which > - * is optimized for bulk inserting a lot of tuples, knowing that we have > - * exclusive access to the heap. raw_heap_insert builds new pages in > - * local storage. When a page is full, or at the end of the process, > - * we insert it to WAL as a single record and then write it to disk > - * directly through smgr. Note, however, that any data sent to the new > - * heap's TOAST table will go through the normal bufmgr. > + * We can't use the normal heap_insert function to insert into the new heap, > + * because heap_insert overwrites the visibility information. We use a > + * special-purpose raw_heap_insert function instead, which is optimized for > + * bulk inserting a lot of tuples, knowing that we have exclusive access to the > + * heap. raw_heap_insert builds new pages in local storage. When a page is > + * full, or at the end of the process, we insert it to WAL as a single record > + * and then write it to disk directly through directmgr. Note, however, that > + * any data sent to the new heap's TOAST table will go through the normal > + * bufmgr. Part of this just reflows existing lines that seem otherwise unchanged, making it harder to see the actual change. > @@ -643,13 +644,6 @@ _bt_blnewpage(uint32 level) > static void > _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno) > { > - /* XLOG stuff */ > - if (wstate->btws_use_wal) > - { > - /* We use the XLOG_FPI record type for this */ > - log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true); > - } > - > /* > * If we have to write pages nonsequentially, fill in the space with > * zeroes until we come back and overwrite. This is not logically > @@ -661,33 +655,33 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno) > { > if (!wstate->btws_zeropage) > wstate->btws_zeropage = (Page) palloc0(BLCKSZ); > - /* don't set checksum for all-zero page */ > - smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM, > - wstate->btws_pages_written++, > - (char *) wstate->btws_zeropage, > - true); > + > + unbuffered_extend(&wstate->ub_wstate, RelationGetSmgr(wstate->index), > + MAIN_FORKNUM, wstate->btws_pages_written++, > + wstate->btws_zeropage, true); > } For a bit I thought the true argument to unbuffered_extend was about durability or registering fsyncs. Perhaps worth making it flags argument with an enum for flag arguments? > diff --git a/src/backend/storage/direct/directmgr.c b/src/backend/storage/direct/directmgr.c > new file mode 100644 > index 0000000000..42c37daa7a > --- /dev/null > +++ b/src/backend/storage/direct/directmgr.c > @@ -0,0 +1,98 @@ Now that the API is called unbuffered, the filename / directory seem confusing. > +void > +unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal) > +{ > + wstate->do_wal = do_wal; > +} > + > +void > +unbuffered_extend(UnBufferedWriteState *wstate, SMgrRelation > + smgrrel, ForkNumber forknum, BlockNumber blocknum, Page page, bool > + empty) > +{ > + /* > + * Don't checksum empty pages > + */ > + if (!empty) > + PageSetChecksumInplace(page, blocknum); > + > + smgrextend(smgrrel, forknum, blocknum, (char *) page, true); > + > + /* > + * Don't WAL-log empty pages > + */ > + if (!empty && wstate->do_wal) > + log_newpage(&(smgrrel)->smgr_rnode.node, forknum, > + blocknum, page, true); > +} > + > +void > +unbuffered_extend_range(UnBufferedWriteState *wstate, SMgrRelation smgrrel, > + ForkNumber forknum, int num_pages, BlockNumber *blocknums, Page *pages, > + bool empty, XLogRecPtr custom_lsn) > +{ > + for (int i = 0; i < num_pages; i++) > + { > + Page page = pages[i]; > + BlockNumber blkno = blocknums[i]; > + > + if (!XLogRecPtrIsInvalid(custom_lsn)) > + PageSetLSN(page, custom_lsn); > + > + if (!empty) > + PageSetChecksumInplace(page, blkno); > + > + smgrextend(smgrrel, forknum, blkno, (char *) page, true); > + } > + > + if (!empty && wstate->do_wal) > + log_newpages(&(smgrrel)->smgr_rnode.node, forknum, num_pages, > + blocknums, pages, true); > +} > + > +void > +unbuffered_write(UnBufferedWriteState *wstate, SMgrRelation smgrrel, ForkNumber > + forknum, BlockNumber blocknum, Page page) > +{ > + PageSetChecksumInplace(page, blocknum); > + > + smgrwrite(smgrrel, forknum, blocknum, (char *) page, true); > +} Seem several of these should have some minimal documentation? > +/* > + * When writing data outside shared buffers, a concurrent CHECKPOINT can move > + * the redo pointer past our WAL entries and won't flush our data to disk. If > + * the database crashes before the data makes it to disk, our WAL won't be > + * replayed and the data will be lost. > + * Thus, if a CHECKPOINT begins between unbuffered_prep() and > + * unbuffered_finish(), the backend must fsync the data itself. > + */ Hm. The last sentence sounds like this happens conditionally, but it doesn't at this point. > +typedef struct UnBufferedWriteState > +{ > + /* > + * When writing WAL-logged relation data outside of shared buffers, there > + * is a risk of a concurrent CHECKPOINT moving the redo pointer past the > + * data's associated WAL entries. To avoid this, callers in this situation > + * must fsync the pages they have written themselves. This is necessary > + * only if the relation is WAL-logged or in special cases such as the init > + * fork of an unlogged index. > + */ > + bool do_wal; > +} UnBufferedWriteState; > +/* > + * prototypes for functions in directmgr.c > + */ Newline in between end of struct and comment. > +extern void > +unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal); In headers we don't put the return type on a separate line :/ > --- a/contrib/bloom/blinsert.c > +++ b/contrib/bloom/blinsert.c > @@ -173,7 +173,7 @@ blbuildempty(Relation index) > * internally. However, were this to be replaced with unbuffered_extend(), > * do_wal must be true to ensure the data is logged and fsync'd. > */ > - unbuffered_prep(&wstate, true); > + unbuffered_prep(&wstate, true, false); This makes me think this really should be a flag argument... I also don't like the current name of the parameter "do_optimization" doesn't explain much. > +bool RedoRecPtrChanged(XLogRecPtr comparator_ptr) > +{ newline after return type. > void > -unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal) > +unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal, bool > + do_optimization) See earlier comments about documentation and parameter naming... > + * These callers can optionally use the following optimization: attempt to > + * use the sync request queue and fall back to fsync'ing the pages > + * themselves if the Redo pointer moves between the start and finish of > + * their write. In order to do this, they must set do_optimization to true > + * so that the redo pointer is saved before the write begins. > */ When do we not want this? > From 17fb22142ade65fdbe8c90889e49d0be60ba45e4 Mon Sep 17 00:00:00 2001 > From: Melanie Plageman <melanieplageman@gmail.com> > Date: Fri, 4 Mar 2022 15:53:05 -0500 > Subject: [PATCH v6 3/4] BTree index use unbuffered IO optimization > > While building a btree index, the backend can avoid fsync'ing all of the > pages if it uses the optimization introduced in a prior commit. > > This can substantially improve performance when many indexes are being > built during DDL operations. > --- > src/backend/access/nbtree/nbtree.c | 2 +- > src/backend/access/nbtree/nbtsort.c | 6 +++++- > 2 files changed, 6 insertions(+), 2 deletions(-) > > diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c > index 6b78acefbe..fc5cce4603 100644 > --- a/src/backend/access/nbtree/nbtree.c > +++ b/src/backend/access/nbtree/nbtree.c > @@ -161,7 +161,7 @@ btbuildempty(Relation index) > * internally. However, were this to be replaced with unbuffered_extend(), > * do_wal must be true to ensure the data is logged and fsync'd. > */ > - unbuffered_prep(&wstate, true, false); > + unbuffered_prep(&wstate, true, true); > > /* Construct metapage. */ > metapage = (Page) palloc(BLCKSZ); > diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c > index d6d0d4b361..f1b9e2e24e 100644 > --- a/src/backend/access/nbtree/nbtsort.c > +++ b/src/backend/access/nbtree/nbtsort.c > @@ -1189,7 +1189,11 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2) > int64 tuples_done = 0; > bool deduplicate; > > - unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal, false); > + /* > + * The fsync optimization done by directmgr is only relevant if > + * WAL-logging, so pass btws_use_wal for this parameter. > + */ > + unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal, wstate->btws_use_wal); > > deduplicate = wstate->inskey->allequalimage && !btspool->isunique && > BTGetDeduplicateItems(wstate->index); Why just here? > From 377c195bccf2dd2529e64d0d453104485f7662b7 Mon Sep 17 00:00:00 2001 > From: Melanie Plageman <melanieplageman@gmail.com> > Date: Fri, 4 Mar 2022 15:52:45 -0500 > Subject: [PATCH v6 4/4] Use shared buffers when possible for index build > > When there are not too many tuples, building the index in shared buffers > makes sense. It allows the buffer manager to handle how best to do the > IO. > --- Perhaps it'd be worth making this an independent patch that could be applied separately? > src/backend/access/nbtree/nbtree.c | 32 ++-- > src/backend/access/nbtree/nbtsort.c | 273 +++++++++++++++++++++------- > 2 files changed, 223 insertions(+), 82 deletions(-) > > diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c > index fc5cce4603..d3982b9388 100644 > --- a/src/backend/access/nbtree/nbtree.c > +++ b/src/backend/access/nbtree/nbtree.c > @@ -152,34 +152,24 @@ void > btbuildempty(Relation index) > { > Page metapage; > - UnBufferedWriteState wstate; > + Buffer metabuf; > > /* > - * Though this is an unlogged relation, pass do_wal=true since the init > - * fork of an unlogged index must be wal-logged and fsync'd. This currently > - * has no effect, as unbuffered_write() does not do log_newpage() > - * internally. However, were this to be replaced with unbuffered_extend(), > - * do_wal must be true to ensure the data is logged and fsync'd. > + * Allocate a buffer for metapage and initialize metapage. > */ > - unbuffered_prep(&wstate, true, true); > - > - /* Construct metapage. */ > - metapage = (Page) palloc(BLCKSZ); > + metabuf = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_ZERO_AND_LOCK, > + NULL); > + metapage = BufferGetPage(metabuf); > _bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false)); > > /* > - * Write the page and log it. It might seem that an immediate sync would > - * be sufficient to guarantee that the file exists on disk, but recovery > - * itself might remove it while replaying, for example, an > - * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record. Therefore, we need > - * this even when wal_level=minimal. > + * Mark metapage buffer as dirty and XLOG it > */ > - unbuffered_write(&wstate, RelationGetSmgr(index), INIT_FORKNUM, > - BTREE_METAPAGE, metapage); > - log_newpage(&RelationGetSmgr(index)->smgr_rnode.node, INIT_FORKNUM, > - BTREE_METAPAGE, metapage, true); > - > - unbuffered_finish(&wstate, RelationGetSmgr(index), INIT_FORKNUM); > + START_CRIT_SECTION(); > + MarkBufferDirty(metabuf); > + log_newpage_buffer(metabuf, true); > + END_CRIT_SECTION(); > + _bt_relbuf(index, metabuf); > } I don't understand why this patch changes btbuildempty()? That data is never accessed again, so it doesn't really seem beneficial to shuffle it through shared buffers, even if we benefit from using s_b for the main fork... > + /* > + * If not using shared buffers, for a WAL-logged relation, save the redo > + * pointer location in case a checkpoint begins during the index build. > + */ > + if (wstate->_bt_bl_unbuffered_prep) > + wstate->_bt_bl_unbuffered_prep(&wstate->ub_wstate, > + wstate->btws_use_wal, wstate->btws_use_wal); Code would probably look cleaner if an empty callback were used when no _bt_bl_unbuffered_prep callback is needed. > /* > - * allocate workspace for a new, clean btree page, not linked to any siblings. > + * Set up workspace for a new, clean btree page, not linked to any siblings. > + * Caller must allocate the passed in page. More interesting bit seems to be whether the passed in page needs to be initialized? > @@ -1154,20 +1285,24 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state) > * back one slot. Then we can dump out the page. > */ > _bt_slideleft(s->btps_page); > - _bt_blwritepage(wstate, s->btps_page, s->btps_blkno); > + wstate->_bt_blwritepage(wstate, s->btps_page, s->btps_blkno, s->btps_buf); > + s->btps_buf = InvalidBuffer; > s->btps_page = NULL; /* writepage freed the workspace */ > } Do we really have to add underscores even to struct members? That just looks wrong. Greetings, Andres Freund
pgsql-hackers by date: