Re: Avoiding smgrimmedsync() during nbtree index builds - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Avoiding smgrimmedsync() during nbtree index builds
Date
Msg-id 20220310183251.3h4qokwp5leit34j@alap3.anarazel.de
Whole thread Raw
In response to Re: Avoiding smgrimmedsync() during nbtree index builds  (Melanie Plageman <melanieplageman@gmail.com>)
Responses Re: Avoiding smgrimmedsync() during nbtree index builds  (Greg Stark <stark@mit.edu>)
List pgsql-hackers
Hi,

> From a06407b19c8d168ea966e45c0e483b38d49ddc12 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Fri, 4 Mar 2022 14:48:39 -0500
> Subject: [PATCH v6 1/4] Add unbuffered IO API

I think this or one of the following patches should update src/backend/access/transam/README


> @@ -164,6 +164,16 @@ void
>  blbuildempty(Relation index)
>  {
>      Page        metapage;
> +    UnBufferedWriteState wstate;
> +
> +    /*
> +     * Though this is an unlogged relation, pass do_wal=true since the init
> +     * fork of an unlogged index must be wal-logged and fsync'd. This currently
> +     * has no effect, as unbuffered_write() does not do log_newpage()
> +     * internally. However, were this to be replaced with unbuffered_extend(),
> +     * do_wal must be true to ensure the data is logged and fsync'd.
> +     */
> +    unbuffered_prep(&wstate, true);

Wonder if unbuffered_write should have an assert checking that no writes to
INIT_FORKNUM are non-durable? Looks like it's pretty easy to forget...

I'd choose unbuffered_begin over _prep().


>      /* Construct metapage. */
>      metapage = (Page) palloc(BLCKSZ);
> @@ -176,18 +186,13 @@ blbuildempty(Relation index)
>       * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
>       * this even when wal_level=minimal.
>       */
> -    PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
> -    smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
> -              (char *) metapage, true);
> +    unbuffered_write(&wstate, RelationGetSmgr(index), INIT_FORKNUM,
> +            BLOOM_METAPAGE_BLKNO, metapage);
> +
>      log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
>                  BLOOM_METAPAGE_BLKNO, metapage, true);
>  
> -    /*
> -     * An immediate sync is required even if we xlog'd the page, because the
> -     * write did not go through shared_buffers and therefore a concurrent
> -     * checkpoint may have moved the redo pointer past our xlog record.
> -     */
> -    smgrimmedsync(RelationGetSmgr(index), INIT_FORKNUM);
> +    unbuffered_finish(&wstate, RelationGetSmgr(index), INIT_FORKNUM);
>  }

I mildly prefer complete over finish, but ...



> - * We can't use the normal heap_insert function to insert into the new
> - * heap, because heap_insert overwrites the visibility information.
> - * We use a special-purpose raw_heap_insert function instead, which
> - * is optimized for bulk inserting a lot of tuples, knowing that we have
> - * exclusive access to the heap.  raw_heap_insert builds new pages in
> - * local storage.  When a page is full, or at the end of the process,
> - * we insert it to WAL as a single record and then write it to disk
> - * directly through smgr.  Note, however, that any data sent to the new
> - * heap's TOAST table will go through the normal bufmgr.
> + * We can't use the normal heap_insert function to insert into the new heap,
> + * because heap_insert overwrites the visibility information. We use a
> + * special-purpose raw_heap_insert function instead, which is optimized for
> + * bulk inserting a lot of tuples, knowing that we have exclusive access to the
> + * heap.  raw_heap_insert builds new pages in local storage.  When a page is
> + * full, or at the end of the process, we insert it to WAL as a single record
> + * and then write it to disk directly through directmgr.  Note, however, that
> + * any data sent to the new heap's TOAST table will go through the normal
> + * bufmgr.

Part of this just reflows existing lines that seem otherwise unchanged, making
it harder to see the actual change.



> @@ -643,13 +644,6 @@ _bt_blnewpage(uint32 level)
>  static void
>  _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
>  {
> -    /* XLOG stuff */
> -    if (wstate->btws_use_wal)
> -    {
> -        /* We use the XLOG_FPI record type for this */
> -        log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
> -    }
> -
>      /*
>       * If we have to write pages nonsequentially, fill in the space with
>       * zeroes until we come back and overwrite.  This is not logically
> @@ -661,33 +655,33 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
>      {
>          if (!wstate->btws_zeropage)
>              wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
> -        /* don't set checksum for all-zero page */
> -        smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
> -                   wstate->btws_pages_written++,
> -                   (char *) wstate->btws_zeropage,
> -                   true);
> +
> +        unbuffered_extend(&wstate->ub_wstate, RelationGetSmgr(wstate->index),
> +                MAIN_FORKNUM, wstate->btws_pages_written++,
> +                wstate->btws_zeropage, true);
>      }

For a bit I thought the true argument to unbuffered_extend was about
durability or registering fsyncs. Perhaps worth making it flags argument with
an enum for flag arguments?


> diff --git a/src/backend/storage/direct/directmgr.c b/src/backend/storage/direct/directmgr.c
> new file mode 100644
> index 0000000000..42c37daa7a
> --- /dev/null
> +++ b/src/backend/storage/direct/directmgr.c
> @@ -0,0 +1,98 @@

Now that the API is called unbuffered, the filename / directory seem
confusing.


> +void
> +unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal)
> +{
> +    wstate->do_wal = do_wal;
> +}
> +
> +void
> +unbuffered_extend(UnBufferedWriteState *wstate, SMgrRelation
> +        smgrrel, ForkNumber forknum, BlockNumber blocknum, Page page, bool
> +        empty)
> +{
> +    /*
> +     * Don't checksum empty pages
> +     */
> +    if (!empty)
> +        PageSetChecksumInplace(page, blocknum);
> +
> +    smgrextend(smgrrel, forknum, blocknum, (char *) page, true);
> +
> +    /*
> +     * Don't WAL-log empty pages
> +     */
> +    if (!empty && wstate->do_wal)
> +        log_newpage(&(smgrrel)->smgr_rnode.node, forknum,
> +                    blocknum, page, true);
> +}
> +
> +void
> +unbuffered_extend_range(UnBufferedWriteState *wstate, SMgrRelation smgrrel,
> +        ForkNumber forknum, int num_pages, BlockNumber *blocknums, Page *pages,
> +        bool empty, XLogRecPtr custom_lsn)
> +{
> +    for (int i = 0; i < num_pages; i++)
> +    {
> +        Page        page = pages[i];
> +        BlockNumber blkno = blocknums[i];
> +
> +        if (!XLogRecPtrIsInvalid(custom_lsn))
> +            PageSetLSN(page, custom_lsn);
> +
> +        if (!empty)
> +            PageSetChecksumInplace(page, blkno);
> +
> +        smgrextend(smgrrel, forknum, blkno, (char *) page, true);
> +    }
> +
> +    if (!empty && wstate->do_wal)
> +        log_newpages(&(smgrrel)->smgr_rnode.node, forknum, num_pages,
> +                blocknums, pages, true);
> +}
> +
> +void
> +unbuffered_write(UnBufferedWriteState *wstate, SMgrRelation smgrrel, ForkNumber
> +        forknum, BlockNumber blocknum, Page page)
> +{
> +    PageSetChecksumInplace(page, blocknum);
> +
> +    smgrwrite(smgrrel, forknum, blocknum, (char *) page, true);
> +}

Seem several of these should have some minimal documentation?


> +/*
> + * When writing data outside shared buffers, a concurrent CHECKPOINT can move
> + * the redo pointer past our WAL entries and won't flush our data to disk. If
> + * the database crashes before the data makes it to disk, our WAL won't be
> + * replayed and the data will be lost.
> + * Thus, if a CHECKPOINT begins between unbuffered_prep() and
> + * unbuffered_finish(), the backend must fsync the data itself.
> + */

Hm. The last sentence sounds like this happens conditionally, but it doesn't
at this point.



> +typedef struct UnBufferedWriteState
> +{
> +    /*
> +     * When writing WAL-logged relation data outside of shared buffers, there
> +     * is a risk of a concurrent CHECKPOINT moving the redo pointer past the
> +     * data's associated WAL entries. To avoid this, callers in this situation
> +     * must fsync the pages they have written themselves. This is necessary
> +     * only if the relation is WAL-logged or in special cases such as the init
> +     * fork of an unlogged index.
> +     */
> +    bool do_wal;
> +} UnBufferedWriteState;
> +/*
> + * prototypes for functions in directmgr.c
> + */

Newline in between end of struct and comment.

> +extern void
> +unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal);

In headers we don't put the return type on a separate line :/





> --- a/contrib/bloom/blinsert.c
> +++ b/contrib/bloom/blinsert.c
> @@ -173,7 +173,7 @@ blbuildempty(Relation index)
>       * internally. However, were this to be replaced with unbuffered_extend(),
>       * do_wal must be true to ensure the data is logged and fsync'd.
>       */
> -    unbuffered_prep(&wstate, true);
> +    unbuffered_prep(&wstate, true, false);

This makes me think this really should be a flag argument...

I also don't like the current name of the parameter "do_optimization" doesn't
explain much.


> +bool RedoRecPtrChanged(XLogRecPtr comparator_ptr)
> +{

newline after return type.

>  void
> -unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal)
> +unbuffered_prep(UnBufferedWriteState *wstate, bool do_wal, bool
> +        do_optimization)

See earlier comments about documentation and parameter naming...


> +     * These callers can optionally use the following optimization: attempt to
> +     * use the sync request queue and fall back to fsync'ing the pages
> +     * themselves if the Redo pointer moves between the start and finish of
> +     * their write. In order to do this, they must set do_optimization to true
> +     * so that the redo pointer is saved before the write begins.
>       */

When do we not want this?



> From 17fb22142ade65fdbe8c90889e49d0be60ba45e4 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Fri, 4 Mar 2022 15:53:05 -0500
> Subject: [PATCH v6 3/4] BTree index use unbuffered IO optimization
> 
> While building a btree index, the backend can avoid fsync'ing all of the
> pages if it uses the optimization introduced in a prior commit.
> 
> This can substantially improve performance when many indexes are being
> built during DDL operations.
> ---
>  src/backend/access/nbtree/nbtree.c  | 2 +-
>  src/backend/access/nbtree/nbtsort.c | 6 +++++-
>  2 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
> index 6b78acefbe..fc5cce4603 100644
> --- a/src/backend/access/nbtree/nbtree.c
> +++ b/src/backend/access/nbtree/nbtree.c
> @@ -161,7 +161,7 @@ btbuildempty(Relation index)
>       * internally. However, were this to be replaced with unbuffered_extend(),
>       * do_wal must be true to ensure the data is logged and fsync'd.
>       */
> -    unbuffered_prep(&wstate, true, false);
> +    unbuffered_prep(&wstate, true, true);
>  
>      /* Construct metapage. */
>      metapage = (Page) palloc(BLCKSZ);
> diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
> index d6d0d4b361..f1b9e2e24e 100644
> --- a/src/backend/access/nbtree/nbtsort.c
> +++ b/src/backend/access/nbtree/nbtsort.c
> @@ -1189,7 +1189,11 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
>      int64        tuples_done = 0;
>      bool        deduplicate;
>  
> -    unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal, false);
> +    /*
> +     * The fsync optimization done by directmgr is only relevant if
> +     * WAL-logging, so pass btws_use_wal for this parameter.
> +     */
> +    unbuffered_prep(&wstate->ub_wstate, wstate->btws_use_wal, wstate->btws_use_wal);
>  
>      deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
>          BTGetDeduplicateItems(wstate->index);

Why just here?



> From 377c195bccf2dd2529e64d0d453104485f7662b7 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Fri, 4 Mar 2022 15:52:45 -0500
> Subject: [PATCH v6 4/4] Use shared buffers when possible for index build
> 
> When there are not too many tuples, building the index in shared buffers
> makes sense. It allows the buffer manager to handle how best to do the
> IO.
> ---

Perhaps it'd be worth making this an independent patch that could be applied
separately?


>  src/backend/access/nbtree/nbtree.c  |  32 ++--
>  src/backend/access/nbtree/nbtsort.c | 273 +++++++++++++++++++++-------
>  2 files changed, 223 insertions(+), 82 deletions(-)
> 
> diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
> index fc5cce4603..d3982b9388 100644
> --- a/src/backend/access/nbtree/nbtree.c
> +++ b/src/backend/access/nbtree/nbtree.c
> @@ -152,34 +152,24 @@ void
>  btbuildempty(Relation index)
>  {
>      Page        metapage;
> -    UnBufferedWriteState wstate;
> +    Buffer metabuf;
>  
>      /*
> -     * Though this is an unlogged relation, pass do_wal=true since the init
> -     * fork of an unlogged index must be wal-logged and fsync'd. This currently
> -     * has no effect, as unbuffered_write() does not do log_newpage()
> -     * internally. However, were this to be replaced with unbuffered_extend(),
> -     * do_wal must be true to ensure the data is logged and fsync'd.
> +     * Allocate a buffer for metapage and initialize metapage.
>       */
> -    unbuffered_prep(&wstate, true, true);
> -
> -    /* Construct metapage. */
> -    metapage = (Page) palloc(BLCKSZ);
> +    metabuf = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_ZERO_AND_LOCK,
> +            NULL);
> +    metapage = BufferGetPage(metabuf);
>      _bt_initmetapage(metapage, P_NONE, 0, _bt_allequalimage(index, false));
>  
>      /*
> -     * Write the page and log it.  It might seem that an immediate sync would
> -     * be sufficient to guarantee that the file exists on disk, but recovery
> -     * itself might remove it while replaying, for example, an
> -     * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we need
> -     * this even when wal_level=minimal.
> +     * Mark metapage buffer as dirty and XLOG it
>       */
> -    unbuffered_write(&wstate, RelationGetSmgr(index), INIT_FORKNUM,
> -            BTREE_METAPAGE, metapage);
> -    log_newpage(&RelationGetSmgr(index)->smgr_rnode.node, INIT_FORKNUM,
> -                BTREE_METAPAGE, metapage, true);
> -
> -    unbuffered_finish(&wstate, RelationGetSmgr(index), INIT_FORKNUM);
> +    START_CRIT_SECTION();
> +    MarkBufferDirty(metabuf);
> +    log_newpage_buffer(metabuf, true);
> +    END_CRIT_SECTION();
> +    _bt_relbuf(index, metabuf);
>  }

I don't understand why this patch changes btbuildempty()? That data is never
accessed again, so it doesn't really seem beneficial to shuffle it through
shared buffers, even if we benefit from using s_b for the main fork...



> +    /*
> +     * If not using shared buffers, for a WAL-logged relation, save the redo
> +     * pointer location in case a checkpoint begins during the index build.
> +     */
> +    if (wstate->_bt_bl_unbuffered_prep)
> +        wstate->_bt_bl_unbuffered_prep(&wstate->ub_wstate,
> +                wstate->btws_use_wal, wstate->btws_use_wal);

Code would probably look cleaner if an empty callback were used when no
_bt_bl_unbuffered_prep callback is needed.



>  /*
> - * allocate workspace for a new, clean btree page, not linked to any siblings.
> + * Set up workspace for a new, clean btree page, not linked to any siblings.
> + * Caller must allocate the passed in page.

More interesting bit seems to be whether the passed in page needs to be
initialized?


> @@ -1154,20 +1285,24 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
>           * back one slot.  Then we can dump out the page.
>           */
>          _bt_slideleft(s->btps_page);
> -        _bt_blwritepage(wstate, s->btps_page, s->btps_blkno);
> +        wstate->_bt_blwritepage(wstate, s->btps_page, s->btps_blkno, s->btps_buf);
> +        s->btps_buf = InvalidBuffer;
>          s->btps_page = NULL;    /* writepage freed the workspace */
>      }

Do we really have to add underscores even to struct members? That just looks
wrong.


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Column Filtering in Logical Replication
Next
From: Peter Eisentraut
Date:
Subject: Re: role self-revocation