Re: Avoiding smgrimmedsync() during nbtree index builds - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Avoiding smgrimmedsync() during nbtree index builds
Date
Msg-id 20220123215541.xxv54akpdcjy5klb@alap3.anarazel.de
Whole thread Raw
In response to Re: Avoiding smgrimmedsync() during nbtree index builds  (Melanie Plageman <melanieplageman@gmail.com>)
Responses Re: Avoiding smgrimmedsync() during nbtree index builds
List pgsql-hackers
Hi,

On 2022-01-11 12:10:54 -0500, Melanie Plageman wrote:
> On Mon, Jan 10, 2022 at 5:50 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > I have attached a v3 which includes two commits -- one of which
> > implements the directmgr API and uses it and the other which adds
> > functionality to use either directmgr or bufmgr API during index build.
> >
> > Also registering for march commitfest.
> 
> Forgot directmgr.h. Attached v4

Are you looking at the failures Justin pointed out? Something isn't quite
right yet. See https://postgr.es/m/20220116202559.GW14051%40telsasoft.com and
the subsequent mail about it also triggering on once on linux.


> Thus, the backend must ensure that
> either the Redo pointer has not moved or that the data is fsync'd before
> freeing the page.

"freeing"?


> This is not a problem with pages written in shared buffers because the
> checkpointer will block until all buffers that were dirtied before it
> began finish before it moves the Redo pointer past their associated WAL
> entries.

> This commit makes two main changes:
> 
> 1) It wraps smgrextend() and smgrwrite() in functions from a new API
>    for writing data outside of shared buffers, directmgr.
> 
> 2) It saves the XLOG Redo pointer location before doing the write or
>    extend. It also adds an fsync request for the page to the
>    checkpointer's pending-ops table. Then, after doing the write or
>    extend, if the Redo pointer has moved (meaning a checkpoint has
>    started since it saved it last), then the backend fsync's the page
>    itself. Otherwise, it lets the checkpointer take care of fsync'ing
>    the page the next time it processes the pending-ops table.

Why combine those two into one commit?


> @@ -654,9 +657,8 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
>      /* Now extend the file */
>      while (vm_nblocks_now < vm_nblocks)
>      {
> -        PageSetChecksumInplace((Page) pg.data, vm_nblocks_now);
> -
> -        smgrextend(reln, VISIBILITYMAP_FORKNUM, vm_nblocks_now, pg.data, false);
> +        // TODO: aren't these pages empty? why checksum them
> +        unbuffered_extend(&ub_wstate, VISIBILITYMAP_FORKNUM, vm_nblocks_now, (Page) pg.data, false);

Yea, it's a bit odd. PageSetChecksumInplace() will just return immediately:

    /* If we don't need a checksum, just return */
    if (PageIsNew(page) || !DataChecksumsEnabled())
        return;

OTOH, it seems easier to have it there than to later forget it, when
e.g. adding some actual initial content to the pages during the smgrextend().



> @@ -560,6 +562,8 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
>  
>      wstate.heap = btspool->heap;
>      wstate.index = btspool->index;
> +    wstate.ub_wstate.smgr_rel = RelationGetSmgr(btspool->index);
> +    wstate.ub_wstate.redo = InvalidXLogRecPtr;
>      wstate.inskey = _bt_mkscankey(wstate.index, NULL);
>      /* _bt_mkscankey() won't set allequalimage without metapage */
>      wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
> @@ -656,31 +660,19 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
>          if (!wstate->btws_zeropage)
>              wstate->btws_zeropage = (Page) palloc0(BLCKSZ);
>          /* don't set checksum for all-zero page */
> -        smgrextend(RelationGetSmgr(wstate->index), MAIN_FORKNUM,
> -                   wstate->btws_pages_written++,
> -                   (char *) wstate->btws_zeropage,
> -                   true);
> +        unbuffered_extend(&wstate->ub_wstate, MAIN_FORKNUM, wstate->btws_pages_written++, wstate->btws_zeropage,
true);
>      }

There's a bunch of long lines in here...


> -    /*
> -     * When we WAL-logged index pages, we must nonetheless fsync index files.
> -     * Since we're building outside shared buffers, a CHECKPOINT occurring
> -     * during the build has no way to flush the previously written data to
> -     * disk (indeed it won't know the index even exists).  A crash later on
> -     * would replay WAL from the checkpoint, therefore it wouldn't replay our
> -     * earlier WAL entries. If we do not fsync those pages here, they might
> -     * still not be on disk when the crash occurs.
> -     */
>      if (wstate->btws_use_wal)
> -        smgrimmedsync(RelationGetSmgr(wstate->index), MAIN_FORKNUM);
> +        unbuffered_finish(&wstate->ub_wstate, MAIN_FORKNUM);
>  }

The API of unbuffered_finish() only sometimes getting called, but
unbuffered_prep() being unconditional, strikes me as prone to bugs. Perhaps
it'd make sense to pass in whether the relation needs to be synced or not instead?



>  spgbuildempty(Relation index)
>  {
>      Page        page;
> +    UnBufferedWriteState wstate;
> +
> +    wstate.smgr_rel = RelationGetSmgr(index);
> +    unbuffered_prep(&wstate);

I don't think that's actually safe, and one of the instances could be the
cause cause of the bug CI is seeing:

 * Note: since a relcache flush can cause the file handle to be closed again,
 * it's unwise to hold onto the pointer returned by this function for any
 * long period.  Recommended practice is to just re-execute RelationGetSmgr
 * each time you need to access the SMgrRelation.  It's quite cheap in
 * comparison to whatever an smgr function is going to do.
 */
static inline SMgrRelation
RelationGetSmgr(Relation rel)


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: row filtering for logical replication
Next
From: Andres Freund
Date:
Subject: Re: pg_basebackup fsyncs some files despite --no-sync (was: Adding CI to our tree)