Re: Avoiding smgrimmedsync() during nbtree index builds - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Avoiding smgrimmedsync() during nbtree index builds
Date
Msg-id 812dbe94-5547-830b-714e-ad26906fb6f0@iki.fi
Whole thread Raw
In response to Avoiding smgrimmedsync() during nbtree index builds  (Andres Freund <andres@anarazel.de>)
Responses Re: Avoiding smgrimmedsync() during nbtree index builds
List pgsql-hackers
On 21/01/2021 22:36, Andres Freund wrote:
> Hi,
> 
> Every nbtree index build currently does an smgrimmedsync at the end:
> 
> /*
>   * Read tuples in correct sort order from tuplesort, and load them into
>   * btree leaves.
>   */
> static void
> _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
> ...
>     /*
>      * When we WAL-logged index pages, we must nonetheless fsync index files.
>      * Since we're building outside shared buffers, a CHECKPOINT occurring
>      * during the build has no way to flush the previously written data to
>      * disk (indeed it won't know the index even exists).  A crash later on
>      * would replay WAL from the checkpoint, therefore it wouldn't replay our
>      * earlier WAL entries. If we do not fsync those pages here, they might
>      * still not be on disk when the crash occurs.
>      */
>     if (wstate->btws_use_wal)
>     {
>         RelationOpenSmgr(wstate->index);
>         smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
>     }
> 
> In cases we create lots of small indexes, e.g. because of an initial
> schema load, partition creation or something like that, that turns out
> to be a major limiting factor (unless one turns fsync off).
> 
> 
> One way to address that would be to put newly built indexes into s_b
> (using a strategy, to avoid blowing out the whole cache), instead of
> using smgrwrite() etc directly. But that's a discussion with a bit more
> complex tradeoffs.
> 
> 
> What I wonder is why the issue addressed in the comment I copied above
> can't more efficiently be addressed using sync requests, like we do for
> other writes?  It's possibly bit more complicated than just passing
> skipFsync=false to smgrwrite/smgrextend, but it should be quite doable?

Makes sense.

> A quick hack (probably not quite correct!) to evaluate the benefit shows
> that the attached script takes 2m17.223s with the smgrimmedsync and
> 0m22.870s passing skipFsync=false to write/extend. Entirely IO bound in
> the former case, CPU bound in the latter.
> 
> Creating lots of tables with indexes (directly or indirectly through
> relations having a toast table) is pretty common, particularly after the
> introduction of partitioning.
> 
> 
> Thinking through the correctness of replacing smgrimmedsync() with sync
> requests, the potential problems that I can see are:
> 
> 1) redo point falls between the log_newpage() and the
>     write()/register_dirty_segment() in smgrextend/smgrwrite.
> 2) redo point falls between write() and register_dirty_segment()
> 
> But both of these are fine in the context of initially filling a newly
> created relfilenode, as far as I can tell? Otherwise the current
> smgrimmedsync() approach wouldn't be safe either, as far as I can tell?

Hmm. If the redo point falls between write() and the 
register_dirty_segment(), and the checkpointer finishes the whole 
checkpoint before register_dirty_segment(), you are not safe. That can't 
happen with write from the buffer manager, because the checkpointer 
would block waiting for the flush of the buffer to finish.

- Heikki



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: libpq debug log
Next
From: Tom Lane
Date:
Subject: Re: strange error reporting