Avoiding smgrimmedsync() during nbtree index builds - Mailing list pgsql-hackers

From Andres Freund
Subject Avoiding smgrimmedsync() during nbtree index builds
Date
Msg-id 20210121203656.tc7kqildbqnyihog@alap3.anarazel.de
Whole thread Raw
Responses Re: Avoiding smgrimmedsync() during nbtree index builds
List pgsql-hackers
Hi,

Every nbtree index build currently does an smgrimmedsync at the end:

/*
 * Read tuples in correct sort order from tuplesort, and load them into
 * btree leaves.
 */
static void
_bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
...
    /*
     * When we WAL-logged index pages, we must nonetheless fsync index files.
     * Since we're building outside shared buffers, a CHECKPOINT occurring
     * during the build has no way to flush the previously written data to
     * disk (indeed it won't know the index even exists).  A crash later on
     * would replay WAL from the checkpoint, therefore it wouldn't replay our
     * earlier WAL entries. If we do not fsync those pages here, they might
     * still not be on disk when the crash occurs.
     */
    if (wstate->btws_use_wal)
    {
        RelationOpenSmgr(wstate->index);
        smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
    }

In cases we create lots of small indexes, e.g. because of an initial
schema load, partition creation or something like that, that turns out
to be a major limiting factor (unless one turns fsync off).


One way to address that would be to put newly built indexes into s_b
(using a strategy, to avoid blowing out the whole cache), instead of
using smgrwrite() etc directly. But that's a discussion with a bit more
complex tradeoffs.


What I wonder is why the issue addressed in the comment I copied above
can't more efficiently be addressed using sync requests, like we do for
other writes?  It's possibly bit more complicated than just passing
skipFsync=false to smgrwrite/smgrextend, but it should be quite doable?


A quick hack (probably not quite correct!) to evaluate the benefit shows
that the attached script takes 2m17.223s with the smgrimmedsync and
0m22.870s passing skipFsync=false to write/extend. Entirely IO bound in
the former case, CPU bound in the latter.

Creating lots of tables with indexes (directly or indirectly through
relations having a toast table) is pretty common, particularly after the
introduction of partitioning.


Thinking through the correctness of replacing smgrimmedsync() with sync
requests, the potential problems that I can see are:

1) redo point falls between the log_newpage() and the
   write()/register_dirty_segment() in smgrextend/smgrwrite.
2) redo point falls between write() and register_dirty_segment()

But both of these are fine in the context of initially filling a newly
created relfilenode, as far as I can tell? Otherwise the current
smgrimmedsync() approach wouldn't be safe either, as far as I can tell?

Greetings,

Andres Freund

Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Race condition in recovery?
Next
From: Alexey Kondratov
Date:
Subject: Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly