Re: Avoiding smgrimmedsync() during nbtree index builds - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Avoiding smgrimmedsync() during nbtree index builds |
Date | |
Msg-id | 812dbe94-5547-830b-714e-ad26906fb6f0@iki.fi Whole thread Raw |
In response to | Avoiding smgrimmedsync() during nbtree index builds (Andres Freund <andres@anarazel.de>) |
Responses |
Re: Avoiding smgrimmedsync() during nbtree index builds
|
List | pgsql-hackers |
On 21/01/2021 22:36, Andres Freund wrote: > Hi, > > Every nbtree index build currently does an smgrimmedsync at the end: > > /* > * Read tuples in correct sort order from tuplesort, and load them into > * btree leaves. > */ > static void > _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2) > ... > /* > * When we WAL-logged index pages, we must nonetheless fsync index files. > * Since we're building outside shared buffers, a CHECKPOINT occurring > * during the build has no way to flush the previously written data to > * disk (indeed it won't know the index even exists). A crash later on > * would replay WAL from the checkpoint, therefore it wouldn't replay our > * earlier WAL entries. If we do not fsync those pages here, they might > * still not be on disk when the crash occurs. > */ > if (wstate->btws_use_wal) > { > RelationOpenSmgr(wstate->index); > smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM); > } > > In cases we create lots of small indexes, e.g. because of an initial > schema load, partition creation or something like that, that turns out > to be a major limiting factor (unless one turns fsync off). > > > One way to address that would be to put newly built indexes into s_b > (using a strategy, to avoid blowing out the whole cache), instead of > using smgrwrite() etc directly. But that's a discussion with a bit more > complex tradeoffs. > > > What I wonder is why the issue addressed in the comment I copied above > can't more efficiently be addressed using sync requests, like we do for > other writes? It's possibly bit more complicated than just passing > skipFsync=false to smgrwrite/smgrextend, but it should be quite doable? Makes sense. > A quick hack (probably not quite correct!) to evaluate the benefit shows > that the attached script takes 2m17.223s with the smgrimmedsync and > 0m22.870s passing skipFsync=false to write/extend. Entirely IO bound in > the former case, CPU bound in the latter. > > Creating lots of tables with indexes (directly or indirectly through > relations having a toast table) is pretty common, particularly after the > introduction of partitioning. > > > Thinking through the correctness of replacing smgrimmedsync() with sync > requests, the potential problems that I can see are: > > 1) redo point falls between the log_newpage() and the > write()/register_dirty_segment() in smgrextend/smgrwrite. > 2) redo point falls between write() and register_dirty_segment() > > But both of these are fine in the context of initially filling a newly > created relfilenode, as far as I can tell? Otherwise the current > smgrimmedsync() approach wouldn't be safe either, as far as I can tell? Hmm. If the redo point falls between write() and the register_dirty_segment(), and the checkpointer finishes the whole checkpoint before register_dirty_segment(), you are not safe. That can't happen with write from the buffer manager, because the checkpointer would block waiting for the flush of the buffer to finish. - Heikki
pgsql-hackers by date: