Re: Relation bulk write facility - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Relation bulk write facility |
Date | |
Msg-id | cac7d1b6-8358-40be-af0b-21bc9b27d34c@iki.fi Whole thread Raw |
In response to | Re: Relation bulk write facility (Noah Misch <noah@leadboat.com>) |
Responses |
Re: Relation bulk write facility
|
List | pgsql-hackers |
On 02/07/2024 02:24, Noah Misch wrote: > On Tue, Jul 02, 2024 at 12:53:05AM +0300, Heikki Linnakangas wrote: >> On 01/07/2024 23:52, Noah Misch wrote: >>> Commit 8af2565 wrote: >>>> --- /dev/null >>>> +++ b/src/backend/storage/smgr/bulk_write.c >>> >>>> +/* >>>> + * Finish bulk write operation. >>>> + * >>>> + * This WAL-logs and flushes any remaining pending writes to disk, and fsyncs >>>> + * the relation if needed. >>>> + */ >>>> +void >>>> +smgr_bulk_finish(BulkWriteState *bulkstate) >>>> +{ >>>> + /* WAL-log and flush any remaining pages */ >>>> + smgr_bulk_flush(bulkstate); >>>> + >>>> + /* >>>> + * When we wrote out the pages, we passed skipFsync=true to avoid the >>>> + * overhead of registering all the writes with the checkpointer. Register >>>> + * the whole relation now. >>>> + * >>>> + * There is one hole in that idea: If a checkpoint occurred while we were >>>> + * writing the pages, it already missed fsyncing the pages we had written >>>> + * before the checkpoint started. A crash later on would replay the WAL >>>> + * starting from the checkpoint, therefore it wouldn't replay our earlier >>>> + * WAL records. So if a checkpoint started after the bulk write, fsync >>>> + * the files now. >>>> + */ >>>> + if (!SmgrIsTemp(bulkstate->smgr)) >>>> + { >>> >>> Shouldn't this be "if (bulkstate->use_wal)"? The GetRedoRecPtr()-based >>> decision is irrelevant to the !wal case. Either we don't need fsync at all >>> (TEMP or UNLOGGED) or smgrDoPendingSyncs() will do it (wal_level=minimal). >> >> The point of GetRedoRecPtr() is to detect if a checkpoint has started >> concurrently. It works for that purpose whether or not the bulk load is >> WAL-logged. It is not compared with the LSNs of WAL records written by the >> bulk load. > > I think the significance of start_RedoRecPtr is it preceding all records > needed to recreate the bulk write. If start_RedoRecPtr==GetRedoRecPtr() and > we crash after commit, we're indifferent to whether the rel gets synced at a > checkpoint before that crash or rebuilt from WAL after that crash. If > start_RedoRecPtr!=GetRedoRecPtr(), some WAL of the bulk write is already > deleted, so only smgrimmedsync() suffices. Overall, while it is not compared > with LSNs in WAL records, it's significant only to the extent that such a WAL > record exists. What am I missing? You're right. You pointed out below that we don't need to register or immediately fsync the relation if it was not WAL-logged, I missed that. In the alternative universe that we did need to fsync() even in !use_wal case, the point of the start_RedoRecPtr==GetRedoRecPtr() was to detect whether the last checkpoint "missed" fsyncing the files that we wrote. But the point is moot now. >> Unlogged tables do need to be fsync'd. The scenario is: >> >> 1. Bulk load an unlogged table. >> 2. Shut down Postgres cleanly >> 3. Pull power plug from server, and restart. >> >> We talked about this earlier in the "Unlogged relation copy is not fsync'd" >> thread [1]. I had already forgotten about that; that bug actually still >> exists in back branches, and we should fix it.. >> >> [1] https://www.postgresql.org/message-id/flat/65e94fc8-ce1d-dd02-3be3-fda0fe8f2965%40iki.fi > > Ah, that's right. I agree this code suffices for unlogged. As a further > optimization, it would be valid to ignore GetRedoRecPtr() for unlogged and > always call smgrregistersync(). (For any rel, smgrimmedsync() improves on > smgrregistersync() only if we fail to reach the shutdown checkpoint. Without > a shutdown checkpoint, unlogged rels get reset anyway.) > >>> I don't see any functional problem, but this likely arranges for an >>> unnecessary sync when a checkpoint starts between mdcreate() and >>> here. (The mdcreate() sync may also be unnecessary, but that's >>> longstanding.) >> Hmm, yes we might do two fsyncs() with wal_level=minimal, unnecessarily. It >> seems hard to eliminate the redundancy. smgr_bulk_finish() could skip the >> fsync, if it knew that smgrDoPendingSyncs() will do it later. However, >> smgrDoPendingSyncs() might also decide to WAL-log the relation instead of >> fsyncing it, and in that case we do still need the fsync. > > We do not need the fsync in the "WAL-log the relation instead" case; see > https://postgr.es/m/20230921062210.GA110358@rfd.leadboat.com Ah, true, I missed that log_newpage_range() loads the pages to the buffer cache and dirties them. That kinds of sucks actually, I wish it didn't need to dirty the buffers. > So maybe like this: > > if (use_wal) /* includes init forks */ > current logic; > else if (unlogged) > smgrregistersync; > /* else temp || (permanent && wal_level=minimal): nothing to do */ Makes sense, except that we cannot distinguish between unlogged relations and permanent relations with !use_wal here. It would be nice to have relpersistence flag in SMgrRelation. I remember wanting to have that before, although I don't remember what the context was exactly. >> Fortunately, fsync() on a file that's already flushed to disk is pretty >> cheap. > > Yep. I'm more concerned about future readers wondering why the function is > using LSNs to decide what to do about data that doesn't appear in WAL. A > comment could be another way to fix that, though. Agreed, this is all very subtle, and deserves a good comment. What do you think of the attached? -- Heikki Linnakangas Neon (https://neon.tech)
Attachment
pgsql-hackers by date: