Re: In-place persistance change of a relation - Mailing list pgsql-hackers
From | Kyotaro Horiguchi |
---|---|
Subject | Re: In-place persistance change of a relation |
Date | |
Msg-id | 20201112.155537.905170546612920882.horikyota.ntt@gmail.com Whole thread Raw |
List | pgsql-hackers |
At Wed, 11 Nov 2020 09:56:44 -0500, Stephen Frost <sfrost@snowman.net> wrote in > Greetings, > > * Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote: > > At Tue, 10 Nov 2020 09:33:12 -0500, Stephen Frost <sfrost@snowman.net> wrote in > > > * Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote: > > > > For fuel(?) of the discussion, I tried a very-quick PoC for in-place > > > > ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some > > > > trials of several ways, I drifted to the following way after poking > > > > several ways. > > > > > > > > 1. Flip BM_PERMANENT of active buffers > > > > 2. adding/removing init fork > > > > 3. sync files, > > > > 4. Flip pg_class.relpersistence. > > > > > > > > It always skips table copy in the SET UNLOGGED case, and only when > > > > wal_level=minimal in the SET LOGGED case. Crash recovery seems > > > > working by some brief testing by hand. > > > > > > Somehow missed that this patch more-or-less does what I was referring to > > > down-thread, but I did want to mention that it looks like it's missing a > > > necessary FlushRelationBuffers() call before the sync, otherwise there > > > could be dirty buffers for the relation that's being set to LOGGED (with > > > wal_level=minimal), which wouldn't be good. See the comments above > > > smgrimmedsync(). > > > > Right. Thanks. However, since SetRelFileNodeBuffersPersistence() > > called just above scans shared buffers so I don't want to just call > > FlushRelationBuffers() separately. Instead, I added buffer-flush to > > SetRelFileNodeBuffersPersistence(). > > Maybe I'm missing something, but it sure looks like in the patch that > SetRelFileNodeBuffersPersistence() is being called after the > smgrimmedsync() call, and I don't think you get to just switch the order > of those- the sync is telling the kernel to make sure it's written to > disk, while the FlushBuffer() is just writing it into the kernel but > doesn't provide any guarantee that the data has actually made it to > disk. We have to FlushBuffer() first, and then call smgrimmedsync(). > Perhaps there's a way to avoid having to go through shared buffers > twice, and I generally agreed it'd be good if we could avoid doing so, > but this approach doesn't look like it actually works. Yeah, sorry for the rare-baked version.. I was confused about the order at the time. The next version works like this: LOGGED->UNLOGGED <collect reloids to process> for each relations: <set buffer persistence to !BM_PERMANENT (wal-logged if walleve > minimal> <create init fork> if it is index call ambuildempty() (which syncs the init fork) else WAL-log smgr_create then sync the init file. <update catalog> ... commit time: <do nogthing> abort time: <unlink init fork> <revert buffer persistence> UNLOGGED->LOGGED <collect reloids to process> for each relations: <set buffer persistence to !BM_PERMANENT (wal-logged if walleve > minimal> <record drop-init-fork to pending-deletes> <sync storage files> <update catalog> ... commit time: <log smgrunlink> <smgrunlink init fork> abort time: <revert buffer persistence> > > FWIW this is a revised version of the PoC, which has some known > > problems. > > > > - Flipping of Buffer persistence is not WAL-logged nor even be able to > > be safely roll-backed. (It might be better to drop buffers). > > Not sure if it'd be better to drop buffers or not, but figuring out how > to deal with rollback seems pretty important. How is the persistence > change in the catalog not WAL-logged though..? Rollback works as the above. Buffer persistence change is registered in pending-deletes. Persistence change in catalog is rolled back in the ordinary way (or automatically). If wal_level > minimal, persistence change of buffers is propagated to standbys by WAL. However I'm not sure we need wal-logging otherwise, the next version emits WAL since SMGR_CREATE is always logged by existing code. > > - This version handles indexes but not yet handle toast relatins. > > Would need to be fixed, of course. Fixed. > > - tableAMs are supposed to support this feature. (but I'm not sure > > it's worth allowing them not to do so). > > Seems like they should. Init fork of index relations needs a call to ambuildempty() instead of "log_smgrcreate-smgrimmedsync" after smgrcreate. Instead of adding similar interface in indexAm, I reverted changes of tableam and make RelationCreate/DropInitFork() directly do that. That introduces new include of amapi.h to storage.c, which is a bit uneasy. The previous version give up the in-place persistence change in the case where wal_level > minimal and SET LOGGED since that needs WAL to be emitted. However, in-place change still has the advantage of not running a table copy. So the next verson always runs persistence change in-place. As suggested by Andres, I'll send a summary of this patch. The patch will be attached to the coming mail. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
pgsql-hackers by date: