At Wed, 11 Nov 2020 09:56:44 -0500, Stephen Frost <sfrost@snowman.net> wrote in
> Greetings,
>
> * Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:
> > At Tue, 10 Nov 2020 09:33:12 -0500, Stephen Frost <sfrost@snowman.net> wrote in
> > > * Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:
> > > > For fuel(?) of the discussion, I tried a very-quick PoC for in-place
> > > > ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some
> > > > trials of several ways, I drifted to the following way after poking
> > > > several ways.
> > > >
> > > > 1. Flip BM_PERMANENT of active buffers
> > > > 2. adding/removing init fork
> > > > 3. sync files,
> > > > 4. Flip pg_class.relpersistence.
> > > >
> > > > It always skips table copy in the SET UNLOGGED case, and only when
> > > > wal_level=minimal in the SET LOGGED case. Crash recovery seems
> > > > working by some brief testing by hand.
> > >
> > > Somehow missed that this patch more-or-less does what I was referring to
> > > down-thread, but I did want to mention that it looks like it's missing a
> > > necessary FlushRelationBuffers() call before the sync, otherwise there
> > > could be dirty buffers for the relation that's being set to LOGGED (with
> > > wal_level=minimal), which wouldn't be good. See the comments above
> > > smgrimmedsync().
> >
> > Right. Thanks. However, since SetRelFileNodeBuffersPersistence()
> > called just above scans shared buffers so I don't want to just call
> > FlushRelationBuffers() separately. Instead, I added buffer-flush to
> > SetRelFileNodeBuffersPersistence().
>
> Maybe I'm missing something, but it sure looks like in the patch that
> SetRelFileNodeBuffersPersistence() is being called after the
> smgrimmedsync() call, and I don't think you get to just switch the order
> of those- the sync is telling the kernel to make sure it's written to
> disk, while the FlushBuffer() is just writing it into the kernel but
> doesn't provide any guarantee that the data has actually made it to
> disk. We have to FlushBuffer() first, and then call smgrimmedsync().
> Perhaps there's a way to avoid having to go through shared buffers
> twice, and I generally agreed it'd be good if we could avoid doing so,
> but this approach doesn't look like it actually works.
Yeah, sorry for the rare-baked version.. I was confused about the
order at the time. The next version works like this:
LOGGED->UNLOGGED
<collect reloids to process>
for each relations:
<set buffer persistence to !BM_PERMANENT (wal-logged if walleve > minimal>
<create init fork>
if it is index call ambuildempty() (which syncs the init fork)
else WAL-log smgr_create then sync the init file.
<update catalog>
...
commit time:
<do nogthing>
abort time:
<unlink init fork>
<revert buffer persistence>
UNLOGGED->LOGGED
<collect reloids to process>
for each relations:
<set buffer persistence to !BM_PERMANENT (wal-logged if walleve > minimal>
<record drop-init-fork to pending-deletes>
<sync storage files>
<update catalog>
...
commit time:
<log smgrunlink>
<smgrunlink init fork>
abort time:
<revert buffer persistence>
> > FWIW this is a revised version of the PoC, which has some known
> > problems.
> >
> > - Flipping of Buffer persistence is not WAL-logged nor even be able to
> > be safely roll-backed. (It might be better to drop buffers).
>
> Not sure if it'd be better to drop buffers or not, but figuring out how
> to deal with rollback seems pretty important. How is the persistence
> change in the catalog not WAL-logged though..?
Rollback works as the above. Buffer persistence change is registered
in pending-deletes. Persistence change in catalog is rolled back in
the ordinary way (or automatically).
If wal_level > minimal, persistence change of buffers is propagated to
standbys by WAL. However I'm not sure we need wal-logging otherwise,
the next version emits WAL since SMGR_CREATE is always logged by
existing code.
> > - This version handles indexes but not yet handle toast relatins.
>
> Would need to be fixed, of course.
Fixed.
> > - tableAMs are supposed to support this feature. (but I'm not sure
> > it's worth allowing them not to do so).
>
> Seems like they should.
Init fork of index relations needs a call to ambuildempty() instead of
"log_smgrcreate-smgrimmedsync" after smgrcreate. Instead of adding
similar interface in indexAm, I reverted changes of tableam and make
RelationCreate/DropInitFork() directly do that. That introduces new
include of amapi.h to storage.c, which is a bit uneasy.
The previous version give up the in-place persistence change in the
case where wal_level > minimal and SET LOGGED since that needs WAL to
be emitted. However, in-place change still has the advantage of not
running a table copy. So the next verson always runs persistence
change in-place.
As suggested by Andres, I'll send a summary of this patch. The patch
will be attached to the coming mail.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center