Re: In-place persistance change of a relation - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: In-place persistance change of a relation
Date
Msg-id 20201112.155537.905170546612920882.horikyota.ntt@gmail.com
Whole thread Raw
List pgsql-hackers
At Wed, 11 Nov 2020 09:56:44 -0500, Stephen Frost <sfrost@snowman.net> wrote in 
> Greetings,
> 
> * Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:
> > At Tue, 10 Nov 2020 09:33:12 -0500, Stephen Frost <sfrost@snowman.net> wrote in 
> > > * Kyotaro Horiguchi (horikyota.ntt@gmail.com) wrote:
> > > > For fuel(?) of the discussion, I tried a very-quick PoC for in-place
> > > > ALTER TABLE SET LOGGED/UNLOGGED and resulted as attached. After some
> > > > trials of several ways, I drifted to the following way after poking
> > > > several ways.
> > > > 
> > > > 1. Flip BM_PERMANENT of active buffers
> > > > 2. adding/removing init fork
> > > > 3. sync files,
> > > > 4. Flip pg_class.relpersistence.
> > > > 
> > > > It always skips table copy in the SET UNLOGGED case, and only when
> > > > wal_level=minimal in the SET LOGGED case.  Crash recovery seems
> > > > working by some brief testing by hand.
> > > 
> > > Somehow missed that this patch more-or-less does what I was referring to
> > > down-thread, but I did want to mention that it looks like it's missing a
> > > necessary FlushRelationBuffers() call before the sync, otherwise there
> > > could be dirty buffers for the relation that's being set to LOGGED (with
> > > wal_level=minimal), which wouldn't be good.  See the comments above
> > > smgrimmedsync().
> > 
> > Right. Thanks.  However, since SetRelFileNodeBuffersPersistence()
> > called just above scans shared buffers so I don't want to just call
> > FlushRelationBuffers() separately.  Instead, I added buffer-flush to
> > SetRelFileNodeBuffersPersistence().
> 
> Maybe I'm missing something, but it sure looks like in the patch that
> SetRelFileNodeBuffersPersistence() is being called after the
> smgrimmedsync() call, and I don't think you get to just switch the order
> of those- the sync is telling the kernel to make sure it's written to
> disk, while the FlushBuffer() is just writing it into the kernel but
> doesn't provide any guarantee that the data has actually made it to
> disk.  We have to FlushBuffer() first, and then call smgrimmedsync().
> Perhaps there's a way to avoid having to go through shared buffers
> twice, and I generally agreed it'd be good if we could avoid doing so,
> but this approach doesn't look like it actually works.

Yeah, sorry for the rare-baked version.. I was confused about the
order at the time.  The next version works like this:

LOGGED->UNLOGGED
<collect reloids to process>

for each relations:
  <set buffer persistence to !BM_PERMANENT (wal-logged if walleve > minimal>
  <create init fork>
  if it is index call ambuildempty() (which syncs the init fork)
  else WAL-log smgr_create then sync the init file.
  <update catalog>
...
commit time:
  <do nogthing>
abort time:
  <unlink init fork>
  <revert buffer persistence>

UNLOGGED->LOGGED
<collect reloids to process>

for each relations:
  <set buffer persistence to !BM_PERMANENT (wal-logged if walleve > minimal>
  <record drop-init-fork to pending-deletes>
  <sync storage files>
  <update catalog>
...
commit time:
  <log smgrunlink>
  <smgrunlink init fork>
abort time:
  <revert buffer persistence>


> > FWIW this is a revised version of the PoC, which has some known
> > problems.
> > 
> > - Flipping of Buffer persistence is not WAL-logged nor even be able to
> >   be safely roll-backed. (It might be better to drop buffers).
> 
> Not sure if it'd be better to drop buffers or not, but figuring out how
> to deal with rollback seems pretty important.  How is the persistence
> change in the catalog not WAL-logged though..?

Rollback works as the above. Buffer persistence change is registered
in pending-deletes.  Persistence change in catalog is rolled back in
the ordinary way (or automatically).

If wal_level > minimal, persistence change of buffers is propagated to
standbys by WAL.  However I'm not sure we need wal-logging otherwise,
the next version emits WAL since SMGR_CREATE is always logged by
existing code.

> > - This version handles indexes but not yet handle toast relatins.
> 
> Would need to be fixed, of course.

Fixed.

> > - tableAMs are supposed to support this feature. (but I'm not sure
> >   it's worth allowing them not to do so).
> 
> Seems like they should.

Init fork of index relations needs a call to ambuildempty() instead of
"log_smgrcreate-smgrimmedsync" after smgrcreate. Instead of adding
similar interface in indexAm, I reverted changes of tableam and make
RelationCreate/DropInitFork() directly do that.  That introduces new
include of amapi.h to storage.c, which is a bit uneasy.

The previous version give up the in-place persistence change in the
case where wal_level > minimal and SET LOGGED since that needs WAL to
be emitted.  However, in-place change still has the advantage of not
running a table copy. So the next verson always runs persistence
change in-place.

As suggested by Andres, I'll send a summary of this patch. The patch
will be attached to the coming mail.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: In-placre persistance change of a relation
Next
From: Peter Smith
Date:
Subject: Re: [HACKERS] logical decoding of two-phase transactions