Re: [HACKERS] Unlogged tables cleanup - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [HACKERS] Unlogged tables cleanup
Date
Msg-id 20190513173735.3znpqfkb3gasig6v@alap3.anarazel.de
Whole thread Raw
In response to Re: [HACKERS] Unlogged tables cleanup  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] Unlogged tables cleanup
List pgsql-hackers
On 2019-05-13 13:09:17 -0400, Robert Haas wrote:
> On Mon, May 13, 2019 at 12:50 PM Andres Freund <andres@anarazel.de> wrote:
> > > AFAICS ResetUnloggedRelations copies the init fork after replaying WAL,
> > > so it would be sufficient to have the init fork be recovered from WAL
> > > for that to work.  However, we also do ResetUnloggedRelations *before*
> > > replaying WAL in order to remove leftover not-init-fork files, and that
> > > process requires that the init fork is present at that time.
> >
> > What scenario are you precisely wondering about? That
> > ResetUnloggedRelations() could overwrite the main fork, while not yet
> > having valid contents (due to the lack of smgrimmedsync())? Shouldn't
> > that only be possible while still in an inconsistent state? A checkpoint
> > would have serialized the correct contents, and we'd not reach HS
> > consistency before having replayed that WAL records resetting the table
> > and the init fork consistency?
> 
> I think I see what Alvaro is talking about, or at least I think I see
> *a* possible problem based on his remarks.
> 
> Suppose we create an unlogged table and then crash. The main fork
> makes it to disk, and the init fork does not.  Before WAL replay, we
> remove any main forks that have init forks, but because the init fork
> was lost, that does not happen.  Recovery recreates the init fork.
> After WAL replay, we try to copy_file() each _init fork to the
> corresponding main fork. That fails, because copy_file() expects to be
> able to create the target file, and here it can't do that because it
> already exists.

Ugh, this is all such a mess. But, isn't this broken independently of
the smgrimmedsync() issue? In a basebackup case, the basebackup could
have included the main fork, but not the init fork, and the reverse. WAL
replay *solely* needs to be able to recover from that.  At the very
least we'd have to do the cleanup step after becoming consistent, not
just before recovery even started.


> If that's the scenario, I'm not sure the smgrimmedsync() call is
> sufficient.  Suppose we log_smgrcreate() but then crash before
> smgrimmedsync()... seems like we'd need to do them in the other order,
> or else maybe just pass a flag to copy_file() telling it not to be so
> picky.

I think I mentioned this elsewhere recently, I think our smgr WAL
logging makes this unneccessarily hard. We should just have the intial
smgrcreate record (be it for the main fork, or just the init fork), have
a flag that tells WAL replay to clean out everything pre-existing


Hm, I'm also a bit confused as to how ResetUnloggedRelations() got to
take its argument as a bitmask - given that we only pass in individual
flags at the moment.  Seems pretty clear to me that the call at the end
of recovery would need to be UNLOGGED_RELATION_CLEANUP |
UNLOGGED_RELATION_INIT, unless we make some bigger changes?

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: [HACKERS] Unlogged tables cleanup
Next
From: Andres Freund
Date:
Subject: Re: [HACKERS] Unlogged tables cleanup