Re: [HACKERS] Unlogged tables cleanup - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: [HACKERS] Unlogged tables cleanup |
Date | |
Msg-id | 20190513173735.3znpqfkb3gasig6v@alap3.anarazel.de Whole thread Raw |
In response to | Re: [HACKERS] Unlogged tables cleanup (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: [HACKERS] Unlogged tables cleanup
|
List | pgsql-hackers |
On 2019-05-13 13:09:17 -0400, Robert Haas wrote: > On Mon, May 13, 2019 at 12:50 PM Andres Freund <andres@anarazel.de> wrote: > > > AFAICS ResetUnloggedRelations copies the init fork after replaying WAL, > > > so it would be sufficient to have the init fork be recovered from WAL > > > for that to work. However, we also do ResetUnloggedRelations *before* > > > replaying WAL in order to remove leftover not-init-fork files, and that > > > process requires that the init fork is present at that time. > > > > What scenario are you precisely wondering about? That > > ResetUnloggedRelations() could overwrite the main fork, while not yet > > having valid contents (due to the lack of smgrimmedsync())? Shouldn't > > that only be possible while still in an inconsistent state? A checkpoint > > would have serialized the correct contents, and we'd not reach HS > > consistency before having replayed that WAL records resetting the table > > and the init fork consistency? > > I think I see what Alvaro is talking about, or at least I think I see > *a* possible problem based on his remarks. > > Suppose we create an unlogged table and then crash. The main fork > makes it to disk, and the init fork does not. Before WAL replay, we > remove any main forks that have init forks, but because the init fork > was lost, that does not happen. Recovery recreates the init fork. > After WAL replay, we try to copy_file() each _init fork to the > corresponding main fork. That fails, because copy_file() expects to be > able to create the target file, and here it can't do that because it > already exists. Ugh, this is all such a mess. But, isn't this broken independently of the smgrimmedsync() issue? In a basebackup case, the basebackup could have included the main fork, but not the init fork, and the reverse. WAL replay *solely* needs to be able to recover from that. At the very least we'd have to do the cleanup step after becoming consistent, not just before recovery even started. > If that's the scenario, I'm not sure the smgrimmedsync() call is > sufficient. Suppose we log_smgrcreate() but then crash before > smgrimmedsync()... seems like we'd need to do them in the other order, > or else maybe just pass a flag to copy_file() telling it not to be so > picky. I think I mentioned this elsewhere recently, I think our smgr WAL logging makes this unneccessarily hard. We should just have the intial smgrcreate record (be it for the main fork, or just the init fork), have a flag that tells WAL replay to clean out everything pre-existing Hm, I'm also a bit confused as to how ResetUnloggedRelations() got to take its argument as a bitmask - given that we only pass in individual flags at the moment. Seems pretty clear to me that the call at the end of recovery would need to be UNLOGGED_RELATION_CLEANUP | UNLOGGED_RELATION_INIT, unless we make some bigger changes? Greetings, Andres Freund
pgsql-hackers by date: