Re: Unlogged tables cleanup - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Unlogged tables cleanup
Date
Msg-id CAB7nPqSstNc=ZgMW-mNj1ZpHCwFOZm2sMfDptJ+tC0+ZGtQhMw@mail.gmail.com
Whole thread Raw
In response to Re: Unlogged tables cleanup  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Unlogged tables cleanup  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Thu, Nov 17, 2016 at 7:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Yeah, but surely it's obvious that if you don't WAL log it, it's not
> going to work for archiving or streaming.  It's a lot less obvious why
> you have to WAL log it when you're not doing either of those things if
> you've already written it and fsync'd it locally.  How is it going to
> disappear if it's been fsync'd, one wonders?

That's not obvious that replaying a WAL record for a database creation
or tablespace creation would cause that for sure! I didn't know that
redo was wiping them out with rmtree() at replay before looking at
this bug. One more thing to recall when fixing an issue in this area
in the future.

>> More seriously, if there could be more details regarding that, I would
>> think that we could say something like "logging the init fork is
>> mandatory in any case to ensure its on-disk presence when at recovery
>> replay, even on non-default tablespaces whose base location are
>> deleted and re-created from scratch if the WAL record in charge of
>> creating this tablespace gets replayed". The problem shows up because
>> of tablespaces being deleted at replay at the end... So perhaps this
>> makes sense. What do you think?
>
> Yes, that's about what I think we need to explain.

OK, what do you think about the attached then?

I came up with something like that for those code paths:
-   /* Write the page.  If archiving/streaming, XLOG it. */
+   /*
+    * Write the page and log it unconditionally.  This is important
+    * particularly for indexes created on tablespaces and databases
+    * whose creation happened after the last redo pointer as recovery
+    * removes any of their existing content when the corresponding
+    * create records are replayed.
+    */
I have not been able to use the word "create" less than that. The
patch is really repetitive, but I think that we had better mention the
need of logging the content in each code path and not touch
log_newpage() to keep it a maximum flexible.

> Actually, I'm
> wondering if we ought to just switch this over to using the delayChkpt
> mechanism instead of going through this complicated song-and-dance.
> Incurring an immediate sync to avoid having to WAL-log this is a bit
> tenuous, but having to WAL-log it AND do the immediate sync just seems
> silly, and I'm actually a bit worried that even with your fix there
> might still be a latent bug somewhere.  We couldn't switch mechanisms
> cleanly in the 9.2 branch (cf.
> f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should
> back-patch it in the form you have it in already, but it's probably
> worth switching over at least in master.

Thinking through it... Could we not just rip off the sync phase
because there is delayChkpt? It seems to me that what counts is that
the commit of the transaction that creates the relation does not get
past the redo point. It would matter for read uncommitted transactions
but that concept does not exist in PG, and never will. On
back-branches it is definitely safer to keep the sync, I am just
thinking about a HEAD-only optimization here as you do.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Patch: Implement failover on libpq connect level.
Next
From: David Steele
Date:
Subject: Re: Fix checkpoint skip logic on idle systems by tracking LSN progress