Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Date
Msg-id CA+TgmobVgbceRgw0pi0R1Xp=mhK-pLTuKGz-G1cqNxhRa-XWQw@mail.gmail.com
Whole thread Raw
In response to Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints  (Andres Freund <andres@anarazel.de>)
Responses Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints  (Andres Freund <andres@anarazel.de>)
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints  (Dilip Kumar <dilipbalaut@gmail.com>)
List pgsql-hackers
On Thu, Feb 17, 2022 at 4:13 PM Andres Freund <andres@anarazel.de> wrote:
> Could you or Dilip outline how it now works, and what exactly makes it safe
> etc (e.g. around locking, invalidation processing, snapshots, xid horizons)?
>
> I just scrolled through the patchset without finding such an explanation, so
> it's a bit hard to judge.

That's a good question and it's making me think about a few things I
hadn't considered before.

Dilip can add more here, but my impression is that most problems are
prevented by CREATE DATABASE, with or without this patch, starts by
acquiring a ShareLock on the database, preventing new connections, and
then making sure there are no existing connections. That means nothing
in the target database can be changing, which I think makes a lot of
the stuff on your list a non-issue. Any problems that remain have to
be the result of something that CREATE DATABASE does having a bad
interaction either with something that is completed beforehand or
something that begins afterward. There just can't be overlap, and I
think that rules out most problems.

Now you pointed out earlier one problem that it doesn't fix: unlike
the current method, this method involves reading buffers from the
template database into shared_buffers, and those buffers, once read,
stick around even after the operation finishes. That's not an
intrinsic problem, though, because a connection to the database could
do the same thing. However, again as you pointed out, it is a problem,
though, if we do it with less locking than a real database connection
would have done. It seems to me that if there are other problems here,
they have to be basically of the same sort: they have to leave the
system in some state which is otherwise impossible. Do you see some
other kind of hazard, or more examples of how that could happen? I
think the leftover buffers in shared_buffers have to be basically the
only thing, because apart from that, how is this any different than a
file copy?

The only other kind of hazard I can think of is: could it be unsafe to
try to interpret the contents of a database to which no one else is
connected at present due to any of the issues you mention? But what's
the hazard exactly? It can't be a problem if we've failed to process
sinval messages for the target database, because we're not using any
caches anyway. We can't. It can't be unsafe to test visibility of XIDs
for that database, because in an alternate universe some backend could
have connected to that database and seen the same XIDs. One thing we
COULD be doing wrong is using the wrong snapshot to test the
visibility of XIDs. The patch uses GetActiveSnapshot(), and I'm
thinking that is probably wrong. Shouldn't it be GetLatestSnapshot()?
And do we need to worry about snapshots being database-specific? Maybe
so.

> > But if it is the latter then there's really no point to that kind of cleanup
> > work and we should probably just give up now.
>
> This thread is long. Could you summarize what lead you to consider other
> approaches (e.g. looking in the filesystem for relfilenodes) as not feasible /
> too ugly / ...?

I don't think it's infeasible to look at the filesystem for files and
just copy whatever files we find. It's a plausible alternate design. I
just don't like it as well. I think that relying on the filesystem
contents to tell us what's going on is kind of hacky. The only
technical issue I see there is that the WAL logging might require more
kludgery, since that mechanism is kind of intertwined with
shared_buffers. You'd have to get the right block references into the
WAL record, and you have to make sure that checkpoints don't move the
redo pointer at an inopportune moment.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: O(n) tasks cause lengthy startups and checkpoints
Next
From: David Rowley
Date:
Subject: Re: Use generation context to speed up tuplesorts