Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date
Msg-id 544EC007.40702@fuzzy.cz
Whole thread Raw
In response to Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
List pgsql-hackers
On 27.10.2014 17:24, Heikki Linnakangas wrote:
> On 10/27/2014 03:46 PM, Tom Lane wrote:
>> Heikki Linnakangas <hlinnakangas@vmware.com> writes:
>>> On 10/27/2014 03:21 PM, Tomas Vondra wrote:
>>>> Thinking about this a bit more, do we really need a full checkpoint?
>>>> That
>>>> is a checkpoint of all the databases in the cluster? Why
>>>> checkpointing the
>>>> source database is not enough?
>>
>>> A full checkpoint ensures that you always begin recovery *after* the
>>> DBASE_CREATE record. I.e. you never replay a DBASE_CREATE record during
>>> crash recovery (except when you crash before the transaction commits, in
>>> which case it doesn't matter if the new database's directory is borked).
>>
>> Yeah.  After re-reading the 2005 thread, I wonder if we shouldn't just
>> bite the bullet and redesign CREATE DATABASE as you suggest, ie, WAL-log
>> all the copied files instead of doing a "cp -r"-equivalent directory
>> copy.
>> That would fix a number of existing replay hazards as well as making it
>> safe to do what Tomas wants.  In the small scale this would cause more
>> I/O
>> (2 copies of the template database's data) but in production situations
>> we might well come out ahead by avoiding a forced checkpoint of the rest
>> of the cluster.  Also I guess we could skip WAL-logging if WAL archiving
>> is off, similarly to the existing optimization for CREATE INDEX etc.
> 
> That would be a nasty surprise for anyone who's using CREATE DATABASE
> as a fast way to clone a large database. But I would be OK with that,
> at least if we can skip the WAL-logging with wal_level=minimal.

That's true. Sadly, I can't think of a solution that would address both
use cases at the same time :-(

The only thing I can think of is having two CREATE DATABASE "flavors".
One keeping the current approach (suitable for fast cloning) and one
with the WAL logging (minimizing the CREATE DATABASE duration the impact
on other backends).

It will probably make the code significantly more complex, which is not
exactly desirable, I guess. Also, if we keep the current code (even if
only as a special case) it won't eliminate the existing replay hazards
(which was one of the Tom's arguments for biting the bullet).

I'm also thinking that for wal_level=archive and large databases, this
won't really eliminate the checkpoint as it will likely generate enough
WAL to hit checkpoint_segments and trigger a checkpoint anyway. No?

That being said, our CREATE DATABASE docs currently say this
   Although it is possible to copy a database other than template1 by   specifying its name as the template, this is
not(yet) intended as   a general-purpose "COPY DATABASE" facility. The principal   limitation is that no other sessions
canbe connected to the   template database while it is being copied. CREATE DATABASE will   fail if any other
connectionexists when it starts; otherwise, new   connections to the template database are locked out until CREATE
DATABASEcompletes. See Section 21.3 for more information.
 

I think that this limitation pretty much means no one should use CREATE
DATABASE for cloning live databases in production environment (because
of the locking).

It also seems to me the "general-purpose COPY DATABASE" described in the
docs is what we're describing in this thread.

regards
Tomas



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Reducing the cost of sinval messaging
Next
From: Adam Brightwell
Date:
Subject: Re: Directory/File Access Permissions for COPY and Generic File Access Functions