Home > mailing lists

Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT - Mailing list pgsql-hackers

From	Heikki Linnakangas
Subject	Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date	October 27, 2014 12:47:42
Msg-id	544E1499.6070105@vmware.com Whole thread Raw
In response to	proposal: CREATE DATABASE vs. (partial) CHECKPOINT (Tomas Vondra <tv@fuzzy.cz>)
Responses	Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
List	pgsql-hackers

Tree view

On 10/26/2014 11:47 PM, Tomas Vondra wrote:
> After eyeballing the code for an hour or two, I think CREATE DATABASE
> should be fine with performing only a 'partial checkpoint' on the
> template database - calling FlushDatabaseBuffers and processing unlink
> requests, as suggested by the comment in createdb().

Hmm. You could replace the first checkpoint with that, but I don't think 
that's enough for the second. To get any significant performance 
benefit, you need to get rid of both checkpoints, because doing two 
checkpoints one after another is almost as fast as doing a single 
checkpoint; the second checkpoint has very little work to do because the 
first checkpoint already flushed out everything.

The second checkpoint, after copying but before commit, is done because 
(from the comments in createdb function):

>  * #1: When PITR is off, we don't XLOG the contents of newly created
>  * indexes; therefore the drop-and-recreate-whole-directory behavior
>  * of DBASE_CREATE replay would lose such indexes.
>
>  * #2: Since we have to recopy the source database during DBASE_CREATE
>  * replay, we run the risk of copying changes in it that were
>  * committed after the original CREATE DATABASE command but before the
>  * system crash that led to the replay.  This is at least unexpected
>  * and at worst could lead to inconsistencies, eg duplicate table
>  * names.

Doing only FlushDatabaseBuffers would not prevent these issues - you 
need a full checkpoint. These issues are better explained here: 
http://www.postgresql.org/message-id/28884.1119727671@sss.pgh.pa.us

To solve #1, we could redesign CREATE DATABASE so that replaying the 
DBASE_CREATE record doesn't zap the old directory, and also doesn't copy 
any files. We could instead just assume that if the transaction commits, 
all the files have been copied and fsync'd already, like we assume that 
if a CREATE INDEX commits in wal_level=minimal, the underlying file was 
fsync'd before the commit.

That would also solve #2, when doing crash recovery. But it would remain 
when doing archive recovery. I guess we could still redo the copy when 
in archive recovery mode. I believe it would be the first time we have a 
WAL record that's replayed differently in crash recovery than in archive 
recovery, so here be dragons...

> It's not exactly trivial change, but it does not seem frighteningly
> difficult coding either.
>
> The templates are usually static, so this would minimize both the CREATE
> DATABASE duration and disruption to the cluster it causes.

I wonder if we should bite the bullet and start WAL-logging all the 
files that are copied from the template database to the new database. 
When the template database is small (template0 is 6.4MB currently), that 
wouldn't generate too much WAL. We could perhaps do that only if the 
template database is small, and do the checkpoints otherwise, although I 
wouldn't like to have subtly different behavior depending on database 
size like that.

- Heikki

pgsql-hackers by date:

From: Alvaro Herrera
Date: 27 October 2014, 12:29:40
Subject: Re: On partitioning

From: Pavel Stehule
Date: 27 October 2014, 13:06:14
Subject: Re: strip nulls functions for json and jsonb

Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT - Mailing list pgsql-hackers

Previous

Next