Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date
Msg-id 22147331d79514289d417eac645ed588.squirrel@2.emaily.eu
Whole thread Raw
In response to Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
List pgsql-hackers
Dne 27 Říjen 2014, 10:47, Heikki Linnakangas napsal(a):
> On 10/26/2014 11:47 PM, Tomas Vondra wrote:
>> After eyeballing the code for an hour or two, I think CREATE DATABASE
>> should be fine with performing only a 'partial checkpoint' on the
>> template database - calling FlushDatabaseBuffers and processing unlink
>> requests, as suggested by the comment in createdb().
>
> Hmm. You could replace the first checkpoint with that, but I don't think
> that's enough for the second. To get any significant performance
> benefit, you need to get rid of both checkpoints, because doing two
> checkpoints one after another is almost as fast as doing a single
> checkpoint; the second checkpoint has very little work to do because the
> first checkpoint already flushed out everything.

Yes, that's why I wrote that two checkpoints not too far away are
effectively a single checkpoint. OTOH if the template database is not
small, it may take a while to copy the data, increasing the distance
between the checkpoints.

While our template databases are small (in the order of ~10-100MB), there
are probably people using this to clone much larger databases.

> The second checkpoint, after copying but before commit, is done because
> (from the comments in createdb function):
>
>>  * #1: When PITR is off, we don't XLOG the contents of newly created
>>  * indexes; therefore the drop-and-recreate-whole-directory behavior
>>  * of DBASE_CREATE replay would lose such indexes.
>>
>>  * #2: Since we have to recopy the source database during DBASE_CREATE
>>  * replay, we run the risk of copying changes in it that were
>>  * committed after the original CREATE DATABASE command but before the
>>  * system crash that led to the replay.  This is at least unexpected
>>  * and at worst could lead to inconsistencies, eg duplicate table
>>  * names.
>
> Doing only FlushDatabaseBuffers would not prevent these issues - you
> need a full checkpoint. These issues are better explained here:
> http://www.postgresql.org/message-id/28884.1119727671@sss.pgh.pa.us
>
> To solve #1, we could redesign CREATE DATABASE so that replaying the
> DBASE_CREATE record doesn't zap the old directory, and also doesn't copy
> any files. We could instead just assume that if the transaction commits,
> all the files have been copied and fsync'd already, like we assume that
> if a CREATE INDEX commits in wal_level=minimal, the underlying file was
> fsync'd before the commit.
>
> That would also solve #2, when doing crash recovery. But it would remain
> when doing archive recovery. I guess we could still redo the copy when
> in archive recovery mode. I believe it would be the first time we have a
> WAL record that's replayed differently in crash recovery than in archive
> recovery, so here be dragons...

Yeah ...

>> It's not exactly trivial change, but it does not seem frighteningly
>> difficult coding either.
>>
>> The templates are usually static, so this would minimize both the CREATE
>> DATABASE duration and disruption to the cluster it causes.
>
> I wonder if we should bite the bullet and start WAL-logging all the
> files that are copied from the template database to the new database.
> When the template database is small (template0 is 6.4MB currently), that
> wouldn't generate too much WAL. We could perhaps do that only if the
> template database is small, and do the checkpoints otherwise, although I
> wouldn't like to have subtly different behavior depending on database
> size like that.

IMHO writing all the data into a WAL would be the cleanest solution.

Also, what is a small database? I don't think a static value will work,
because the sweet spot between the current approach (forcing two
checkpoints) and writing everything in WAL depends on the amount of dirty
buffers that need to be checkpointed. Which is mostly driven by the size
of shared buffers and write activity - for small shared buffers and/or
mostly-read workload, checkpoints are cheap, so the 'small database'
threshold (when choosing the WAL approach) is much higher than for large
shared buffers or write-heavy workloads.

So maybe if we could determine the amount of data to be checkpointed, and
then base the decision on that, that'd work better? This would also have
to take into account that writing into WAL is sequential, while
checkpoints usually cause random writes all over the datafiles (which is
more expensive).

Another option might be forcing just a "spread" checkpoint, not the
immediate one (which is what we do now). That would not fix the CREATE
DATABASE duration (actually, it would make it longer), but it would lower
the impact on other activity on the machine.

regards
Tomas




pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Missing FIN_CRC32 calls in logical replication code
Next
From: Fujii Masao
Date:
Subject: Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)