Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Date
Msg-id aea01367-d988-fcfa-5224-bbc91839d5d6@iki.fi
Whole thread Raw
In response to [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
List pgsql-hackers
On 15/06/2021 14:20, Dilip Kumar wrote:
> Design Idea:
> -----------------
> First, create the target database directory along with the version
> file and WAL-log this operation.  Create the "relation map file" in
> the target database and copy the content from the source database. For
> this, we can use some modified versions of the write_relmap_file() and
> WAL-log the relmap create operation along with the file content.  Now,
> read the relmap file to find the relfilenode for pg_class and then we
> read pg_class block by block and decode the tuples. For reading the
> pg_class blocks, we can use ReadBufferWithoutRelCache() so that we
> don't need the relcache.  Nothing prevents us from checking visibility
> for tuples in another database because CLOG is global to the cluster.
> And nothing prevents us from deforming those tuples because the column
> definitions for pg_class have to be the same in every database. Then
> we can get the relfilenode of every file we need to copy, and prepare
> a list of all such relfilenode.

I guess that would work, but you could also walk the database directory 
like copydir() does. How you find the relations to copy is orthogonal to 
whether you WAL-log them or use checkpoints. And whether you use the 
buffer cache is also orthogonal to the rest of the proposal; you could 
issue FlushDatabaseBuffers() instead of a checkpoint.

> Next, for each relfilenode in the
> source database, create a respective relfilenode in the target
> database (for all forks) using smgrcreate, which is already a
> WAL-logged operation.  Now read the source relfilenode block by block
> using ReadBufferWithoutRelCache() and copy the block to the target
> relfilenode using smgrextend() and WAL-log them using log_newpage().
> For the source database, we can not directly use the smgrread(),
> because there could be some dirty buffers so we will have to read them
> through the buffer manager interface, otherwise, we will have to flush
> all the dirty buffers.

Yeah, WAL-logging the contents of the source database would certainly be 
less weird than the current system. As Julien also pointed out, the 
question is, are there people using on "CREATE DATABASE foo TEMPLATE 
bar" to copy a large source database, on the premise that it's fast 
because it skips WAL-logging?

In principle, we could have both mechanisms, and use the new WAL-logged 
system if the database is small, and the old system with checkpoints if 
it's large. But I don't like idea of having to maintain both.

- Heikki



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: Race condition in recovery?
Next
From: Robert Haas
Date:
Subject: Re: Race condition in recovery?