Thread: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

[Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
Currently, CREATE DATABASE forces a checkpoint, then copies all the
files, then forces another checkpoint. The comments in the createdb()
function explain the reasons for this. The attached patch fixes this
problem by making CREATE DATABASE completely WAL-logged so that now we
can avoid checkpoints.  The patch modifies both CREATE DATABASE and
ALTER DATABASE..SET TABLESPACE to be fully WAL-logged.

One main advantage of this change is that it will be cheaper. Forcing
checkpoints on an idle system is no big deal, but when the system is
under heavy write load, it's very expensive. Another advantage is that
it makes things better for features like TDE, which might want the
pages in the source database to be encrypted using a different key or
nonce than the pages in the target database.


Design Idea:
-----------------
First, create the target database directory along with the version
file and WAL-log this operation.  Create the "relation map file" in
the target database and copy the content from the source database. For
this, we can use some modified versions of the write_relmap_file() and
WAL-log the relmap create operation along with the file content.  Now,
read the relmap file to find the relfilenode for pg_class and then we
read pg_class block by block and decode the tuples. For reading the
pg_class blocks, we can use ReadBufferWithoutRelCache() so that we
don't need the relcache.  Nothing prevents us from checking visibility
for tuples in another database because CLOG is global to the cluster.
And nothing prevents us from deforming those tuples because the column
definitions for pg_class have to be the same in every database. Then
we can get the relfilenode of every file we need to copy, and prepare
a list of all such relfilenode.  Next, for each relfilenode in the
source database, create a respective relfilenode in the target
database (for all forks) using smgrcreate, which is already a
WAL-logged operation.  Now read the source relfilenode block by block
using ReadBufferWithoutRelCache() and copy the block to the target
relfilenode using smgrextend() and WAL-log them using log_newpage().
For the source database, we can not directly use the smgrread(),
because there could be some dirty buffers so we will have to read them
through the buffer manager interface, otherwise, we will have to flush
all the dirty buffers.

WAL sequence using pg_waldump
----------------------------------------------------
1. (new wal to create db dir and write PG_VERSION file)
rmgr: Database desc: CREATE create dir 1663/16394

2. (new wal to create and write relmap file)
rmgr: RelMap   desc: CREATE database 16394 tablespace 1663 size 512

2. (create relfilenode)
rmgr: Storage  desc: CREATE base/16394/16384
rmgr: Storage  desc: CREATE base/16394/2619

3. (write page data)
rmgr: XLOG     desc: FPI , blkref #0: rel 1663/16394/2619 blk 0 FPW
rmgr: XLOG     desc: FPI , blkref #0: rel 1663/16394/2619 blk 1 FPW
............
4. (create other forks)
rmgr: Storage  desc: CREATE base/16394/2619_fsm
rmgr: Storage  CREATE base/16394/2619_vm
.............

I have attached a POC patch, which shows this idea, with this patch
all basic sanity testing and the "check-world" is passing.

Open points:
-------------------
- This is a POC patch so needs more refactoring/cleanup and testing.
- Might need to relook into the SMGR level API usage.


Credits:
-----------
Thanks to Robert Haas, for suggesting this idea and the high-level design.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Julien Rouhaud
Date:
On Tue, Jun 15, 2021 at 04:50:24PM +0530, Dilip Kumar wrote:
> Currently, CREATE DATABASE forces a checkpoint, then copies all the
> files, then forces another checkpoint. The comments in the createdb()
> function explain the reasons for this. The attached patch fixes this
> problem by making CREATE DATABASE completely WAL-logged so that now we
> can avoid checkpoints.  The patch modifies both CREATE DATABASE and
> ALTER DATABASE..SET TABLESPACE to be fully WAL-logged.
> 
> One main advantage of this change is that it will be cheaper. Forcing
> checkpoints on an idle system is no big deal, but when the system is
> under heavy write load, it's very expensive. Another advantage is that
> it makes things better for features like TDE, which might want the
> pages in the source database to be encrypted using a different key or
> nonce than the pages in the target database.

I only had a quick look at the patch but AFAICS your patch makes the new
behavior mandatory.  Wouldn't it make sense to have a way to use the previous
approach?  People creating wanting to copy somewhat big database and with a
slow replication may prefer to pay 2 checkpoints rather than stream everything.
Same for people who have an otherwise idle system (I often use that to make
temporary backups and/or prepare multiple dataset and most of the time the
checkpoint is basically free).



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Heikki Linnakangas
Date:
On 15/06/2021 14:20, Dilip Kumar wrote:
> Design Idea:
> -----------------
> First, create the target database directory along with the version
> file and WAL-log this operation.  Create the "relation map file" in
> the target database and copy the content from the source database. For
> this, we can use some modified versions of the write_relmap_file() and
> WAL-log the relmap create operation along with the file content.  Now,
> read the relmap file to find the relfilenode for pg_class and then we
> read pg_class block by block and decode the tuples. For reading the
> pg_class blocks, we can use ReadBufferWithoutRelCache() so that we
> don't need the relcache.  Nothing prevents us from checking visibility
> for tuples in another database because CLOG is global to the cluster.
> And nothing prevents us from deforming those tuples because the column
> definitions for pg_class have to be the same in every database. Then
> we can get the relfilenode of every file we need to copy, and prepare
> a list of all such relfilenode.

I guess that would work, but you could also walk the database directory 
like copydir() does. How you find the relations to copy is orthogonal to 
whether you WAL-log them or use checkpoints. And whether you use the 
buffer cache is also orthogonal to the rest of the proposal; you could 
issue FlushDatabaseBuffers() instead of a checkpoint.

> Next, for each relfilenode in the
> source database, create a respective relfilenode in the target
> database (for all forks) using smgrcreate, which is already a
> WAL-logged operation.  Now read the source relfilenode block by block
> using ReadBufferWithoutRelCache() and copy the block to the target
> relfilenode using smgrextend() and WAL-log them using log_newpage().
> For the source database, we can not directly use the smgrread(),
> because there could be some dirty buffers so we will have to read them
> through the buffer manager interface, otherwise, we will have to flush
> all the dirty buffers.

Yeah, WAL-logging the contents of the source database would certainly be 
less weird than the current system. As Julien also pointed out, the 
question is, are there people using on "CREATE DATABASE foo TEMPLATE 
bar" to copy a large source database, on the premise that it's fast 
because it skips WAL-logging?

In principle, we could have both mechanisms, and use the new WAL-logged 
system if the database is small, and the old system with checkpoints if 
it's large. But I don't like idea of having to maintain both.

- Heikki



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Tue, Jun 15, 2021 at 5:34 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 15/06/2021 14:20, Dilip Kumar wrote:
> > Design Idea:
. Then
> > we can get the relfilenode of every file we need to copy, and prepare
> > a list of all such relfilenode.
>
> I guess that would work, but you could also walk the database directory
> like copydir() does. How you find the relations to copy is orthogonal to
> whether you WAL-log them or use checkpoints. And whether you use the
> buffer cache is also orthogonal to the rest of the proposal; you could
> issue FlushDatabaseBuffers() instead of a checkpoint.

Yeah, that would also work, but I thought since we are already
avoiding the checkpoint so let's avoid FlushDatabaseBuffers() also and
directly use the lower level buffer manager API which doesn't need
recache.  And I am using pg_class to identify the useful relfilenode
so that we can avoid processing some unwanted relfilenode but yeah I
agree that this is orthogonal to whether we use checkpoint or not.

> Yeah, WAL-logging the contents of the source database would certainly be
> less weird than the current system. As Julien also pointed out, the
> question is, are there people using on "CREATE DATABASE foo TEMPLATE
> bar" to copy a large source database, on the premise that it's fast
> because it skips WAL-logging?
>
> In principle, we could have both mechanisms, and use the new WAL-logged
> system if the database is small, and the old system with checkpoints if
> it's large. But I don't like idea of having to maintain both.

Yeah, I agree in some cases, where we don't have many dirty buffers,
checkpointing can be faster.  I think code wise maintaining two
approaches will not be a very difficult job because the old approach
just calls copydir(), but I am thinking about how can we decide which
approach is better in which scenario.  I don't think we can take calls
just based on the database size?  It would also depend upon many other
factors e.g. how busy your system is, how many total dirty buffers are
there in the cluster right? because checkpoint will affect the
performance of the operation going on in other databases in the
cluster.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Adam Brusselback
Date:
Am I mistaken in thinking that this would allow CREATE DATABASE to run inside a transaction block now, further reducing the DDL commands that are non-transactional?

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andrew Dunstan
Date:
On 6/15/21 8:04 AM, Heikki Linnakangas wrote:
>
> Yeah, WAL-logging the contents of the source database would certainly
> be less weird than the current system. As Julien also pointed out, the
> question is, are there people using on "CREATE DATABASE foo TEMPLATE
> bar" to copy a large source database, on the premise that it's fast
> because it skips WAL-logging?


I'm 100% certain there are. It's not even a niche case.


>
> In principle, we could have both mechanisms, and use the new
> WAL-logged system if the database is small, and the old system with
> checkpoints if it's large. But I don't like idea of having to maintain
> both.
>
>

Rather than use size, I'd be inclined to say use this if the source
database is marked as a template, and use the copydir approach for
anything that isn't.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Julien Rouhaud
Date:
On Tue, Jun 15, 2021 at 9:31 PM Andrew Dunstan <andrew@dunslane.net> wrote:
>
> Rather than use size, I'd be inclined to say use this if the source
> database is marked as a template, and use the copydir approach for
> anything that isn't.

Looks like a good approach.



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Kyotaro Horiguchi
Date:
At Tue, 15 Jun 2021 22:07:32 +0800, Julien Rouhaud <rjuju123@gmail.com> wrote in 
> On Tue, Jun 15, 2021 at 9:31 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> >
> > Rather than use size, I'd be inclined to say use this if the source
> > database is marked as a template, and use the copydir approach for
> > anything that isn't.
> 
> Looks like a good approach.

If we are willing to maintain the two methods.

Couldn't we just skip the checkpoints if the database is known to
"clean", which means no page has been loaded for the database since
startup? We can use the "template" mark to reject connections to the
database.  (I'm afraid that we also should prevent vacuum to visit the
template databases, but...)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Tue, Jun 15, 2021 at 7:01 PM Andrew Dunstan <andrew@dunslane.net> wrote:
>
> Rather than use size, I'd be inclined to say use this if the source
> database is marked as a template, and use the copydir approach for
> anything that isn't.

Yeah, that is possible, on the other thought wouldn't it be good to
provide control to the user by providing two different commands, e.g.
COPY DATABASE for the existing method (copydir) and CREATE DATABASE
for the new method (fully wal logged)?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
Hi,

On 2021-06-15 16:50:24 +0530, Dilip Kumar wrote:
> The patch modifies both CREATE DATABASE and ALTER DATABASE..SET
> TABLESPACE to be fully WAL-logged.

Generally quite a bit in favor of this - the current approach is very
heavyweight, slow and I think we have a few open corner bugs related to
it.


> Design Idea:
> -----------------
> First, create the target database directory along with the version
> file and WAL-log this operation.

What happens if you crash / promote at this point?


> Create the "relation map file" in the target database and copy the
> content from the source database. For this, we can use some modified
> versions of the write_relmap_file() and WAL-log the relmap create
> operation along with the file content.  Now, read the relmap file to
> find the relfilenode for pg_class and then we read pg_class block by
> block and decode the tuples.

This doesn't seem like a great approach - you're not going to be able to
use much of the normal infrastructure around processing tuples. So it
seems like it'd end up with quite a bit of special case code that needs
to maintained in parallel.


> Now read the source relfilenode block by block using
> ReadBufferWithoutRelCache() and copy the block to the target
> relfilenode using smgrextend() and WAL-log them using log_newpage().
> For the source database, we can not directly use the smgrread(),
> because there could be some dirty buffers so we will have to read them
> through the buffer manager interface, otherwise, we will have to flush
> all the dirty buffers.

I think we might need a bit more batching for the WAL logging. There are
cases of template database considerably bigger than the default and the
overhead of logging each write separately seems likely to be noticable.

Greetings,

Andres Freund



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
Hi,

On 2021-06-15 18:11:23 +0530, Dilip Kumar wrote:
> On Tue, Jun 15, 2021 at 5:34 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >
> > On 15/06/2021 14:20, Dilip Kumar wrote:
> > > Design Idea:
> . Then
> > > we can get the relfilenode of every file we need to copy, and prepare
> > > a list of all such relfilenode.
> >
> > I guess that would work, but you could also walk the database directory
> > like copydir() does. How you find the relations to copy is orthogonal to
> > whether you WAL-log them or use checkpoints. And whether you use the
> > buffer cache is also orthogonal to the rest of the proposal; you could
> > issue FlushDatabaseBuffers() instead of a checkpoint.
> 
> Yeah, that would also work, but I thought since we are already
> avoiding the checkpoint so let's avoid FlushDatabaseBuffers() also and
> directly use the lower level buffer manager API which doesn't need
> recache.  And I am using pg_class to identify the useful relfilenode
> so that we can avoid processing some unwanted relfilenode but yeah I
> agree that this is orthogonal to whether we use checkpoint or not.

It's not entirely obvious to me that it's important to avoid
FlushDatabaseBuffers() on its own. Forcing a checkpoint is problematic because
it unnecessarily writes out dirty buffers in other databases, triggers FPWs
etc. Normally a database used as a template won't have a meaningful amount of
dirty buffers itself, so the FlushDatabaseBuffers() shouldn't trigger a lot of
writes. Of course, there is the matter of FlushDatabaseBuffers() not being
cheap with a large shared_buffers - but I suspect that's not a huge factor
compared to the rest of the database creation cost.

I think the better argument for going through shared buffers is that it might
be worth doing so for the *target* database. A common use of frequently
creating databases, in particular with a non-default template database, is to
run regression tests with pre-created schema / data - writing out all that data
just to have it then dropped a few seconds later after the regression test
completed is wasteful.



> > In principle, we could have both mechanisms, and use the new WAL-logged
> > system if the database is small, and the old system with checkpoints if
> > it's large. But I don't like idea of having to maintain both.
> 
> Yeah, I agree in some cases, where we don't have many dirty buffers,
> checkpointing can be faster.

I don't think the main issue is the speed of checkpointing itself? The reaoson
to maintain the old paths is that the "new approach" is bloating WAL volume,
no? Right now cloning a 1TB database costs a few hundred bytes of WAL and about
1TB of write IO. With the proposed approach, the write volume approximately
doubles, because there'll also be about 1TB in WAL.

Greetings,

Andres Freund



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Tomas Vondra
Date:

On 6/15/21 3:31 PM, Andrew Dunstan wrote:
> 
> On 6/15/21 8:04 AM, Heikki Linnakangas wrote:
>>
>> Yeah, WAL-logging the contents of the source database would certainly
>> be less weird than the current system. As Julien also pointed out, the
>> question is, are there people using on "CREATE DATABASE foo TEMPLATE
>> bar" to copy a large source database, on the premise that it's fast
>> because it skips WAL-logging?
> 
> 
> I'm 100% certain there are. It's not even a niche case.
> 
> 
>>
>> In principle, we could have both mechanisms, and use the new
>> WAL-logged system if the database is small, and the old system with
>> checkpoints if it's large. But I don't like idea of having to maintain
>> both.
>>
>>
> 
> Rather than use size, I'd be inclined to say use this if the source
> database is marked as a template, and use the copydir approach for
> anything that isn't.
> 


I think we should be asking what is the benefit of that use case, and 
perhaps try addressing that without having to maintain two entirely 
different ways to do CREATE DATABASE. It's not like we're sure the 
current code is 100% reliable in various corner cases, I doubt having 
two separate approaches will improve the situation :-/

I can see three reasons why people want to skip the WAL logging:

1) it's faster, because there's no CPU and I/O for building the WAL

   I wonder if some optimization / batching could help with (1), as
   suggested by Andres elsewhere in this thread.

2) it saves the amount of WAL (could matter with large template 
databases and WAL archiving, etc.)

   We can't really do much about this - we need to log all the data. But
   the batching from (1) might help a bit too, I guess.

3) saves the amount of WAL that needs to be copied to standby, so that 
there's no increase of replication lag, etc. particularly when the 
network link has limited bandwidth

   I think this is a more general issue - some operations that may
   generate a lot of WAL, and we generally assume it's better to do
   that rather than hold exclusive locks for long time. But maybe we
   could have some throttling, to limit the amount of WAL per second,
   similarly to what we have to plain vacuum.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
()log_newpage()On Thu, Jun 17, 2021 at 3:28 AM Andres Freund
<andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-06-15 16:50:24 +0530, Dilip Kumar wrote:
> > The patch modifies both CREATE DATABASE and ALTER DATABASE..SET
> > TABLESPACE to be fully WAL-logged.
>
> Generally quite a bit in favor of this - the current approach is very
> heavyweight, slow and I think we have a few open corner bugs related to
> it.

Great!

>
> > Design Idea:
> > -----------------
> > First, create the target database directory along with the version
> > file and WAL-log this operation.
>
> What happens if you crash / promote at this point?

I will check this.

> > Create the "relation map file" in the target database and copy the
> > content from the source database. For this, we can use some modified
> > versions of the write_relmap_file() and WAL-log the relmap create
> > operation along with the file content.  Now, read the relmap file to
> > find the relfilenode for pg_class and then we read pg_class block by
> > block and decode the tuples.
>
> This doesn't seem like a great approach - you're not going to be able to
> use much of the normal infrastructure around processing tuples. So it
> seems like it'd end up with quite a bit of special case code that needs
> to maintained in parallel.

Yeah, this needs some special-purpose code but it is not too much
code.  I agree that instead of scanning the pg_class we can scan all
the tablespaces and under that identify the source database directory
as we do now.  And from there we can copy each relfilenode block by
block with wal log.  Honestly, these both seem like a special-purpose
code.  Another problem with directly scanning the directory is, how we
are supposed to get the "relpersistence" which is stored in pg_class
tuple right?

>
> > Now read the source relfilenode block by block using
> > ReadBufferWithoutRelCache() and copy the block to the target
> > relfilenode using smgrextend() and WAL-log them using log_newpage().
> > For the source database, we can not directly use the smgrread(),
> > because there could be some dirty buffers so we will have to read them
> > through the buffer manager interface, otherwise, we will have to flush
> > all the dirty buffers.
>
> I think we might need a bit more batching for the WAL logging. There are
> cases of template database considerably bigger than the default and the
> overhead of logging each write separately seems likely to be noticable.

Yeah, we can do that, and instead of using log_newpage() we can use
log_newpages(), to log multiple pages at once.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Thu, Jun 17, 2021 at 3:43 AM Andres Freund <andres@anarazel.de> wrote:

> > Yeah, that would also work, but I thought since we are already
> > avoiding the checkpoint so let's avoid FlushDatabaseBuffers() also and
> > directly use the lower level buffer manager API which doesn't need
> > recache.  And I am using pg_class to identify the useful relfilenode
> > so that we can avoid processing some unwanted relfilenode but yeah I
> > agree that this is orthogonal to whether we use checkpoint or not.
>
> It's not entirely obvious to me that it's important to avoid
> FlushDatabaseBuffers() on its own. Forcing a checkpoint is problematic because
> it unnecessarily writes out dirty buffers in other databases, triggers FPWs
> etc. Normally a database used as a template won't have a meaningful amount of
> dirty buffers itself, so the FlushDatabaseBuffers() shouldn't trigger a lot of
> writes. Of course, there is the matter of FlushDatabaseBuffers() not being
> cheap with a large shared_buffers - but I suspect that's not a huge factor
> compared to the rest of the database creation cost.

Okay so if I think from that POW, then maybe we can just
FlushDatabaseBuffers() and then directly use smgrread() calls.

> I think the better argument for going through shared buffers is that it might
> be worth doing so for the *target* database. A common use of frequently
> creating databases, in particular with a non-default template database, is to
> run regression tests with pre-created schema / data - writing out all that data
> just to have it then dropped a few seconds later after the regression test
> completed is wasteful.

Okay, I am not sure how common this use case is but for this use case
it makes sense to use bufmgr for the target database.

> > > In principle, we could have both mechanisms, and use the new WAL-logged
> > > system if the database is small, and the old system with checkpoints if
> > > it's large. But I don't like idea of having to maintain both.
> >
> > Yeah, I agree in some cases, where we don't have many dirty buffers,
> > checkpointing can be faster.
>
> I don't think the main issue is the speed of checkpointing itself? The reaoson
> to maintain the old paths is that the "new approach" is bloating WAL volume,
> no? Right now cloning a 1TB database costs a few hundred bytes of WAL and about
> 1TB of write IO. With the proposed approach, the write volume approximately
> doubles, because there'll also be about 1TB in WAL.

Make sense.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Heikki Linnakangas
Date:
On 17/06/2021 08:45, Dilip Kumar wrote:
> Another problem with directly scanning the directory is, how we
> are supposed to get the "relpersistence" which is stored in pg_class
> tuple right?

You only need relpersistence if you want to use the buffer cache, right? 
I think that's a good argument for not using it.

- Heikki



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Thu, Jun 17, 2021 at 2:50 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 17/06/2021 08:45, Dilip Kumar wrote:
> > Another problem with directly scanning the directory is, how we
> > are supposed to get the "relpersistence" which is stored in pg_class
> > tuple right?
>
> You only need relpersistence if you want to use the buffer cache, right?
> I think that's a good argument for not using it.

Yeah, that is the one place,  another place I am using it to decide
whether to WAL log the new page while writing into the target
relfilenode, if it is unlogged relation then I am not WAL logging. But
now, I think that is not the right idea, during creating the database
we should WAL log all the pages irrespective of whether the table is
logged or unlogged.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Wed, Jun 16, 2021 at 6:13 PM Andres Freund <andres@anarazel.de> wrote:
> I don't think the main issue is the speed of checkpointing itself? The reaoson
> to maintain the old paths is that the "new approach" is bloating WAL volume,
> no? Right now cloning a 1TB database costs a few hundred bytes of WAL and about
> 1TB of write IO. With the proposed approach, the write volume approximately
> doubles, because there'll also be about 1TB in WAL.

This is a good point, but on the other hand, I think this smells a lot
like the wal_level=minimal optimization where we don't need to log
data being bulk-loaded into a table created in the same transaction if
wal_level=minimal. In theory, that optimization has a lot of value,
but in practice it gets a lot of bad press on this list, because (1)
sometimes doing the fsync is more expensive than writing the extra WAL
would have been and (2) most people want to run with
wal_level=replica/logical so it ends up being a code path that isn't
used much and is therefore more likely than average to have bugs
nobody's terribly interested in fixing (except Noah ... thanks Noah!).
If we add features in the future, lke TDE or perhaps incremental
backup, that rely on new pages getting new LSNs instead of recycled
ones, this may turn into the same kind of wart. And as with that
optimization, you're probably not even better off unless the database
is pretty big, and you might be worse off if you have to do fsyncs or
flush buffers synchronously. I'm not severely opposed to keeping both
methods around, so if that's really what people want to do, OK, but I
guess I wonder whether we're really going to be happy with that
decision down the road.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Thu, Jun 17, 2021 at 5:20 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> You only need relpersistence if you want to use the buffer cache, right?
> I think that's a good argument for not using it.

I think the root of the problem with this feature is that it doesn't
go through shared_buffers, so in my opinion, it would be better if we
can make it all go through shared_buffers. It seems like you're
advocating a middle ground where half of the operation goes through
shared_buffers and the other half doesn't, but that sounds like
getting rid of half of the hack when we could have gotten rid of all
of it. I think things that don't go through shared_buffers are bad,
and we should be making an effort to get rid of them where we can
reasonably do so. I believe I've both introduced and fixed my share of
bugs that were caused by such cases, and I think the behavior of the
whole system would be a lot easier to reason about if we had fewer of
those, or none.

I can also think of at least one significant advantage of driving this
off the remote database's pg_class rather than the filesystem
contents. It's a known defect of PostgreSQL that if you create a table
and then crash, you leave behind a dead file that never gets removed.
If you now copy the database that contains that orphaned file, you
would ideally prefer not to copy that file, but if you do a copy based
on the filesystem contents, then you will. If you drive the copy off
of pg_class, you won't.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
Hi,

On 2021-06-17 13:53:38 -0400, Robert Haas wrote:
> On Thu, Jun 17, 2021 at 5:20 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > You only need relpersistence if you want to use the buffer cache, right?
> > I think that's a good argument for not using it.

Do we really need pg_class to figure this out? Can't we just check if
the relation has an init fork?


> I can also think of at least one significant advantage of driving this
> off the remote database's pg_class rather than the filesystem
> contents. It's a known defect of PostgreSQL that if you create a table
> and then crash, you leave behind a dead file that never gets removed.
> If you now copy the database that contains that orphaned file, you
> would ideally prefer not to copy that file, but if you do a copy based
> on the filesystem contents, then you will. If you drive the copy off
> of pg_class, you won't.

I'm very unconvinced this is the place to tackle the issue of orphan
relfilenodes. It'd be one thing if it were doable by existing code,
e.g. because we supported cross-database relation accesses fully, but we
don't.

Adding a hacky special case implementation for cross-database relation
accesses that violates all kinds of assumptions (like holding a lock on
a relation when accessing it / pinning pages, processing relcache
invals, ...) doesn't seem like a good plan.

I don't think this is an academic concern: You need to read from shared
buffers to read the "remote" pg_class, otherwise you'll potentially miss
changes. But it's not correct to read in pages or to pin pages without
holding a lock, and there's code that relies on that (see
e.g. InvalidateBuffer()).

Greetings,

Andres Freund



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Thu, Jun 17, 2021 at 2:17 PM Andres Freund <andres@anarazel.de> wrote:
> Adding a hacky special case implementation for cross-database relation
> accesses that violates all kinds of assumptions (like holding a lock on
> a relation when accessing it / pinning pages, processing relcache
> invals, ...) doesn't seem like a good plan.

I agree that we don't want hacky code that violates assumptions, but
bypassing shared_buffers is a bit hacky, too. Can't we lock the
relations as we're copying them? We know pg_class's OID a fortiori,
and we can find out all the other OIDs as we go.

I'm just thinking that the hackiness of going around shared_buffers
feels irreducible, but maybe the hackiness in the patch is something
that can be solved with more engineering.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
Hi,

On 2021-06-17 14:22:52 -0400, Robert Haas wrote:
> On Thu, Jun 17, 2021 at 2:17 PM Andres Freund <andres@anarazel.de> wrote:
> > Adding a hacky special case implementation for cross-database relation
> > accesses that violates all kinds of assumptions (like holding a lock on
> > a relation when accessing it / pinning pages, processing relcache
> > invals, ...) doesn't seem like a good plan.
> 
> I agree that we don't want hacky code that violates assumptions, but
> bypassing shared_buffers is a bit hacky, too. Can't we lock the
> relations as we're copying them? We know pg_class's OID a fortiori,
> and we can find out all the other OIDs as we go.

We possibly can - but I'm not sure that won't end up violating some
other assumptions.


> I'm just thinking that the hackiness of going around shared_buffers
> feels irreducible, but maybe the hackiness in the patch is something
> that can be solved with more engineering.

Which bypassing of shared buffers are you talking about here? We'd still
have to solve a subset of the issues around locking (at least on the
source side), but I don't think we need to read pg_class contents to be
able to go through shared_buffers? As I suggested, we can use the init
fork presence to infer relpersistence?

Or do you mean that looking at the filesystem at all is bypassing shared
buffers?

Greetings,

Andres Freund



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Thu, Jun 17, 2021 at 2:48 PM Andres Freund <andres@anarazel.de> wrote:
> Or do you mean that looking at the filesystem at all is bypassing shared
> buffers?

This is what I mean. I think we will end up in a better spot if we can
avoid doing that without creating too much ugliness elsewhere.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Fri, Jun 18, 2021 at 12:50 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jun 17, 2021 at 2:48 PM Andres Freund <andres@anarazel.de> wrote:
> > Or do you mean that looking at the filesystem at all is bypassing shared
> > buffers?
>
> This is what I mean. I think we will end up in a better spot if we can
> avoid doing that without creating too much ugliness elsewhere.
>

The patch was not getting applied on head so I have rebased it, along
with that now I have used bufmgr layer for writing writing/logging
destination pages as well instead of directly using sgmr layer.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Tue, Jul 6, 2021 at 3:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Jun 18, 2021 at 12:50 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Thu, Jun 17, 2021 at 2:48 PM Andres Freund <andres@anarazel.de> wrote:
> > > Or do you mean that looking at the filesystem at all is bypassing shared
> > > buffers?
> >
> > This is what I mean. I think we will end up in a better spot if we can
> > avoid doing that without creating too much ugliness elsewhere.
> >
>
> The patch was not getting applied on head so I have rebased it, along
> with that now I have used bufmgr layer for writing writing/logging
> destination pages as well instead of directly using sgmr layer.

I have done further cleanup of the patch and also divided it into 3 patches.

0001 - Currently, write_relmap_file and load_relmap_file are tightly
coupled with shared_map and local_map.  As part of the higher level
patch set we need remap read/write interfaces that are not dependent
upon shared_map and local_map, and we should be able to pass map
memory as an external parameter instead.

0002- Support new interfaces in relmapper, 1) Support copying the
relmap file from one database path to the other database path. 2) Like
RelationMapOidToFilenode, provide another interface which do the same
but instead of getting it for the database we are connected to it will
get it for the input database path. These interfaces are required for
the next patch for supporting the wal logged created database.

0003- The main patch for WAL logging the created database operation.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Thu, Sep 2, 2021 at 2:06 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> 0003- The main patch for WAL logging the created database operation.

Andres pointed out that this approach ends up accessing relations
without taking a lock on them. It doesn't look like you did anything
about that.

+ /* Built-in oids are mapped directly */
+ if (classForm->oid < FirstGenbkiObjectId)
+ relfilenode = classForm->oid;
+ else if (OidIsValid(classForm->relfilenode))
+ relfilenode = classForm->relfilenode;
+ else
+ continue;

Am I missing something, or is this totally busted?

[rhaas pgsql]$ createdb
[rhaas pgsql]$ psql
psql (15devel)
Type "help" for help.

rhaas=# select oid::regclass from pg_class where relfilenode not in
(0, oid) and oid < 10000;
 oid
-----
(0 rows)

rhaas=# vacuum full pg_attrdef;
VACUUM
rhaas=# select oid::regclass from pg_class where relfilenode not in
(0, oid) and oid < 10000;
              oid
--------------------------------
 pg_attrdef_adrelid_adnum_index
 pg_attrdef_oid_index
 pg_toast.pg_toast_2604
 pg_toast.pg_toast_2604_index
 pg_attrdef
(5 rows)

  /*
+ * Now drop all buffers holding data of the target database; they should
+ * no longer be dirty so DropDatabaseBuffers is safe.

The way things worked before, this was true, but now AFAICS it's
false. I'm not sure whether that means that DropDatabaseBuffers() here
is actually unsafe or whether it just means that you haven't updated
the comment to explain the reason.

+ * Since we copy the file directly without looking at the shared buffers,
+ * we'd better first flush out any pages of the source relation that are
+ * in shared buffers.  We assume no new changes will be made while we are
+ * holding exclusive lock on the rel.

Ditto.

+ /* As always, WAL must hit the disk before the data update does. */

Actually, the way it's coded now, part of the on-disk changes are done
before WAL is issued, and part are done after. I doubt that's the
right idea. There's nothing special about writing the actual payload
bytes vs. the other on-disk changes (creating directories and files).
In any case the ordering deserves a better-considered comment than
this one.

+ XLogRegisterData((char *) PG_MAJORVERSION, nbytes);

Surely this is utterly pointless.

+ CopyDatabase(src_dboid, dboid, src_deftablespace, dst_deftablespace);
  PG_END_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
  PointerGetDatum(&fparms));

I'd leave braces around the code for which we're ensuring error
cleanup even if it's just one line.

+ if (info == XLOG_DBASEDIR_CREATE)
  {
  xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);

Seems odd to rename the record but not change the name of the struct.
I think I would be inclined to keep the existing record name, but if
we're going to change it we should be more thorough.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Fri, Jun 18, 2021 at 12:18 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2021-06-17 14:22:52 -0400, Robert Haas wrote:
> On Thu, Jun 17, 2021 at 2:17 PM Andres Freund <andres@anarazel.de> wrote:
> > Adding a hacky special case implementation for cross-database relation
> > accesses that violates all kinds of assumptions (like holding a lock on
> > a relation when accessing it / pinning pages, processing relcache
> > invals, ...) doesn't seem like a good plan.
>
> I agree that we don't want hacky code that violates assumptions, but
> bypassing shared_buffers is a bit hacky, too. Can't we lock the
> relations as we're copying them? We know pg_class's OID a fortiori,
> and we can find out all the other OIDs as we go. 

We possibly can - but I'm not sure that won't end up violating some
other assumptions.

Yeah, we can surely lock the relation as described by Robert, but IMHO, while creating the database we are already holding the exclusive lock on the database and there is no one else allowed to be connected to the database, so do we actually need to bother about the lock for the correctness?
 
> I'm just thinking that the hackiness of going around shared_buffers
> feels irreducible, but maybe the hackiness in the patch is something
> that can be solved with more engineering.

Which bypassing of shared buffers are you talking about here? We'd still
have to solve a subset of the issues around locking (at least on the
source side), but I don't think we need to read pg_class contents to be
able to go through shared_buffers? As I suggested, we can use the init
fork presence to infer relpersistence?

I believe we want to avoid scanning pg_class for identifying the relation list so that we can avoid this special-purpose code?  IMHO, scanning the disk, such as going through all the tablespaces and then finding the source database directory and identifying each relfilenode, also appears to be a special-purpose code, unless I am missing what you mean by special-purpose code.

Or do you mean that looking at the filesystem at all is bypassing shared
buffers?
 
I think we already have such a code in multiple places where we bypass the shared buffers for copying the relation e.g. index_copy_data(), heapam_relation_copy_data().  So the current system as it stands, we can not claim that we are designing something for the first time where we are bypassing the shared buffers.   So if we are thinking that bypassing the shared buffers is hackish and we don't want to use it for the new patches then lets avoid it completely even while identifying the relfilenodes to be copied.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Thu, Sep 2, 2021 at 8:52 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Sep 2, 2021 at 2:06 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> 0003- The main patch for WAL logging the created database operation.

Andres pointed out that this approach ends up accessing relations
without taking a lock on them. It doesn't look like you did anything
about that.

I missed that, I have shared my opinion about this in my last email [1]
 

+ /* Built-in oids are mapped directly */
+ if (classForm->oid < FirstGenbkiObjectId)
+ relfilenode = classForm->oid;
+ else if (OidIsValid(classForm->relfilenode))
+ relfilenode = classForm->relfilenode;
+ else
+ continue;

Am I missing something, or is this totally busted?

Oops, I think the condition should be like below, but I will think carefully before posting the next version if there is something else I am missing.

if (OidIsValid(classForm->relfilenode))
   relfilenode = classForm->relfilenode;
else if  if (classForm->oid < FirstGenbkiObjectId)
   relfilenode = classForm->oid;
else
  continue
 
  /*
+ * Now drop all buffers holding data of the target database; they should
+ * no longer be dirty so DropDatabaseBuffers is safe.

The way things worked before, this was true, but now AFAICS it's
false. I'm not sure whether that means that DropDatabaseBuffers() here
is actually unsafe or whether it just means that you haven't updated
the comment to explain the reason.

I think DropDatabaseBuffers(), itself is unsafe, we just copied pages using bufmgr and dirtied the buffers so dropping buffers is definitely unsafe here.
 
+ * Since we copy the file directly without looking at the shared buffers,
+ * we'd better first flush out any pages of the source relation that are
+ * in shared buffers.  We assume no new changes will be made while we are
+ * holding exclusive lock on the rel.

Ditto.

Yeah this comment no longer makes sense now.
 

+ /* As always, WAL must hit the disk before the data update does. */

Actually, the way it's coded now, part of the on-disk changes are done
before WAL is issued, and part are done after. I doubt that's the
right idea. 
There's nothing special about writing the actual payload
bytes vs. the other on-disk changes (creating directories and files).
In any case the ordering deserves a better-considered comment than
this one.

Agreed to all, but In general, I think WAL hitting the disk before data is more applicable for the shared buffers, where we want to flush the WAL first before writing the shared buffer so that in case of torn page we have an option to recover the page from previous FPI. But in such cases where we are creating a directory or file there is no such requirement.   Anyways, I agreed with the comments that it should be more uniform and the comment should be correct.

+ XLogRegisterData((char *) PG_MAJORVERSION, nbytes);

Surely this is utterly pointless.

Yeah it is.  During the WAL replay also we know the PG_MAJORVERSION :)
 
+ CopyDatabase(src_dboid, dboid, src_deftablespace, dst_deftablespace);
  PG_END_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
  PointerGetDatum(&fparms));

I'd leave braces around the code for which we're ensuring error
cleanup even if it's just one line.
 
Okay
 
+ if (info == XLOG_DBASEDIR_CREATE)
  {
  xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);

Seems odd to rename the record but not change the name of the struct.
I think I would be inclined to keep the existing record name, but if
we're going to change it we should be more thorough.

Right, I think we can leave the record name as it is. 


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Fri, Sep 3, 2021 at 6:23 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> + /* Built-in oids are mapped directly */
>> + if (classForm->oid < FirstGenbkiObjectId)
>> + relfilenode = classForm->oid;
>> + else if (OidIsValid(classForm->relfilenode))
>> + relfilenode = classForm->relfilenode;
>> + else
>> + continue;
>>
>> Am I missing something, or is this totally busted?
>
> Oops, I think the condition should be like below, but I will think carefully before posting the next version if there
issomething else I am missing. 
>
> if (OidIsValid(classForm->relfilenode))
>    relfilenode = classForm->relfilenode;
> else if  if (classForm->oid < FirstGenbkiObjectId)
>    relfilenode = classForm->oid;
> else
>   continue

What about mapped rels that have been rewritten at some point?

> Agreed to all, but In general, I think WAL hitting the disk before data is more applicable for the shared buffers,
wherewe want to flush the WAL first before writing the shared buffer so that in case of torn page we have an option to
recoverthe page from previous FPI. But in such cases where we are creating a directory or file there is no such
requirement.  Anyways, I agreed with the comments that it should be more uniform and the comment should be correct. 

There have been previous debates about whether WAL records for
filesystem operations should be issued before or after those
operations are performed. I'm not sure how easy those discussion are
to find in the archives, but it's very relevant here. I think the
short version is - if we write a WAL record first and then the
operation fails afterward, we have to PANIC. But if we perform the
operation first and then write the WAL record if it succeeds, then we
could crash before writing WAL and end up out of sync with our
standbys. If we then later do any WAL-logged operation locally that
depends on that operation having been performed, replay will fail on
the standby. There used to be, or maybe still are, comments in the
code defending the latter approach, but more recently it's been
strongly criticized. The thinking, AIUI, is basically that filesystem
operations really ought not to fail, because nobody should be doing
weird things to the data directory, and if they do, panicking is OK.
But having replay fail in strange ways on the standby later is not OK.

I'm not sure if everyone agrees with that logic; it seems somewhat
debatable. I *think* I personally agree with it but ... I'm not even
100% sure about that.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
Hi,

On 2021-09-03 14:25:10 +0530, Dilip Kumar wrote:
> Yeah, we can surely lock the relation as described by Robert, but IMHO,
> while creating the database we are already holding the exclusive lock on
> the database and there is no one else allowed to be connected to the
> database, so do we actually need to bother about the lock for the
> correctness?

The problem is that checkpointer, bgwriter, buffer reclaim don't care about
the database of the buffer they're working on... The exclusive lock on the
database doesn't change anything about that. Perhaps you can justify it's safe
because there can't be any dirty buffers or such though.


> I think we already have such a code in multiple places where we bypass the
> shared buffers for copying the relation
> e.g. index_copy_data(), heapam_relation_copy_data().

That's not at all comparable. We hold an exclusive lock on the relation at
that point, and we don't have a separate implementation of reading tuples from
the table or something like that.

Greetings,

Andres Freund



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Sat, Sep 4, 2021 at 3:24 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2021-09-03 14:25:10 +0530, Dilip Kumar wrote:
> Yeah, we can surely lock the relation as described by Robert, but IMHO,
> while creating the database we are already holding the exclusive lock on
> the database and there is no one else allowed to be connected to the
> database, so do we actually need to bother about the lock for the
> correctness?

The problem is that checkpointer, bgwriter, buffer reclaim don't care about
the database of the buffer they're working on... The exclusive lock on the
database doesn't change anything about that.

But these directly operate on the buffers and In my patch, whether we are reading the pg_class for identifying the relfilenode or we are copying the relation block by block we are always holding the lock on the buffer.
 
Perhaps you can justify it's safe
because there can't be any dirty buffers or such though.


> I think we already have such a code in multiple places where we bypass the
> shared buffers for copying the relation
> e.g. index_copy_data(), heapam_relation_copy_data().

That's not at all comparable. We hold an exclusive lock on the relation at
that point, and we don't have a separate implementation of reading tuples from
the table or something like that.

Okay, but my example was against the point Robert raised that he feels that bypassing the shared buffer anywhere is hackish.  But yeah, I agree his point might be that even if we are using it in existing code we can not justify it.

For moving forward I think the main open concerns we have as of now are

1. Special purpose code of scanning pg_class, so that we can solve it by scanning the source database directory, I think Robert doesn't like this approach because we are directly scanning to directory and bypassing the shared buffers?  But this is not any worse than what we have now right?  I mean now also we are scanning the directory directly, so only change will be instead of copying files directly we will read file and copy block by block.

2. Another problem is, while copying the relation we are accessing the relation buffers but we are not holding the relation lock, but we are already holding the buffer so I am not sure do we really have a problem here w.r.t checkpointer, bgwriter?  But if we have the problem then also we can create the lock tag and acquire the relation lock.

3. While copying the relation whether to use the bufmgr or directly use the smgr?

If we use the bufmgr then maybe we can avoid flushing some of the buffers to the disk and save some I/O but in general we copy from the template database so there might not be a lot of dirty buffers and we might not save anything,  OTOH, if we directly use the smgr for copying the relation data we can reuse some existing code RelationCopyStorage() and the patch will be simpler.  Other than just code simplicity or IO there is also a concern by Robert that he doesn't like to bypass the bufmgr, and that will be applicable to the point #1 as well as #3.

Thoughts?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
On 2021-09-05 14:22:51 +0530, Dilip Kumar wrote:
> On Sat, Sep 4, 2021 at 3:24 AM Andres Freund <andres@anarazel.de> wrote:
> 
> > Hi,
> >
> > On 2021-09-03 14:25:10 +0530, Dilip Kumar wrote:
> > > Yeah, we can surely lock the relation as described by Robert, but IMHO,
> > > while creating the database we are already holding the exclusive lock on
> > > the database and there is no one else allowed to be connected to the
> > > database, so do we actually need to bother about the lock for the
> > > correctness?
> >
> > The problem is that checkpointer, bgwriter, buffer reclaim don't care about
> > the database of the buffer they're working on... The exclusive lock on the
> > database doesn't change anything about that.
> 
> 
> But these directly operate on the buffers and In my patch, whether we are
> reading the pg_class for identifying the relfilenode or we are copying the
> relation block by block we are always holding the lock on the buffer.

I don't think a buffer lock is really sufficient. See e.g. code like:

static void
InvalidateBuffer(BufferDesc *buf)
{
...
    /*
     * We assume the only reason for it to be pinned is that someone else is
     * flushing the page out.  Wait for them to finish.  (This could be an
     * infinite loop if the refcount is messed up... it would be nice to time
     * out after awhile, but there seems no way to be sure how many loops may
     * be needed.  Note that if the other guy has pinned the buffer but not
     * yet done StartBufferIO, WaitIO will fall through and we'll effectively
     * be busy-looping here.)
     */
    if (BUF_STATE_GET_REFCOUNT(buf_state) != 0)
    {
        UnlockBufHdr(buf, buf_state);
        LWLockRelease(oldPartitionLock);
        /* safety check: should definitely not be our *own* pin */
        if (GetPrivateRefCount(BufferDescriptorGetBuffer(buf)) > 0)
            elog(ERROR, "buffer is pinned in InvalidateBuffer");
        WaitIO(buf);
        goto retry;
    }

IOW, currently we assume that you're only allowed to pin a block in a relation
while you hold a lock on the relation. It might be a good idea to change that,
but it's not as trivial as one might think - consider e.g. dropping a
relation's buffers while holding an exclusive lock: If there's potential
concurrent reads of that buffer we'd be in trouble.


> 3. While copying the relation whether to use the bufmgr or directly use the
> smgr?
> 
> If we use the bufmgr then maybe we can avoid flushing some of the buffers
> to the disk and save some I/O but in general we copy from the template
> database so there might not be a lot of dirty buffers and we might not save
> anything

I would assume the big benefit would be that the *target* database does not
have to be written out / shared buffer is immediately populated.

Greetings,

Andres Freund



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Mon, Sep 6, 2021 at 1:58 AM Andres Freund <andres@anarazel.de> wrote:
On 2021-09-05 14:22:51 +0530, Dilip Kumar wrote: 
> But these directly operate on the buffers and In my patch, whether we are
> reading the pg_class for identifying the relfilenode or we are copying the
> relation block by block we are always holding the lock on the buffer.

I don't think a buffer lock is really sufficient. See e.g. code like:

I agree that the only buffer lock is not sufficient, but here we are talking about the case where we are already holding the exclusive lock on the database + the buffer lock.   So the cases like below which should be called only from the drop relation must be protected by the database exclusive lock and the other example like buffer reclaim/checkpointer should be protected by the buffer pin + lock.   Having said that, I am not against the point that we should not acquire the relation lock in our case.  I agree that if there is an assumption that for holding the buffer pin we must be holding the relation lock then better not to break that.


static void
InvalidateBuffer(BufferDesc *buf)
{
...
        /*
         * We assume the only reason for it to be pinned is that someone else is
         * flushing the page out.  Wait for them to finish.  (This could be an
         * infinite loop if the refcount is messed up... it would be nice to time
         * out after awhile, but there seems no way to be sure how many loops may
         * be needed.  Note that if the other guy has pinned the buffer but not
         * yet done StartBufferIO, WaitIO will fall through and we'll effectively
         * be busy-looping here.)
         */
        if (BUF_STATE_GET_REFCOUNT(buf_state) != 0)
        {
                UnlockBufHdr(buf, buf_state);
                LWLockRelease(oldPartitionLock);
                /* safety check: should definitely not be our *own* pin */
                if (GetPrivateRefCount(BufferDescriptorGetBuffer(buf)) > 0)
                        elog(ERROR, "buffer is pinned in InvalidateBuffer");
                WaitIO(buf);
                goto retry;
        }

IOW, currently we assume that you're only allowed to pin a block in a relation
while you hold a lock on the relation. It might be a good idea to change that,
but it's not as trivial as one might think - consider e.g. dropping a
relation's buffers while holding an exclusive lock: If there's potential
concurrent reads of that buffer we'd be in trouble.

> 3. While copying the relation whether to use the bufmgr or directly use the
> smgr?
>
> If we use the bufmgr then maybe we can avoid flushing some of the buffers
> to the disk and save some I/O but in general we copy from the template
> database so there might not be a lot of dirty buffers and we might not save
> anything

I would assume the big benefit would be that the *target* database does not
have to be written out / shared buffer is immediately populated.

Okay, that makes sense.  Infact for using the shared buffers for the destination database's relation we don't even have the locking issue, because that database is not yet accessible to anyone right?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Mon, Sep 6, 2021 at 11:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote

Okay, that makes sense.  Infact for using the shared buffers for the destination database's relation we don't even have the locking issue, because that database is not yet accessible to anyone right?

Based on all these discussions I am planning to change the design as below,

- FlushDatabaseBuffers().

- Scan the database directory under each tablespace and prepare a tablespace-wise relfilenode list, along with this we will remember the persistent level as well based on the presence of INITFORK.

- Next, copy each relfilenode to the destination, while copying for the source relation directly use the smgrread whereas for the destination relation use bufmgr.  The main reasons for not using the bufmgr for the source relations are a) We can avoid acquiring a special-purpose lock on the relation b) We are copying from the template database so in most of the cases there might not be many dirty buffers for that database so there is no real need for using the shared buffers.

Any objections to the above design?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Fri, Sep 3, 2021 at 5:54 PM Andres Freund <andres@anarazel.de> wrote:
> > I think we already have such a code in multiple places where we bypass the
> > shared buffers for copying the relation
> > e.g. index_copy_data(), heapam_relation_copy_data().
>
> That's not at all comparable. We hold an exclusive lock on the relation at
> that point, and we don't have a separate implementation of reading tuples from
> the table or something like that.

I don't think there's a way to do this that is perfectly clean, so the
discussion here is really about finding the least unpleasant
alternative. I *really* like the idea of using pg_class to figure out
what relations to copy. As far as I'm concerned, pg_class is the
canonical list of what's in the database, and to the extent that the
filesystem happens to agree, that's good luck. From that perspective,
using the filesystem to figure out what to copy is by definition a
hack.

Now, having to use dedicated tuple-reading code is also a hack, but to
me that's largely an accident of questionable design decisions
elsewhere. You can't read a buffer with just the minimal amount of
information that you need to read a buffer; you have to have a
relcache entry, so we have things like ReadBufferWithoutRelcache and
CreateFakeRelcacheEntry. It's a little crazy to me that someone saw
that ReadBuffer() needed a thing which some callers might not have and
instead of saying "hmm, maybe we ought to change the arguments so that
anyone with enough information to call this function can do so," they
said "hmm, let's create a fake object that is not really the same as a
real one but good enough to fool the function into doing the right
thing, probably." I think the code layering here is just flat-out
broken and ought to be fixed. A layer whose job it is to read and
write blocks should not know that relations are even a thing. (The
widespread use of global variables in the relcache code, the catcache
code, and many other places in lieu of explicit parameter-passing just
makes everything a lot worse.)

So I think if we commit to the hackiness of the sort that this patch
introduces, there is some hope of things getting better in the future.
I don't think it's a real easy path forward, but maybe it's possible.
If on the other hand we commit to using the filesystem, I don't see
how it ever gets any better. Unlogged tables are a great example of a
feature that depended on the filesystem and it now seems to me to be -
by far - the worst thing about that feature. I have no idea how to get
rid of that dependency or all of the associated problems without
reverting the feature. But in this case, we seem to have another
option, and so I think we should take it.

Your (or other people's mileage) may vary ... this is just my view of it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Wed, Sep 8, 2021 at 9:54 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Sep 3, 2021 at 5:54 PM Andres Freund <andres@anarazel.de> wrote:
> > I think we already have such a code in multiple places where we bypass the
> > shared buffers for copying the relation
> > e.g. index_copy_data(), heapam_relation_copy_data().
>
> That's not at all comparable. We hold an exclusive lock on the relation at
> that point, and we don't have a separate implementation of reading tuples from
> the table or something like that.

I don't think there's a way to do this that is perfectly clean, so the
discussion here is really about finding the least unpleasant
alternative. I *really* like the idea of using pg_class to figure out
what relations to copy. As far as I'm concerned, pg_class is the
canonical list of what's in the database, and to the extent that the
filesystem happens to agree, that's good luck. From that perspective,
using the filesystem to figure out what to copy is by definition a
hack.

I agree with you, even though I think that scanning pg_class for identifying the relfilenode looks like a more sensible thing to do than directly scanning the file system, we need to consider one point that, now also in current system  (in create database) we are scanning the directory for copying the file so instead of copying them directly we need to logically identify the relfilenode and then copy it block by block, so maybe this approach will not make anyone unhappy because it is not any worse than the current system.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Sat, Sep 11, 2021 at 12:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I agree with you, even though I think that scanning pg_class for identifying the relfilenode looks like a more
sensiblething to do than directly scanning the file system, we need to consider one point that, now also in current
system (in create database) we are scanning the directory for copying the file so instead of copying them directly we
needto logically identify the relfilenode and then copy it block by block, so maybe this approach will not make anyone
unhappybecause it is not any worse than the current system. 

So, I agree. If we can't get agreement on this approach, then we can
do that, and as you say, it's no worse than what we are doing now. But
I am just trying to lay out my view of why I think that's not as good
as this.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Thu, Sep 2, 2021 at 8:52 PM Robert Haas <robertmhaas@gmail.com> wrote:

PFA, updated version of the patch, where I have fixed the issues
reported by you and also done some more refactoring and patch split,
next I am planning to post the patch with another approach where we
scan the directory instead of scanning the pg_class for identifying
the relfilenodes.  For specific comments please find my response
inline,


> Andres pointed out that this approach ends up accessing relations
> without taking a lock on them. It doesn't look like you did anything
> about that.

Now I have acquired a lock before scanning the pg_class as well as
other relfilenode.

>
> + /* Built-in oids are mapped directly */
> + if (classForm->oid < FirstGenbkiObjectId)
> + relfilenode = classForm->oid;
> + else if (OidIsValid(classForm->relfilenode))
> + relfilenode = classForm->relfilenode;
> + else
> + continue;
>
> Am I missing something, or is this totally busted?

Handled the mapped relation using relmapper.

>   /*
> + * Now drop all buffers holding data of the target database; they should
> + * no longer be dirty so DropDatabaseBuffers is safe.
>
> The way things worked before, this was true, but now AFAICS it's
> false. I'm not sure whether that means that DropDatabaseBuffers() here
> is actually unsafe or whether it just means that you haven't updated
> the comment to explain the reason.

Now we can only drop the buffer specific to old tablespace not the new
tablespace so can not directly use the dboid, so extended the
DropDatabaseBuffers interface to take tablespace oid as and input and
updated the comments accordingly.

> + * Since we copy the file directly without looking at the shared buffers,
> + * we'd better first flush out any pages of the source relation that are
> + * in shared buffers.  We assume no new changes will be made while we are
> + * holding exclusive lock on the rel.
>
> Ditto.

I think these comments is related to index_copy_data() and this is
still valid, it is showing in the patch due to some refactoring so I
have separated out this refactoring patch as 0003 to avoid confusion.

>
> + /* As always, WAL must hit the disk before the data update does. */
>
> Actually, the way it's coded now, part of the on-disk changes are done
> before WAL is issued, and part are done after. I doubt that's the
> right idea. There's nothing special about writing the actual payload
> bytes vs. the other on-disk changes (creating directories and files).
> In any case the ordering deserves a better-considered comment than
> this one.

Changed, now WAL first and then disk change.


Open question:
- Scan pg_class vs scan directories
- Whether to retain the old created database mechanism as option or not.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Mon, Sep 27, 2021 at 12:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>
> Open question:
> - Scan pg_class vs scan directories
> - Whether to retain the old created database mechanism as option or not.

I have done some code improvement in 0001 and 0002.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Mon, Oct 4, 2021 at 2:51 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

I have implemented the patch with approach2 as well, i.e. instead of
scanning the pg-class, we scan the directory.

IMHO, we have already discussed most of the advantages and
disadvantages of both approaches so I don't want to mention those
again. But I have noticed one more issue with the approach2,
basically, if we scan the directory then we don't have any way to
identify the relation-OID and that is required in order to acquire the
relation lock before copying it, right?

Patch details:
0001 to 0006 implements an approach1
0007 removes the code of pg_class scanning and adds the directory scan.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
John Naylor
Date:
Hi,

I've looked over this patch set and email thread a couple times, and I don't see anything amiss, but I'm also not terribly familiar with the subsystems this part of the code relies on. I haven't yet tried to stress test with a large database, but it seems like a good idea to do so.

I have a couple comments and questions:

0006:

+ * XXX We can optimize RelationMapOidToFileenodeForDatabase API
+ * so that instead of reading the relmap file every time, it can
+ * save it in a temporary variable and use it for subsequent
+ * calls.  Then later reset it once we're done or at the
+ * transaction end.

Do we really need to consider optimizing here? Only a handful of relations will be found in the relmap, right?

+ * Once we start copying files from the source database, we need to be able
+ * to clean 'em up if we fail.  Use an ENSURE block to make sure this
+ * happens.  (This is not a 100% solution, because of the possibility of
+ * failure during transaction commit after we leave this routine, but it
+ * should handle most scenarios.)

This comment in master started with

- * Once we start copying subdirectories, we need to be able to clean 'em

Is the distinction important enough to change this comment? Also, is "most scenarios" still true with the patch? I haven't read into how ENSURE works.

Same with this comment change, seems fine the way it was:

- * Use an ENSURE block to make sure we remove the debris if the copy fails
- * (eg, due to out-of-disk-space).  This is not a 100% solution, because
- * of the possibility of failure during transaction commit, but it should
- * handle most scenarios.
+ * Use an ENSURE block to make sure we remove the debris if the copy fails.
+ * This is not a 100% solution, because of the possibility of failure
+ * during transaction commit, but it should handle most scenarios.

And do we need additional tests? Maybe we don't, but it seems good to make sure.

I haven't looked at 0007, and I have no opinion on which approach is better.

--

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Tue, Nov 23, 2021 at 10:29 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
>
> Hi,
>
> I've looked over this patch set and email thread a couple times, and I don't see anything amiss, but I'm also not
terriblyfamiliar with the subsystems this part of the code relies on. I haven't yet tried to stress test with a large
database,but it seems like a good idea to do so. 

Thanks, John for looking into the patches.  Yeah, that makes sense,
next week I will try to test with a large database and maybe with
multiple tablespaces as well to see how this behaves.

> I have a couple comments and questions:
>
> 0006:
>
> + * XXX We can optimize RelationMapOidToFileenodeForDatabase API
> + * so that instead of reading the relmap file every time, it can
> + * save it in a temporary variable and use it for subsequent
> + * calls.  Then later reset it once we're done or at the
> + * transaction end.
>
> Do we really need to consider optimizing here? Only a handful of relations will be found in the relmap, right?

You are right, it is actually not required I will remove this comment.

>
> + * Once we start copying files from the source database, we need to be able
> + * to clean 'em up if we fail.  Use an ENSURE block to make sure this
> + * happens.  (This is not a 100% solution, because of the possibility of
> + * failure during transaction commit after we leave this routine, but it
> + * should handle most scenarios.)
>
> This comment in master started with
>
> - * Once we start copying subdirectories, we need to be able to clean 'em
>
> Is the distinction important enough to change this comment? Also, is "most scenarios" still true with the patch? I
haven'tread into how ENSURE works. 

Actually, it is like PG_TRY(), CATCH() block with extra assurance to
cleanup on shm_exit as well.  And in the cleanup function, we go
through all the tablespaces and remove the new DB-related directory
which we are trying to create.  And you are right, we actually don't
need to change the comments.

> Same with this comment change, seems fine the way it was:

Correct.

> - * Use an ENSURE block to make sure we remove the debris if the copy fails
> - * (eg, due to out-of-disk-space).  This is not a 100% solution, because
> - * of the possibility of failure during transaction commit, but it should
> - * handle most scenarios.
> + * Use an ENSURE block to make sure we remove the debris if the copy fails.
> + * This is not a 100% solution, because of the possibility of failure
> + * during transaction commit, but it should handle most scenarios.
>
> And do we need additional tests? Maybe we don't, but it seems good to make sure.
>
> I haven't looked at 0007, and I have no opinion on which approach is better.

Okay, I like approach 6 because of mainly two reasons, 1) it is not
directly scanning the raw file to identify which files to copy so
seems cleaner to me 2) with 0007 if we directly scan directory we
don't know the relation oid, so before acquiring the buffer lock there
is no way to acquire the relation lock.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Greg Nancarrow
Date:
On Tue, Oct 5, 2021 at 7:07 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Patch details:
> 0001 to 0006 implements an approach1
> 0007 removes the code of pg_class scanning and adds the directory scan.
>

I had a scan through the patches, though have not yet actually run any
tests to try to better gauge their benefit.
I do have some initial review comments though:

0003

src/backend/commands/tablecmds.c
(1) RelationCopyAllFork()
In the following comment:

+/*
+ * Copy source smgr all fork's data to the destination smgr.
+ */

Shouldn't it say "smgr relation"?
Also, you could additionally say ", using a specified fork data
copying function." or something like that, to account for the
additional argument.


0006

src/backend/commands/dbcommands.c
(1) function prototype location

The following prototype is currently located in the "non-export
function prototypes" section of the source file, but it's not static -
shouldn't it be in dbcommands.h?

+void RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
+        ForkNumber forkNum, char relpersistence);

(2) CreateDirAndVersionFile()
Shouldn't the following code:

+ fd = OpenTransientFile(versionfile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+ if (fd < 0 && errno == EEXIST && isRedo)
+   fd = OpenTransientFile(versionfile, O_RDWR | PG_BINARY);

actually be:

+ fd = OpenTransientFile(versionfile, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
+ if (fd < 0 && errno == EEXIST && isRedo)
+   fd = OpenTransientFile(versionfile, O_WRONLY | O_TRUNC | PG_BINARY);

since we're only writing to that file descriptor and we want to
truncate the file if it already exists.

The current comment says "... open it in the write mode.", but should
say "... open it in write mode."

Also, shouldn't you be writing a newline (\n) after the
PG_MAJORVERSION ? (compare with code in initdb.c)

(3) GetDatabaseRelationList()
Shouldn't:

+  if (PageIsNew(page) || PageIsEmpty(page))
+    continue;

be:

+  if (PageIsNew(page) || PageIsEmpty(page))
+  {
+    UnlockReleaseBuffer(buf);
+    continue;
+  }

?

Also, in the following code:

+  if (rnodelist == NULL)
+    rnodelist = list_make1(relinfo);
+  else
+    rnodelist = lappend(rnodelist, relinfo);

it should really be "== NIL" rather than "== NULL".
But in any case, that code can just be:

   rnodelist = lappend(rnodelist, relinfo);

because lappend() will create a list if the first arg is NIL.

(4) RelationCopyStorageUsingBuffer()

In the function comments, IMO it is better to use "APIs" instead of "apis".

Also, better to use "get" instead of "got" in the following comment:

+  /* If we got a cancel signal during the copy of the data, quit */


0007

(I think I prefer the first approach rather than this 2nd approach)

src/backend/commands/dbcommands.c
(1) createdb()
pfree(srcpath) seems to be missing, in the case that CopyDatabase() gets called.

(2) GetRelfileNodeFromFileName()
%s in sscanf() allows an unbounded read and is considered potentially
dangerous (allows buffer overflow), especially here where
FORKNAMECHARS is so small.

+  nmatch = sscanf(filename, "%u_%s", &relfilenode, forkname);

how about using the following instead in this case:

+  nmatch = sscanf(filename, "%u_%4s", &relfilenode, forkname);

?

(even if there were > 4 chars after the underscore, it would still
match and InvalidOid would be returned because nmatch==2)

Regards,
Greg Nancarrow
Fujitsu Australia



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Thu, Nov 25, 2021 at 1:07 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
>
> On Tue, Oct 5, 2021 at 7:07 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Patch details:
> > 0001 to 0006 implements an approach1
> > 0007 removes the code of pg_class scanning and adds the directory scan.
> >
>
> I had a scan through the patches, though have not yet actually run any
> tests to try to better gauge their benefit.
> I do have some initial review comments though:
>
> 0003
>
> src/backend/commands/tablecmds.c
> (1) RelationCopyAllFork()
> In the following comment:
>
> +/*
> + * Copy source smgr all fork's data to the destination smgr.
> + */
>
> Shouldn't it say "smgr relation"?
> Also, you could additionally say ", using a specified fork data
> copying function." or something like that, to account for the
> additional argument.
>
>
> 0006
>
> src/backend/commands/dbcommands.c
> (1) function prototype location
>
> The following prototype is currently located in the "non-export
> function prototypes" section of the source file, but it's not static -
> shouldn't it be in dbcommands.h?
>
> +void RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst,
> +        ForkNumber forkNum, char relpersistence);
>
> (2) CreateDirAndVersionFile()
> Shouldn't the following code:
>
> + fd = OpenTransientFile(versionfile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
> + if (fd < 0 && errno == EEXIST && isRedo)
> +   fd = OpenTransientFile(versionfile, O_RDWR | PG_BINARY);
>
> actually be:
>
> + fd = OpenTransientFile(versionfile, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY);
> + if (fd < 0 && errno == EEXIST && isRedo)
> +   fd = OpenTransientFile(versionfile, O_WRONLY | O_TRUNC | PG_BINARY);
>
> since we're only writing to that file descriptor and we want to
> truncate the file if it already exists.
>
> The current comment says "... open it in the write mode.", but should
> say "... open it in write mode."
>
> Also, shouldn't you be writing a newline (\n) after the
> PG_MAJORVERSION ? (compare with code in initdb.c)
>
> (3) GetDatabaseRelationList()
> Shouldn't:
>
> +  if (PageIsNew(page) || PageIsEmpty(page))
> +    continue;
>
> be:
>
> +  if (PageIsNew(page) || PageIsEmpty(page))
> +  {
> +    UnlockReleaseBuffer(buf);
> +    continue;
> +  }
>
> ?
>
> Also, in the following code:
>
> +  if (rnodelist == NULL)
> +    rnodelist = list_make1(relinfo);
> +  else
> +    rnodelist = lappend(rnodelist, relinfo);
>
> it should really be "== NIL" rather than "== NULL".
> But in any case, that code can just be:
>
>    rnodelist = lappend(rnodelist, relinfo);
>
> because lappend() will create a list if the first arg is NIL.
>
> (4) RelationCopyStorageUsingBuffer()
>
> In the function comments, IMO it is better to use "APIs" instead of "apis".
>
> Also, better to use "get" instead of "got" in the following comment:
>
> +  /* If we got a cancel signal during the copy of the data, quit */

Thanks for the review and many valuable comments, I have fixed all of
them except this comment (/* If we got a cancel signal during the copy
of the data, quit */) because this looks fine to me.  0007, I have
dropped from the patchset for now.  I have also included fixes for
comments given by John.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Greg Nancarrow
Date:
On Thu, Nov 25, 2021 at 10:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Thanks for the review and many valuable comments, I have fixed all of
> them except this comment (/* If we got a cancel signal during the copy
> of the data, quit */) because this looks fine to me.  0007, I have
> dropped from the patchset for now.  I have also included fixes for
> comments given by John.
>

Any progress/results yet on testing against a large database (as
suggested by John Naylor) and multiple tablespaces?

Thanks for the patch updates.
I have some additional minor comments:

0002

(1) Tidy patch comment

I suggest minor tidying of the patch comment, as follows:

Support new interfaces in relmapper, 1) Support copying the
relmap file from one database path to another database path.
2) Like RelationMapOidToFilenode, provide another interface
which does the same but, instead of getting it for the database
we are connected to, it will get it for the input database
path.

These interfaces are required for the next patch, for supporting
the WAL-logged created database.


0003

src/include/commands/tablecmds.h
(1) typedef void (*copy_relation_storage) ...

The new typename "copy_relation_storage" needs to be added to
src/tools/pgindent/typedefs.list


0006

src/backend/commands/dbcommands.c
(1) CreateDirAndVersionFile

After writing to the file, you should probably pfree(buf.data), right?
Actually, I don't think StringInfo (dynamic string allocation) is
needed here, since the version string is so short, so why not just use
a local "char buf[16]" buffer and snprintf() the
PG_MAJORVERSION+newline into that?

Also (as mentioned in my first review) shouldn't the "O_TRUNC" flag be
additionally specified in the case when OpenTransientFile() is tried
for a 2nd time because of errno==EEXIST on the 1st attempt? (otherwise
if the existing file did contain something you'd end up writing after
the existing data in the file).


src/backend/commands/dbcommands.c
(2) typedef struct CreateDBRelInfo ... CreateDBRelInfo

The new typename "CreateDBRelInfo" needs to be added to
src/tools/pgindent/typedefs.list

src/bin/pg_rewind/parsexlog.c
(3) Include additional header file

It seems that the following additional header file is not needed to
compile the source file:

+#include "utils/relmapper.h"


Regards,
Greg Nancarrow
Fujitsu Australia



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Greg Nancarrow
Date:
On Thu, Nov 25, 2021 at 10:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Thanks for the review and many valuable comments, I have fixed all of
> them except this comment (/* If we got a cancel signal during the copy
> of the data, quit */) because this looks fine to me.  0007, I have
> dropped from the patchset for now.  I have also included fixes for
> comments given by John.
>

I found the following issue with the patches applied:

A server crash occurs after the following sequence of commands:

create tablespace tbsp1 location '<directory>/tbsp1';
create tablespace tbsp2 location '<directory>/tbsp2';
create database test1 tablespace tbsp1;
create database test2 template test1 tablespace tbsp2;
alter database test2 set tablespace tbsp1;
checkpoint;

The following type of message is seen in the server log:

2021-12-01 16:48:26.623 AEDT [67423] PANIC:  could not fsync file
"pg_tblspc/16385/PG_15_202111301/16387/3394": No such file or
directory
2021-12-01 16:48:27.228 AEDT [67422] LOG:  checkpointer process (PID
67423) was terminated by signal 6: Aborted
2021-12-01 16:48:27.228 AEDT [67422] LOG:  terminating any other
active server processes
2021-12-01 16:48:27.233 AEDT [67422] LOG:  all server processes
terminated; reinitializing

Also (prior to running the checkpoint command above) I've seen errors
like the following when running pg_dumpall:

pg_dump: error: connection to server on socket "/tmp/.s.PGSQL.5432"
failed: PANIC:  could not open critical system index 2662
pg_dumpall: error: pg_dump failed on database "test2", exiting

Hopefully the above example will help in tracking down the cause.


Regards,
Greg Nancarrow
Fujitsu Australia



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Wed, Dec 1, 2021 at 12:07 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
>
> On Thu, Nov 25, 2021 at 10:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Thanks for the review and many valuable comments, I have fixed all of
> > them except this comment (/* If we got a cancel signal during the copy
> > of the data, quit */) because this looks fine to me.  0007, I have
> > dropped from the patchset for now.  I have also included fixes for
> > comments given by John.
> >
>
> I found the following issue with the patches applied:
>
> A server crash occurs after the following sequence of commands:
>
> create tablespace tbsp1 location '<directory>/tbsp1';
> create tablespace tbsp2 location '<directory>/tbsp2';
> create database test1 tablespace tbsp1;
> create database test2 template test1 tablespace tbsp2;
> alter database test2 set tablespace tbsp1;
> checkpoint;
>
> The following type of message is seen in the server log:
>
> 2021-12-01 16:48:26.623 AEDT [67423] PANIC:  could not fsync file
> "pg_tblspc/16385/PG_15_202111301/16387/3394": No such file or
> directory

Thanks a lot for testing this. From the error, it seems like some of
the old buffer w.r.t. the previous tablespace is not dropped after the
movedb.  Actually, we are calling DropDatabaseBuffers() after copying
to a new tablespace and dropping all the buffers of this database
w.r.t the old tablespace.  But seems something is missing, I will
reproduce this and try to fix it by tomorrow.  I will also fix the
other review comments raised by you in the previous mail.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Wed, Dec 1, 2021 at 6:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

> Thanks a lot for testing this. From the error, it seems like some of
> the old buffer w.r.t. the previous tablespace is not dropped after the
> movedb.  Actually, we are calling DropDatabaseBuffers() after copying
> to a new tablespace and dropping all the buffers of this database
> w.r.t the old tablespace.  But seems something is missing, I will
> reproduce this and try to fix it by tomorrow.  I will also fix the
> other review comments raised by you in the previous mail.

Okay, I got the issue, basically we are dropping the database buffers
but not unregistering the existing sync request for database buffers
w.r.t old tablespace. Attached patch fixes that.  I also had to extend
ForgetDatabaseSyncRequests so that we can delete the sync request of
the database for the particular tablespace so added another patch for
the same (0006).

I will test the performance scenario next week, which is suggested by John.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Ashutosh Sharma
Date:
I see that this patch is reducing the database creation time by almost 3-4 times provided that the template database has some user data in it. However, there are couple of points to be noted:

1) It makes the crash recovery a bit slower than before if the crash has occurred after the execution of a create database statement. Moreover, if the template database size is big, it might even generate a lot of WAL files which the user needs to be aware of.

2) This will put a lot of load on the first checkpoint that will occur after creating the database statement. I will experiment around this to see if this has any side effects.

--

Further, the code changes in the patch looks good. I just have few comments:

+void
+LockRelationId(LockRelId *relid, LOCKMODE lockmode)
+{
+   LOCKTAG     tag;
+   LOCALLOCK  *locallock;
+   LockAcquireResult res;
+
+   SET_LOCKTAG_RELATION(tag, relid->dbId, relid->relId);

Should there be an assertion statement here to ensure that relid->dbid and  relid->relid is valid?

--

    if (info == XLOG_DBASE_CREATE)
    {
        xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
-       char       *src_path;
-       char       *dst_path;
-       struct stat st;
-
-       src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
-       dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+       char       *dbpath;

-       /*
-        * Our theory for replaying a CREATE is to forcibly drop the target
-        * subdirectory if present, then re-copy the source data. This may be
-        * more work than needed, but it is simple to implement.
-        */
-       if (stat(dst_path, &st) == 0 && S_ISDIR(st.st_mode))
-       {
-           if (!rmtree(dst_path, true))
-               /* If this failed, copydir() below is going to error. */
-               ereport(WARNING,
-                       (errmsg("some useless files may be left behind in old database directory \"%s\"",
-                               dst_path)));
-       }

I think this is a significant change and probably needs some kind of explanation/comments as-in why we are just creating a dir and copying the version file when replaying create database operation. Earlier, this meant replaying the complete create database operation, that doesn't seem to be the case now.

--

Have you intentionally skipped pg_internal.init file from being copied to the target database?

--
With Regards,
Ashutosh Sharma.


On Thu, Dec 2, 2021 at 7:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Wed, Dec 1, 2021 at 6:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

> Thanks a lot for testing this. From the error, it seems like some of
> the old buffer w.r.t. the previous tablespace is not dropped after the
> movedb.  Actually, we are calling DropDatabaseBuffers() after copying
> to a new tablespace and dropping all the buffers of this database
> w.r.t the old tablespace.  But seems something is missing, I will
> reproduce this and try to fix it by tomorrow.  I will also fix the
> other review comments raised by you in the previous mail.

Okay, I got the issue, basically we are dropping the database buffers
but not unregistering the existing sync request for database buffers
w.r.t old tablespace. Attached patch fixes that.  I also had to extend
ForgetDatabaseSyncRequests so that we can delete the sync request of
the database for the particular tablespace so added another patch for
the same (0006).

I will test the performance scenario next week, which is suggested by John.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Fri, Dec 3, 2021 at 7:38 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> I see that this patch is reducing the database creation time by almost 3-4 times provided that the template database
hassome user data in it. However, there are couple of points to be noted: 

Thanks a lot for looking into the patches.
>
> 1) It makes the crash recovery a bit slower than before if the crash has occurred after the execution of a create
databasestatement. Moreover, if the template database size is big, it might even generate a lot of WAL files which the
userneeds to be aware of. 

Yes it will but actually that is the only correct way to do it, in
current we are just logging the WAL as copying the source directory to
destination directory without really noting down exactly what we
wanted to copy, so we are force to do the checkpoint right after
create database because in crash recovery we can not actually replay
that WAL.  Because WAL just say copy the source to destination so it
is very much possible that at the DO time source directory had some
different content than the REDO time so this would have created the
inconsistencies in the crash recovery so to avoid this bug they force
the checkpoint so now also if you do force checkpoint then again crash
recovery will be equally fast.  So I would not say that we have made
crash recovery slow but we have removed some bugs and with that now we
don't need to force the checkpoint.  Also note that in current code
even with force checkpoint the bug is not completely avoided in all
the cases, check below comments from the code[1].

> 2) This will put a lot of load on the first checkpoint that will occur after creating the database statement. I will
experimentaround this to see if this has any side effects. 

But now a checkpoint can happen at its own need and there is no need
to force a checkpoint like it was before patch.

So the major goal of this patch is 1) Correctly WAL log the create
database which is hack in the current system,  2) Avoid force
checkpoints, 3) We copy page by page so it will support TDE because if
the source and destination database has different encryption then we
can reencrypt the page before copying to destination database, which
is not possible in current system as we are copying directory  4) Now
the new database pages will get the latest LSN which is the correct
things earlier new database pages were getting copied directly with
old LSN only.


> Further, the code changes in the patch looks good. I just have few comments:

I will look into the other comments and get back to you, thanks.

[1]
* In PITR replay, the first of these isn't an issue, and the second
* is only a risk if the CREATE DATABASE and subsequent template
* database change both occur while a base backup is being taken.
* There doesn't seem to be much we can do about that except document
* it as a limitation.
*
* Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
* we can avoid this.
*/
RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Ashutosh Sharma
Date:
On Fri, Dec 3, 2021 at 8:28 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Dec 3, 2021 at 7:38 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> I see that this patch is reducing the database creation time by almost 3-4 times provided that the template database has some user data in it. However, there are couple of points to be noted:

Thanks a lot for looking into the patches.
>
> 1) It makes the crash recovery a bit slower than before if the crash has occurred after the execution of a create database statement. Moreover, if the template database size is big, it might even generate a lot of WAL files which the user needs to be aware of.

Yes it will but actually that is the only correct way to do it, in
current we are just logging the WAL as copying the source directory to
destination directory without really noting down exactly what we
wanted to copy, so we are force to do the checkpoint right after
create database because in crash recovery we can not actually replay
that WAL.  Because WAL just say copy the source to destination so it
is very much possible that at the DO time source directory had some
different content than the REDO time so this would have created the
inconsistencies in the crash recovery so to avoid this bug they force
the checkpoint so now also if you do force checkpoint then again crash
recovery will be equally fast.  So I would not say that we have made
crash recovery slow but we have removed some bugs and with that now we
don't need to force the checkpoint.  Also note that in current code
even with force checkpoint the bug is not completely avoided in all
the cases, check below comments from the code[1].

> 2) This will put a lot of load on the first checkpoint that will occur after creating the database statement. I will experiment around this to see if this has any side effects.

But now a checkpoint can happen at its own need and there is no need
to force a checkpoint like it was before patch.

So the major goal of this patch is 1) Correctly WAL log the create
database which is hack in the current system,  2) Avoid force
checkpoints, 3) We copy page by page so it will support TDE because if
the source and destination database has different encryption then we
can reencrypt the page before copying to destination database, which
is not possible in current system as we are copying directory  4) Now
the new database pages will get the latest LSN which is the correct
things earlier new database pages were getting copied directly with
old LSN only.

OK. Understood, thanks.!

--
With Regards,
Ashutosh Sharma. 

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Ashutosh Sharma
Date:
Here are few more review comments:

1) It seems that we are not freeing the memory allocated for buf.data in CreateDirAndVersionFile().

--

+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{

2) Do we need to pass dbpath here? I mean why not reconstruct it from dbid and tsid.

--

3) Not sure if this point has already been discussed, Will we be able to recover the data when wal_level is set to minimal because the following condition would be false with this wal level.

+   use_wal = XLogIsNeeded() &&
+       (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);

--
With Regards,
Ashutosh Sharma.

On Mon, Dec 6, 2021 at 9:12 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
On Fri, Dec 3, 2021 at 8:28 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Dec 3, 2021 at 7:38 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> I see that this patch is reducing the database creation time by almost 3-4 times provided that the template database has some user data in it. However, there are couple of points to be noted:

Thanks a lot for looking into the patches.
>
> 1) It makes the crash recovery a bit slower than before if the crash has occurred after the execution of a create database statement. Moreover, if the template database size is big, it might even generate a lot of WAL files which the user needs to be aware of.

Yes it will but actually that is the only correct way to do it, in
current we are just logging the WAL as copying the source directory to
destination directory without really noting down exactly what we
wanted to copy, so we are force to do the checkpoint right after
create database because in crash recovery we can not actually replay
that WAL.  Because WAL just say copy the source to destination so it
is very much possible that at the DO time source directory had some
different content than the REDO time so this would have created the
inconsistencies in the crash recovery so to avoid this bug they force
the checkpoint so now also if you do force checkpoint then again crash
recovery will be equally fast.  So I would not say that we have made
crash recovery slow but we have removed some bugs and with that now we
don't need to force the checkpoint.  Also note that in current code
even with force checkpoint the bug is not completely avoided in all
the cases, check below comments from the code[1].

> 2) This will put a lot of load on the first checkpoint that will occur after creating the database statement. I will experiment around this to see if this has any side effects.

But now a checkpoint can happen at its own need and there is no need
to force a checkpoint like it was before patch.

So the major goal of this patch is 1) Correctly WAL log the create
database which is hack in the current system,  2) Avoid force
checkpoints, 3) We copy page by page so it will support TDE because if
the source and destination database has different encryption then we
can reencrypt the page before copying to destination database, which
is not possible in current system as we are copying directory  4) Now
the new database pages will get the latest LSN which is the correct
things earlier new database pages were getting copied directly with
old LSN only.

OK. Understood, thanks.!

--
With Regards,
Ashutosh Sharma. 

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Mon, Dec 6, 2021 at 9:17 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Here are few more review comments:

Thanks for reviewing it.

> 1) It seems that we are not freeing the memory allocated for buf.data in CreateDirAndVersionFile().

Yeah this was a problem in v6 but I have fixed in v7, can you check that.
>
> + */
> +static void
> +CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
> +{
>
> 2) Do we need to pass dbpath here? I mean why not reconstruct it from dbid and tsid.

Yeah we can do that but I thought computing dbpath has some cost and
since the caller already has it why not to pass it.

>
> 3) Not sure if this point has already been discussed, Will we be able to recover the data when wal_level is set to
minimalbecause the following condition would be false with this wal level.
 
>
> +   use_wal = XLogIsNeeded() &&
> +       (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
>

Since we are creating new relfilenode this is fine, refer "Skipping
WAL for New RelFileNode" in src/backend/access/transam/README

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Ashutosh Sharma
Date:
Thank you, Dilip for the quick response. I am okay with the changes done in the v7 patch.

One last point - If we try to clone a huge database, as expected CREATE DATABASE emits a lot of WALs, causing a lot of intermediate checkpoints which seems to be affecting the performance slightly.

--
With Regards,
Ashutosh Sharma.

On Mon, Dec 6, 2021 at 9:59 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Mon, Dec 6, 2021 at 9:17 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Here are few more review comments:

Thanks for reviewing it.

> 1) It seems that we are not freeing the memory allocated for buf.data in CreateDirAndVersionFile().

Yeah this was a problem in v6 but I have fixed in v7, can you check that.
>
> + */
> +static void
> +CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
> +{
>
> 2) Do we need to pass dbpath here? I mean why not reconstruct it from dbid and tsid.

Yeah we can do that but I thought computing dbpath has some cost and
since the caller already has it why not to pass it.

>
> 3) Not sure if this point has already been discussed, Will we be able to recover the data when wal_level is set to minimal because the following condition would be false with this wal level.
>
> +   use_wal = XLogIsNeeded() &&
> +       (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
>

Since we are creating new relfilenode this is fine, refer "Skipping
WAL for New RelFileNode" in src/backend/access/transam/README

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Mon, Dec 6, 2021 at 9:23 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> One last point - If we try to clone a huge database, as expected CREATE DATABASE emits a lot of WALs, causing a lot
ofintermediate checkpoints which seems to be affecting the performance slightly.
 

Yes, I think this needs to be characterized better. If you have a big
shared buffers setting and a lot of those buffers are dirty and the
template database is small, all of which is fairly normal, then this
new approach should be much quicker. On the other hand, what if the
situation is reversed? Perhaps you have a small shared buffers and not
much of it is dirty and the template database is gigantic. Then maybe
this new approach will be slower. But right now I think we don't know
where the crossover point is, and I think we should try to figure that
out.

So for example, imagine tests with 1GB of shard_buffers, 8GB, and
64GB. And template databases with sizes of whatever the default is,
1GB, 10GB, 100GB. Repeatedly make 75% of the pages dirty and then
create a new database from one of the templates. And then just measure
the performance. Maybe for large databases this approach is just
really the pits -- and if your max_wal_size is too small, it
definitely will be. But, I don't know, maybe with reasonable settings
it's not that bad. Writing everything to disk twice - once to WAL and
once to the target directory - has to be more expensive than doing it
once. But on the other hand, it's all sequential I/O and the data
pages don't need to be fsync'd, so perhaps the overhead is relatively
mild. I don't know.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Ashutosh Sharma
Date:
Thanks Robert for sharing your thoughts.

On Mon, Dec 6, 2021 at 11:16 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Dec 6, 2021 at 9:23 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> One last point - If we try to clone a huge database, as expected CREATE DATABASE emits a lot of WALs, causing a lot of intermediate checkpoints which seems to be affecting the performance slightly.

Yes, I think this needs to be characterized better. If you have a big
shared buffers setting and a lot of those buffers are dirty and the
template database is small, all of which is fairly normal, then this
new approach should be much quicker. On the other hand, what if the
situation is reversed? Perhaps you have a small shared buffers and not
much of it is dirty and the template database is gigantic. Then maybe
this new approach will be slower. But right now I think we don't know
where the crossover point is, and I think we should try to figure that
out.

Yes I think so too.
 

So for example, imagine tests with 1GB of shard_buffers, 8GB, and
64GB. And template databases with sizes of whatever the default is,
1GB, 10GB, 100GB. Repeatedly make 75% of the pages dirty and then
create a new database from one of the templates. And then just measure
the performance. Maybe for large databases this approach is just
really the pits -- and if your max_wal_size is too small, it
definitely will be. But, I don't know, maybe with reasonable settings
it's not that bad. Writing everything to disk twice - once to WAL and
once to the target directory - has to be more expensive than doing it
once. But on the other hand, it's all sequential I/O and the data
pages don't need to be fsync'd, so perhaps the overhead is relatively
mild. I don't know.

So far, I haven't found much performance overhead with a few gb of data in the template database. It's just a bit with the default settings, perhaps setting a higher value of max_wal_size would reduce this overhead.

--
With Regards,
Ashutosh Sharma. 

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Mon, Dec 6, 2021 at 7:53 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Thank you, Dilip for the quick response. I am okay with the changes done in the v7 patch.
>
> One last point - If we try to clone a huge database, as expected CREATE DATABASE emits a lot of WALs, causing a lot
ofintermediate checkpoints which seems to be affecting the performance slightly.
 

Yeah, that is a valid point because instead of just one WAL for
createdb we will generate WAL for each page in the database, so I
agree that if the max_wal_size is not enough for those WALs then we
might have to pay the cost of multiple checkpoints.  However, if we
compare it with the current mechanism then now it is a forced
checkpoint and there is no way to avoid it whereas with the new
approach user can set enough max_wal_size and they can avoid it.  So
in other words now the checkpoint is driven by the amount of resource
which is true for any other operation e.g. ALTER TABLE SET TABLESPACE
so now it is in more sync with the rest of the system, but without the
patch, it was a special purpose forced checkpoint only for the
createdb.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Neha Sharma
Date:
Hello Dilip,

While testing the v7 patches, I am observing a crash with the below test case.

Test case:
create tablespace tab location '<dir_path>/test_dir';
create tablespace tab1 location '<dir_path>/test_dir1';
create database test tablespace tab;
\c test
create table t( a int PRIMARY KEY,b text);
CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
insert into t values (generate_series(1,2000000), large_val());
alter table t set tablespace tab1 ;
\c postgres
create database test1 template test;
alter database test set tablespace pg_default;
alter database test set tablespace tab;
\c test1
alter table t set tablespace tab;

 Logfile says:
2021-12-08 23:31:58.855 +04 [134252] PANIC:  could not fsync file "base/16386/4152": No such file or directory
2021-12-08 23:31:59.398 +04 [134251] LOG:  checkpointer process (PID 134252) was terminated by signal 6: Aborted


Thanks.
--
Regards,
Neha Sharma


On Tue, Dec 7, 2021 at 12:24 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Mon, Dec 6, 2021 at 7:53 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Thank you, Dilip for the quick response. I am okay with the changes done in the v7 patch.
>
> One last point - If we try to clone a huge database, as expected CREATE DATABASE emits a lot of WALs, causing a lot of intermediate checkpoints which seems to be affecting the performance slightly.

Yeah, that is a valid point because instead of just one WAL for
createdb we will generate WAL for each page in the database, so I
agree that if the max_wal_size is not enough for those WALs then we
might have to pay the cost of multiple checkpoints.  However, if we
compare it with the current mechanism then now it is a forced
checkpoint and there is no way to avoid it whereas with the new
approach user can set enough max_wal_size and they can avoid it.  So
in other words now the checkpoint is driven by the amount of resource
which is true for any other operation e.g. ALTER TABLE SET TABLESPACE
so now it is in more sync with the rest of the system, but without the
patch, it was a special purpose forced checkpoint only for the
createdb.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Greg Nancarrow
Date:
On Thu, Dec 9, 2021 at 6:57 AM Neha Sharma <neha.sharma@enterprisedb.com> wrote:
>
> While testing the v7 patches, I am observing a crash with the below test case.
>
> Test case:
> create tablespace tab location '<dir_path>/test_dir';
> create tablespace tab1 location '<dir_path>/test_dir1';
> create database test tablespace tab;
> \c test
> create table t( a int PRIMARY KEY,b text);
> CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select array_agg(md5(g::text))::text from
generate_series(1,256) g';
 
> insert into t values (generate_series(1,2000000), large_val());
> alter table t set tablespace tab1 ;
> \c postgres
> create database test1 template test;
> alter database test set tablespace pg_default;
> alter database test set tablespace tab;
> \c test1
> alter table t set tablespace tab;
>
>  Logfile says:
> 2021-12-08 23:31:58.855 +04 [134252] PANIC:  could not fsync file "base/16386/4152": No such file or directory
> 2021-12-08 23:31:59.398 +04 [134251] LOG:  checkpointer process (PID 134252) was terminated by signal 6: Aborted
>

I tried to reproduce the issue using your test scenario, but I needed
to reduce the amount of inserted data (so reduced 2000000 to 20000)
due to disk space.
I then consistently get an error like the following:

postgres=# alter database test set tablespace pg_default;
ERROR:  could not create file
"pg_tblspc/16385/PG_15_202111301/16386/36395": File exists

(this only happens when the patch is used)


Regards,
Greg Nancarrow
Fujitsu Australia



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Neha Sharma
Date:



On Thu, Dec 9, 2021 at 4:26 AM Greg Nancarrow <gregn4422@gmail.com> wrote:
On Thu, Dec 9, 2021 at 6:57 AM Neha Sharma <neha.sharma@enterprisedb.com> wrote:
>
> While testing the v7 patches, I am observing a crash with the below test case.
>
> Test case:
> create tablespace tab location '<dir_path>/test_dir';
> create tablespace tab1 location '<dir_path>/test_dir1';
> create database test tablespace tab;
> \c test
> create table t( a int PRIMARY KEY,b text);
> CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
> insert into t values (generate_series(1,2000000), large_val());
> alter table t set tablespace tab1 ;
> \c postgres
> create database test1 template test;
> alter database test set tablespace pg_default;
> alter database test set tablespace tab;
> \c test1
> alter table t set tablespace tab;
>
>  Logfile says:
> 2021-12-08 23:31:58.855 +04 [134252] PANIC:  could not fsync file "base/16386/4152": No such file or directory
> 2021-12-08 23:31:59.398 +04 [134251] LOG:  checkpointer process (PID 134252) was terminated by signal 6: Aborted
>

I tried to reproduce the issue using your test scenario, but I needed
to reduce the amount of inserted data (so reduced 2000000 to 20000)
due to disk space.
I then consistently get an error like the following:

postgres=# alter database test set tablespace pg_default;
ERROR:  could not create file
"pg_tblspc/16385/PG_15_202111301/16386/36395": File exists

(this only happens when the patch is used)

 
Yes, I was also getting this, and moving further we get a crash when we alter the table of database test1.
Below is the output of the test at my end.

postgres=# create tablespace tab1 location '/home/edb/PGsources/postgresql/inst/bin/rep_test1';
CREATE TABLESPACE
postgres=# create tablespace tab location '/home/edb/PGsources/postgresql/inst/bin/rep_test';
CREATE TABLESPACE
postgres=# create database test tablespace tab;
CREATE DATABASE
postgres=# \c test
You are now connected to database "test" as user "edb".
test=# create table t( a int PRIMARY KEY,b text);
CREATE TABLE
test=# CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
CREATE FUNCTION
test=# insert into t values (generate_series(1,2000000), large_val());
INSERT 0 2000000
test=# alter table t set tablespace tab1 ;
ALTER TABLE
test=# \c postgres
You are now connected to database "postgres" as user "edb".
postgres=# create database test1 template test;
CREATE DATABASE
postgres=# alter database test set tablespace pg_default;
ERROR:  could not create file "pg_tblspc/16384/PG_15_202111301/16386/2016395": File exists
postgres=# alter database test set tablespace tab;
ALTER DATABASE
postgres=# \c test1
You are now connected to database "test1" as user "edb".
test1=# alter table t set tablespace tab;
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?> 

Regards,
Greg Nancarrow
Fujitsu Australia

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Ashutosh Sharma
Date:
Hi,

The issue here is that we are trying to create a table that exists inside a non-default tablespace when doing ALTER DATABASE. I think this should be skipped otherwise we will come across the error like shown below:

ashu@postgres=# alter database test set tablespace pg_default;
ERROR:  58P02: could not create file "pg_tblspc/16385/PG_15_202111301/16386/16390": File exists

I have taken the above from Neha's test-case.

--

Attached patch fixes this. I am passing a new boolean flag named *movedb* to CopyDatabase() so that it could skip the creation of tables existing in non-default tablespace when doing alter database. Alternatively, we can also rename the boolean flag movedb to createdb and pass its value accordingly from movedb() or createdb(). Either way looks fine to me. Kindly check the attached patch for the changes.

Dilip, Could you please check the attached patch and let me know if it looks fine or not?

Neha, can you please re-run the test-cases with the attached patch.

Thanks,

--
With Regards,
Ashutosh Sharma.

On Thu, Dec 9, 2021 at 8:43 AM Neha Sharma <neha.sharma@enterprisedb.com> wrote:



On Thu, Dec 9, 2021 at 4:26 AM Greg Nancarrow <gregn4422@gmail.com> wrote:
On Thu, Dec 9, 2021 at 6:57 AM Neha Sharma <neha.sharma@enterprisedb.com> wrote:
>
> While testing the v7 patches, I am observing a crash with the below test case.
>
> Test case:
> create tablespace tab location '<dir_path>/test_dir';
> create tablespace tab1 location '<dir_path>/test_dir1';
> create database test tablespace tab;
> \c test
> create table t( a int PRIMARY KEY,b text);
> CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
> insert into t values (generate_series(1,2000000), large_val());
> alter table t set tablespace tab1 ;
> \c postgres
> create database test1 template test;
> alter database test set tablespace pg_default;
> alter database test set tablespace tab;
> \c test1
> alter table t set tablespace tab;
>
>  Logfile says:
> 2021-12-08 23:31:58.855 +04 [134252] PANIC:  could not fsync file "base/16386/4152": No such file or directory
> 2021-12-08 23:31:59.398 +04 [134251] LOG:  checkpointer process (PID 134252) was terminated by signal 6: Aborted
>

I tried to reproduce the issue using your test scenario, but I needed
to reduce the amount of inserted data (so reduced 2000000 to 20000)
due to disk space.
I then consistently get an error like the following:

postgres=# alter database test set tablespace pg_default;
ERROR:  could not create file
"pg_tblspc/16385/PG_15_202111301/16386/36395": File exists

(this only happens when the patch is used)

 
Yes, I was also getting this, and moving further we get a crash when we alter the table of database test1.
Below is the output of the test at my end.

postgres=# create tablespace tab1 location '/home/edb/PGsources/postgresql/inst/bin/rep_test1';
CREATE TABLESPACE
postgres=# create tablespace tab location '/home/edb/PGsources/postgresql/inst/bin/rep_test';
CREATE TABLESPACE
postgres=# create database test tablespace tab;
CREATE DATABASE
postgres=# \c test
You are now connected to database "test" as user "edb".
test=# create table t( a int PRIMARY KEY,b text);
CREATE TABLE
test=# CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
CREATE FUNCTION
test=# insert into t values (generate_series(1,2000000), large_val());
INSERT 0 2000000
test=# alter table t set tablespace tab1 ;
ALTER TABLE
test=# \c postgres
You are now connected to database "postgres" as user "edb".
postgres=# create database test1 template test;
CREATE DATABASE
postgres=# alter database test set tablespace pg_default;
ERROR:  could not create file "pg_tblspc/16384/PG_15_202111301/16386/2016395": File exists
postgres=# alter database test set tablespace tab;
ALTER DATABASE
postgres=# \c test1
You are now connected to database "test1" as user "edb".
test1=# alter table t set tablespace tab;
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?> 

Regards,
Greg Nancarrow
Fujitsu Australia
Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Thu, Dec 9, 2021 at 12:42 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Hi,
>
> The issue here is that we are trying to create a table that exists inside a non-default tablespace when doing ALTER
DATABASE.I think this should be skipped otherwise we will come across the error like shown below: 
>
> ashu@postgres=# alter database test set tablespace pg_default;
> ERROR:  58P02: could not create file "pg_tblspc/16385/PG_15_202111301/16386/16390": File exists
>
> I have taken the above from Neha's test-case.
>
> --
>
> Attached patch fixes this. I am passing a new boolean flag named *movedb* to CopyDatabase() so that it could skip the
creationof tables existing in non-default tablespace when doing alter database. Alternatively, we can also rename the
booleanflag movedb to createdb and pass its value accordingly from movedb() or createdb(). Either way looks fine to me.
Kindlycheck the attached patch for the changes. 
>
> Dilip, Could you please check the attached patch and let me know if it looks fine or not?
>
> Neha, can you please re-run the test-cases with the attached patch.

Thanks Ahustosh,  yeah I have observed the same, earlier we were
directly copying the whole directory so this was not an issue, now if
some tables of the database are already in the destination tablespace
then we should skip them while copying.  I will review your patch and
merge into the main patch.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Neha Sharma
Date:
On Thu, Dec 9, 2021 at 11:12 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
Hi,

The issue here is that we are trying to create a table that exists inside a non-default tablespace when doing ALTER DATABASE. I think this should be skipped otherwise we will come across the error like shown below:

ashu@postgres=# alter database test set tablespace pg_default;
ERROR:  58P02: could not create file "pg_tblspc/16385/PG_15_202111301/16386/16390": File exists
 
Thanks Ashutosh for the patch, the mentioned issue has been resolved with the patch.

But I am still able to reproduce the crash consistently on top of this patch + v7 patches,just the test case has been modified.

create tablespace tab1 location '<dir_path>/test1';
create tablespace tab location '<dir_path>/test';
create database test tablespace tab;
\c test
create table t( a int PRIMARY KEY,b text);
CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
insert into t values (generate_series(1,100000), large_val());
alter table t set tablespace tab1 ;
\c postgres
create database test1 template test;
\c test1
alter table t set tablespace tab;
\c postgres
alter database test1 set tablespace tab1;

--Cancel the below command after few seconds
alter database test1 set tablespace pg_default;

\c test1
alter table t set tablespace tab1;


Logfile Snippet:
2021-12-09 17:49:18.110 +04 [18151] PANIC:  could not fsync file "base/116398/116400": No such file or directory
2021-12-09 17:49:19.105 +04 [18150] LOG:  checkpointer process (PID 18151) was terminated by signal 6: Aborted
2021-12-09 17:49:19.105 +04 [18150] LOG:  terminating any other active server processes

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Thu, Dec 9, 2021 at 7:23 PM Neha Sharma <neha.sharma@enterprisedb.com> wrote:
>
> On Thu, Dec 9, 2021 at 11:12 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

> \c postgres
> alter database test1 set tablespace tab1;
>
> --Cancel the below command after few seconds
> alter database test1 set tablespace pg_default;
>
> \c test1
> alter table t set tablespace tab1;
>
>
> Logfile Snippet:
> 2021-12-09 17:49:18.110 +04 [18151] PANIC:  could not fsync file "base/116398/116400": No such file or directory
> 2021-12-09 17:49:19.105 +04 [18150] LOG:  checkpointer process (PID 18151) was terminated by signal 6: Aborted
> 2021-12-09 17:49:19.105 +04 [18150] LOG:  terminating any other active server processes

Yeah, it seems like the fsync requests produced while copying database
objects to the new tablespace are not unregistered.  This seems like a
different issue than previously raised.  I will work on this next
week, thanks for testing.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Ashutosh Sharma
Date:
On Thu, Dec 9, 2021 at 7:23 PM Neha Sharma <neha.sharma@enterprisedb.com> wrote:
On Thu, Dec 9, 2021 at 11:12 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
Hi,

The issue here is that we are trying to create a table that exists inside a non-default tablespace when doing ALTER DATABASE. I think this should be skipped otherwise we will come across the error like shown below:

ashu@postgres=# alter database test set tablespace pg_default;
ERROR:  58P02: could not create file "pg_tblspc/16385/PG_15_202111301/16386/16390": File exists
 
Thanks Ashutosh for the patch, the mentioned issue has been resolved with the patch.

But I am still able to reproduce the crash consistently on top of this patch + v7 patches,just the test case has been modified.

create tablespace tab1 location '<dir_path>/test1';
create tablespace tab location '<dir_path>/test';
create database test tablespace tab;
\c test
create table t( a int PRIMARY KEY,b text);
CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
insert into t values (generate_series(1,100000), large_val());
alter table t set tablespace tab1 ;
\c postgres
create database test1 template test;
\c test1
alter table t set tablespace tab;
\c postgres
alter database test1 set tablespace tab1;

--Cancel the below command after few seconds
alter database test1 set tablespace pg_default;

\c test1
alter table t set tablespace tab1;


Logfile Snippet:
2021-12-09 17:49:18.110 +04 [18151] PANIC:  could not fsync file "base/116398/116400": No such file or directory
2021-12-09 17:49:19.105 +04 [18150] LOG:  checkpointer process (PID 18151) was terminated by signal 6: Aborted
2021-12-09 17:49:19.105 +04 [18150] LOG:  terminating any other active server processes

This is different from the issue you raised earlier. As Dilip said, we need to unregister sync requests for files that got successfully copied to the target database, but the overall alter database statement failed. We are doing this when the database is created successfully, but not when it fails.
Probably doing the same inside the cleanup function movedb_failure_callback() should fix the problem. 

--
With Regards,
Ashutosh Sharma.

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Fri, Dec 10, 2021 at 7:39 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>>
>> Logfile Snippet:
>> 2021-12-09 17:49:18.110 +04 [18151] PANIC:  could not fsync file "base/116398/116400": No such file or directory
>> 2021-12-09 17:49:19.105 +04 [18150] LOG:  checkpointer process (PID 18151) was terminated by signal 6: Aborted
>> 2021-12-09 17:49:19.105 +04 [18150] LOG:  terminating any other active server processes
>
>
> This is different from the issue you raised earlier. As Dilip said, we need to unregister sync requests for files
thatgot successfully copied to the target database, but the overall alter database statement failed. We are doing this
whenthe database is created successfully, but not when it fails. 
> Probably doing the same inside the cleanup function movedb_failure_callback() should fix the problem.

Correct, I have done this cleanup, apart from this we have dropped the
fsyc request in create database failure case as well and also need to
drop buffer in error case of creatdb as well as movedb.  I have also
fixed the other issue for which you gave the patch (a bit differently)
basically, in case of movedb the source and destination dboid are same
so we don't need an additional parameter and also readjusted the
conditions to avoid nested if.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Ashutosh Sharma
Date:
+       /*
+        * If the relation is from the default tablespace then we need to
+        * create it in the destinations db's default tablespace.  Otherwise,
+        * we need to create in the same tablespace as it is in the source
+        * database.
+        */

This comment looks a bit confusing to me especially because when we say destination db's default tablespace people may think of pg_default tablespace (at least I think so). Basically what you are trying to say here - "If the relation exists in the same tablespace as the src database, then in the destination db also it should be the same or something like that.. " So, why not put it that way instead of referring to it as the default tablespace. It's just my view. If you disagree you can ignore it.

--

+       else if (src_dboid == dst_dboid)
+           continue;
+       else
+           dstrnode.spcNode = srcrnode.spcNode;;

There is an extra semicolon here.

--
With Regards,
Ashutosh Sharma.


On Sun, Dec 12, 2021 at 1:39 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Dec 10, 2021 at 7:39 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>>
>> Logfile Snippet:
>> 2021-12-09 17:49:18.110 +04 [18151] PANIC:  could not fsync file "base/116398/116400": No such file or directory
>> 2021-12-09 17:49:19.105 +04 [18150] LOG:  checkpointer process (PID 18151) was terminated by signal 6: Aborted
>> 2021-12-09 17:49:19.105 +04 [18150] LOG:  terminating any other active server processes
>
>
> This is different from the issue you raised earlier. As Dilip said, we need to unregister sync requests for files that got successfully copied to the target database, but the overall alter database statement failed. We are doing this when the database is created successfully, but not when it fails.
> Probably doing the same inside the cleanup function movedb_failure_callback() should fix the problem.

Correct, I have done this cleanup, apart from this we have dropped the
fsyc request in create database failure case as well and also need to
drop buffer in error case of creatdb as well as movedb.  I have also
fixed the other issue for which you gave the patch (a bit differently)
basically, in case of movedb the source and destination dboid are same
so we don't need an additional parameter and also readjusted the
conditions to avoid nested if.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Mon, Dec 13, 2021 at 8:34 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> +       /*
> +        * If the relation is from the default tablespace then we need to
> +        * create it in the destinations db's default tablespace.  Otherwise,
> +        * we need to create in the same tablespace as it is in the source
> +        * database.
> +        */
>
> This comment looks a bit confusing to me especially because when we say destination db's default tablespace people
maythink of pg_default tablespace (at least I think so). Basically what you are trying to say here - "If the relation
existsin the same tablespace as the src database, then in the destination db also it should be the same or something
likethat.. " So, why not put it that way instead of referring to it as the default tablespace. It's just my view. If
youdisagree you can ignore it. 
>
> --
>
> +       else if (src_dboid == dst_dboid)
> +           continue;
> +       else
> +           dstrnode.spcNode = srcrnode.spcNode;;
>
> There is an extra semicolon here.


Noted. I will fix them in the next version.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Bruce Momjian
Date:
On Thu, Dec  2, 2021 at 07:19:50PM +0530, Dilip Kumar wrote:
From the patch:

> Currently, CREATE DATABASE forces a checkpoint, then copies all the files,
> then forces another checkpoint. The comments in the createdb() function
> explain the reasons for this. The attached patch fixes this problem by making
> create database completely WAL logged so that we can avoid the checkpoints.
> 
> This can also be useful for supporting the TDE. For example, if we need different
> encryption for the source and the target database then we can not re-encrypt the
> page data if we copy the whole directory.  But with this patch, we are copying
> page by page so we have an opportunity to re-encrypt the page before copying that
> to the target database.

Uh, why is this true?  Why can't we just copy the heap/index files 8k at
a time and reencrypt them during the file copy, rather than using shared
buffers?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Thu, Dec 16, 2021 at 12:15 AM Bruce Momjian <bruce@momjian.us> wrote:
>
> On Thu, Dec  2, 2021 at 07:19:50PM +0530, Dilip Kumar wrote:
> From the patch:
>
> > Currently, CREATE DATABASE forces a checkpoint, then copies all the files,
> > then forces another checkpoint. The comments in the createdb() function
> > explain the reasons for this. The attached patch fixes this problem by making
> > create database completely WAL logged so that we can avoid the checkpoints.
> >
> > This can also be useful for supporting the TDE. For example, if we need different
> > encryption for the source and the target database then we can not re-encrypt the
> > page data if we copy the whole directory.  But with this patch, we are copying
> > page by page so we have an opportunity to re-encrypt the page before copying that
> > to the target database.
>
> Uh, why is this true?  Why can't we just copy the heap/index files 8k at
> a time and reencrypt them during the file copy, rather than using shared
> buffers?

Hi Bruce,

Yeah, you are right that if we copy in 8k block then we can re-encrypt
the page, but in the current system, we are not copying block by
block.  So the main effort for this patch is not only for TDE but to
get rid of the checkpoint we are forced to do before and after create
database.  So my point is that in this patch since we are copying page
by page we get an opportunity to re-encrypt the page.  I agree that if
the re-encryption would have been the main goal of this patch then
true we can copy files in 8k blocks and re-encrypt those blocks, that
time even if we have to access some page data for re-encryption (like
nonce) then also we can do it, but that is not the main objective.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Neha Sharma
Date:
Hi,

While testing the v8 patches in a hot-standby setup, it was observed the master is crashing with the below error;

2021-12-16 19:32:47.757 +04 [101483] PANIC:  could not fsync file "pg_tblspc/16385/PG_15_202112111/16386/16391": No such file or directory
2021-12-16 19:32:48.917 +04 [101482] LOG:  checkpointer process (PID 101483) was terminated by signal 6: Aborted

Parameters configured at master:
wal_level = hot_standby
max_wal_senders = 3
hot_standby = on
max_standby_streaming_delay= -1
wal_consistency_checking='all'
max_wal_size= 10GB
checkpoint_timeout= 1d
log_min_messages=debug1

Test Case:
create tablespace tab1 location '/home/edb/PGsources/postgresql/inst/bin/test1';
create tablespace tab location '/home/edb/PGsources/postgresql/inst/bin/test';
create database test tablespace tab;
\c test
create table t( a int PRIMARY KEY,b text);
CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
insert into t values (generate_series(1,100000), large_val());
alter table t set tablespace tab1 ;
\c postgres
create database test1 template test;
\c test1
alter table t set tablespace tab;
\c postgres
alter database test1 set tablespace tab1;

--cancel the below command
alter database test1 set tablespace pg_default; --press ctrl+c
\c test1
alter table t set tablespace tab1;


Log file attached for reference.

Thanks.
--
Regards,
Neha Sharma


On Thu, Dec 16, 2021 at 4:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, Dec 16, 2021 at 12:15 AM Bruce Momjian <bruce@momjian.us> wrote:
>
> On Thu, Dec  2, 2021 at 07:19:50PM +0530, Dilip Kumar wrote:
> From the patch:
>
> > Currently, CREATE DATABASE forces a checkpoint, then copies all the files,
> > then forces another checkpoint. The comments in the createdb() function
> > explain the reasons for this. The attached patch fixes this problem by making
> > create database completely WAL logged so that we can avoid the checkpoints.
> >
> > This can also be useful for supporting the TDE. For example, if we need different
> > encryption for the source and the target database then we can not re-encrypt the
> > page data if we copy the whole directory.  But with this patch, we are copying
> > page by page so we have an opportunity to re-encrypt the page before copying that
> > to the target database.
>
> Uh, why is this true?  Why can't we just copy the heap/index files 8k at
> a time and reencrypt them during the file copy, rather than using shared
> buffers?

Hi Bruce,

Yeah, you are right that if we copy in 8k block then we can re-encrypt
the page, but in the current system, we are not copying block by
block.  So the main effort for this patch is not only for TDE but to
get rid of the checkpoint we are forced to do before and after create
database.  So my point is that in this patch since we are copying page
by page we get an opportunity to re-encrypt the page.  I agree that if
the re-encryption would have been the main goal of this patch then
true we can copy files in 8k blocks and re-encrypt those blocks, that
time even if we have to access some page data for re-encryption (like
nonce) then also we can do it, but that is not the main objective.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Ashutosh Sharma
Date:
I am getting the below error when running the same test-case that Neha shared in her previous email.

ERROR:  55000: some relations of database "test1" are already in tablespace "tab1"
HINT:  You must move them back to the database's default tablespace before using this command.
LOCATION:  movedb, dbcommands.c:1555

test-case:
========
create tablespace tab1 location '/home/ashu/test1';
create tablespace tab location '/home/ashu/test';

create database test tablespace tab;
\c test

create table t(a int primary key, b text);

create or replace function large_val() returns text language sql as 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';

insert into t values (generate_series(1,100000), large_val());

alter table t set tablespace tab1 ;

\c postgres
create database test1 template test;

\c test1
alter table t set tablespace tab;

\c postgres
alter database test1 set tablespace tab1; -- this fails with  the given error.

Observations:
===========
Please note that before running above alter database statement, the table 't'  is moved to tablespace 'tab' from 'tab1' so not sure why ReadDir() is returning true when searching for table 't' in tablespace 'tab1'. It should have returned NULL here:

 while ((xlde = ReadDir(dstdir, dst_dbpath)) != NULL)
        {
            if (strcmp(xlde->d_name, ".") == 0 ||
                strcmp(xlde->d_name, "..") == 0)
                continue;

            ereport(ERROR,
                    (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
                     errmsg("some relations of database \"%s\" are already in tablespace \"%s\"",
                            dbname, tblspcname),
                     errhint("You must move them back to the database's default tablespace before using this command.")));
        }

Also, if I run the checkpoint explicitly before executing the above alter database statement, this error doesn't appear which means it only happens with the new changes because earlier we were doing the force checkpoint at the end of createdb statement.

--
With Regards,
Ashutosh Sharma.

On Thu, Dec 16, 2021 at 9:26 PM Neha Sharma <neha.sharma@enterprisedb.com> wrote:
Hi,

While testing the v8 patches in a hot-standby setup, it was observed the master is crashing with the below error;

2021-12-16 19:32:47.757 +04 [101483] PANIC:  could not fsync file "pg_tblspc/16385/PG_15_202112111/16386/16391": No such file or directory
2021-12-16 19:32:48.917 +04 [101482] LOG:  checkpointer process (PID 101483) was terminated by signal 6: Aborted

Parameters configured at master:
wal_level = hot_standby
max_wal_senders = 3
hot_standby = on
max_standby_streaming_delay= -1
wal_consistency_checking='all'
max_wal_size= 10GB
checkpoint_timeout= 1d
log_min_messages=debug1

Test Case:
create tablespace tab1 location '/home/edb/PGsources/postgresql/inst/bin/test1';
create tablespace tab location '/home/edb/PGsources/postgresql/inst/bin/test';
create database test tablespace tab;
\c test
create table t( a int PRIMARY KEY,b text);
CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
insert into t values (generate_series(1,100000), large_val());
alter table t set tablespace tab1 ;
\c postgres
create database test1 template test;
\c test1
alter table t set tablespace tab;
\c postgres
alter database test1 set tablespace tab1;

--cancel the below command
alter database test1 set tablespace pg_default; --press ctrl+c
\c test1
alter table t set tablespace tab1;


Log file attached for reference.

Thanks.
--
Regards,
Neha Sharma


On Thu, Dec 16, 2021 at 4:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, Dec 16, 2021 at 12:15 AM Bruce Momjian <bruce@momjian.us> wrote:
>
> On Thu, Dec  2, 2021 at 07:19:50PM +0530, Dilip Kumar wrote:
> From the patch:
>
> > Currently, CREATE DATABASE forces a checkpoint, then copies all the files,
> > then forces another checkpoint. The comments in the createdb() function
> > explain the reasons for this. The attached patch fixes this problem by making
> > create database completely WAL logged so that we can avoid the checkpoints.
> >
> > This can also be useful for supporting the TDE. For example, if we need different
> > encryption for the source and the target database then we can not re-encrypt the
> > page data if we copy the whole directory.  But with this patch, we are copying
> > page by page so we have an opportunity to re-encrypt the page before copying that
> > to the target database.
>
> Uh, why is this true?  Why can't we just copy the heap/index files 8k at
> a time and reencrypt them during the file copy, rather than using shared
> buffers?

Hi Bruce,

Yeah, you are right that if we copy in 8k block then we can re-encrypt
the page, but in the current system, we are not copying block by
block.  So the main effort for this patch is not only for TDE but to
get rid of the checkpoint we are forced to do before and after create
database.  So my point is that in this patch since we are copying page
by page we get an opportunity to re-encrypt the page.  I agree that if
the re-encryption would have been the main goal of this patch then
true we can copy files in 8k blocks and re-encrypt those blocks, that
time even if we have to access some page data for re-encryption (like
nonce) then also we can do it, but that is not the main objective.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Mon, Dec 6, 2021 at 12:45 PM Robert Haas <robertmhaas@gmail.com> wrote:
> So for example, imagine tests with 1GB of shard_buffers, 8GB, and
> 64GB. And template databases with sizes of whatever the default is,
> 1GB, 10GB, 100GB. Repeatedly make 75% of the pages dirty and then
> create a new database from one of the templates. And then just measure
> the performance. Maybe for large databases this approach is just
> really the pits -- and if your max_wal_size is too small, it
> definitely will be. But, I don't know, maybe with reasonable settings
> it's not that bad. Writing everything to disk twice - once to WAL and
> once to the target directory - has to be more expensive than doing it
> once. But on the other hand, it's all sequential I/O and the data
> pages don't need to be fsync'd, so perhaps the overhead is relatively
> mild. I don't know.

I have been tied up with other things for a bit now and have not had
time to look at this thread; sorry about that. I have a little more
time available now so I thought I would take a look at this again and
see where things stand.

Sadly, it doesn't appear to me that anyone has done any performance
testing of this patch, along the lines suggested above or otherwise,
and I think it's a crucial question for the patch. My reading of this
thread is that nobody really likes the idea of maintaining two methods
for performing CREATE DATABASE, but nobody wants to hose people who
are using it to clone large databases, either. To some extent those
things are inexorably in conflict. If we postulate that the 10TB
template database is on a local RAID array with 40 spindles, while
pg_wal is on an iSCSI volume that we access via a 128kB ISDN link,
then the new system is going to be infinitely worse. But real
situations aren't likely to be that bad, and it would be useful in my
opinion to have an idea how bad they actually are.

I'm somewhat inclined to propose that we keep the existing method
around along with the new method. Even though nobody really likes
that, we don't necessarily have to maintain both methods forever. If,
say, we use the new method by default in all cases, but add an option
to get the old method back if you need it, we could leave it that way
for a few years and then propose removing the old method (and the
switch to activate it) and see if anyone complains. That way, if the
new method turns out to suck in certain cases, users have a way out.
However, I still think doing some performance testing would be a
really good idea. It's not a great plan to make decisions about this
kind of thing in an information vacuum.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Wed, Dec 22, 2021 at 9:32 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> I couldn't find the mdpostchkpt() function. Are you talking about
> SyncPostCheckpoint() ? Anyway, as you have rightly said, we need to
> unlink all the files available inside the dst_tablespaceoid/dst_dboid/
> directory by scanning the pendingUnlinks list. And finally we don't
> want the next checkpoint to unlink this file again and PANIC so for
> that we have to update the entry for this unlinked rel file in the
> hash table i.e. cancel the sync request for it.

Until commit 3eb77eba5a51780d5cf52cd66a9844cd4d26feb0 in April 2019,
there was an mdpostckpt function, which is probably what was meant
here.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Sun, Dec 12, 2021 at 3:09 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Correct, I have done this cleanup, apart from this we have dropped the
> fsyc request in create database failure case as well and also need to
> drop buffer in error case of creatdb as well as movedb.  I have also
> fixed the other issue for which you gave the patch (a bit differently)
> basically, in case of movedb the source and destination dboid are same
> so we don't need an additional parameter and also readjusted the
> conditions to avoid nested if.

Amazingly to me given how much time has passed, these patches still
apply, although I think there are a few outstanding issues that you
promised to fix in the next version and haven't yet addressed.

In 0007, I think you will need to work a bit harder. I don't think
that you can just add a second argument to
ForgetDatabaseSyncRequests() that makes it do something other than
what the name of the function suggests but without renaming the
function or updating any comments. Elsewhere we have things like
TablespaceCreateDbspace and ResetUnloggedRelationsInDbspaceDir so
perhaps we ought to just add a new function with a name inspired by
those precedents alongside the existing one, rather than doing it this
way.

In 0008, this is a bit confusing:

+               PageInit(dstPage, BufferGetPageSize(dstBuf), 0);
+               memcpy(dstPage, srcPage, BLCKSZ);

After a minute, I figured out that the point here was to force
log_newpage() to actually set the LSN, but how about a comment?

I kind of wonder whether GetDatabaseRelationList should be broken into
two functions so that don't have quite such deep nesting. And I wonder
if maybe the return value of GetActiveSnapshot() should be cached in a
local variable.

On the whole I think there aren't huge code-level issues here, even if
things need to be tweaked here and there and bugs fixed. The real key
is arriving at a set of design trade-offs that doesn't make anyone too
upset.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Tue, Feb 8, 2022 at 10:39 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> I have been tied up with other things for a bit now and have not had
> time to look at this thread; sorry about that. I have a little more
> time available now so I thought I would take a look at this again and
> see where things stand.

Thanks for looking into this.

>
> Sadly, it doesn't appear to me that anyone has done any performance
> testing of this patch, along the lines suggested above or otherwise,
> and I think it's a crucial question for the patch.

Yeah, actually some performance testing started as shared by Ahustosh
[1] and soon after that we got side tracked by another issue[2] which
we thought had to be fixed before we proceed with this feature.

My reading of this
> thread is that nobody really likes the idea of maintaining two methods
> for performing CREATE DATABASE, but nobody wants to hose people who
> are using it to clone large databases, either. To some extent those
> things are inexorably in conflict. If we postulate that the 10TB
> template database is on a local RAID array with 40 spindles, while
> pg_wal is on an iSCSI volume that we access via a 128kB ISDN link,
> then the new system is going to be infinitely worse. But real
> situations aren't likely to be that bad, and it would be useful in my
> opinion to have an idea how bad they actually are.

Yeah that makes sense, I will work on performance testing in this line
and also on previous ideas you suggested.

> I'm somewhat inclined to propose that we keep the existing method
> around along with the new method. Even though nobody really likes
> that, we don't necessarily have to maintain both methods forever. If,
> say, we use the new method by default in all cases, but add an option
> to get the old method back if you need it, we could leave it that way
> for a few years and then propose removing the old method (and the
> switch to activate it) and see if anyone complains. That way, if the
> new method turns out to suck in certain cases, users have a way out.
> However, I still think doing some performance testing would be a
> really good idea. It's not a great plan to make decisions about this
> kind of thing in an information vacuum.

Yeah that makes sense to me.

Now, one bigger question is can we proceed with this patch without
fixing [2], IMHO, if we are deciding to keep the old method also
intact then one option could be that for now only change CREATE
DATABASE to support both old and new way of creating database and for
time being leave the ALTER DATABASE SET TABLESPACE alone and let it
work only with the old method?  And another option is that we first
fix the issue related to the tombstone file and then come back to
this?

IMHO, the first option could be better in a way that we have already
made better progress in this patch and this is in better shape than
the other patch we are trying to make for removing the tombstone
files.


[1]https://www.postgresql.org/message-id/CAE9k0Pkg20tHq8oiJ%2BxXa9%3Daf3QZCSYTw99aBaPthA1UMKhnTg%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CA%2BTgmobM5FN5x0u3tSpoNvk_TZPFCdbcHxsXCoY1ytn1dXROvg%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Bruce Momjian
Date:
On Tue, Feb  8, 2022 at 12:09:08PM -0500, Robert Haas wrote:
> Sadly, it doesn't appear to me that anyone has done any performance
> testing of this patch, along the lines suggested above or otherwise,
> and I think it's a crucial question for the patch. My reading of this
> thread is that nobody really likes the idea of maintaining two methods
> for performing CREATE DATABASE, but nobody wants to hose people who
> are using it to clone large databases, either. To some extent those
> things are inexorably in conflict. If we postulate that the 10TB
> template database is on a local RAID array with 40 spindles, while
> pg_wal is on an iSCSI volume that we access via a 128kB ISDN link,
> then the new system is going to be infinitely worse. But real
> situations aren't likely to be that bad, and it would be useful in my
> opinion to have an idea how bad they actually are.

Honestly, I never understood why the checkpoint during CREATE DATABASE
was as problem --- we checkpoint by default every five minutes anyway,
so why is an additional two a problem --- it just means the next
checkpoint will do less work.  It is hard to see how avoiding
checkpoints to add WAL writes, fscyncs, and replication traffic could be
a win.

I see the patch justification outlined here:

    https://www.postgresql.org/message-id/CAFiTN-sP6yLVTfjR42mEfvFwJ-SZ2iEtG1t0j=QX09X=BM+KWQ@mail.gmail.com

TDE is mentioned as a value for this patch, but I don't see why it is
needed --- TDE can easily decrypt/encrypt the pages while they are
copied.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andrew Dunstan
Date:
On 6/16/21 03:52, Dilip Kumar wrote:
> On Tue, Jun 15, 2021 at 7:01 PM Andrew Dunstan <andrew@dunslane.net> wrote:
>> Rather than use size, I'd be inclined to say use this if the source
>> database is marked as a template, and use the copydir approach for
>> anything that isn't.
> Yeah, that is possible, on the other thought wouldn't it be good to
> provide control to the user by providing two different commands, e.g.
> COPY DATABASE for the existing method (copydir) and CREATE DATABASE
> for the new method (fully wal logged)?
>


This proposal seems to have gotten lost.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Wed, Feb 9, 2022 at 7:49 PM Bruce Momjian <bruce@momjian.us> wrote:
>
> Honestly, I never understood why the checkpoint during CREATE DATABASE
> was as problem --- we checkpoint by default every five minutes anyway,
> so why is an additional two a problem --- it just means the next
> checkpoint will do less work.  It is hard to see how avoiding
> checkpoints to add WAL writes, fscyncs, and replication traffic could be
> a win.

But don't you think that the current way of WAL logging the CREATE
DATABASE is a bit hacky?  I mean we are just logically WAL logging the
source and destination directory paths without actually WAL logging
what content we want to copy.  IMHO this is against the basic
principle of WAL and that's the reason we are forcefully checkpointing
to avoid replaying that WAL during crash recovery.  Even after this
some of the code comments say that we have limitations during PITR[1]
and we want to avoid it sometime in the future.

[1]
* In PITR replay, the first of these isn't an issue, and the second
* is only a risk if the CREATE DATABASE and subsequent template
* database change both occur while a base backup is being taken.
* There doesn't seem to be much we can do about that except document
* it as a limitation.
*
* Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
* we can avoid this.
*/
RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Wed, Feb 9, 2022 at 9:25 PM Andrew Dunstan <andrew@dunslane.net> wrote:
>
>
> On 6/16/21 03:52, Dilip Kumar wrote:
> > On Tue, Jun 15, 2021 at 7:01 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> >> Rather than use size, I'd be inclined to say use this if the source
> >> database is marked as a template, and use the copydir approach for
> >> anything that isn't.
> > Yeah, that is possible, on the other thought wouldn't it be good to
> > provide control to the user by providing two different commands, e.g.
> > COPY DATABASE for the existing method (copydir) and CREATE DATABASE
> > for the new method (fully wal logged)?
> >
>
>
> This proposal seems to have gotten lost.

Yeah, I am planning to work on this part so that we can support both methods.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Wed, Feb 9, 2022 at 9:19 AM Bruce Momjian <bruce@momjian.us> wrote:
> Honestly, I never understood why the checkpoint during CREATE DATABASE
> was as problem --- we checkpoint by default every five minutes anyway,
> so why is an additional two a problem --- it just means the next
> checkpoint will do less work.  It is hard to see how avoiding
> checkpoints to add WAL writes, fscyncs, and replication traffic could be
> a win.

Try running pgbench with the --progress option and enough concurrent
jobs to keep a moderately large system busy and watching what happens
to the tps each time a checkpoint occurs. It's extremely dramatic, or
at least it was the last time I ran such tests. I think that
performance will sometimes drop by a factor of five or more when the
checkpoint hits, and take multiple minutes to recover.

I think your statement that doing an extra checkpoint "just means the
next checkpoint will do less work" is kind of misleading. That's
certainly true in some situations. But when the same pages are being
dirtied over and over again, an extra checkpoint often means that the
system will do MUCH MORE work, because every checkpoint triggers a new
set of full-page writes over the actively-updated portion of the
database.

I think that very few people run systems with heavy write workloads
with checkpoint_timeout=5m, precisely because of this issue. Almost
every system I see has had that raised to at least 10m and sometimes
30m or more. It can make a massive difference.

> I see the patch justification outlined here:
>
>         https://www.postgresql.org/message-id/CAFiTN-sP6yLVTfjR42mEfvFwJ-SZ2iEtG1t0j=QX09X=BM+KWQ@mail.gmail.com
>
> TDE is mentioned as a value for this patch, but I don't see why it is
> needed --- TDE can easily decrypt/encrypt the pages while they are
> copied.

That's true, but depending on what other design decisions we make,
WAL-logging it might be a problem.

Right now, when someone creates a new database, we log a single record
that basically says "go copy the directory'". That's very different
than what we normally do, which is to log changes to individual pages,
or where required, small groups of pages (e.g. a single WAL record is
written for an UPDATE even though it may touch two pages). The fact
that in this case we only log a single WAL record for an operation
that could touch an unbounded amount of data is why this needs special
handling around checkpoints. It also introduces a certain amount of
fragility into the system, because if for some reason the source
directory on the standby doesn't exactly match the source directory on
the primary, the new databases won't match either. Any errors that
creep into the process can be propagated around to other places by a
system like this. However, ordinarily that doesn't happen, which is
why we've been able to use this system successfully for so many years.

The other reason we've been able to use this successfully is that
we're confident that we can perform exactly the same operation on the
standby as we do on the primary knowing only the relevant directory
names. If we say "copy this directory to there" we believe we'll be
able to do that exactly the same way on the standby. Is that still
true with TDE? Well, it depends. If the encryption can be performed
knowing only the key and the identity of the block (database OID,
tablespace OID, relfilenode, fork, block number) then it's true. But
if the encryption needs to, for example, generate a random nonce for
each block, then it's false. If you want the standby to be an exact
copy of the master in a system where new blocks get random nonces,
then you need to replicate the copy block-by-block, not as one
gigantic operation, so that you can log the nonce you picked for each
block. On the other hand, maybe you DON'T want the standby to be an
exact copy of the master. If, for example, you imagine a system where
the master and standby aren't even using the same key, then this is a
lot less relevant.

I can't predict whether PostgreSQL will get TDE in the future, and if
it does, I can't predict what form it will take. Therefore any strong
statement about whether this will benefit TDE or not seems to me to be
pretty questionable - we don't know that it will be useful, and we
don't know that it won't. But, like Dilip, I think the way we're
WAL-logging CREATE DATABASE right now is a hack, and I *know* it can
cause massive performance drops on busy systems.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Wed, Feb 9, 2022 at 10:59 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Wed, Feb 9, 2022 at 9:25 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> > On 6/16/21 03:52, Dilip Kumar wrote:
> > > On Tue, Jun 15, 2021 at 7:01 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> > >> Rather than use size, I'd be inclined to say use this if the source
> > >> database is marked as a template, and use the copydir approach for
> > >> anything that isn't.
> > > Yeah, that is possible, on the other thought wouldn't it be good to
> > > provide control to the user by providing two different commands, e.g.
> > > COPY DATABASE for the existing method (copydir) and CREATE DATABASE
> > > for the new method (fully wal logged)?
> >
> > This proposal seems to have gotten lost.
>
> Yeah, I am planning to work on this part so that we can support both methods.

But can we pick a different syntax? In my view this should be an
option to CREATE DATABASE rather than a whole new command.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andrew Dunstan
Date:
On 2/9/22 10:58, Dilip Kumar wrote:
> On Wed, Feb 9, 2022 at 9:25 PM Andrew Dunstan <andrew@dunslane.net> wrote:
>>
>> On 6/16/21 03:52, Dilip Kumar wrote:
>>> On Tue, Jun 15, 2021 at 7:01 PM Andrew Dunstan <andrew@dunslane.net> wrote:
>>>> Rather than use size, I'd be inclined to say use this if the source
>>>> database is marked as a template, and use the copydir approach for
>>>> anything that isn't.
>>> Yeah, that is possible, on the other thought wouldn't it be good to
>>> provide control to the user by providing two different commands, e.g.
>>> COPY DATABASE for the existing method (copydir) and CREATE DATABASE
>>> for the new method (fully wal logged)?
>>>
>>
>> This proposal seems to have gotten lost.
> Yeah, I am planning to work on this part so that we can support both methods.
>

OK, many thanks.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Tue, Feb 8, 2022 at 11:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Now, one bigger question is can we proceed with this patch without
> fixing [2], IMHO, if we are deciding to keep the old method also
> intact then one option could be that for now only change CREATE
> DATABASE to support both old and new way of creating database and for
> time being leave the ALTER DATABASE SET TABLESPACE alone and let it
> work only with the old method?  And another option is that we first
> fix the issue related to the tombstone file and then come back to
> this?
>
> IMHO, the first option could be better in a way that we have already
> made better progress in this patch and this is in better shape than
> the other patch we are trying to make for removing the tombstone
> files.

Yeah, it's getting quite close to the end of this release cycle. I'm
not sure whether we can get anything committed here at all in the time
we have remaining, but I agree with you that this patch seems like a
better prospect than that one.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Bruce Momjian
Date:
On Wed, Feb  9, 2022 at 11:00:06AM -0500, Robert Haas wrote:
> Try running pgbench with the --progress option and enough concurrent
> jobs to keep a moderately large system busy and watching what happens
> to the tps each time a checkpoint occurs. It's extremely dramatic, or
> at least it was the last time I ran such tests. I think that
> performance will sometimes drop by a factor of five or more when the
> checkpoint hits, and take multiple minutes to recover.
> 
> I think your statement that doing an extra checkpoint "just means the
> next checkpoint will do less work" is kind of misleading. That's
> certainly true in some situations. But when the same pages are being
> dirtied over and over again, an extra checkpoint often means that the
> system will do MUCH MORE work, because every checkpoint triggers a new
> set of full-page writes over the actively-updated portion of the
> database.
> 
> I think that very few people run systems with heavy write workloads
> with checkpoint_timeout=5m, precisely because of this issue. Almost
> every system I see has had that raised to at least 10m and sometimes
> 30m or more. It can make a massive difference.

Well, I think the worst case is that the checkpoint happens exactly
between two checkpoints, so you are checkpointing twice as often, but if
it happens just before or after a checkpoint, I assume the effect would
be minimal.

So, it seems we are weighing having a checkpoint happen in the middle of
a checkpoint interval vs writing more WAL.  If the WAL traffic, without
CREATE DATABASE, is high, and the template database is small, writing
more WAL and skipping the checkpoint will be win, but if the WAL traffic
is small and the template database is big, the extra WAL will be a loss.
Is this accurate?

> I can't predict whether PostgreSQL will get TDE in the future, and if
> it does, I can't predict what form it will take. Therefore any strong
> statement about whether this will benefit TDE or not seems to me to be
> pretty questionable - we don't know that it will be useful, and we

Agreed.  We would want to have a different heap/index key on the standby
so we can rotate the heap/index key.

> don't know that it won't. But, like Dilip, I think the way we're
> WAL-logging CREATE DATABASE right now is a hack, and I *know* it can

Yes, it is a hack, but it seems to be a clever one that we might have
chosen if it had not been part of the original system.

> cause massive performance drops on busy systems.

See above.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Wed, Feb 9, 2022 at 1:34 PM Bruce Momjian <bruce@momjian.us> wrote:
> Well, I think the worst case is that the checkpoint happens exactly
> between two checkpoints, so you are checkpointing twice as often, but if
> it happens just before or after a checkpoint, I assume the effect would
> be minimal.

I agree for the most part. I think that if checkpoints happen every 8
minutes normally and the extra checkpoint happens 2 minutes after the
previous checkpoint, the impact may be almost as bad as if it had
happened right in the middle. If it happens 5 seconds after the
previous checkpoint, it should be low impact.

> So, it seems we are weighing having a checkpoint happen in the middle of
> a checkpoint interval vs writing more WAL.  If the WAL traffic, without
> CREATE DATABASE, is high, and the template database is small, writing
> more WAL and skipping the checkpoint will be win, but if the WAL traffic
> is small and the template database is big, the extra WAL will be a loss.
> Is this accurate?

I think that's basically correct. I would expect that the worry about
big template database is mostly about template databases that are
REALLY big. I think if your template database is 10GB you probably
shouldn't be worried about this feature. 10GB of extra WAL isn't
nothing, but if you've got reasonably capable hardware, it's not
overloaded, and max_wal_size is big enough, it's probably not going to
have a huge impact. Also, most of the impact will probably be on the
CREATE DATABASE command itself, and other things running on the system
at the same time will be impacted to a lesser degree. I think it's
even possible that you will be happier with this feature than without,
because you may like the idea that CREATE DATABASE itself is slow more
than you like the idea of it making everything else on the system
slow. On the other hand, if your template database is 1TB, the extra
WAL is probably going to be a fairly big problem.

Basically I think for most people this should be neutral or a win. For
people with really large template databases, it's a loss. Hence the
discussion about having a way for people who prefer the current
behavior to keep it.

> Agreed.  We would want to have a different heap/index key on the standby
> so we can rotate the heap/index key.

I don't like that design, and I don't think that's what we should do,
but I understand that you feel differently. IMHO, this thread is not
the place to hash that out.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Julien Rouhaud
Date:
On Wed, Feb 09, 2022 at 02:30:08PM -0500, Robert Haas wrote:
> On Wed, Feb 9, 2022 at 1:34 PM Bruce Momjian <bruce@momjian.us> wrote:
> > Well, I think the worst case is that the checkpoint happens exactly
> > between two checkpoints, so you are checkpointing twice as often, but if
> > it happens just before or after a checkpoint, I assume the effect would
> > be minimal.
> 
> I agree for the most part. I think that if checkpoints happen every 8
> minutes normally and the extra checkpoint happens 2 minutes after the
> previous checkpoint, the impact may be almost as bad as if it had
> happened right in the middle. If it happens 5 seconds after the
> previous checkpoint, it should be low impact.

But the extra checkpoints will be immediate, while on a properly configured
system it should be spread checkpoint.  That will add some more overhead.

> > So, it seems we are weighing having a checkpoint happen in the middle of
> > a checkpoint interval vs writing more WAL.  If the WAL traffic, without
> > CREATE DATABASE, is high, and the template database is small, writing
> > more WAL and skipping the checkpoint will be win, but if the WAL traffic
> > is small and the template database is big, the extra WAL will be a loss.
> > Is this accurate?
> 
> I think that's basically correct. I would expect that the worry about
> big template database is mostly about template databases that are
> REALLY big. I think if your template database is 10GB you probably
> shouldn't be worried about this feature. 10GB of extra WAL isn't
> nothing, but if you've got reasonably capable hardware, it's not
> overloaded, and max_wal_size is big enough, it's probably not going to
> have a huge impact. Also, most of the impact will probably be on the
> CREATE DATABASE command itself, and other things running on the system
> at the same time will be impacted to a lesser degree. I think it's
> even possible that you will be happier with this feature than without,
> because you may like the idea that CREATE DATABASE itself is slow more
> than you like the idea of it making everything else on the system
> slow. On the other hand, if your template database is 1TB, the extra
> WAL is probably going to be a fairly big problem.
> 
> Basically I think for most people this should be neutral or a win. For
> people with really large template databases, it's a loss. Hence the
> discussion about having a way for people who prefer the current
> behavior to keep it.

Those extra WALs will also impact backups and replication.  You could have
fancy hardware, a read-mostly workload and the need to replicate over a slow
WAN, and in that case the 10GB could be much more problematic.



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Wed, Feb 9, 2022 at 9:31 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Feb 9, 2022 at 10:59 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > On Wed, Feb 9, 2022 at 9:25 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> > > On 6/16/21 03:52, Dilip Kumar wrote:
> > > > On Tue, Jun 15, 2021 at 7:01 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> > > >> Rather than use size, I'd be inclined to say use this if the source
> > > >> database is marked as a template, and use the copydir approach for
> > > >> anything that isn't.
> > > > Yeah, that is possible, on the other thought wouldn't it be good to
> > > > provide control to the user by providing two different commands, e.g.
> > > > COPY DATABASE for the existing method (copydir) and CREATE DATABASE
> > > > for the new method (fully wal logged)?
> > >
> > > This proposal seems to have gotten lost.
> >
> > Yeah, I am planning to work on this part so that we can support both methods.
>
> But can we pick a different syntax? In my view this should be an
> option to CREATE DATABASE rather than a whole new command.

Maybe we can provide something like

CREATE DATABASE..WITH WAL_LOG=true/false ? OR
CREATE DATABASE..WITH WAL_LOG_DATA_PAGE=true/false ? OR
CREATE DATABASE..WITH CHECKPOINT=true/false ? OR

And then we can explain in documentation about these options?  I think
default should be new method?


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Thu, Feb 10, 2022 at 2:52 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
> Those extra WALs will also impact backups and replication.  You could have
> fancy hardware, a read-mostly workload and the need to replicate over a slow
> WAN, and in that case the 10GB could be much more problematic.

True, I guess, but how bad does your WAN have to be for that to be an
issue? On a 1 gigabit/second link, that's a little over 2 minutes of
transfer time. That's not nothing, but it's not extreme, either,
especially because there's no sense in querying an empty database.
You're going to have to put some stuff in that database before you can
do anything meaningful with it, and that's going to have to be
replicated with or without this feature.

I am not saying it couldn't be a problem, and that's why I'm endorsing
making the behavior optional. But I think that it's a niche scenario.
You need a bigger-than-normal template database, a slow WAN link, AND
you need the amount of data loaded into the databases you create from
the template to be small enough to make the cost of logging the
template pages material. If you create a 10GB database from a template
and then load 200GB of data into it, the WAL-logging overhead of
creating the template is only 5%.

I won't really be surprised if we hear that someone has a 10GB
template database and likes to make a ton of copies and only change
500 rows in each one while replicating the whole thing over a slow
WAN. That can definitely happen, and I'm sure whoever is doing that
has reasons for it which they consider good and sufficient. However, I
don't think there are likely to be a ton of people doing stuff like
that - just a few.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
Hi,

On 2022-02-10 10:32:42 -0500, Robert Haas wrote:
> I won't really be surprised if we hear that someone has a 10GB
> template database and likes to make a ton of copies and only change
> 500 rows in each one while replicating the whole thing over a slow
> WAN. That can definitely happen, and I'm sure whoever is doing that
> has reasons for it which they consider good and sufficient. However, I
> don't think there are likely to be a ton of people doing stuff like
> that - just a few.

Yea. I would be a bit more concerned if we made creating template databases
very cheap, e.g. by using file copy-on-write functionality like we have for
pg_upgrade.  But right now it's a fairly hefty operation anyway.

Greetings,

Andres Freund



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andrew Dunstan
Date:
On 2/10/22 07:32, Dilip Kumar wrote:
> On Wed, Feb 9, 2022 at 9:31 PM Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Feb 9, 2022 at 10:59 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>> On Wed, Feb 9, 2022 at 9:25 PM Andrew Dunstan <andrew@dunslane.net> wrote:
>>>> On 6/16/21 03:52, Dilip Kumar wrote:
>>>>> On Tue, Jun 15, 2021 at 7:01 PM Andrew Dunstan <andrew@dunslane.net> wrote:
>>>>>> Rather than use size, I'd be inclined to say use this if the source
>>>>>> database is marked as a template, and use the copydir approach for
>>>>>> anything that isn't.
>>>>> Yeah, that is possible, on the other thought wouldn't it be good to
>>>>> provide control to the user by providing two different commands, e.g.
>>>>> COPY DATABASE for the existing method (copydir) and CREATE DATABASE
>>>>> for the new method (fully wal logged)?
>>>> This proposal seems to have gotten lost.
>>> Yeah, I am planning to work on this part so that we can support both methods.
>> But can we pick a different syntax? In my view this should be an
>> option to CREATE DATABASE rather than a whole new command.
> Maybe we can provide something like
>
> CREATE DATABASE..WITH WAL_LOG=true/false ? OR
> CREATE DATABASE..WITH WAL_LOG_DATA_PAGE=true/false ? OR
> CREATE DATABASE..WITH CHECKPOINT=true/false ? OR
>
> And then we can explain in documentation about these options?  I think
> default should be new method?
>
>

The last one at least has the advantage that it doesn't invent yet
another keyword.

I can live with the new method being the default. I'm sure it would be
highlighted in the release notes too.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Fri, Feb 11, 2022 at 12:11 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> The last one at least has the advantage that it doesn't invent yet
> another keyword.

We don't need a new keyword for this as long as it lexes as one token,
because createdb_opt_name accepts IDENT. So I think we should focus on
trying to come up with something that is as clear as we know how to
make it.

What I find difficult about doing that is that this is all a bunch of
technical details that users may have difficulty understanding. If we
say WAL_LOG or WAL_LOG_DATA, a reasonably but not incredibly
well-informed user will assume that skipping WAL is not really an
option. If we say CHECKPOINT, a reasonably but not incredibly
well-informed user will presume they don't want one (I think).
CHECKPOINT also seems like it's naming the switch by the unwanted side
effect, which doesn't seem too flattering to the existing method.

How about something like LOG_AS_CLONE? That makes it clear, I hope,
that we're logging it a different way, but that method of logging it
is different in each case. You'd still have to read the documentation
to find out what it really means, but at least it seems like it points
you more in the right direction. To me, anyway.

> I can live with the new method being the default. I'm sure it would be
> highlighted in the release notes too.

That would make sense.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Bruce Momjian
Date:
On Fri, Feb 11, 2022 at 12:35:50PM -0500, Robert Haas wrote:
> How about something like LOG_AS_CLONE? That makes it clear, I hope,
> that we're logging it a different way, but that method of logging it
> is different in each case. You'd still have to read the documentation
> to find out what it really means, but at least it seems like it points
> you more in the right direction. To me, anyway.

I think CLONE would be confusing since we don't use that term often,
maybe LOG_DB_COPY or LOG_FILE_COPY?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Fri, Feb 11, 2022 at 12:50 PM Bruce Momjian <bruce@momjian.us> wrote:
> On Fri, Feb 11, 2022 at 12:35:50PM -0500, Robert Haas wrote:
> > How about something like LOG_AS_CLONE? That makes it clear, I hope,
> > that we're logging it a different way, but that method of logging it
> > is different in each case. You'd still have to read the documentation
> > to find out what it really means, but at least it seems like it points
> > you more in the right direction. To me, anyway.
>
> I think CLONE would be confusing since we don't use that term often,
> maybe LOG_DB_COPY or LOG_FILE_COPY?

Yeah, maybe. But it's not clear to me with that kind of naming whether
TRUE or FALSE would be the existing behavior? One version logs a
single record for the whole database, and the other logs a record per
database block. Neither version logs per file. LOG_COPIED_BLOCKS,
maybe?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Bruce Momjian
Date:
On Fri, Feb 11, 2022 at 01:18:58PM -0500, Robert Haas wrote:
> On Fri, Feb 11, 2022 at 12:50 PM Bruce Momjian <bruce@momjian.us> wrote:
> > On Fri, Feb 11, 2022 at 12:35:50PM -0500, Robert Haas wrote:
> > > How about something like LOG_AS_CLONE? That makes it clear, I hope,
> > > that we're logging it a different way, but that method of logging it
> > > is different in each case. You'd still have to read the documentation
> > > to find out what it really means, but at least it seems like it points
> > > you more in the right direction. To me, anyway.
> >
> > I think CLONE would be confusing since we don't use that term often,
> > maybe LOG_DB_COPY or LOG_FILE_COPY?
> 
> Yeah, maybe. But it's not clear to me with that kind of naming whether
> TRUE or FALSE would be the existing behavior? One version logs a
> single record for the whole database, and the other logs a record per
> database block. Neither version logs per file. LOG_COPIED_BLOCKS,
> maybe?

Yes, I like BLOCKS more than FILE.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andrew Dunstan
Date:
On 2/11/22 13:32, Bruce Momjian wrote:
> On Fri, Feb 11, 2022 at 01:18:58PM -0500, Robert Haas wrote:
>> On Fri, Feb 11, 2022 at 12:50 PM Bruce Momjian <bruce@momjian.us> wrote:
>>> On Fri, Feb 11, 2022 at 12:35:50PM -0500, Robert Haas wrote:
>>>> How about something like LOG_AS_CLONE? That makes it clear, I hope,
>>>> that we're logging it a different way, but that method of logging it
>>>> is different in each case. You'd still have to read the documentation
>>>> to find out what it really means, but at least it seems like it points
>>>> you more in the right direction. To me, anyway.
>>> I think CLONE would be confusing since we don't use that term often,
>>> maybe LOG_DB_COPY or LOG_FILE_COPY?
>> Yeah, maybe. But it's not clear to me with that kind of naming whether
>> TRUE or FALSE would be the existing behavior? One version logs a
>> single record for the whole database, and the other logs a record per
>> database block. Neither version logs per file. LOG_COPIED_BLOCKS,
>> maybe?
> Yes, I like BLOCKS more than FILE.


I'm not really sure any single parameter name is going to capture the
subtlety involved here.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Fri, Feb 11, 2022 at 3:40 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> I'm not really sure any single parameter name is going to capture the
> subtlety involved here.

I mean to some extent that's inevitable, but it's not a reason not to
do the best we can.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Fri, Feb 11, 2022 at 1:32 PM Bruce Momjian <bruce@momjian.us> wrote:
> > Yeah, maybe. But it's not clear to me with that kind of naming whether
> > TRUE or FALSE would be the existing behavior? One version logs a
> > single record for the whole database, and the other logs a record per
> > database block. Neither version logs per file. LOG_COPIED_BLOCKS,
> > maybe?
>
> Yes, I like BLOCKS more than FILE.

Cool.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Alvaro Herrera
Date:
On 2022-Feb-11, Robert Haas wrote:

> What I find difficult about doing that is that this is all a bunch of
> technical details that users may have difficulty understanding. If we
> say WAL_LOG or WAL_LOG_DATA, a reasonably but not incredibly
> well-informed user will assume that skipping WAL is not really an
> option. If we say CHECKPOINT, a reasonably but not incredibly
> well-informed user will presume they don't want one (I think).
> CHECKPOINT also seems like it's naming the switch by the unwanted side
> effect, which doesn't seem too flattering to the existing method.

It seems you're thinking deciding what to do based on an option that
gets a boolean argument.  But what about making the argument be an enum?
For example

CREATE DATABASE ... WITH (STRATEGY = LOG);    -- default if option is omitted
CREATE DATABASE ... WITH (STRATEGY = CHECKPOINT);

So the user has to think about it in terms of some strategy to choose,
rather than enabling or disabling some flag with nontrivial
implications.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"[PostgreSQL] is a great group; in my opinion it is THE best open source
development communities in existence anywhere."                (Lamar Owen)



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andrew Dunstan
Date:
On 2/11/22 15:47, Robert Haas wrote:
> On Fri, Feb 11, 2022 at 3:40 PM Andrew Dunstan <andrew@dunslane.net> wrote:
>> I'm not really sure any single parameter name is going to capture the
>> subtlety involved here.
> I mean to some extent that's inevitable, but it's not a reason not to
> do the best we can.


True.

I do think we should be wary of any name starting with "LOG", though.
Long experience tells us that's something that confuses users when it
refers to the WAL.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Fri, Feb 11, 2022 at 4:08 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> It seems you're thinking deciding what to do based on an option that
> gets a boolean argument.  But what about making the argument be an enum?
> For example
>
> CREATE DATABASE ... WITH (STRATEGY = LOG);      -- default if option is omitted
> CREATE DATABASE ... WITH (STRATEGY = CHECKPOINT);
>
> So the user has to think about it in terms of some strategy to choose,
> rather than enabling or disabling some flag with nontrivial
> implications.

I don't like those particular strategy names very much, but in general
I think that could be a way to go, too. I somewhat hope we never end
up with THREE strategies for creating a new database, but now that I
think about it, we might. Somebody might want to use a fancy FS
primitive that clones a directory at the FS level, or something.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
Hi,

On 2022-02-11 16:19:12 -0500, Robert Haas wrote:
> I somewhat hope we never end up with THREE strategies for creating a new
> database, but now that I think about it, we might. Somebody might want to
> use a fancy FS primitive that clones a directory at the FS level, or
> something.

I think that'd be a great, and pretty easy to implement, feature. But it seems
like it'd be mostly orthogonal to the "WAL log data" vs "checkpoint data"
question? On the primary / single node system using "WAL log data" with "COW
file copy" would work well.

I bet using COW file copies would speed up our own regression tests noticeably
- on slower systems we spend a fair bit of time and space creating template0
and postgres, with the bulk of the data never changing.

Template databases are also fairly commonly used by application developers to
avoid the cost of rerunning all the setup DDL & initial data loading for
different tests. Making that measurably cheaper would be a significant win.

Greetings,

Andres Freund



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Sat, Feb 12, 2022 at 2:38 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2022-Feb-11, Robert Haas wrote:
>
> > What I find difficult about doing that is that this is all a bunch of
> > technical details that users may have difficulty understanding. If we
> > say WAL_LOG or WAL_LOG_DATA, a reasonably but not incredibly
> > well-informed user will assume that skipping WAL is not really an
> > option. If we say CHECKPOINT, a reasonably but not incredibly
> > well-informed user will presume they don't want one (I think).
> > CHECKPOINT also seems like it's naming the switch by the unwanted side
> > effect, which doesn't seem too flattering to the existing method.
>
> It seems you're thinking deciding what to do based on an option that
> gets a boolean argument.  But what about making the argument be an enum?
> For example
>
> CREATE DATABASE ... WITH (STRATEGY = LOG);      -- default if option is omitted
> CREATE DATABASE ... WITH (STRATEGY = CHECKPOINT);
>
> So the user has to think about it in terms of some strategy to choose,
> rather than enabling or disabling some flag with nontrivial
> implications.


Yeah I think being explicit about giving the strategy to the user
looks like a better option.  Now they can choose whether they want it
to create using WAL log or using CHECKPOINT.  Otherwise, if we give a
flag then we will have to give an explanation that if they choose not
to WAL log then we will have to do a checkpoint internally.  So I
think giving LOG vs CHECKPOINT as an explicit option looks better to
me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Sun, Feb 13, 2022 at 10:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>

I have done performance testing with different template DB sizes and
different amounts of dirty shared buffers and I think as expected the
bigger the dirty shared buffer the checkpoint approach becomes costly
and OTOH the larger the template DB size the WAL log approach takes
more time.

I think it is very common to have larger shared buffers and of course,
if somebody has configured such a large shared buffer then a good % of
it will be dirty most of the time.  So IMHO in the future, the WAL log
approach is going to be more usable in general.  However, this is just
my opinion, and others may have completely different thoughts and
anyhow we are keeping options for both the approaches so no worry.

Next, I am planning to do some more tests, where we are having pgbench
running and concurrently we do CREATEDB maybe every 1 minute and see
what is the CREATEDB time as well as what is the impact on pgbench
performance.  Because currently I have only measured CREATEDB time but
we must be knowing the impact of createdb on the other system as well.

Test setup:
max_wal_size=64GB
checkpoint_timeout=15min
- CREATE base TABLE of size of Shared Buffers
- CREATE template database and table in it of varying sizes (as per test)
- CHECKPOINT (write out dirty buffers)
- UPDATE 70% of tuple in base table (dirty 70% of shared buffers)
- CREATE database using template db. (Actual test target)

test1:
1 GB shared buffers, template DB size = 6MB, dirty shared buffer=70%
Head: 2341.665 ms
Patch: 85.229 ms

test2:
1 GB shared buffers, template DB size = 1GB, dirty shared buffer=70%
Head: 4044 ms
Patch: 8376 ms

test3:
8 GB shared buffers, template DB size = 1GB, dirty shared buffer=70%
Head: 21398 ms
Patch: 9834 ms

test4:
8 GB shared buffers, template DB size = 10GB, dirty shared buffer=95%
Head: 38574 ms
Patch: 77160 ms

test4:
32 GB shared buffers, template DB size = 10GB, dirty shared buffer=70%
Head: 47656 ms
Patch: 79767 ms

test5:
64 GB shared buffers, template DB size = 1GB, dirty shared buffer=70%
Head: 59151 ms
Patch: 8742 ms

test6:
64 GB shared buffers, template DB size = 50GB, dirty shared buffer=50%
Head: 171614 ms
Patch: 406040 ms

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Sun, Feb 13, 2022 at 1:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrot>
> test4:
> 32 GB shared buffers, template DB size = 10GB, dirty shared buffer=70%
> Head: 47656 ms
> Patch: 79767 ms

This seems like the most surprising result of the bunch. Here, the
template DB is both small enough to fit in shared_buffers and small
enough not to trigger a checkpoint all by itself, and yet the patch
loses.

Did you checkpoint between one test and the next, or might this test
have been done after a bunch of WAL had already been written since the
last checkpoint so that the 10GB pushed it over the edge?

BTW, you have test4 twice in your list of results.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Sun, Feb 13, 2022 at 9:56 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sun, Feb 13, 2022 at 1:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrot>
> > test4:
> > 32 GB shared buffers, template DB size = 10GB, dirty shared buffer=70%
> > Head: 47656 ms
> > Patch: 79767 ms
>
> This seems like the most surprising result of the bunch. Here, the
> template DB is both small enough to fit in shared_buffers and small
> enough not to trigger a checkpoint all by itself, and yet the patch
> loses.

Well this is not really surprising to me because what I have noticed
is that with the new approach the createdb time is completely
dependent upon the template db size.  So if the source db size is 10GB
it is taking around 80sec and the shared buffers size does not have a
major impact.  Maybe a very small shared buffer can have more impact
so I will test that as well.

>
> Did you checkpoint between one test and the next, or might this test
> have been done after a bunch of WAL had already been written since the
> last checkpoint so that the 10GB pushed it over the edge?

Not really,  I am running each test with a new initdb so that could
not be an issue.

> BTW, you have test4 twice in your list of results.

My bad, those are different tests.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Sun, Feb 13, 2022 at 12:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sun, Feb 13, 2022 at 10:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

> Next, I am planning to do some more tests, where we are having pgbench
> running and concurrently we do CREATEDB maybe every 1 minute and see
> what is the CREATEDB time as well as what is the impact on pgbench
> performance.  Because currently I have only measured CREATEDB time but
> we must be knowing the impact of createdb on the other system as well.

I have done tests with the pgbench as well.  So basically I did not
notice any significant difference in the TPS, I was expecting there
should be some difference due to the checkpoint on the head so maybe I
need to test with more backend maybe.  And createdb time there is a
huge difference. I think this is because template1 db is very small so
patch is getting completed in no time whereas head is taking huge time
because of high dirty shared buffers (due to concurrent pgbench).

config:
echo "logging_collector=on" >> data/postgresql.conf
echo "port = 5432" >> data/postgresql.conf
echo "max_wal_size=64GB" >> data/postgresql.conf
echo "checkpoint_timeout=15min" >> data/postgresql.conf
echo "shared_buffers=32GB" >> data/postgresql.conf

Test:
./pgbench -i -s 1000 postgres
./pgbench -c 32 -j 32 -T 1200 -M prepared postgres >> result.txt
-- Concurrently run below script every 1 mins
CREATE DATABASE mydb log_copied_blocks=true/false;

Results:
- Pgbench TPS: Did not observe any difference head vs patch
- Create db time(very small template):
head:  21000 ms to 42000 ms (at different time)
patch: 80 ms

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Mon, Feb 14, 2022 at 10:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sun, Feb 13, 2022 at 9:56 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Sun, Feb 13, 2022 at 1:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrot>
> > > test4:
> > > 32 GB shared buffers, template DB size = 10GB, dirty shared buffer=70%
> > > Head: 47656 ms
> > > Patch: 79767 ms
> >
> > This seems like the most surprising result of the bunch. Here, the
> > template DB is both small enough to fit in shared_buffers and small
> > enough not to trigger a checkpoint all by itself, and yet the patch
> > loses.
>
> Well this is not really surprising to me because what I have noticed
> is that with the new approach the createdb time is completely
> dependent upon the template db size.  So if the source db size is 10GB
> it is taking around 80sec and the shared buffers size does not have a
> major impact.  Maybe a very small shared buffer can have more impact
> so I will test that as well.

I have done some more experiments just to understand where we are
spending most of the time.  First I have tried with synchronous commit
and fsync off and the creation time dropped from 80s to 70s then I
just removed the log_newpage then time further dropped to 50s.  I have
also tried with different shared buffer sizes and observed that
reducing or increasing the shared buffer size does not have much
impact on the created db with the new approach.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Mon, Feb 14, 2022 at 12:01 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Well this is not really surprising to me because what I have noticed
> is that with the new approach the createdb time is completely
> dependent upon the template db size.  So if the source db size is 10GB
> it is taking around 80sec and the shared buffers size does not have a
> major impact.  Maybe a very small shared buffer can have more impact
> so I will test that as well.

OK. Well, then this approach is somewhat worse than I expected for
moderately large template databases. But it seems very good for small
template databases, especially when there is other work in progress on
the system.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Ashutosh Sharma
Date:
Hi Dilip,

On Sun, Feb 13, 2022 at 12:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sun, Feb 13, 2022 at 10:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
>
> I have done performance testing with different template DB sizes and
> different amounts of dirty shared buffers and I think as expected the
> bigger the dirty shared buffer the checkpoint approach becomes costly
> and OTOH the larger the template DB size the WAL log approach takes
> more time.
>
> I think it is very common to have larger shared buffers and of course,
> if somebody has configured such a large shared buffer then a good % of
> it will be dirty most of the time.  So IMHO in the future, the WAL log
> approach is going to be more usable in general.  However, this is just
> my opinion, and others may have completely different thoughts and
> anyhow we are keeping options for both the approaches so no worry.
>
> Next, I am planning to do some more tests, where we are having pgbench
> running and concurrently we do CREATEDB maybe every 1 minute and see
> what is the CREATEDB time as well as what is the impact on pgbench
> performance.  Because currently I have only measured CREATEDB time but
> we must be knowing the impact of createdb on the other system as well.
>
> Test setup:
> max_wal_size=64GB
> checkpoint_timeout=15min
> - CREATE base TABLE of size of Shared Buffers
> - CREATE template database and table in it of varying sizes (as per test)
> - CHECKPOINT (write out dirty buffers)
> - UPDATE 70% of tuple in base table (dirty 70% of shared buffers)
> - CREATE database using template db. (Actual test target)
>
> test1:
> 1 GB shared buffers, template DB size = 6MB, dirty shared buffer=70%
> Head: 2341.665 ms
> Patch: 85.229 ms
>
> test2:
> 1 GB shared buffers, template DB size = 1GB, dirty shared buffer=70%
> Head: 4044 ms
> Patch: 8376 ms
>
> test3:
> 8 GB shared buffers, template DB size = 1GB, dirty shared buffer=70%
> Head: 21398 ms
> Patch: 9834 ms
>
> test4:
> 8 GB shared buffers, template DB size = 10GB, dirty shared buffer=95%
> Head: 38574 ms
> Patch: 77160 ms
>
> test4:
> 32 GB shared buffers, template DB size = 10GB, dirty shared buffer=70%
> Head: 47656 ms
> Patch: 79767 ms
>

Is it possible to see the WAL size generated by these two statements:
UPDATE 70% of the tuple in the base table (dirty 70% of the shared
buffers) && CREATE database using template DB (Actual test target).
Just wanted to know if it can exceed the max_wal_size of 64GB. Also,
is it possible to try with minimal wal_level? Sorry for asking you
this, I could try it myself but I don't have any high level system to
try it.

--
With Regards,
Ashutosh Sharma.



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Mon, Feb 14, 2022 at 9:17 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
>
> Is it possible to see the WAL size generated by these two statements:
> UPDATE 70% of the tuple in the base table (dirty 70% of the shared
> buffers) && CREATE database using template DB (Actual test target).
> Just wanted to know if it can exceed the max_wal_size of 64GB.

I think we already know the wal size generated by creating a db with
an old and new approach.  With the old approach it is just one WAL log
and with the new approach it is going to log every single block of the
database.  Yeah the updating 70% of the database could have some
impact but for verification purposes I tested without the update and
still the create db with WAL log is taking almost the same time.  But
anyway when I test next time I will verify again that no force
checkpoint is triggered.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Sun, Feb 13, 2022 at 10:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Feb 12, 2022 at 2:38 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

> > It seems you're thinking deciding what to do based on an option that
> > gets a boolean argument.  But what about making the argument be an enum?
> > For example
> >
> > CREATE DATABASE ... WITH (STRATEGY = LOG);      -- default if option is omitted
> > CREATE DATABASE ... WITH (STRATEGY = CHECKPOINT);
> >
> > So the user has to think about it in terms of some strategy to choose,
> > rather than enabling or disabling some flag with nontrivial
> > implications.
>
>
> Yeah I think being explicit about giving the strategy to the user
> looks like a better option.  Now they can choose whether they want it
> to create using WAL log or using CHECKPOINT.  Otherwise, if we give a
> flag then we will have to give an explanation that if they choose not
> to WAL log then we will have to do a checkpoint internally.  So I
> think giving LOG vs CHECKPOINT as an explicit option looks better to
> me.

So do we have consensus to use (STRATEGY = LOG/CHECKPOINT or do we
think that keeping it bool i.e. Is LOG_COPIED_BLOCKS a better option?
Once we have consensus on this I will make this change and
documentation as well along with the other changes suggested by
Robert.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Mon, Feb 14, 2022 at 11:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> So do we have consensus to use (STRATEGY = LOG/CHECKPOINT or do we
> think that keeping it bool i.e. Is LOG_COPIED_BLOCKS a better option?
> Once we have consensus on this I will make this change and
> documentation as well along with the other changes suggested by
> Robert.

I think we have consensus on STRATEGY. I'm not sure if we have
consensus on what the option values should be. If we had an option to
use fs-based cloning, that would also need to issue a checkpoint,
which makes me think that CHECKPOINT is not the best name.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Bruce Momjian
Date:
On Mon, Feb 14, 2022 at 12:27:10PM -0500, Robert Haas wrote:
> On Mon, Feb 14, 2022 at 11:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > So do we have consensus to use (STRATEGY = LOG/CHECKPOINT or do we
> > think that keeping it bool i.e. Is LOG_COPIED_BLOCKS a better option?
> > Once we have consensus on this I will make this change and
> > documentation as well along with the other changes suggested by
> > Robert.
> 
> I think we have consensus on STRATEGY. I'm not sure if we have
> consensus on what the option values should be. If we had an option to
> use fs-based cloning, that would also need to issue a checkpoint,
> which makes me think that CHECKPOINT is not the best name.

I think if we want LOG, it has tob e WAL_LOG instead of just LOG.  Was
there discussion that the user _has_ to specify and option instead of
using a default?  That doesn't seem good.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Mon, Feb 14, 2022 at 1:58 PM Bruce Momjian <bruce@momjian.us> wrote:
> > I think we have consensus on STRATEGY. I'm not sure if we have
> > consensus on what the option values should be. If we had an option to
> > use fs-based cloning, that would also need to issue a checkpoint,
> > which makes me think that CHECKPOINT is not the best name.
>
> I think if we want LOG, it has tob e WAL_LOG instead of just LOG.  Was
> there discussion that the user _has_ to specify and option instead of
> using a default?  That doesn't seem good.

I agree. I think we can set a default, which can be either whatever we
think will be best on average, or maybe it can be conditional based on
the database size or something.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Maciek Sakrejda
Date:
Andrew made a good case above for avoiding LOG:

>I do think we should be wary of any name starting with "LOG", though.
>Long experience tells us that's something that confuses users when it
refers to the WAL.


Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Tue, Feb 15, 2022 at 2:01 AM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:
>
Here is the updated version of the patch, the changes are 1) Fixed
review comments given by Robert and one open comment from Ashutosh.
2) Preserved the old create db method. 3) As agreed upthread for now
we are using the new strategy only for createdb not for movedb so I
have removed the changes in ForgetDatabaseSyncRequests() and
DropDatabaseBuffers().  3) Provided a database creation strategy
option as of now I have kept it as below.

CREATE DATABASE ... WITH (STRATEGY = WAL_LOG);      -- default if
option is omitted
CREATE DATABASE ... WITH (STRATEGY = FILE_COPY);

I have updated the document but I was not sure how much internal
information to be exposed to the user so I will work on that based on
feedback from others.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Tue, Feb 15, 2022 at 6:49 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Here is the updated version of the patch, the changes are 1) Fixed
> review comments given by Robert and one open comment from Ashutosh.
> 2) Preserved the old create db method. 3) As agreed upthread for now
> we are using the new strategy only for createdb not for movedb so I
> have removed the changes in ForgetDatabaseSyncRequests() and
> DropDatabaseBuffers().  3) Provided a database creation strategy
> option as of now I have kept it as below.
>
> CREATE DATABASE ... WITH (STRATEGY = WAL_LOG);      -- default if
> option is omitted
> CREATE DATABASE ... WITH (STRATEGY = FILE_COPY);

All right. I think there have been two design-level objections to this
patch, and this resolves one of them. The other one is trickier,
because AFAICT it's basically an opinion question: is accessing
pg_class in the template database from some backend that is connected
to another database too ugly to be acceptable? Several people have
expressed concerns about that, but it's not clear to me whether they
are essentially saying "that is not what I would do if I were doing
this project" or more like "if you commit something that does it that
way I will be enraged and demand an immediate revert and the removal
of your commit bit." If it's the former, I think it's possible to
clean up various details of these patches to make them look nicer than
they do at present and get something committed for PostgreSQL 15. But
if it is the latter then there's really no point to that kind of
cleanup work and we should probably just give up now. So, Andres,
Heikki, and anybody else with a strong opinion, can you clarify how
vigorously you hate this design, or don't?

My own opinion is that this is actually rather elegant. It just makes
sense to me that the right way to figure out what relations are in a
database is to get that list from the database rather than from the
filesystem. Nobody would accept the idea of making \d work by listing
out the directory contents rather than by walking pg_class, and so the
only reason we ought to accept that in the case of cloning a database
is if the code is too ugly any other way. But is it really? It's true
that this patch set does some refactoring of interfaces in order to
make that work, and there's a few things about how it does that that I
think could be improved, but on the whole, it's seems like a
remarkably small amount of code to do something that we've long
considered absolutely taboo. Now, it's nowhere close to being
something that could be used to allow fully general cross-database
access, and there are severe problems with the idea of allowing any
such thing. In particular, there are various places that test for
connections to a database, and aren't going to be expected processes
not connected to the database to be touching it. My belief is that a
heavyweight lock on the database is a suitable surrogate, because we
actually take such a lock when connecting to a database, and that
forms part of the regular interlock. Taking such locks routinely for
short periods would be expensive and might create other problems, but
doing it for a maintenance operation seems OK. Also, if we wanted to
actually support full cross-database access, locking wouldn't be the
only problem by far. We'd have to deal with things like the relcache
and the catcache, which would be hard, and might increase the cost of
very common things that we need to be cheap. But none of that is
implicated in this patch, which only generalizes code paths that are
not so commonly taken as to pose a problem, and manages to reuse quite
a bit of code rather than introducing entirely new code to do the same
things.
.
It does introduce some new code here and there, though: there isn't
zero duplication. The biggest chunk of that FWICS is in 0006, in
GetDatabaseRelationList and GetRelListFromPage. I just can't get
excited about that. It's literally about two screens worth of code.
We're not talking about duplicating src/backend/access/heapam or
something like that. I do think it would be a good idea to split it up
just a bit more: I think the code inside GetRelListFromPage that is
guarded by HeapTupleSatisfiesVisibility() could be moved into a
separate subroutine, and I think that would likely look a big nicer.
But fundamentally I just don't see a huge issue here. That is not to
say there isn't a huge issue here: just that I don't see it.

Comments?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
Hi,

On 2022-02-17 14:27:09 -0500, Robert Haas wrote:
> The other one is trickier, because AFAICT it's basically an opinion
> question: is accessing pg_class in the template database from some backend
> that is connected to another database too ugly to be acceptable? Several
> people have expressed concerns about that, but it's not clear to me whether
> they are essentially saying "that is not what I would do if I were doing
> this project" or more like "if you commit something that does it that way I
> will be enraged and demand an immediate revert and the removal of your
> commit bit." If it's the former, I think it's possible to clean up various
> details of these patches to make them look nicer than they do at present and
> get something committed for PostgreSQL 15.

Could you or Dilip outline how it now works, and what exactly makes it safe
etc (e.g. around locking, invalidation processing, snapshots, xid horizons)?

I just scrolled through the patchset without finding such an explanation, so
it's a bit hard to judge.


> But if it is the latter then there's really no point to that kind of cleanup
> work and we should probably just give up now.

This thread is long. Could you summarize what lead you to consider other
approaches (e.g. looking in the filesystem for relfilenodes) as not feasible /
too ugly / ...?

Greetings,

Andres Freund



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Thu, Feb 17, 2022 at 4:13 PM Andres Freund <andres@anarazel.de> wrote:
> Could you or Dilip outline how it now works, and what exactly makes it safe
> etc (e.g. around locking, invalidation processing, snapshots, xid horizons)?
>
> I just scrolled through the patchset without finding such an explanation, so
> it's a bit hard to judge.

That's a good question and it's making me think about a few things I
hadn't considered before.

Dilip can add more here, but my impression is that most problems are
prevented by CREATE DATABASE, with or without this patch, starts by
acquiring a ShareLock on the database, preventing new connections, and
then making sure there are no existing connections. That means nothing
in the target database can be changing, which I think makes a lot of
the stuff on your list a non-issue. Any problems that remain have to
be the result of something that CREATE DATABASE does having a bad
interaction either with something that is completed beforehand or
something that begins afterward. There just can't be overlap, and I
think that rules out most problems.

Now you pointed out earlier one problem that it doesn't fix: unlike
the current method, this method involves reading buffers from the
template database into shared_buffers, and those buffers, once read,
stick around even after the operation finishes. That's not an
intrinsic problem, though, because a connection to the database could
do the same thing. However, again as you pointed out, it is a problem,
though, if we do it with less locking than a real database connection
would have done. It seems to me that if there are other problems here,
they have to be basically of the same sort: they have to leave the
system in some state which is otherwise impossible. Do you see some
other kind of hazard, or more examples of how that could happen? I
think the leftover buffers in shared_buffers have to be basically the
only thing, because apart from that, how is this any different than a
file copy?

The only other kind of hazard I can think of is: could it be unsafe to
try to interpret the contents of a database to which no one else is
connected at present due to any of the issues you mention? But what's
the hazard exactly? It can't be a problem if we've failed to process
sinval messages for the target database, because we're not using any
caches anyway. We can't. It can't be unsafe to test visibility of XIDs
for that database, because in an alternate universe some backend could
have connected to that database and seen the same XIDs. One thing we
COULD be doing wrong is using the wrong snapshot to test the
visibility of XIDs. The patch uses GetActiveSnapshot(), and I'm
thinking that is probably wrong. Shouldn't it be GetLatestSnapshot()?
And do we need to worry about snapshots being database-specific? Maybe
so.

> > But if it is the latter then there's really no point to that kind of cleanup
> > work and we should probably just give up now.
>
> This thread is long. Could you summarize what lead you to consider other
> approaches (e.g. looking in the filesystem for relfilenodes) as not feasible /
> too ugly / ...?

I don't think it's infeasible to look at the filesystem for files and
just copy whatever files we find. It's a plausible alternate design. I
just don't like it as well. I think that relying on the filesystem
contents to tell us what's going on is kind of hacky. The only
technical issue I see there is that the WAL logging might require more
kludgery, since that mechanism is kind of intertwined with
shared_buffers. You'd have to get the right block references into the
WAL record, and you have to make sure that checkpoints don't move the
redo pointer at an inopportune moment.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
Hi,

On 2022-02-17 18:00:19 -0500, Robert Haas wrote:
> Now you pointed out earlier one problem that it doesn't fix: unlike
> the current method, this method involves reading buffers from the
> template database into shared_buffers, and those buffers, once read,
> stick around even after the operation finishes.

Yea, I don't see a problem with that. A concurrent DROP DATABASE or such would
be problematic, but the locking prevents that.


> The only other kind of hazard I can think of is: could it be unsafe to
> try to interpret the contents of a database to which no one else is
> connected at present due to any of the issues you mention? But what's
> the hazard exactly?

I don't quite know. But I don't think it's impossible to run into trouble in
this area. E.g. xid horizons are computed in a database specific way. If the
routine reading pg_class did hot pruning, you could end up removing data
that's actually needed for a logical slot in the other database because the
backend local horizon state was computed for the "local" database?

Could there be problems because other backends wouldn't see the backend
accessing the other database as being connected to that database
(PGPROC->databaseId)?

Or what if somebody optimized snapshots to disregard readonly transactions in
other databases?


> It can't be a problem if we've failed to process sinval messages for the
> target database, because we're not using any caches anyway.

Could you end up with an outdated relmap entry? Probably not.


> We can't. It can't be unsafe to test visibility of XIDs for that database,
> because in an alternate universe some backend could have connected to that
> database and seen the same XIDs.

That's a weak argument, because in that alternative universe a PGPROC entry
with the PGPROC->databaseId = template_databases_oid would exist.

Greetings,

Andres Freund



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Fri, Feb 18, 2022 at 4:30 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
>
> > This thread is long. Could you summarize what lead you to consider other
> > approaches (e.g. looking in the filesystem for relfilenodes) as not feasible /
> > too ugly / ...?
>
> I don't think it's infeasible to look at the filesystem for files and
> just copy whatever files we find. It's a plausible alternate design. I
> just don't like it as well. I think that relying on the filesystem
> contents to tell us what's going on is kind of hacky. The only
> technical issue I see there is that the WAL logging might require more
> kludgery, since that mechanism is kind of intertwined with
> shared_buffers. You'd have to get the right block references into the
> WAL record, and you have to make sure that checkpoints don't move the
> redo pointer at an inopportune moment.


Actually based on the previous discussion, I also tried to write the
POC with the file system scanning approach to identify the relation to
be copied seet patch 0007 in this thread [1].  And later we identified
one issue [2], i.e. while scanning directly the disk file we will only
know the relfilenode but we can not identify the relation oid that
means we can not lock the relation.  Now, I am not saying that there
is no way to work around that issue but that was also one of the
reasons for not pursuing that approach.

[1] https://www.postgresql.org/message-id/CAFiTN-v1KYsVAhq_fOWFa27LZiw9uK4n4cz5XmQJxJpsVcfq1w%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAFiTN-v%3DU58by_BeiZruNhykxk1q9XUxF%2BqLzD2LZAsEn2EBkg%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Fri, Feb 18, 2022 at 5:09 AM Andres Freund <andres@anarazel.de> wrote:

Thanks a lot Andres for taking time to read the thread and patch.

> > The only other kind of hazard I can think of is: could it be unsafe to
> > try to interpret the contents of a database to which no one else is
> > connected at present due to any of the issues you mention? But what's
> > the hazard exactly?
>
> I don't quite know. But I don't think it's impossible to run into trouble in
> this area. E.g. xid horizons are computed in a database specific way. If the
> routine reading pg_class did hot pruning, you could end up removing data
> that's actually needed for a logical slot in the other database because the
> backend local horizon state was computed for the "local" database?

I agree that while computing the xid horizon (ComputeXidHorizons()),
we only consider the backend which are connected to the same database
to which we are connected.  But we don't need to worry here because we
know the fact that there could be absolutely no backend connected to
the database we are trying to copy so we don't need to worry about
pruning the tuples which are visible to other backends.

Now if we are worried about the replication slot then for that we also
consider the xmin horizon from the replication slots so I don't think
that we have any problem here as well.  And we also consider the
walsender as well for computing the xid horizon.

> Could there be problems because other backends wouldn't see the backend
> accessing the other database as being connected to that database
> (PGPROC->databaseId)?

You mean that other backend will not consider this backend (which is
copying database) as connected to source database?  Yeah that is
correct but what is the problem in that, other backends can not
connect to the source database so what problem can they create to the
backend which is copying the database.

> Or what if somebody optimized snapshots to disregard readonly transactions in
> other databases?

Can you elaborate on this point?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Thu, Feb 17, 2022 at 6:39 PM Andres Freund <andres@anarazel.de> wrote:
> > The only other kind of hazard I can think of is: could it be unsafe to
> > try to interpret the contents of a database to which no one else is
> > connected at present due to any of the issues you mention? But what's
> > the hazard exactly?
>
> I don't quite know. But I don't think it's impossible to run into trouble in
> this area. E.g. xid horizons are computed in a database specific way. If the
> routine reading pg_class did hot pruning, you could end up removing data
> that's actually needed for a logical slot in the other database because the
> backend local horizon state was computed for the "local" database?

Yeah, but it doesn't -- and shouldn't. There's no HeapScanDesc here,
so we can't accidentally wander into heap_page_prune_opt(). We should
document the things we're thinking about here in the comments to
prevent future mistakes, but I think for the moment we are OK.

> Could there be problems because other backends wouldn't see the backend
> accessing the other database as being connected to that database
> (PGPROC->databaseId)?

I think that if there's any hazard here, it must be related to
snapshots, which brings us to your next point:

> Or what if somebody optimized snapshots to disregard readonly transactions in
> other databases?

So there are two related questions here. One is whether the snapshot
that we're using to access the template database can be unsafe, and
the other is whether the read-only access that we're performing inside
the template database could mess up somebody else's snapshot. Let's
deal with the second point first: nobody else knows that we're reading
from the template database, and nobody else is reading from the
template database except possibly for someone who is doing exactly
what we're doing. Therefore, I think this hazard can be ruled out.

On the first point, a key point in my opinion is that there can be no
in-flight transactions in the template database, because nobody is
connected to it, and prepared transactions in a template database are
verboten. It therefore can't matter if we include too few XIDs in our
snapshot, or if our xmin is too new. The reverse case can matter,
though: if the xmin of our snapshot were too old, or if we had extra
XIDs in our snapshot, then we might think that a transaction is still
in progress when it isn't. Therefore, I think the patch is wrong to
use GetActiveSnapshot() and must use GetLatestSnapshot() *after* it's
finished making sure that nobody is using the template database. I
don't think there's a hazard beyond that, though. Let's consider the
two ways in which things could go wrong:

1. Extra XIDs in the snapshot. Any current or future optimization of
snapshots would presumably be trying to make them smaller by removing
XIDs from the snapshot, not making them bigger by adding XIDs to the
snapshot. I guess in theory you can imagine an optimization that tests
for the presence of XIDs by some method other than scanning through an
array, and which feels free to add XIDs to the snapshot if they "can't
matter," but I think it's up to the author of that hypothetical future
patch to make sure they don't break anything in so doing -- especially
because it's entirely possible for our session to see XIDs used by a
backend in some other database, because they could show up in shared
catalogs. I think that's why, as far as I can tell, we only use the
database ID when setting pruning thresholds, and not for snapshots.

2. xmin of snapshot too new. There are no in-progress transactions in
the template database, so how can this even happen? If we set the xmin
"in the future," then we could get confused about what's visible due
to wraparound, but that seems crazy. I don't see how there can be a
problem here.

> > It can't be a problem if we've failed to process sinval messages for the
> > target database, because we're not using any caches anyway.
>
> Could you end up with an outdated relmap entry? Probably not.

Again, we're not relying on caching -- we read the file.

> > We can't. It can't be unsafe to test visibility of XIDs for that database,
> > because in an alternate universe some backend could have connected to that
> > database and seen the same XIDs.
>
> That's a weak argument, because in that alternative universe a PGPROC entry
> with the PGPROC->databaseId = template_databases_oid would exist.

So what? As I also argue above, I don't think that affects snapshot
generation, and if it did it wouldn't matter anyway, because it could
only remove in-progress transactions from the snapshot, and there
aren't any in the template database anyhow.

Another way of looking at this is: we could just as well use
SnapshotSelf or (if it still existed) SnapshotNow to test visibility.
In a world where there are no transactions in flight, it's the same
thing. In fact, maybe we should do it that way, just to make it
clearer what's happening.

I think these are really good questions you are raising, so I'm not
trying to be dismissive. But after some thought I'm not yet seeing any
problems (other than the use of GetActiveSnapshot).

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Ashutosh Sharma
Date:
I'm not sure about the current status, but found it while playing
around with the latest changes a bit, so thought of sharing it here.

+      <varlistentry>
+       <term><replaceable class="parameter">strategy</replaceable></term>
+       <listitem>
+        <para>
+         This is used for copying the database directory.  Currently, we have
+         two strategies the <literal>WAL_LOG_BLOCK</literal> and the

Isn't it wal_log instead of wal_log_block?

I think when users input wrong strategy with createdb command, we
should provide a hint message showing allowed values for strategy
types along with an error message. This will be helpful for the users.

On Tue, Feb 15, 2022 at 5:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Feb 15, 2022 at 2:01 AM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:
> >
> Here is the updated version of the patch, the changes are 1) Fixed
> review comments given by Robert and one open comment from Ashutosh.
> 2) Preserved the old create db method. 3) As agreed upthread for now
> we are using the new strategy only for createdb not for movedb so I
> have removed the changes in ForgetDatabaseSyncRequests() and
> DropDatabaseBuffers().  3) Provided a database creation strategy
> option as of now I have kept it as below.
>
> CREATE DATABASE ... WITH (STRATEGY = WAL_LOG);      -- default if
> option is omitted
> CREATE DATABASE ... WITH (STRATEGY = FILE_COPY);
>
> I have updated the document but I was not sure how much internal
> information to be exposed to the user so I will work on that based on
> feedback from others.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Tue, Feb 22, 2022 at 8:27 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
I'm not sure about the current status, but found it while playing
around with the latest changes a bit, so thought of sharing it here.

+      <varlistentry>
+       <term><replaceable class="parameter">strategy</replaceable></term>
+       <listitem>
+        <para>
+         This is used for copying the database directory.  Currently, we have
+         two strategies the <literal>WAL_LOG_BLOCK</literal> and the

Isn't it wal_log instead of wal_log_block?

I think when users input wrong strategy with createdb command, we
should provide a hint message showing allowed values for strategy
types along with an error message. This will be helpful for the users.

I will fix these two comments while posting the next version.
 
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Tue, Mar 1, 2022 at 5:15 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Tue, Feb 22, 2022 at 8:27 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
I'm not sure about the current status, but found it while playing
around with the latest changes a bit, so thought of sharing it here.

+      <varlistentry>
+       <term><replaceable class="parameter">strategy</replaceable></term>
+       <listitem>
+        <para>
+         This is used for copying the database directory.  Currently, we have
+         two strategies the <literal>WAL_LOG_BLOCK</literal> and the

Isn't it wal_log instead of wal_log_block?

I think when users input wrong strategy with createdb command, we
should provide a hint message showing allowed values for strategy
types along with an error message. This will be helpful for the users.

I will fix these two comments while posting the next version.
 

The new version of the patch fixes these 2 comments pointed by Ashutosh and also splits the GetRelListFromPage() function as suggested by Robert and uses the latest snapshot for scanning the pg_class instead of active snapshot as pointed out by Robert.  I haven't yet added the test case to create a database using this new strategy option.  So if we are okay with these two options FILE_COPY and WAL_LOG then I will add test cases for the same.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Thu, Mar 3, 2022 at 11:22 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> The new version of the patch fixes these 2 comments pointed by Ashutosh and also splits the GetRelListFromPage()
functionas suggested by Robert and uses the latest snapshot for scanning the pg_class instead of active snapshot as
pointedout by Robert.  I haven't yet added the test case to create a database using this new strategy option.  So if we
areokay with these two options FILE_COPY and WAL_LOG then I will add test cases for the same. 

Reviewing 0001, the boundaries of the critical section move slightly,
but only over a memcpy, which can't fail, so that seems fine. But this
comment looks ominous:

     * Note: we're cheating a little bit here by assuming that mapped files
     * are either in pg_global or the database's default tablespace.

It's not clear to me how the code that follows relies on this
assumption, but the overall patch set would make that not true any
more, so there's some kind of an issue to think about there.

It's a little asymmetric that load_relmap_file() gets a subroutine
read_relmap_file() while write_relmap_file() gets a subroutine
write_relmap_file_internal(). Perhaps we could call the functions
{load,write}_named_relmap_file() or something of that sort.

Reviewing 0002, your comment updates in relmap_redo() are not
complete. Note that there's an unmodified comment that says "Write out
the new map and send sinval" just above where you modify the code to
only conditionally send sinval. I'm somewhat uncomfortable with the
shape of this logic, too. It looks weird to be sometimes calling
write_relmap_file and sometimes write_relmap_file_internal. You'd
expect functions with those names to be called at different
abstraction levels, rather than at parallel call sites. The renaming I
proposed would help with this but it's not just a cosmetic issue: the
logic to construct mapfilename is in this function in one case, but in
the called function in the other case. I can't help but think that the
write_relmap_file()/write_relmap_file_internal() split isn't entirely
the right thing.

I think part of the confusion here is that, pre-patch,
write_relmap_file() gets called during both recovery and normal
running, and the updates to shared_map or local_map are actually
nonsense during recovery, because the local map at least is local to
whatever our database is, and we don't have a database connection if
we're the startup process. After your patch, we're still going through
write_relmap_file in recovery in some cases, but really those map
updates don't seem like things that should be happening at all. And on
the other hand it's not clear to me why the CRC stuff isn't needed in
all cases, but that's only going to happen when we go through the
non-internal version of the function. You've probably spent more time
looking at this code than I have, but I'm wondering if the division
should be like this: we have one function that does the actual update,
and another function that does the update plus sets global variables.
Recovery always uses the first one, and outside of recovery we use the
first one for the create-database case and the second one otherwise.
Thoughts?

Regarding 0003, my initial thought was to like the fact that you'd
avoided duplicating code by using a function parameter, but as I look
at it a bit more, it's not clear to me that it's enough code that we
really care about not duplicating it. I would not expect to find a
function called RelationCopyAllFork() in tablecmds.c. I'd expect to
find it in storage.c, I think. And I think I'd be surprised to find
out that it doesn't actually know anything about copying; it's
basically just a loop over the forks to which you can supply your own
copy-function. And the fact that it's got an argument of type
copy_relation_storage and the argument name is copy_storage and the
value is sometimes RelationCopyStorageis a terminological muddle, too.
So I feel like perhaps this needs more thought.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Wed, Mar 9, 2022 at 3:12 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
Thanks for reviewing and valuable feedback.

> Reviewing 0001, the boundaries of the critical section move slightly,
> but only over a memcpy, which can't fail, so that seems fine. But this
> comment looks ominous:
>
>      * Note: we're cheating a little bit here by assuming that mapped files
>      * are either in pg_global or the database's default tablespace.
>
> It's not clear to me how the code that follows relies on this
> assumption, but the overall patch set would make that not true any
> more, so there's some kind of an issue to think about there.

I think the comments are w.r.t choosing the file path, because here we
assume either it is in the global tablespace or default tablespace of
the database.  Here also the comment is partially true because we also
assume that it will be in the default tablespace of the database
(because we do not need to worry about the shared relations).  But I
think this comments can move to the caller function where we are
creating the file path.

if (shared)
{
snprintf(mapfilename, sizeof(mapfilename), "global/%s",
RELMAPPER_FILENAME);
}
else
{
snprintf(mapfilename, sizeof(mapfilename), "%s/%s",
dbpath, RELMAPPER_FILENAME);
}

> It's a little asymmetric that load_relmap_file() gets a subroutine
> read_relmap_file() while write_relmap_file() gets a subroutine
> write_relmap_file_internal(). Perhaps we could call the functions
> {load,write}_named_relmap_file() or something of that sort.

Yeah this should be changed.

> Reviewing 0002, your comment updates in relmap_redo() are not
> complete. Note that there's an unmodified comment that says "Write out
> the new map and send sinval" just above where you modify the code to
> only conditionally send sinval. I'm somewhat uncomfortable with the
> shape of this logic, too. It looks weird to be sometimes calling
> write_relmap_file and sometimes write_relmap_file_internal. You'd
> expect functions with those names to be called at different
> abstraction levels, rather than at parallel call sites. The renaming I
> proposed would help with this but it's not just a cosmetic issue: the
> logic to construct mapfilename is in this function in one case, but in
> the called function in the other case. I can't help but think that the
> write_relmap_file()/write_relmap_file_internal() split isn't entirely
> the right thing.
>
> I think part of the confusion here is that, pre-patch,
> write_relmap_file() gets called during both recovery and normal
> running, and the updates to shared_map or local_map are actually
> nonsense during recovery, because the local map at least is local to
> whatever our database is, and we don't have a database connection if
> we're the startup process.

Yeah you are correct about the local map, but I am not sure whether we
can rely on not updating the shared map in the startup process.
Because how can we guarantee that now or in future the startup process
can never look into the map?  I agree that it is not connected to the
database so it doesn't make sense to look into the local map but how
we are going to ensure the shared map.  Said that I think there are
only 3 function which must be looking at these maps directly
RelationMapOidToFilenode(), RelationMapFilenodeToOid() and
RelationMapUpdateMap() and these functions are called from a very few
places and I don't think these should be called during recovery.  So
probably we can put a elog saying they should never be called during
recovery?

After your patch, we're still going through
> write_relmap_file in recovery in some cases, but really those map
> updates don't seem like things that should be happening at all. And on
> the other hand it's not clear to me why the CRC stuff isn't needed in
> all cases, but that's only going to happen when we go through the
> non-internal version of the function. You've probably spent more time
> looking at this code than I have, but I'm wondering if the division
> should be like this: we have one function that does the actual update,
> and another function that does the update plus sets global variables.
> Recovery always uses the first one, and outside of recovery we use the
> first one for the create-database case and the second one otherwise.
> Thoughts?

Right, infact now also if you see the logic, the
write_relmap_file_internal() is taking care of the actual update and
the write_relmap_file() is doing update + setting the global
variables.  So yeah we can rename as you suggested in 0001 and here
also we can change the logic as you suggested that the recovery and
createdb will only call the first function which is just doing the
update.


> Regarding 0003, my initial thought was to like the fact that you'd
> avoided duplicating code by using a function parameter, but as I look
> at it a bit more, it's not clear to me that it's enough code that we
> really care about not duplicating it. I would not expect to find a
> function called RelationCopyAllFork() in tablecmds.c.

Okay, actually I see this logic of copying the fork at a few different
places like
index_copy_data() in tablecmds.c. and then in
heapam_relation_copy_data() in heapam_handler.c.  So I was not sure
what could be right place for this function so I kept it in the same
file (tablecmds.c) because I splitted it from the function in this
file.

I'd expect to
> find it in storage.c, I think. And I think I'd be surprised to find
> out that it doesn't actually know anything about copying; it's
> basically just a loop over the forks to which you can supply your own
> copy-function.

Yeah but it eventually expects a function pointer to copy storage so
we can not completely deny that it knows nothing about the copy?

And the fact that it's got an argument of type
> copy_relation_storage and the argument name is copy_storage and the
> value is sometimes RelationCopyStorageis a terminological muddle, too.
> So I feel like perhaps this needs more thought.

One option is that we can duplicate this loop in dbcommand.c as well
like we are having it already in tablecmds.c and heapam_handler.c?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Wed, Mar 9, 2022 at 6:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Yeah you are correct about the local map, but I am not sure whether we
> can rely on not updating the shared map in the startup process.
> Because how can we guarantee that now or in future the startup process
> can never look into the map?  I agree that it is not connected to the
> database so it doesn't make sense to look into the local map but how
> we are going to ensure the shared map.  Said that I think there are
> only 3 function which must be looking at these maps directly
> RelationMapOidToFilenode(), RelationMapFilenodeToOid() and
> RelationMapUpdateMap() and these functions are called from a very few
> places and I don't think these should be called during recovery.  So
> probably we can put a elog saying they should never be called during
> recovery?

Yeah, that seems reasonable.

> Right, infact now also if you see the logic, the
> write_relmap_file_internal() is taking care of the actual update and
> the write_relmap_file() is doing update + setting the global
> variables.  So yeah we can rename as you suggested in 0001 and here
> also we can change the logic as you suggested that the recovery and
> createdb will only call the first function which is just doing the
> update.

But I think we want the path construction to be managed by the
function rather than the caller, too.

> I'd expect to
> > find it in storage.c, I think. And I think I'd be surprised to find
> > out that it doesn't actually know anything about copying; it's
> > basically just a loop over the forks to which you can supply your own
> > copy-function.
>
> Yeah but it eventually expects a function pointer to copy storage so
> we can not completely deny that it knows nothing about the copy?

Sure, I guess. It's just not obvious why the argument would actually
need to be a function that copies storage, or why there's more than
one way to copy storage. I'd rather keep all the code paths unified,
if we can, and set behavior via flags or something, maybe. I'm not
sure whether that's realistic, though.

> And the fact that it's got an argument of type
> > copy_relation_storage and the argument name is copy_storage and the
> > value is sometimes RelationCopyStorageis a terminological muddle, too.
> > So I feel like perhaps this needs more thought.
>
> One option is that we can duplicate this loop in dbcommand.c as well
> like we are having it already in tablecmds.c and heapam_handler.c?

Yeah, I think this is also worth considering.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Wed, Mar 9, 2022 at 6:44 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> > Right, infact now also if you see the logic, the
> > write_relmap_file_internal() is taking care of the actual update and
> > the write_relmap_file() is doing update + setting the global
> > variables.  So yeah we can rename as you suggested in 0001 and here
> > also we can change the logic as you suggested that the recovery and
> > createdb will only call the first function which is just doing the
> > update.
>
> But I think we want the path construction to be managed by the
> function rather than the caller, too.

I have completely changed the logic for this refactoring.  Basically,
write_relmap_file(), is already having parameters to control whether
to write wal, send inval and we are already passing the dbpath.
Instead of making a new function I just pass one additional parameter
to this function itself about whether we are creating a new map or not
and I think with that changes are very less and this looks cleaner to
me.  Similarly for load_relmap_file() also I just had to pass the
dbpath and memory for destination map.  Please have a look and let me
know your thoughts.

> Sure, I guess. It's just not obvious why the argument would actually
> need to be a function that copies storage, or why there's more than
> one way to copy storage. I'd rather keep all the code paths unified,
> if we can, and set behavior via flags or something, maybe. I'm not
> sure whether that's realistic, though.

I try considering that, I think it doesn't look good to make it flag
based,  One of the main problem I noticed is that now for copying
either we need to call RelationCopyStorageis() or
RelationCopyStorageUsingBuffer() based on the input flag.  But if we
move the main copy function to the storage.c then the storage.c will
have depedency on bufmgr functions because I don't think we can keep
RelationCopyStorageUsingBuffer() inside storage.c.  So for now, I have
duplicated the loop which is already there in index_copy_data() and
heapam_relation_copy_data() and kept that in bufmgr.c and also moved
RelationCopyStorageUsingBuffer() into the bufmgr.c.  I think bufmgr.c
is already having function which are dealing with smgr things so I
feel this is the right place for the function.

Other changes:
1.  0001 and 0002 are merged because now we are not really refactoring
these function and just passing the additioanl arguments to it make
sense to combine the changes.
2. Same with 0003, that now we are not refactoring existing functions
but providing new interfaces so merged it with the 0004 (which was
0006 previously)

I think we should also write the test cases for create database
strategy.  But I do not see any test case for create database for
testing the existing options.  So I am wondering whether we should add
the test case only for the new option we are providing or we should
create a  separate path which tests the new option as well as the
existing options.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Ashutosh Sharma
Date:
Here are some review comments on the latest patch
(v11-0004-WAL-logged-CREATE-DATABASE.patch). I actually did the review
of the v10 patch but that applies for this latest version as well.

+               /* Now errors are fatal ... */
+               START_CRIT_SECTION();

Did you mean PANIC instead of FATAL?

==

+
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                                        errmsg("invalid create
strategy %s", strategy),
+                                        errhint("Valid strategies are
\"wal_log\", and \"file_copy\".")));
+       }


Should this be - "invalid createdb strategy" instead of "invalid
create strategy"?

==

+               /*
+                * In case of ALTER DATABASE SET TABLESPACE we don't need to do
+                * anything for the object which are not in the source
db's default
+                * tablespace.  The source and destination dboid will be same in
+                * case of ALTER DATABASE SET TABLESPACE.
+                */
+               else if (src_dboid == dst_dboid)
+                       continue;
+               else
+                       dstrnode.spcNode = srcrnode.spcNode;


Is this change still required? Do we support the WAL_COPY strategy for
ALTER DATABASE?

==

+               /* Open the source and the destination relation at
smgr level. */
+               src_smgr = smgropen(srcrnode, InvalidBackendId);
+               dst_smgr = smgropen(dstrnode, InvalidBackendId);
+
+               /* Copy relation storage from source to the destination. */
+               CreateAndCopyRelationData(src_smgr, dst_smgr,
relinfo->relpersistence);

Do we need to do smgropen for destination relfilenode here? Aren't we
already doing that inside RelationCreateStorage?

==

+       use_wal = XLogIsNeeded() &&
+               (relpersistence == RELPERSISTENCE_PERMANENT ||
copying_initfork);
+
+       /* Get number of blocks in the source relation. */
+       nblocks = smgrnblocks(src, forkNum);

What if number of blocks in a source relation is ZERO? Should we check
for that and return immediately. We have already done smgrcreate.

==

+       /* We don't need to copy the shared objects to the target. */
+       if (classForm->reltablespace == GLOBALTABLESPACE_OID)
+               return NULL;
+
+       /*
+        * If the object doesn't have the storage then nothing to be
+        * done for that object so just ignore it.
+        */
+       if (!RELKIND_HAS_STORAGE(classForm->relkind))
+               return NULL;

We can probably club together above two if-checks.

==

+      <varlistentry>
+       <term><replaceable class="parameter">strategy</replaceable></term>
+       <listitem>
+        <para>
+         This is used for copying the database directory.  Currently, we have
+         two strategies the <literal>WAL_LOG</literal> and the
+         <literal>FILE_COPY</literal>.  If <literal>WAL_LOG</literal> strategy
+         is used then the database will be copied block by block and it will
+         also WAL log each copied block.  Otherwise, if <literal>FILE_COPY

I think we need to mention the default strategy in the documentation page.

--
With Regards,
Ashutosh Sharma.

On Thu, Mar 10, 2022 at 4:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Mar 9, 2022 at 6:44 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > > Right, infact now also if you see the logic, the
> > > write_relmap_file_internal() is taking care of the actual update and
> > > the write_relmap_file() is doing update + setting the global
> > > variables.  So yeah we can rename as you suggested in 0001 and here
> > > also we can change the logic as you suggested that the recovery and
> > > createdb will only call the first function which is just doing the
> > > update.
> >
> > But I think we want the path construction to be managed by the
> > function rather than the caller, too.
>
> I have completely changed the logic for this refactoring.  Basically,
> write_relmap_file(), is already having parameters to control whether
> to write wal, send inval and we are already passing the dbpath.
> Instead of making a new function I just pass one additional parameter
> to this function itself about whether we are creating a new map or not
> and I think with that changes are very less and this looks cleaner to
> me.  Similarly for load_relmap_file() also I just had to pass the
> dbpath and memory for destination map.  Please have a look and let me
> know your thoughts.
>
> > Sure, I guess. It's just not obvious why the argument would actually
> > need to be a function that copies storage, or why there's more than
> > one way to copy storage. I'd rather keep all the code paths unified,
> > if we can, and set behavior via flags or something, maybe. I'm not
> > sure whether that's realistic, though.
>
> I try considering that, I think it doesn't look good to make it flag
> based,  One of the main problem I noticed is that now for copying
> either we need to call RelationCopyStorageis() or
> RelationCopyStorageUsingBuffer() based on the input flag.  But if we
> move the main copy function to the storage.c then the storage.c will
> have depedency on bufmgr functions because I don't think we can keep
> RelationCopyStorageUsingBuffer() inside storage.c.  So for now, I have
> duplicated the loop which is already there in index_copy_data() and
> heapam_relation_copy_data() and kept that in bufmgr.c and also moved
> RelationCopyStorageUsingBuffer() into the bufmgr.c.  I think bufmgr.c
> is already having function which are dealing with smgr things so I
> feel this is the right place for the function.
>
> Other changes:
> 1.  0001 and 0002 are merged because now we are not really refactoring
> these function and just passing the additioanl arguments to it make
> sense to combine the changes.
> 2. Same with 0003, that now we are not refactoring existing functions
> but providing new interfaces so merged it with the 0004 (which was
> 0006 previously)
>
> I think we should also write the test cases for create database
> strategy.  But I do not see any test case for create database for
> testing the existing options.  So I am wondering whether we should add
> the test case only for the new option we are providing or we should
> create a  separate path which tests the new option as well as the
> existing options.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Thu, Mar 10, 2022 at 7:22 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Here are some review comments on the latest patch
> (v11-0004-WAL-logged-CREATE-DATABASE.patch). I actually did the review
> of the v10 patch but that applies for this latest version as well.
>
> +               /* Now errors are fatal ... */
> +               START_CRIT_SECTION();
>
> Did you mean PANIC instead of FATAL?

I think here fatal didn't really mean the error level FATAL, that
means critical and I have seen it is used in other places also.  But I
really don't think we need this comments to removed to avoid any
confusion.

> ==
>
> +
> (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                                        errmsg("invalid create
> strategy %s", strategy),
> +                                        errhint("Valid strategies are
> \"wal_log\", and \"file_copy\".")));
> +       }
>
>
> Should this be - "invalid createdb strategy" instead of "invalid
> create strategy"?

Changed

> ==
>
> +               /*
> +                * In case of ALTER DATABASE SET TABLESPACE we don't need to do
> +                * anything for the object which are not in the source
> db's default
> +                * tablespace.  The source and destination dboid will be same in
> +                * case of ALTER DATABASE SET TABLESPACE.
> +                */
> +               else if (src_dboid == dst_dboid)
> +                       continue;
> +               else
> +                       dstrnode.spcNode = srcrnode.spcNode;
>
>
> Is this change still required? Do we support the WAL_COPY strategy for
> ALTER DATABASE?

Yeah now it is unreachable code so removed.

> ==
>
> +               /* Open the source and the destination relation at
> smgr level. */
> +               src_smgr = smgropen(srcrnode, InvalidBackendId);
> +               dst_smgr = smgropen(dstrnode, InvalidBackendId);
> +
> +               /* Copy relation storage from source to the destination. */
> +               CreateAndCopyRelationData(src_smgr, dst_smgr,
> relinfo->relpersistence);
>
> Do we need to do smgropen for destination relfilenode here? Aren't we
> already doing that inside RelationCreateStorage?

Yeah I have changed the complete logic and removed the smgr_open for
both source and destination and moved inside
CreateAndCopyRelationData, please check the updated code.

> ==
>
> +       use_wal = XLogIsNeeded() &&
> +               (relpersistence == RELPERSISTENCE_PERMANENT ||
> copying_initfork);
> +
> +       /* Get number of blocks in the source relation. */
> +       nblocks = smgrnblocks(src, forkNum);
>
> What if number of blocks in a source relation is ZERO? Should we check
> for that and return immediately. We have already done smgrcreate.

Yeah make sense to optimize, with that we will not have to get the
buffer strategy so done.

> ==
>
> +       /* We don't need to copy the shared objects to the target. */
> +       if (classForm->reltablespace == GLOBALTABLESPACE_OID)
> +               return NULL;
> +
> +       /*
> +        * If the object doesn't have the storage then nothing to be
> +        * done for that object so just ignore it.
> +        */
> +       if (!RELKIND_HAS_STORAGE(classForm->relkind))
> +               return NULL;
>
> We can probably club together above two if-checks.

Done

> ==
>
> +      <varlistentry>
> +       <term><replaceable class="parameter">strategy</replaceable></term>
> +       <listitem>
> +        <para>
> +         This is used for copying the database directory.  Currently, we have
> +         two strategies the <literal>WAL_LOG</literal> and the
> +         <literal>FILE_COPY</literal>.  If <literal>WAL_LOG</literal> strategy
> +         is used then the database will be copied block by block and it will
> +         also WAL log each copied block.  Otherwise, if <literal>FILE_COPY
>
> I think we need to mention the default strategy in the documentation page.

Done

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Thu, Mar 10, 2022 at 6:02 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I have completely changed the logic for this refactoring.  Basically,
> write_relmap_file(), is already having parameters to control whether
> to write wal, send inval and we are already passing the dbpath.
> Instead of making a new function I just pass one additional parameter
> to this function itself about whether we are creating a new map or not
> and I think with that changes are very less and this looks cleaner to
> me.  Similarly for load_relmap_file() also I just had to pass the
> dbpath and memory for destination map.  Please have a look and let me
> know your thoughts.

It's not terrible, but how about something like the attached instead?
I think this has the effect of reducing the number of cases that the
low-level code needs to know about from 2 to 1, instead of making it
go up from 2 to 3.

> I think we should also write the test cases for create database
> strategy.  But I do not see any test case for create database for
> testing the existing options.  So I am wondering whether we should add
> the test case only for the new option we are providing or we should
> create a  separate path which tests the new option as well as the
> existing options.

FWIW, src/bin/scripts/t/020_createdb.pl does a little bit of testing
of this kind.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Ashutosh Sharma
Date:
Thanks Dilip for working on the review comments. I'll take a look at
the new version of patch and let you know my comments, if any.

--
With Regards,
Ashutosh Sharma.

On Thu, Mar 10, 2022 at 8:38 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Mar 10, 2022 at 7:22 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> >
> > Here are some review comments on the latest patch
> > (v11-0004-WAL-logged-CREATE-DATABASE.patch). I actually did the review
> > of the v10 patch but that applies for this latest version as well.
> >
> > +               /* Now errors are fatal ... */
> > +               START_CRIT_SECTION();
> >
> > Did you mean PANIC instead of FATAL?
>
> I think here fatal didn't really mean the error level FATAL, that
> means critical and I have seen it is used in other places also.  But I
> really don't think we need this comments to removed to avoid any
> confusion.
>
> > ==
> >
> > +
> > (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +                                        errmsg("invalid create
> > strategy %s", strategy),
> > +                                        errhint("Valid strategies are
> > \"wal_log\", and \"file_copy\".")));
> > +       }
> >
> >
> > Should this be - "invalid createdb strategy" instead of "invalid
> > create strategy"?
>
> Changed
>
> > ==
> >
> > +               /*
> > +                * In case of ALTER DATABASE SET TABLESPACE we don't need to do
> > +                * anything for the object which are not in the source
> > db's default
> > +                * tablespace.  The source and destination dboid will be same in
> > +                * case of ALTER DATABASE SET TABLESPACE.
> > +                */
> > +               else if (src_dboid == dst_dboid)
> > +                       continue;
> > +               else
> > +                       dstrnode.spcNode = srcrnode.spcNode;
> >
> >
> > Is this change still required? Do we support the WAL_COPY strategy for
> > ALTER DATABASE?
>
> Yeah now it is unreachable code so removed.
>
> > ==
> >
> > +               /* Open the source and the destination relation at
> > smgr level. */
> > +               src_smgr = smgropen(srcrnode, InvalidBackendId);
> > +               dst_smgr = smgropen(dstrnode, InvalidBackendId);
> > +
> > +               /* Copy relation storage from source to the destination. */
> > +               CreateAndCopyRelationData(src_smgr, dst_smgr,
> > relinfo->relpersistence);
> >
> > Do we need to do smgropen for destination relfilenode here? Aren't we
> > already doing that inside RelationCreateStorage?
>
> Yeah I have changed the complete logic and removed the smgr_open for
> both source and destination and moved inside
> CreateAndCopyRelationData, please check the updated code.
>
> > ==
> >
> > +       use_wal = XLogIsNeeded() &&
> > +               (relpersistence == RELPERSISTENCE_PERMANENT ||
> > copying_initfork);
> > +
> > +       /* Get number of blocks in the source relation. */
> > +       nblocks = smgrnblocks(src, forkNum);
> >
> > What if number of blocks in a source relation is ZERO? Should we check
> > for that and return immediately. We have already done smgrcreate.
>
> Yeah make sense to optimize, with that we will not have to get the
> buffer strategy so done.
>
> > ==
> >
> > +       /* We don't need to copy the shared objects to the target. */
> > +       if (classForm->reltablespace == GLOBALTABLESPACE_OID)
> > +               return NULL;
> > +
> > +       /*
> > +        * If the object doesn't have the storage then nothing to be
> > +        * done for that object so just ignore it.
> > +        */
> > +       if (!RELKIND_HAS_STORAGE(classForm->relkind))
> > +               return NULL;
> >
> > We can probably club together above two if-checks.
>
> Done
>
> > ==
> >
> > +      <varlistentry>
> > +       <term><replaceable class="parameter">strategy</replaceable></term>
> > +       <listitem>
> > +        <para>
> > +         This is used for copying the database directory.  Currently, we have
> > +         two strategies the <literal>WAL_LOG</literal> and the
> > +         <literal>FILE_COPY</literal>.  If <literal>WAL_LOG</literal> strategy
> > +         is used then the database will be copied block by block and it will
> > +         also WAL log each copied block.  Otherwise, if <literal>FILE_COPY
> >
> > I think we need to mention the default strategy in the documentation page.
>
> Done
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Ashutosh Sharma
Date:
On Thu, Mar 10, 2022 at 10:18 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Mar 10, 2022 at 6:02 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I have completely changed the logic for this refactoring.  Basically,
> > write_relmap_file(), is already having parameters to control whether
> > to write wal, send inval and we are already passing the dbpath.
> > Instead of making a new function I just pass one additional parameter
> > to this function itself about whether we are creating a new map or not
> > and I think with that changes are very less and this looks cleaner to
> > me.  Similarly for load_relmap_file() also I just had to pass the
> > dbpath and memory for destination map.  Please have a look and let me
> > know your thoughts.
>
> It's not terrible, but how about something like the attached instead?
> I think this has the effect of reducing the number of cases that the
> low-level code needs to know about from 2 to 1, instead of making it
> go up from 2 to 3.
>

Looks better, but why do you want to pass elevel to the
load_relmap_file()? Are we calling this function from somewhere other
than read_relmap_file()? If not, do we have any plans to call this
function  directly bypassing read_relmap_file for any upcoming patch?

--
With Regards,
Ashutosh Sharma.



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Thu, Mar 10, 2022 at 10:18 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Mar 10, 2022 at 6:02 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I have completely changed the logic for this refactoring.  Basically,
> > write_relmap_file(), is already having parameters to control whether
> > to write wal, send inval and we are already passing the dbpath.
> > Instead of making a new function I just pass one additional parameter
> > to this function itself about whether we are creating a new map or not
> > and I think with that changes are very less and this looks cleaner to
> > me.  Similarly for load_relmap_file() also I just had to pass the
> > dbpath and memory for destination map.  Please have a look and let me
> > know your thoughts.
>
> It's not terrible, but how about something like the attached instead?
> I think this has the effect of reducing the number of cases that the
> low-level code needs to know about from 2 to 1, instead of making it
> go up from 2 to 3.

Yeah this looks cleaner, I will rebase the remaining patch.

> > I think we should also write the test cases for create database
> > strategy.  But I do not see any test case for create database for
> > testing the existing options.  So I am wondering whether we should add
> > the test case only for the new option we are providing or we should
> > create a  separate path which tests the new option as well as the
> > existing options.
>
> FWIW, src/bin/scripts/t/020_createdb.pl does a little bit of testing
> of this kind.

Okay, I think we need to support the strategy in createdb bin as well.
I will do that.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Fri, Mar 11, 2022 at 11:52 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Mar 10, 2022 at 10:18 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Thu, Mar 10, 2022 at 6:02 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > I have completely changed the logic for this refactoring.  Basically,
> > > write_relmap_file(), is already having parameters to control whether
> > > to write wal, send inval and we are already passing the dbpath.
> > > Instead of making a new function I just pass one additional parameter
> > > to this function itself about whether we are creating a new map or not
> > > and I think with that changes are very less and this looks cleaner to
> > > me.  Similarly for load_relmap_file() also I just had to pass the
> > > dbpath and memory for destination map.  Please have a look and let me
> > > know your thoughts.
> >
> > It's not terrible, but how about something like the attached instead?
> > I think this has the effect of reducing the number of cases that the
> > low-level code needs to know about from 2 to 1, instead of making it
> > go up from 2 to 3.
>
> Yeah this looks cleaner, I will rebase the remaining patch.

Here is the updated version of the patch set.

Changes, 1) it take Robert's patch as first refactoring patch 2)
Rebase other new relmapper apis on top of that in 0002 3) Some code
refactoring in main patch 0005 and also one problem fix, earlier in
wal log method I have removed ForceSyncCommit(), but IMHO that is
equally valid whether we use file_copy or wal_log because in both
cases we are creating the disk files.  4) Support strategy in createdb
tool and add test case as part of 0006.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Ashutosh Sharma
Date:
You may also need to add documentation to app-createdb.sgml. Currently
you have just added to create_database.sgml. Also, I had a quick look
at the new changes done in v13-0005-WAL-logged-CREATE-DATABASE.patch
and they seemed fine to me although I haven't put much emphasis on the
comments and other cosmetic stuff.

--
With Regards,
Ashutosh Sharma.

On Fri, Mar 11, 2022 at 3:51 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Mar 11, 2022 at 11:52 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Mar 10, 2022 at 10:18 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > >
> > > On Thu, Mar 10, 2022 at 6:02 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > > I have completely changed the logic for this refactoring.  Basically,
> > > > write_relmap_file(), is already having parameters to control whether
> > > > to write wal, send inval and we are already passing the dbpath.
> > > > Instead of making a new function I just pass one additional parameter
> > > > to this function itself about whether we are creating a new map or not
> > > > and I think with that changes are very less and this looks cleaner to
> > > > me.  Similarly for load_relmap_file() also I just had to pass the
> > > > dbpath and memory for destination map.  Please have a look and let me
> > > > know your thoughts.
> > >
> > > It's not terrible, but how about something like the attached instead?
> > > I think this has the effect of reducing the number of cases that the
> > > low-level code needs to know about from 2 to 1, instead of making it
> > > go up from 2 to 3.
> >
> > Yeah this looks cleaner, I will rebase the remaining patch.
>
> Here is the updated version of the patch set.
>
> Changes, 1) it take Robert's patch as first refactoring patch 2)
> Rebase other new relmapper apis on top of that in 0002 3) Some code
> refactoring in main patch 0005 and also one problem fix, earlier in
> wal log method I have removed ForceSyncCommit(), but IMHO that is
> equally valid whether we use file_copy or wal_log because in both
> cases we are creating the disk files.  4) Support strategy in createdb
> tool and add test case as part of 0006.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Fri, Mar 11, 2022 at 12:15 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> Looks better, but why do you want to pass elevel to the
> load_relmap_file()? Are we calling this function from somewhere other
> than read_relmap_file()? If not, do we have any plans to call this
> function  directly bypassing read_relmap_file for any upcoming patch?

If it fails during CREATE DATABASE, it should be ERROR, not FATAL. In
that case, we only need to stop trying to create a database; we don't
need to terminate the session. On the other hand if we can't read our
own database's relmap files, that's an unrecoverable error, because we
will not be able to run any queries at all, so FATAL is appropriate.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Ashutosh Sharma
Date:
On Fri, Mar 11, 2022 at 8:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Mar 11, 2022 at 12:15 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > Looks better, but why do you want to pass elevel to the
> > load_relmap_file()? Are we calling this function from somewhere other
> > than read_relmap_file()? If not, do we have any plans to call this
> > function  directly bypassing read_relmap_file for any upcoming patch?
>
> If it fails during CREATE DATABASE, it should be ERROR, not FATAL. In
> that case, we only need to stop trying to create a database; we don't
> need to terminate the session. On the other hand if we can't read our
> own database's relmap files, that's an unrecoverable error, because we
> will not be able to run any queries at all, so FATAL is appropriate.
>

OK. I can see it being used in the v13 patch. In the previous patches
it was hard-coded with FATAL. Also, we simply error out when doing
file copy as I can see in the copy_file function. So yes FATAL is not
the right option to use when creating a database. Thanks.

--
With Regards,
Ashutosh Sharma.



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Fri, Mar 11, 2022 at 5:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Changes, 1) it take Robert's patch as first refactoring patch 2)
> Rebase other new relmapper apis on top of that in 0002 3) Some code
> refactoring in main patch 0005 and also one problem fix, earlier in
> wal log method I have removed ForceSyncCommit(), but IMHO that is
> equally valid whether we use file_copy or wal_log because in both
> cases we are creating the disk files.  4) Support strategy in createdb
> tool and add test case as part of 0006.

I don't think you've adequately considered temporary relations here.
It seems to be that ReadBufferWithoutRelcache() could not be safe on a
temprel, because we'd need a BackendId to access the underlying
storage. So I think that ReadBufferWithoutRelcache can only accept
unlogged or permanent, and maybe the argument ought to be a Boolean
instead of a relpersistence value. I thought that this problem might
be only cosmetic, but I checked the code that actually does the copy,
and there's no filter there on relpersistence either. And I think
there should be.


-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Fri, Mar 11, 2022 at 1:10 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I don't think you've adequately considered temporary relations here.
> It seems to be that ReadBufferWithoutRelcache() could not be safe on a
> temprel, because we'd need a BackendId to access the underlying
> storage. So I think that ReadBufferWithoutRelcache can only accept
> unlogged or permanent, and maybe the argument ought to be a Boolean
> instead of a relpersistence value. I thought that this problem might
> be only cosmetic, but I checked the code that actually does the copy,
> and there's no filter there on relpersistence either. And I think
> there should be.

I hit "send" too quickly there:

rhaas=# create database fudge;
CREATE DATABASE
rhaas=# \c fudge
You are now connected to database "fudge" as user "rhaas".
fudge=# create temp table q ();
CREATE TABLE
fudge=# ^Z
[2]+  Stopped                 psql
[rhaas Downloads]$ pg_ctl stop -mi
waiting for server to shut down.... done
server stopped
[rhaas Downloads]$ %%
psql
\c
You are now connected to database "fudge" as user "rhaas".
fudge=# select * from pg_class where relpersistence='t';
  oid  | relname | relnamespace | reltype | reloftype | relowner |
relam | relfilenode | reltablespace | relpages | reltuples |
relallvisible | reltoastrelid | relhasindex | relisshared |
relpersistence | relkind | relnatts | relchecks | relhasrules |
relhastriggers | relhassubclass | relrowsecurity | relforcerowsecurity
| relispopulated | relreplident | relispartition | relrewrite |
relfrozenxid | relminmxid | relacl | reloptions | relpartbound

-------+---------+--------------+---------+-----------+----------+-------+-------------+---------------+----------+-----------+---------------+---------------+-------------+-------------+----------------+---------+----------+-----------+-------------+----------------+----------------+----------------+---------------------+----------------+--------------+----------------+------------+--------------+------------+--------+------------+--------------
 16388 | q       |        16386 |   16390 |         0 |       10 |
2 |       16388 |             0 |        0 |        -1 |             0
|             0 | f           | f           | t              | r
|        0 |         0 | f           | f              | f
| f              | f                   | t              | d
| f              |          0 |          721 |          1 |        |
         |
(1 row)

fudge=# \c rhaas
You are now connected to database "rhaas" as user "rhaas".
rhaas=# alter database fudge is_template true;
ALTER DATABASE
rhaas=# create database cookies template fudge;
CREATE DATABASE
rhaas=# \c cookies
You are now connected to database "cookies" as user "rhaas".
cookies=# select count(*) from pg_class where relpersistence='t';
 count
-------
     1
(1 row)

You have to be quick, because autovacuum will drop the orphaned temp
table when it notices it, but it is possible.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Fri, Mar 11, 2022 at 5:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Changes, 1) it take Robert's patch as first refactoring patch 2)
> Rebase other new relmapper apis on top of that in 0002 3) Some code
> refactoring in main patch 0005 and also one problem fix, earlier in
> wal log method I have removed ForceSyncCommit(), but IMHO that is
> equally valid whether we use file_copy or wal_log because in both
> cases we are creating the disk files.  4) Support strategy in createdb
> tool and add test case as part of 0006.

I think there's something wrong with what this patch is doing with the
XLOG records. It adds XLOG_DBASE_CREATEDIR, but the only new
XLogInsert() calls in the patch are passing XLOG_DBASE_CREATE, and no
existing references are adjusted. Similarly with xl_dbase_create_rec
and xl_dbase_createdir_rec. Why would we introduce a new record type
and not use it?

Let's not call the functions for the different strategies
CopyDatabase() and CopyDatabaseWithWal() but rather something that
matches up to the strategy names e.g. create_database_using_wal_log()
and create_database_using_file_copy(). There's something a little
funny about the names wal_log and file_copy ... they're not quite
parallel gramatically. But it's probably OK.

The changes to createdb_failure_params make me a little nervous. I
think we'd be in real trouble if we failed before completing both
DropDatabaseBuffers() and ForgetDatabaseSyncRequests(). However, it
looks to me like those are both intended to be no-fail operations, so
I don't see an actual hazard here. But, hmm, what about on the
recovery side? Suppose that we start copying the database block by
block and then either (a) the standby is promoted before the copy is
finished or (b) the copy fails. Now the standby has data in
shared_buffers for a database that does not exist. If that's not bad,
then why does createdb_failure_params need to DropDatabaseBuffers()?
But I bet it is bad. I wonder if we should be using
RelationCopyStorage() rather than this new function
RelationCopyStorageUsingBuffer(). That would avoid having the buffers
in shared_buffers, dodging the problem. But then we have a problem
with checkpoint interlocking: we could begin replay from a checkpoint
and find that the pages that were supposed to get copied prior to the
checkpoint were actually not copied, because the checkpoint record
could be written after we've logged a page being copied and before we
actually write the page. Or, we could crash after writing a whole lot
of pages and a checkpoint record, but before RelationCopyStorage()
fsyncs the destination fork. It doesn't seem advisable to hold off
checkpoints for the time it takes to copy an entire relation fork, so
the solution is apparently to keep the data in shared buffers after
all. But that brings us right back to square one. Have you thought
through this whole problem carefully? It seems like a total mess to me
at the moment, but maybe I'm missing something.

There seems to be no reason to specify specific values for the members
of enum CreateDBStrategy.

I think the naming of some of the new functions might need work, in
particular GetRelInfoFromTuple, GetRelListFromPage, and
GetDatabaseRelationList. The names seem kind of generic for what
they're doing. Maybe ScanSourceDatabasePgClass,
ScanSourceDatabasePgClassPage, ScanSourceDatabasePgClassTuple?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Sat, Mar 12, 2022 at 1:55 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
Responding to this specific issue..

> The changes to createdb_failure_params make me a little nervous. I
> think we'd be in real trouble if we failed before completing both
> DropDatabaseBuffers() and ForgetDatabaseSyncRequests(). However, it
> looks to me like those are both intended to be no-fail operations, so
> I don't see an actual hazard here.

I might be missing something but even without that I do not see a real
problem here.  The reason we are dropping the database buffers and
pending sync request is because right after this we are removing the
underlying files and if we just remove the files without dropping the
buffer from the buffer cache then the checkpointer will fail while
trying to flush the buffer.

But, hmm, what about on the
> recovery side? Suppose that we start copying the database block by
> block and then either (a) the standby is promoted before the copy is
> finished or (b) the copy fails.

I think the above logic will be valid in case of standby as well
because we are not really deleting the underlying files.

Now the standby has data in
> shared_buffers for a database that does not exist. If that's not bad,
> then why does createdb_failure_params need to DropDatabaseBuffers()?
> But I bet it is bad. I wonder if we should be using
> RelationCopyStorage() rather than this new function
> RelationCopyStorageUsingBuffer().

I am not sure how RelationCopyStorage() will help in the standby side,
because then also we will log the same WAL (XLOG_FPI) for each page
and standby side we will use buffer to apply this FPI so if you think
that there is a problem then it will be same with
RelationCopyStorage() at least on the standby side.

In fact while we are rewriting the relation during vacuum full that
time also we are calling log_newpage() under RelationCopyStorage() and
during standby if it gets promoted we will be having some buffers in
the buffer pool with the new relfilenode.  So I think our case is also
the same.

So here my stand is that we need to drop database buffers and remove
pending sync requests because we are deleting underlying files and if
we do not do that in some extreme cases then there is no need to drop
the buffers or remove the pending sync request and the worst
consequences would be waste of disk space.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Fri, Mar 11, 2022 at 11:51 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Mar 11, 2022 at 1:10 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > I don't think you've adequately considered temporary relations here.
> > It seems to be that ReadBufferWithoutRelcache() could not be safe on a
> > temprel, because we'd need a BackendId to access the underlying
> > storage. So I think that ReadBufferWithoutRelcache can only accept
> > unlogged or permanent, and maybe the argument ought to be a Boolean
> > instead of a relpersistence value. I thought that this problem might
> > be only cosmetic, but I checked the code that actually does the copy,
> > and there's no filter there on relpersistence either. And I think
> > there should be.

Yeah right for now, this api can only support unlogged or permanent.
I will fix this

> I hit "send" too quickly there:
>
> rhaas=# create database fudge;
> CREATE DATABASE
> rhaas=# \c fudge
> You are now connected to database "fudge" as user "rhaas".
> fudge=# create temp table q ();
> CREATE TABLE
> fudge=# ^Z
> [2]+  Stopped                 psql
> [rhaas Downloads]$ pg_ctl stop -mi
> waiting for server to shut down.... done
> server stopped
> [rhaas Downloads]$ %%
> psql
> \c
> You are now connected to database "fudge" as user "rhaas".
> fudge=# select * from pg_class where relpersistence='t';
>   oid  | relname | relnamespace | reltype | reloftype | relowner |
> relam | relfilenode | reltablespace | relpages | reltuples |
> relallvisible | reltoastrelid | relhasindex | relisshared |
> relpersistence | relkind | relnatts | relchecks | relhasrules |
> relhastriggers | relhassubclass | relrowsecurity | relforcerowsecurity
> | relispopulated | relreplident | relispartition | relrewrite |
> relfrozenxid | relminmxid | relacl | reloptions | relpartbound
>
-------+---------+--------------+---------+-----------+----------+-------+-------------+---------------+----------+-----------+---------------+---------------+-------------+-------------+----------------+---------+----------+-----------+-------------+----------------+----------------+----------------+---------------------+----------------+--------------+----------------+------------+--------------+------------+--------+------------+--------------
>  16388 | q       |        16386 |   16390 |         0 |       10 |
> 2 |       16388 |             0 |        0 |        -1 |             0
> |             0 | f           | f           | t              | r
> |        0 |         0 | f           | f              | f
> | f              | f                   | t              | d
> | f              |          0 |          721 |          1 |        |
>          |
> (1 row)
>
> fudge=# \c rhaas
> You are now connected to database "rhaas" as user "rhaas".
> rhaas=# alter database fudge is_template true;
> ALTER DATABASE
> rhaas=# create database cookies template fudge;
> CREATE DATABASE
> rhaas=# \c cookies
> You are now connected to database "cookies" as user "rhaas".
> cookies=# select count(*) from pg_class where relpersistence='t';
>  count
> -------
>      1
> (1 row)

I think this is not a right example to show the problem, no?  Because
you are showing the pg_class entry and the pg_class is not a temp
relation so even if we avoid copying the temp relation pg_class will
be copied right? so you will still see this uncleaned temp relation
entry.  I could reproduce exactly the same issue without my patch as
well.

So I agree we need to avoid copying temp relations but with that the
above behavior will not change.  Am I missing something?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Sat, Mar 12, 2022 at 11:06 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> In fact while we are rewriting the relation during vacuum full that
> time also we are calling log_newpage() under RelationCopyStorage() and
> during standby if it gets promoted we will be having some buffers in
> the buffer pool with the new relfilenode.  So I think our case is also
> the same.
>
> So here my stand is that we need to drop database buffers and remove
> pending sync requests because we are deleting underlying files and if
> we do not do that in some extreme cases then there is no need to drop
> the buffers or remove the pending sync request and the worst
> consequences would be waste of disk space.

So other than this open point I have fixed other comments given by you
which includes,

- Avoid copying temp relfilenode
- Rename of functions CopyDatabase* -> CreateDatabaseUsing*
- GetDatabaseRelationList and friends to ScanSourceDatabasePgClass*
- Removed unused structure and macro because we are using the same WAL
for copying the database using the old method or creating the
directory and version files for the new method.  Do you think we
should introduce a new WAL for that instead of using the same?

Other than that, I have fixed some mistakes in comments and supported
tab completion for the new options.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Sat, Mar 12, 2022 at 12:36 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> So here my stand is that we need to drop database buffers and remove
> pending sync requests because we are deleting underlying files and if
> we do not do that in some extreme cases then there is no need to drop
> the buffers or remove the pending sync request and the worst
> consequences would be waste of disk space.

Hmm, I guess you're right.

On Mon, Mar 14, 2022 at 7:51 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> - Removed unused structure and macro because we are using the same WAL
> for copying the database using the old method or creating the
> directory and version files for the new method.  Do you think we
> should introduce a new WAL for that instead of using the same?

I think it would make sense to have two different WAL records e.g.
XLOG_DBASE_CREATE_WAL_LOG and XLOG_DBASE_CREATE_FILE_COPY. Then it's
easy to see how this could be generalized to other strategies in the
future.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Mon, Mar 14, 2022 at 7:51 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Other than that, I have fixed some mistakes in comments and supported
> tab completion for the new options.

I was looking at 0001 and 0002 again and realized that I swapped the
names load_relmap_file() and read_relmap_file() from what I should
have done. Here's a revised version. With this, read_relmap_file() and
write_relmap_file() become functions that just read and write the file
without touching any global variables, and load_relmap_file() is the
function that reads data from the file and puts it into a global
variable, which seems more sensible than the way I had it before.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Mon, Mar 14, 2022 at 12:04 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Mar 14, 2022 at 7:51 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > Other than that, I have fixed some mistakes in comments and supported
> > tab completion for the new options.
>
> I was looking at 0001 and 0002 again and realized that I swapped the
> names load_relmap_file() and read_relmap_file() from what I should
> have done. Here's a revised version. With this, read_relmap_file() and
> write_relmap_file() become functions that just read and write the file
> without touching any global variables, and load_relmap_file() is the
> function that reads data from the file and puts it into a global
> variable, which seems more sensible than the way I had it before.

Regarding 0003 and 0005, I'm not a fan of 'bool isunlogged'. I think
'bool permanent' would be better (note BM_PERMANENT). This would
involve reversing true and false.

Regarding 0004, I can't really see a reason for this function to take
a LockRelId as a parameter rather than two separate OIDs. I also can't
entirely see why it should be called LockRelationId. Maybe
LockRelationInDatabaseById(Oid dbid, Oid relid, LOCKMODE lockmode)?
Note that neither caller actually has a LockRelId available; both have
to construct one.

Regarding 0005:

+       CREATEDB_WAL_LOG = 0,
+       CREATEDB_FILE_COPY = 1

I still think you don't need = 0 and = 1 here.

I'll probably go through and do a pass over the comments once you post
the next version of this. There seems to be work needed in a bunch of
places, but it probably makes more sense for me to go through and
adjust the things that seem to need it rather than listing a bunch of
changes for you to make.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Mon, Mar 14, 2022 at 10:04 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> Regarding 0004, I can't really see a reason for this function to take
> a LockRelId as a parameter rather than two separate OIDs. I also can't
> entirely see why it should be called LockRelationId. Maybe
> LockRelationInDatabaseById(Oid dbid, Oid relid, LOCKMODE lockmode)?
> Note that neither caller actually has a LockRelId available; both have
> to construct one.

Actually we already have an existing function
UnlockRelationId(LockRelId *relid, LOCKMODE lockmode) so it makes more
sense to have a parallel lock function.  Do you still think we should
change?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Mon, Mar 14, 2022 at 12:44 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Mon, Mar 14, 2022 at 10:04 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > Regarding 0004, I can't really see a reason for this function to take
> > a LockRelId as a parameter rather than two separate OIDs. I also can't
> > entirely see why it should be called LockRelationId. Maybe
> > LockRelationInDatabaseById(Oid dbid, Oid relid, LOCKMODE lockmode)?
> > Note that neither caller actually has a LockRelId available; both have
> > to construct one.
>
> Actually we already have an existing function
> UnlockRelationId(LockRelId *relid, LOCKMODE lockmode) so it makes more
> sense to have a parallel lock function.  Do you still think we should
> change?

Oh! OK, well, then what you did makes sense, for consistency. Didn't
realize that.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Mon, Mar 14, 2022 at 10:04 PM Robert Haas <robertmhaas@gmail.com> wrote:

> I think it would make sense to have two different WAL records e.g.
> XLOG_DBASE_CREATE_WAL_LOG and XLOG_DBASE_CREATE_FILE_COPY. Then it's
> easy to see how this could be generalized to other strategies in the
> future.

Done that way.  In dbase_desc(), for XLOG_DBASE_CREATE_FILE_COPY I
have kept the older description i.e. "copy dir" and for
XLOG_DBASE_CREATE_WAL_LOG it is "create dir", because logically the
first one is actually copying and the second one is just creating the
directory.  Do you think we should be using "copy dir file_copy" and
"copy dir wal_log" in the description as well?

> On Mon, Mar 14, 2022 at 12:04 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > I was looking at 0001 and 0002 again and realized that I swapped the
> > names load_relmap_file() and read_relmap_file() from what I should
> > have done. Here's a revised version. With this, read_relmap_file() and
> > write_relmap_file() become functions that just read and write the file
> > without touching any global variables, and load_relmap_file() is the
> > function that reads data from the file and puts it into a global
> > variable, which seems more sensible than the way I had it before.

Okay, I have included this patch and rebased other patches on top of that.

> Regarding 0003 and 0005, I'm not a fan of 'bool isunlogged'. I think
> 'bool permanent' would be better (note BM_PERMANENT). This would
> involve reversing true and false.

Okay changed.

> Regarding 0005:
>
> +       CREATEDB_WAL_LOG = 0,
> +       CREATEDB_FILE_COPY = 1
>
> I still think you don't need = 0 and = 1 here.

Done

> I'll probably go through and do a pass over the comments once you post
> the next version of this. There seems to be work needed in a bunch of
> places, but it probably makes more sense for me to go through and
> adjust the things that seem to need it rather than listing a bunch of
> changes for you to make.

Sure, thanks.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Ashutosh Sharma
Date:
Few comments on the latest patch:

-               /* We need to construct the pathname for this database */
-               dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
+               if (xlrec->dbid != InvalidOid)
+                       dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
+               else
+                       dbpath = pstrdup("global");

Do we really need this change? Is GetDatabasePath() alone not capable
of handling it?

==

+static CreateDBRelInfo *ScanSourceDatabasePgClassTuple(HeapTupleData *tuple,
+
                                    Oid tbid, Oid dbid,
+
                                    char *srcpath);
+static List *ScanSourceDatabasePgClassPage(Page page, Buffer buf, Oid tbid,
+
            Oid dbid, char *srcpath,
+
            List *rnodelist, Snapshot snapshot);
+static List *ScanSourceDatabasePgClass(Oid srctbid, Oid srcdbid, char
*srcpath);

I think we can shorten these function names to probably
ScanSourceDBPgClassRel(), ScanSourceDBPgClassTuple() and likewise?

--
With Regards,
Ashutosh Sharma.

On Tue, Mar 15, 2022 at 3:24 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Mar 14, 2022 at 10:04 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> > I think it would make sense to have two different WAL records e.g.
> > XLOG_DBASE_CREATE_WAL_LOG and XLOG_DBASE_CREATE_FILE_COPY. Then it's
> > easy to see how this could be generalized to other strategies in the
> > future.
>
> Done that way.  In dbase_desc(), for XLOG_DBASE_CREATE_FILE_COPY I
> have kept the older description i.e. "copy dir" and for
> XLOG_DBASE_CREATE_WAL_LOG it is "create dir", because logically the
> first one is actually copying and the second one is just creating the
> directory.  Do you think we should be using "copy dir file_copy" and
> "copy dir wal_log" in the description as well?
>
> > On Mon, Mar 14, 2022 at 12:04 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > > I was looking at 0001 and 0002 again and realized that I swapped the
> > > names load_relmap_file() and read_relmap_file() from what I should
> > > have done. Here's a revised version. With this, read_relmap_file() and
> > > write_relmap_file() become functions that just read and write the file
> > > without touching any global variables, and load_relmap_file() is the
> > > function that reads data from the file and puts it into a global
> > > variable, which seems more sensible than the way I had it before.
>
> Okay, I have included this patch and rebased other patches on top of that.
>
> > Regarding 0003 and 0005, I'm not a fan of 'bool isunlogged'. I think
> > 'bool permanent' would be better (note BM_PERMANENT). This would
> > involve reversing true and false.
>
> Okay changed.
>
> > Regarding 0005:
> >
> > +       CREATEDB_WAL_LOG = 0,
> > +       CREATEDB_FILE_COPY = 1
> >
> > I still think you don't need = 0 and = 1 here.
>
> Done
>
> > I'll probably go through and do a pass over the comments once you post
> > the next version of this. There seems to be work needed in a bunch of
> > places, but it probably makes more sense for me to go through and
> > adjust the things that seem to need it rather than listing a bunch of
> > changes for you to make.
>
> Sure, thanks.
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Tue, Mar 15, 2022 at 12:30 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Few comments on the latest patch:
>
> -               /* We need to construct the pathname for this database */
> -               dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
> +               if (xlrec->dbid != InvalidOid)
> +                       dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
> +               else
> +                       dbpath = pstrdup("global");
>
> Do we really need this change? Is GetDatabasePath() alone not capable
> of handling it?

Well, I mean, that function has a special case for
GLOBALTABLESPACE_OID, but GLOBALTABLESPACE_OID is 1664, and InvalidOid
is 0.

> I think we can shorten these function names to probably
> ScanSourceDBPgClassRel(), ScanSourceDBPgClassTuple() and likewise?

We could, but I don't think it's an improvement.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Ashutosh Sharma
Date:
On Tue, Mar 15, 2022 at 10:17 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Mar 15, 2022 at 12:30 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> >
> > Few comments on the latest patch:
> >
> > -               /* We need to construct the pathname for this database */
> > -               dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
> > +               if (xlrec->dbid != InvalidOid)
> > +                       dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
> > +               else
> > +                       dbpath = pstrdup("global");
> >
> > Do we really need this change? Is GetDatabasePath() alone not capable
> > of handling it?
>
> Well, I mean, that function has a special case for
> GLOBALTABLESPACE_OID, but GLOBALTABLESPACE_OID is 1664, and InvalidOid
> is 0.
>

Wouldn't this be true only in case of a shared map file (when dbOid is
Invalid and tblspcOid is globaltablespace_oid) or am I missing
something?

--
With Regards,
Ashutosh Sharma.



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Tue, Mar 15, 2022 at 1:26 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > On Tue, Mar 15, 2022 at 12:30 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > > Few comments on the latest patch:
> > >
> > > -               /* We need to construct the pathname for this database */
> > > -               dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
> > > +               if (xlrec->dbid != InvalidOid)
> > > +                       dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
> > > +               else
> > > +                       dbpath = pstrdup("global");
> > >
> > > Do we really need this change? Is GetDatabasePath() alone not capable
> > > of handling it?
> >
> > Well, I mean, that function has a special case for
> > GLOBALTABLESPACE_OID, but GLOBALTABLESPACE_OID is 1664, and InvalidOid
> > is 0.
> >
>
> Wouldn't this be true only in case of a shared map file (when dbOid is
> Invalid and tblspcOid is globaltablespace_oid) or am I missing
> something?

*facepalm*

Good catch, sorry that I'm slow on the uptake today.

v3 attached.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Tue, Mar 15, 2022 at 11:09 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Mar 15, 2022 at 1:26 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > > On Tue, Mar 15, 2022 at 12:30 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > > > Few comments on the latest patch:
> > > >
> > > > -               /* We need to construct the pathname for this database */
> > > > -               dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
> > > > +               if (xlrec->dbid != InvalidOid)
> > > > +                       dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid);
> > > > +               else
> > > > +                       dbpath = pstrdup("global");
> > > >
> > > > Do we really need this change? Is GetDatabasePath() alone not capable
> > > > of handling it?
> > >
> > > Well, I mean, that function has a special case for
> > > GLOBALTABLESPACE_OID, but GLOBALTABLESPACE_OID is 1664, and InvalidOid
> > > is 0.
> > >
> >
> > Wouldn't this be true only in case of a shared map file (when dbOid is
> > Invalid and tblspcOid is globaltablespace_oid) or am I missing
> > something?
>
> *facepalm*
>
> Good catch, sorry that I'm slow on the uptake today.
>
> v3 attached.

Thanks Ashutosh and Robert.  Other pacthes cleanly applied on this
patch still generated a new version so that we can find all patches
together.  There are no other changes.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Wed, Mar 16, 2022 at 12:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Thanks Ashutosh and Robert.  Other pacthes cleanly applied on this
> patch still generated a new version so that we can find all patches
> together.  There are no other changes.

I committed my v3 of my refactoring patch, here 0001.

I'm working over the comments in the rest of the patch series and will
post an updated version when I get done. I think I will likely merge
all the remaining patches together just to make it simpler to manage;
we can split things out again if we need to do that.

One question that occurred to me when looking this over is whether, or
why, it's safe against concurrent smgr invalidations. It seems to me
that every loop in the new CREATE DATABASE code needs to
CHECK_FOR_INTERRUPTS() -- some do already -- and when they do that, I
think we might receive an invalidation message that causes us to
smgrclose() some or all of the things where we previously did
smgropen(). I don't quite see why that can't cause problems here. I
tried running the src/bin/scripts regression tests with
debug_discard_caches=1 and none of the tests failed, so there may very
well be a reason why this is actually totally fine, but I don't know
what it is. On the other hand, it may be that things went horribly
wrong and the tests are just smart enough to catch it, or maybe
there's a problematic scenario which those tests just don't hit. I
don't know. Thoughts?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Fri, Mar 18, 2022 at 1:44 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Mar 16, 2022 at 12:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > Thanks Ashutosh and Robert.  Other pacthes cleanly applied on this
> > patch still generated a new version so that we can find all patches
> > together.  There are no other changes.
>
> I committed my v3 of my refactoring patch, here 0001.
>
> I'm working over the comments in the rest of the patch series and will
> post an updated version when I get done. I think I will likely merge
> all the remaining patches together just to make it simpler to manage;
> we can split things out again if we need to do that.

Thanks for the effort.

> One question that occurred to me when looking this over is whether, or
> why, it's safe against concurrent smgr invalidations.

We are only accessing the smgr of the source database and the
destination database.  And there is no one else that can be connected
to the source db and the destination db is not visible to anyone.  So
do we really need to worry about the concurrent smgr invalidation?
What am I missing?

It seems to me
> that every loop in the new CREATE DATABASE code needs to
> CHECK_FOR_INTERRUPTS() -- some do already -- and when they do that,

Yes, the pg_class reading code is missing this check so we need to put
it.  But copying code like
CreateDatabaseUsingWalLog() have it inside the deepest loop in
RelationCopyStorageUsingBuffer() and similarly
CreateDatabaseUsingFileCopy() have it in copydir().  Maybe we should
put it in all loop so that we do not skip checking due to some
condition.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Fri, Mar 18, 2022 at 12:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > One question that occurred to me when looking this over is whether, or
> > why, it's safe against concurrent smgr invalidations.
>
> We are only accessing the smgr of the source database and the
> destination database.  And there is no one else that can be connected
> to the source db and the destination db is not visible to anyone.  So
> do we really need to worry about the concurrent smgr invalidation?
> What am I missing?

A sinval reset can occur at any moment due to an overflow of the
queue. That acts as a universal reset of everything. So you can't
reason on the basis of what somebody might be sending.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Sun, Mar 20, 2022 at 12:03 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Mar 18, 2022 at 12:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > One question that occurred to me when looking this over is whether, or
> > > why, it's safe against concurrent smgr invalidations.
> >
> > We are only accessing the smgr of the source database and the
> > destination database.  And there is no one else that can be connected
> > to the source db and the destination db is not visible to anyone.  So
> > do we really need to worry about the concurrent smgr invalidation?
> > What am I missing?
>
> A sinval reset can occur at any moment due to an overflow of the
> queue. That acts as a universal reset of everything. So you can't
> reason on the basis of what somebody might be sending.

I thought that way because IIUC, when we are locking the database
tuple we are ensuring that we are calling
ReceiveSharedInvalidMessages() right?   And IIUC
ReceiveSharedInvalidMessages(), is designed such a way that it will
consume all the outstanding messages and that's the reason it loops
multiple times if it identifies that the queue is full.  And if my
assumption here is correct then I think it is also correct that now we
only need to worry about anyone generating new invalidations and that
is not possible in this case.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Sun, Mar 20, 2022 at 1:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I thought that way because IIUC, when we are locking the database
> tuple we are ensuring that we are calling
> ReceiveSharedInvalidMessages() right?   And IIUC
> ReceiveSharedInvalidMessages(), is designed such a way that it will
> consume all the outstanding messages and that's the reason it loops
> multiple times if it identifies that the queue is full.  And if my
> assumption here is correct then I think it is also correct that now we
> only need to worry about anyone generating new invalidations and that
> is not possible in this case.

Well, I don't see how that chain of logic addresses my concern about
sinval reset.

Mind you, I'm not sure there's an actual problem here, because I tried
testing the patch with debug_discard_caches=1 and nothing failed. But
I still don't understand WHY nothing failed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Mon, Mar 21, 2022 at 7:07 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sun, Mar 20, 2022 at 1:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I thought that way because IIUC, when we are locking the database
> > tuple we are ensuring that we are calling
> > ReceiveSharedInvalidMessages() right?   And IIUC
> > ReceiveSharedInvalidMessages(), is designed such a way that it will
> > consume all the outstanding messages and that's the reason it loops
> > multiple times if it identifies that the queue is full.  And if my
> > assumption here is correct then I think it is also correct that now we
> > only need to worry about anyone generating new invalidations and that
> > is not possible in this case.
>
> Well, I don't see how that chain of logic addresses my concern about
> sinval reset.
>
> Mind you, I'm not sure there's an actual problem here, because I tried
> testing the patch with debug_discard_caches=1 and nothing failed. But
> I still don't understand WHY nothing failed.

Okay, I see what you are saying.  Yeah this looks like a problem to me
as well.  I will try to reproduce this issue.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Mon, Mar 21, 2022 at 8:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Mar 21, 2022 at 7:07 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Sun, Mar 20, 2022 at 1:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > I thought that way because IIUC, when we are locking the database
> > > tuple we are ensuring that we are calling
> > > ReceiveSharedInvalidMessages() right?   And IIUC
> > > ReceiveSharedInvalidMessages(), is designed such a way that it will
> > > consume all the outstanding messages and that's the reason it loops
> > > multiple times if it identifies that the queue is full.  And if my
> > > assumption here is correct then I think it is also correct that now we
> > > only need to worry about anyone generating new invalidations and that
> > > is not possible in this case.
> >
> > Well, I don't see how that chain of logic addresses my concern about
> > sinval reset.
> >
> > Mind you, I'm not sure there's an actual problem here, because I tried
> > testing the patch with debug_discard_caches=1 and nothing failed. But
> > I still don't understand WHY nothing failed.
>
> Okay, I see what you are saying.  Yeah this looks like a problem to me
> as well.  I will try to reproduce this issue.

I tried to debug the case but I realized that somehow
CHECK_FOR_INTERRUPTS() is not calling the
AcceptInvalidationMessages() and I could not find the same while
looking into the code as well.   While debugging I noticed that
AcceptInvalidationMessages() is called multiple times but that is only
through LockRelationId() but while locking the relation we had already
closed the previous smgr because at a time we keep only one smgr open.
And that's the reason it is not hitting the issue which we think it
could. Is there any condition under which it will call
AcceptInvalidationMessages() through CHECK_FOR_INTERRUPTS() ? because
I could not see while debugging as well as in code.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Mon, Mar 21, 2022 at 11:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I tried to debug the case but I realized that somehow
> CHECK_FOR_INTERRUPTS() is not calling the
> AcceptInvalidationMessages() and I could not find the same while
> looking into the code as well.   While debugging I noticed that
> AcceptInvalidationMessages() is called multiple times but that is only
> through LockRelationId() but while locking the relation we had already
> closed the previous smgr because at a time we keep only one smgr open.
> And that's the reason it is not hitting the issue which we think it
> could. Is there any condition under which it will call
> AcceptInvalidationMessages() through CHECK_FOR_INTERRUPTS() ? because
> I could not see while debugging as well as in code.

Yeah, I think the reason you can't find it is that it's not there. I
was confused in what I wrote earlier. I think we only process sinval
catchups when we're idle, not at every CHECK_FOR_INTERRUPTS(). And I
think the reason for that is precisely that it would be hard to write
correct code otherwise, since invalidations might then get processed
in a lot more places. So ... I guess all we really need to do here is
avoid assuming that the results of smgropen() are valid across any
code that might acquire relation locks. Which I think the code already
does.

But on a related note, why doesn't CreateDatabaseUsingWalLog() acquire
locks on both the source and destination relations? It looks like
you're only taking locks for the source template database ... but I
thought the intention here was to make sure that we didn't pull pages
into shared_buffers without holding a lock on the relation and/or the
database? I suppose the point is that while the template database
might be concurrently dropped, nobody can be doing anything
concurrently to the target database because nobody knows that it
exists yet. Still, I think that this would be the only case where we
let pages into shared_buffers without a relation or database lock,
though maybe I'm confused about this point, too. If not, perhaps we
should consider locking the target database OID and each relation OID
as we are copying it?

I guess I'm imagining that there might be more code pathways in the
future that want to ensure that there are no remaining buffers for
some particular database or relation OID. It seems natural to want to
be able to take some lock that prevents buffers from being added, and
then go and get rid of all the ones that are there already. But I
admit I can't quite think of a concrete case where we'd want to do
something like this where the patch as coded would be a problem. I'm
just thinking perhaps taking locks is fairly harmless and might avoid
some hypothetical problem later.

Thoughts?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Mon, Mar 21, 2022 at 11:53 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Mar 21, 2022 at 11:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I tried to debug the case but I realized that somehow
> > CHECK_FOR_INTERRUPTS() is not calling the
> > AcceptInvalidationMessages() and I could not find the same while
> > looking into the code as well.   While debugging I noticed that
> > AcceptInvalidationMessages() is called multiple times but that is only
> > through LockRelationId() but while locking the relation we had already
> > closed the previous smgr because at a time we keep only one smgr open.
> > And that's the reason it is not hitting the issue which we think it
> > could. Is there any condition under which it will call
> > AcceptInvalidationMessages() through CHECK_FOR_INTERRUPTS() ? because
> > I could not see while debugging as well as in code.
>
> Yeah, I think the reason you can't find it is that it's not there. I
> was confused in what I wrote earlier. I think we only process sinval
> catchups when we're idle, not at every CHECK_FOR_INTERRUPTS(). And I
> think the reason for that is precisely that it would be hard to write
> correct code otherwise, since invalidations might then get processed
> in a lot more places. So ... I guess all we really need to do here is
> avoid assuming that the results of smgropen() are valid across any
> code that might acquire relation locks. Which I think the code already
> does.
>
> But on a related note, why doesn't CreateDatabaseUsingWalLog() acquire
> locks on both the source and destination relations? It looks like
> you're only taking locks for the source template database ... but I
> thought the intention here was to make sure that we didn't pull pages
> into shared_buffers without holding a lock on the relation and/or the
> database? I suppose the point is that while the template database
> might be concurrently dropped, nobody can be doing anything
> concurrently to the target database because nobody knows that it
> exists yet. Still, I think that this would be the only case where we
> let pages into shared_buffers without a relation or database lock,
> though maybe I'm confused about this point, too. If not, perhaps we
> should consider locking the target database OID and each relation OID
> as we are copying it?
>
> I guess I'm imagining that there might be more code pathways in the
> future that want to ensure that there are no remaining buffers for
> some particular database or relation OID. It seems natural to want to
> be able to take some lock that prevents buffers from being added, and
> then go and get rid of all the ones that are there already. But I
> admit I can't quite think of a concrete case where we'd want to do
> something like this where the patch as coded would be a problem. I'm
> just thinking perhaps taking locks is fairly harmless and might avoid
> some hypothetical problem later.
>
> Thoughts?

I think this make sense.  I haven't changed the original patch as you
told you were improving on some comments, so in order to avoid
conflict I have created this add on patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Tue, Mar 22, 2022 at 10:28 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
>
> I think this make sense.  I haven't changed the original patch as you
> told you were improving on some comments, so in order to avoid
> conflict I have created this add on patch.
>

In my previous patch mistakenly I used src_dboid instead of
dest_dboid.  Fixed in this version.  For destination db I have used
lock mode as  AccessSharedLock.  Logically if we see access wise we
don't want anyone else to be accessing that db but that is anyway
protected because it is not visible to anyone else.  So I think
AccessSharedLock should be correct here because we are just taking
this lock because we are accessing pages in shared buffers from this
database's relations.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Tue, Mar 22, 2022 at 5:00 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> In my previous patch mistakenly I used src_dboid instead of
> dest_dboid.  Fixed in this version.  For destination db I have used
> lock mode as  AccessSharedLock.  Logically if we see access wise we
> don't want anyone else to be accessing that db but that is anyway
> protected because it is not visible to anyone else.  So I think
> AccessSharedLock should be correct here because we are just taking
> this lock because we are accessing pages in shared buffers from this
> database's relations.

Here's my worked-over version of your previous patch. I haven't tried
to incorporate your incremental patch that you just posted.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Tue, Mar 22, 2022 at 11:23 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Here's my worked-over version of your previous patch. I haven't tried
> to incorporate your incremental patch that you just posted.

Also, please have a look at the XXX comments that I added in a few
places where I think you need to make further changes.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
Hi,

On 2022-03-22 11:23:16 -0400, Robert Haas wrote:
> From 116bcdb6174a750b7ef7ae05ef6f39cebaf9bcf5 Mon Sep 17 00:00:00 2001
> From: Robert Haas <rhaas@postgresql.org>
> Date: Tue, 22 Mar 2022 11:22:26 -0400
> Subject: [PATCH v1] Add new block-by-block strategy for CREATE DATABASE.

I might have missed it because I just skimmed the patch. But I still think it
should contain a comment detailing why accessing catalogs from another
database is safe in this instance, and perhaps a comment or three in places
that could break it (e.g. snapshot computation, horizon stuff).

Greetings,

Andres Freund



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Tue, Mar 22, 2022 at 11:42 AM Andres Freund <andres@anarazel.de> wrote:
> I might have missed it because I just skimmed the patch. But I still think it
> should contain a comment detailing why accessing catalogs from another
> database is safe in this instance, and perhaps a comment or three in places
> that could break it (e.g. snapshot computation, horizon stuff).

Please see the function header comment for ScanSourceDatabasePgClass.
I don't quite see how changes in those places would break this, but if
you want to be more specific perhaps I will see the light?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Mon, Mar 21, 2022 at 2:23 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Mar 21, 2022 at 11:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I tried to debug the case but I realized that somehow
> > CHECK_FOR_INTERRUPTS() is not calling the
> > AcceptInvalidationMessages() and I could not find the same while
> > looking into the code as well.   While debugging I noticed that
> > AcceptInvalidationMessages() is called multiple times but that is only
> > through LockRelationId() but while locking the relation we had already
> > closed the previous smgr because at a time we keep only one smgr open.
> > And that's the reason it is not hitting the issue which we think it
> > could. Is there any condition under which it will call
> > AcceptInvalidationMessages() through CHECK_FOR_INTERRUPTS() ? because
> > I could not see while debugging as well as in code.
>
> Yeah, I think the reason you can't find it is that it's not there. I
> was confused in what I wrote earlier. I think we only process sinval
> catchups when we're idle, not at every CHECK_FOR_INTERRUPTS(). And I
> think the reason for that is precisely that it would be hard to write
> correct code otherwise, since invalidations might then get processed
> in a lot more places. So ... I guess all we really need to do here is
> avoid assuming that the results of smgropen() are valid across any
> code that might acquire relation locks. Which I think the code already
> does.

So I talked to Andres and Thomas about this and they told me that I
was right to worry about this problem. Over on the thread about "wrong
fds used for refilenodes after pg_upgrade relfilenode changes
Reply-To:" there is a plan to make use ProcSignalBarrier to make smgr
objects disappear, and ProcSignalBarrier can be processed at any
CHECK_FOR_INTERRUPTS(), so then we'd have a problem here. Commit
f10f0ae420ee62400876ab34dca2c09c20dcd030 established a policy that you
should always re-fetch the smgr object instead of reusing one you've
already got, and even before that it was known to be unsafe to keep
them around for any period of time, because anything that opened a
relation, including a syscache lookup, could potentially accept
invalidations. So most of our code is already hardened against the
possibility of smgr objects disappearing. I have a feeling there may
be some that isn't, but it would be good if this patch didn't
introduce more such code at the same time that patch is trying to
introduce more ways to get rid of smgr objects. It was suggested to me
that what this patch ought to be doing is calling
CreateFakeRelcacheEntry() and then using RelationGetSmgr(fakerel)
every time we need the SmgrRelation, without ever keeping it around
for any amount of code. That way, if the smgr relation gets closed out
from under us at a CHECK_FOR_INTERRUPTS(), we'll just recreate it at
the next RelationGetSmgr() call.

Andres also noted that he thinks the patch performs redundant cleanup,
because of the fact that it uses RelationCreateStorage. That will
arrange to remove files on abort, but createdb() also has its own
mechanism for that. It doesn't seem like a thing to do twice in two
different ways.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Tue, Mar 22, 2022 at 8:53 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Mar 22, 2022 at 5:00 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > In my previous patch mistakenly I used src_dboid instead of
> > dest_dboid.  Fixed in this version.  For destination db I have used
> > lock mode as  AccessSharedLock.  Logically if we see access wise we
> > don't want anyone else to be accessing that db but that is anyway
> > protected because it is not visible to anyone else.  So I think
> > AccessSharedLock should be correct here because we are just taking
> > this lock because we are accessing pages in shared buffers from this
> > database's relations.
>
> Here's my worked-over version of your previous patch. I haven't tried
> to incorporate your incremental patch that you just posted.

Thanks for working on the comments.  Please find the updated version
which include below changes
- Worked on the XXX comments added by you.
- Added database level lock for the target database as well.
- Used fake relcache and removed direct access to the smgr, I think it
was not really necessary in
ScanSourceDatabasePgClass() because we are using it for a very short
period of time but still I have changed it, let me know if you think
that it is unneccessary to create the fake relcache here.
- Removed extra space in createdb.c and fixed test case.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Wed, Mar 23, 2022 at 2:14 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> So I talked to Andres and Thomas about this and they told me that I
> was right to worry about this problem. Over on the thread about "wrong
> fds used for refilenodes after pg_upgrade relfilenode changes
> Reply-To:" there is a plan to make use ProcSignalBarrier to make smgr
> objects disappear, and ProcSignalBarrier can be processed at any
> CHECK_FOR_INTERRUPTS(), so then we'd have a problem here. Commit
> f10f0ae420ee62400876ab34dca2c09c20dcd030 established a policy that you
> should always re-fetch the smgr object instead of reusing one you've
> already got, and even before that it was known to be unsafe to keep
> them around for any period of time, because anything that opened a
> relation, including a syscache lookup, could potentially accept
> invalidations. So most of our code is already hardened against the
> possibility of smgr objects disappearing. I have a feeling there may
> be some that isn't, but it would be good if this patch didn't
> introduce more such code at the same time that patch is trying to
> introduce more ways to get rid of smgr objects. It was suggested to me
> that what this patch ought to be doing is calling
> CreateFakeRelcacheEntry() and then using RelationGetSmgr(fakerel)
> every time we need the SmgrRelation, without ever keeping it around
> for any amount of code. That way, if the smgr relation gets closed out
> from under us at a CHECK_FOR_INTERRUPTS(), we'll just recreate it at
> the next RelationGetSmgr() call.

Okay, I have changed this in my latest version of the patch.


> Andres also noted that he thinks the patch performs redundant cleanup,
> because of the fact that it uses RelationCreateStorage. That will
> arrange to remove files on abort, but createdb() also has its own
> mechanism for that. It doesn't seem like a thing to do twice in two
> different ways.

Okay this is an interesting point.  So one option is that in case of
failure while using the wal log strategy we do not remove the database
directory, because an abort transaction will take care of removing the
relation file.  But then in failure case we will leave the orphaned
database directory with version file and the relmap file.  Another
option is to do the redundant cleanup as we are doing now.  Any other
options?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Wed, Mar 23, 2022 at 4:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Okay this is an interesting point.  So one option is that in case of
> failure while using the wal log strategy we do not remove the database
> directory, because an abort transaction will take care of removing the
> relation file.  But then in failure case we will leave the orphaned
> database directory with version file and the relmap file.  Another
> option is to do the redundant cleanup as we are doing now.  Any other
> options?

I think our overriding goal should be to get everything using one
mechanism. It doesn't look straightforward to get everything to go
through the PendingRelDelete mechanism, because as you say, it can't
handle non-relation files or directories. However, what if we opt out
of that mechanism? We could do that either by not using
RelationCreateStorage() in the first place and directly calling
smgrcreate(), or by using RelationPreserveStorage() afterwards to yank
the file back out of the list.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Wed, Mar 23, 2022 at 5:54 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Mar 23, 2022 at 4:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > Okay this is an interesting point.  So one option is that in case of
> > failure while using the wal log strategy we do not remove the database
> > directory, because an abort transaction will take care of removing the
> > relation file.  But then in failure case we will leave the orphaned
> > database directory with version file and the relmap file.  Another
> > option is to do the redundant cleanup as we are doing now.  Any other
> > options?
>
> I think our overriding goal should be to get everything using one
> mechanism. It doesn't look straightforward to get everything to go
> through the PendingRelDelete mechanism, because as you say, it can't
> handle non-relation files or directories. However, what if we opt out
> of that mechanism? We could do that either by not using
> RelationCreateStorage() in the first place and directly calling
> smgrcreate(), or by using RelationPreserveStorage() afterwards to yank
> the file back out of the list.

I think directly using smgrcreate() is a better idea instead of first
registering and then unregistering it.   I have made that change in
the attached patch.  After this change now we can merge creating the
MAIN_FORKNUM also in the loop below where we are creating other
fork[1] with one extra condition but I think current code is in more
sync with the other code where we are doing the similar things so I
have not merged it in the loop.  Please let me know if you think
otherwise.

[1]
+    /*
+     * Create and copy all forks of the relation.  We are not using
+     * RelationCreateStorage() as it is registering the cleanup for the
+     * underlying relation storage on the transaction abort.  But during create
+     * database failure, we have a separate cleanup mechanism for the whole
+     * database directory. Therefore, we don't need to register cleanup for
+     * each individual relation storage.
+     */
+    smgrcreate(RelationGetSmgr(dst_rel), MAIN_FORKNUM, false);
+    if (permanent)
+        log_smgrcreate(&dst_rnode, MAIN_FORKNUM);
+
+    /* copy main fork. */
+    RelationCopyStorageUsingBuffer(src_rel, dst_rel, MAIN_FORKNUM, permanent);
+
+    /* copy those extra forks that exist */
+    for (ForkNumber forkNum = MAIN_FORKNUM + 1;
+         forkNum <= MAX_FORKNUM; forkNum++)
+    {
+        if (smgrexists(RelationGetSmgr(src_rel), forkNum))
+        {
+            smgrcreate(RelationGetSmgr(dst_rel), forkNum, false);
+

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Wed, Mar 23, 2022 at 9:19 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I think directly using smgrcreate() is a better idea instead of first
> registering and then unregistering it.   I have made that change in
> the attached patch.  After this change now we can merge creating the
> MAIN_FORKNUM also in the loop below where we are creating other
> fork[1] with one extra condition but I think current code is in more
> sync with the other code where we are doing the similar things so I
> have not merged it in the loop.  Please let me know if you think
> otherwise.

Generally I think our practice is that we do the main fork
unconditionally (because it should always be there) and the others
only if they exist. I suggest that you make this consistent with that,
but you could do it like if (forkNum != MAIN_FORKNUM &&
!smgrexists(...)) continue if that seems nicer.

Do you think that this version handles pending syncs correctly? I
think perhaps that is overlooked.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Wed, Mar 23, 2022 at 7:03 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Mar 23, 2022 at 9:19 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I think directly using smgrcreate() is a better idea instead of first
> > registering and then unregistering it.   I have made that change in
> > the attached patch.  After this change now we can merge creating the
> > MAIN_FORKNUM also in the loop below where we are creating other
> > fork[1] with one extra condition but I think current code is in more
> > sync with the other code where we are doing the similar things so I
> > have not merged it in the loop.  Please let me know if you think
> > otherwise.
>
> Generally I think our practice is that we do the main fork
> unconditionally (because it should always be there) and the others
> only if they exist. I suggest that you make this consistent with that,
> but you could do it like if (forkNum != MAIN_FORKNUM &&
> !smgrexists(...)) continue if that seems nicer.

Maybe we can do that.

> Do you think that this version handles pending syncs correctly? I
> think perhaps that is overlooked.

Yeah I missed that.  So options are either we go to the other approach
and call RelationPreserveStorage() after
RelationCreateStorage(), or we expose the AddPendingSync() function
from the storage layer and then conditionally use it.  I think if we
are planning to expose this api then we better rename it to
RelationAddPendingSync().  Honestly, I do not have any specific
preference here.  I can try both the approaches and send both if you
or anyone else do not have any preference here?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
Hi,

On 2022-03-23 18:49:11 +0530, Dilip Kumar wrote:
> I think directly using smgrcreate() is a better idea instead of first
> registering and then unregistering it.   I have made that change in
> the attached patch.  After this change now we can merge creating the
> MAIN_FORKNUM also in the loop below where we are creating other
> fork[1] with one extra condition but I think current code is in more
> sync with the other code where we are doing the similar things so I
> have not merged it in the loop.  Please let me know if you think
> otherwise.

FWIW, this fails tests: https://cirrus-ci.com/build/4929662173315072
https://cirrus-ci.com/task/6651773434724352?logs=test_bin#L121
https://cirrus-ci.com/task/6088823481303040?logs=test_world#L2377

Greetings,

Andres Freund



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Wed, Mar 23, 2022 at 9:13 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-03-23 18:49:11 +0530, Dilip Kumar wrote:
> > I think directly using smgrcreate() is a better idea instead of first
> > registering and then unregistering it.   I have made that change in
> > the attached patch.  After this change now we can merge creating the
> > MAIN_FORKNUM also in the loop below where we are creating other
> > fork[1] with one extra condition but I think current code is in more
> > sync with the other code where we are doing the similar things so I
> > have not merged it in the loop.  Please let me know if you think
> > otherwise.
>
> FWIW, this fails tests: https://cirrus-ci.com/build/4929662173315072
> https://cirrus-ci.com/task/6651773434724352?logs=test_bin#L121
> https://cirrus-ci.com/task/6088823481303040?logs=test_world#L2377

Strange to see that these changes are making a failure in the
file_copy strategy[1] because we made changes only related to the
wal_log strategy.  However I will look into this.  Thanks.
[1]
Failed test 'createdb -T foobar2 foobar5 -S file_copy exit code 0'


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Wed, Mar 23, 2022 at 9:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Mar 23, 2022 at 7:03 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Wed, Mar 23, 2022 at 9:19 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > I think directly using smgrcreate() is a better idea instead of first
> > > registering and then unregistering it.   I have made that change in
> > > the attached patch.  After this change now we can merge creating the
> > > MAIN_FORKNUM also in the loop below where we are creating other
> > > fork[1] with one extra condition but I think current code is in more
> > > sync with the other code where we are doing the similar things so I
> > > have not merged it in the loop.  Please let me know if you think
> > > otherwise.
> >
> > Generally I think our practice is that we do the main fork
> > unconditionally (because it should always be there) and the others
> > only if they exist. I suggest that you make this consistent with that,
> > but you could do it like if (forkNum != MAIN_FORKNUM &&
> > !smgrexists(...)) continue if that seems nicer.
>
> Maybe we can do that.
>
> > Do you think that this version handles pending syncs correctly? I
> > think perhaps that is overlooked.
>
> Yeah I missed that.  So options are either we go to the other approach
> and call RelationPreserveStorage() after
> RelationCreateStorage(),

Here is the patch with this approach, I am not sending both patches
with different approaches in the same mail otherwise cfbot might
generate conflict while applying the patch I think, so I will send it
in a seperate mail.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Wed, Mar 23, 2022 at 9:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Mar 23, 2022 at 9:13 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2022-03-23 18:49:11 +0530, Dilip Kumar wrote:
> > > I think directly using smgrcreate() is a better idea instead of first
> > > registering and then unregistering it.   I have made that change in
> > > the attached patch.  After this change now we can merge creating the
> > > MAIN_FORKNUM also in the loop below where we are creating other
> > > fork[1] with one extra condition but I think current code is in more
> > > sync with the other code where we are doing the similar things so I
> > > have not merged it in the loop.  Please let me know if you think
> > > otherwise.
> >
> > FWIW, this fails tests: https://cirrus-ci.com/build/4929662173315072
> > https://cirrus-ci.com/task/6651773434724352?logs=test_bin#L121
> > https://cirrus-ci.com/task/6088823481303040?logs=test_world#L2377
>
> Strange to see that these changes are making a failure in the
> file_copy strategy[1] because we made changes only related to the
> wal_log strategy.  However I will look into this.  Thanks.
> [1]
> Failed test 'createdb -T foobar2 foobar5 -S file_copy exit code 0'

I could not see any reason for it to fail, and I could not reproduce
it either.  Is it possible to access the server log for this cfbot
failure?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
Hi,

On 2022-03-23 22:29:40 +0530, Dilip Kumar wrote:
> I could not see any reason for it to fail, and I could not reproduce
> it either.  Is it possible to access the server log for this cfbot
> failure?

Yes, near the top, below the cpu / memory graphs, there's a file
navigator. Should have all files ending with *.log or starting with
regress_log_*.

Greetings,

Andres Freund



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Wed, Mar 23, 2022 at 10:37 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-03-23 22:29:40 +0530, Dilip Kumar wrote:
> > I could not see any reason for it to fail, and I could not reproduce
> > it either.  Is it possible to access the server log for this cfbot
> > failure?
>
> Yes, near the top, below the cpu / memory graphs, there's a file
> navigator. Should have all files ending with *.log or starting with
> regress_log_*.

Okay, I think I have found the reasoning for this failure, basically,
if we see the below logs then the second statement is failing with
foobar5 already exists and that is because some of the above test case
is conditionally generating the same name.  So the fix is to use a
different name.

2022-03-23 13:53:54.554 UTC [32647][client backend]
[020_createdb.pl][3/12:0] LOG:  statement: CREATE DATABASE foobar5
TEMPLATE template0 LOCALE_PROVIDER icu ICU_LOCALE 'en';
......
2022-03-23 13:53:55.374 UTC [32717][client backend]
[020_createdb.pl][3/46:0] LOG:  statement: CREATE DATABASE foobar5
STRATEGY file_copy TEMPLATE foobar2;
2022-03-23 13:53:55.390 UTC [32717][client backend]
[020_createdb.pl][3/46:0] ERROR:  database "foobar5" already exists

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Wed, Mar 23, 2022 at 10:50 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Mar 23, 2022 at 10:37 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2022-03-23 22:29:40 +0530, Dilip Kumar wrote:
> > > I could not see any reason for it to fail, and I could not reproduce
> > > it either.  Is it possible to access the server log for this cfbot
> > > failure?
> >
> > Yes, near the top, below the cpu / memory graphs, there's a file
> > navigator. Should have all files ending with *.log or starting with
> > regress_log_*.
>
> Okay, I think I have found the reasoning for this failure, basically,
> if we see the below logs then the second statement is failing with
> foobar5 already exists and that is because some of the above test case
> is conditionally generating the same name.  So the fix is to use a
> different name.

In the latest version I have fixed this issue by using a non
conflicting name, because when it was compiled with-icu the foobar5
was already used and we were seeing failure.  Apart from this I have
fixed the duplicate cleanup problem by passing an extra parameter to
RelationCreateStorage, which decides whether to register for on-abort
delete or not and added the comments for the same.  IMHO this looks
the most cleaner way to do it, please check the patch and let me know
your thoughts.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Thu, Mar 24, 2022 at 1:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> In the latest version I have fixed this issue by using a non
> conflicting name, because when it was compiled with-icu the foobar5
> was already used and we were seeing failure.  Apart from this I have
> fixed the duplicate cleanup problem by passing an extra parameter to
> RelationCreateStorage, which decides whether to register for on-abort
> delete or not and added the comments for the same.  IMHO this looks
> the most cleaner way to do it, please check the patch and let me know
> your thoughts.

I think that might be an OK way to do it. I think if we were starting
from scratch we'd probably want to come up with some better system,
but that's true of a lot of things.

I went over your version and changed some comments. I also added
documentation for the new wait event. Here's a new version.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Thu, Mar 24, 2022 at 9:29 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Mar 24, 2022 at 1:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > In the latest version I have fixed this issue by using a non
> > conflicting name, because when it was compiled with-icu the foobar5
> > was already used and we were seeing failure.  Apart from this I have
> > fixed the duplicate cleanup problem by passing an extra parameter to
> > RelationCreateStorage, which decides whether to register for on-abort
> > delete or not and added the comments for the same.  IMHO this looks
> > the most cleaner way to do it, please check the patch and let me know
> > your thoughts.
>
> I think that might be an OK way to do it. I think if we were starting
> from scratch we'd probably want to come up with some better system,
> but that's true of a lot of things.

Right.

> I went over your version and changed some comments. I also added
> documentation for the new wait event. Here's a new version.
>

Thanks, I have gone through your changes in comments and docs and those LGTM.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Thu, Mar 24, 2022 at 12:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Thanks, I have gone through your changes in comments and docs and those LGTM.

It looks like this patch will need to be updated for Alvaro's commit
49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. The newly added test
029_replay_tsp_drops.pl fails with this patch applied. The standby log
shows:

2022-03-25 10:00:10.022 EDT [38209] LOG:  entering standby mode
2022-03-25 10:00:10.024 EDT [38209] LOG:  redo starts at 0/3000028
2022-03-25 10:00:10.062 EDT [38209] FATAL:  could not create directory
"pg_tblspc/16385/PG_15_202203241/16390": No such file or directory
2022-03-25 10:00:10.062 EDT [38209] CONTEXT:  WAL redo at 0/43EBD88
for Database/CREATE_WAL_LOG: create dir 16385/16390

On a quick look, I'm guessing that XLOG_DBASE_CREATE_WAL_LOG will need
to mirror some of the logic that was added to the replay code for the
existing strategy, but I haven't figured out the details.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Fri, Mar 25, 2022 at 7:41 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Mar 24, 2022 at 12:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > Thanks, I have gone through your changes in comments and docs and those LGTM.
>
> It looks like this patch will need to be updated for Alvaro's commit
> 49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. The newly added test
> 029_replay_tsp_drops.pl fails with this patch applied. The standby log
> shows:
>
> 2022-03-25 10:00:10.022 EDT [38209] LOG:  entering standby mode
> 2022-03-25 10:00:10.024 EDT [38209] LOG:  redo starts at 0/3000028
> 2022-03-25 10:00:10.062 EDT [38209] FATAL:  could not create directory
> "pg_tblspc/16385/PG_15_202203241/16390": No such file or directory
> 2022-03-25 10:00:10.062 EDT [38209] CONTEXT:  WAL redo at 0/43EBD88
> for Database/CREATE_WAL_LOG: create dir 16385/16390
>
> On a quick look, I'm guessing that XLOG_DBASE_CREATE_WAL_LOG will need
> to mirror some of the logic that was added to the replay code for the
> existing strategy, but I haven't figured out the details.
>

Yeah, I think I got it, for XLOG_DBASE_CREATE_WAL_LOG now we will have
to handle the missing parent directory case, like Alvaro handled for
the XLOG_DBASE_CREATE(_FILE_COPY) case.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Fri, Mar 25, 2022 at 8:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> > On a quick look, I'm guessing that XLOG_DBASE_CREATE_WAL_LOG will need
> > to mirror some of the logic that was added to the replay code for the
> > existing strategy, but I haven't figured out the details.
> >
>
> Yeah, I think I got it, for XLOG_DBASE_CREATE_WAL_LOG now we will have
> to handle the missing parent directory case, like Alvaro handled for
> the XLOG_DBASE_CREATE(_FILE_COPY) case.

I have updated the patch so now we skip the XLOG_DBASE_CREATE_WAL_LOG
as well if the tablespace directory is missing.  But with our new
wal_log method there will be other follow up wal logs like,
XLOG_RELMAP_UPDATE, XLOG_SMGR_CREATE and XLOG_FPI.

I have put the similar logic for relmap_update WAL replay as well, but
we don't need this  for smgr_create or fpi.  Because the mdcreate() is
taking care of creating missing directory in TablespaceCreateDbspace()
and fpi only logged after we create the new smgr at least in case of
create database.

Now, is it possible to get the FPI without smgr_create wal in other
cases?  If it is then that problem is orthogonal to this path, but
anyway I could not find any such scenario.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Sat, Mar 26, 2022 at 5:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Mar 25, 2022 at 8:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > > On a quick look, I'm guessing that XLOG_DBASE_CREATE_WAL_LOG will need
> > > to mirror some of the logic that was added to the replay code for the
> > > existing strategy, but I haven't figured out the details.
> > >
> >
> > Yeah, I think I got it, for XLOG_DBASE_CREATE_WAL_LOG now we will have
> > to handle the missing parent directory case, like Alvaro handled for
> > the XLOG_DBASE_CREATE(_FILE_COPY) case.
>
> I have updated the patch so now we skip the XLOG_DBASE_CREATE_WAL_LOG
> as well if the tablespace directory is missing.  But with our new
> wal_log method there will be other follow up wal logs like,
> XLOG_RELMAP_UPDATE, XLOG_SMGR_CREATE and XLOG_FPI.
>
> I have put the similar logic for relmap_update WAL replay as well,

There was some mistake in the last patch, basically, for relmap update
also I have checked the missing tablespace directory but I should have
checked the missing database directory so I have fixed that.

> Now, is it possible to get the FPI without smgr_create wal in other
> cases?  If it is then that problem is orthogonal to this path, but
> anyway I could not find any such scenario.

I have digged further into it, tried manually removing the directory
before XLOG_FPI, but I noticed that during FPI also
XLogReadBufferExtended() take cares of creating the missing files
using smgrcreate() and that intern take care of missing directory
creation so I don't think we have any problem here.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Mon, Mar 28, 2022 at 2:18 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I have put the similar logic for relmap_update WAL replay as well,
>
> There was some mistake in the last patch, basically, for relmap update
> also I have checked the missing tablespace directory but I should have
> checked the missing database directory so I have fixed that.
>
> > Now, is it possible to get the FPI without smgr_create wal in other
> > cases?  If it is then that problem is orthogonal to this path, but
> > anyway I could not find any such scenario.
>
> I have digged further into it, tried manually removing the directory
> before XLOG_FPI, but I noticed that during FPI also
> XLogReadBufferExtended() take cares of creating the missing files
> using smgrcreate() and that intern take care of missing directory
> creation so I don't think we have any problem here.

I don't understand whether XLOG_RELMAP_UPDATE should be just doing
smgrcreate() as we would for most WAL records or whether it should be
adopting the new system introduced by
49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. I wrote about this concern
over here:

http://postgr.es/m/CA+TgmoYcUPL+WOJL2ZzhH=zmrhj0iOQ=iCFM0SuYqBbqZEamEg@mail.gmail.com

But apart from that question your adaptations here look reasonable to me.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Tue, Mar 29, 2022 at 12:38 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Mar 28, 2022 at 2:18 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > I have put the similar logic for relmap_update WAL replay as well,
> >
> > There was some mistake in the last patch, basically, for relmap update
> > also I have checked the missing tablespace directory but I should have
> > checked the missing database directory so I have fixed that.
> >
> > > Now, is it possible to get the FPI without smgr_create wal in other
> > > cases?  If it is then that problem is orthogonal to this path, but
> > > anyway I could not find any such scenario.
> >
> > I have digged further into it, tried manually removing the directory
> > before XLOG_FPI, but I noticed that during FPI also
> > XLogReadBufferExtended() take cares of creating the missing files
> > using smgrcreate() and that intern take care of missing directory
> > creation so I don't think we have any problem here.
>
> I don't understand whether XLOG_RELMAP_UPDATE should be just doing
> smgrcreate()

XLOG_RELMAP_UPDATE is for the complete database so for which relnode
it will create smgr?  I think you probably meant
TablespaceCreateDbspace()?

 as we would for most WAL records or whether it should be
> adopting the new system introduced by
> 49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. I wrote about this concern
> over here:

okay, thanks.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Mon, Mar 28, 2022 at 3:08 PM Robert Haas <robertmhaas@gmail.com> wrote:
> smgrcreate() as we would for most WAL records or whether it should be
> adopting the new system introduced by
> 49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. I wrote about this concern
> over here:
>
> http://postgr.es/m/CA+TgmoYcUPL+WOJL2ZzhH=zmrhj0iOQ=iCFM0SuYqBbqZEamEg@mail.gmail.com
>
> But apart from that question your adaptations here look reasonable to me.

That commit having been reverted, I committed v6 instead. Let's see
what breaks...

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
On 2022-03-29 11:55:05 -0400, Robert Haas wrote:
> That commit having been reverted, I committed v6 instead. Let's see
> what breaks...

It fails in CI (for the mirror of the postgres repo on github):
https://cirrus-ci.com/task/6279465603956736?logs=test_bin#L121
tap test log:
https://api.cirrus-ci.com/v1/artifact/task/6279465603956736/log/src/bin/scripts/tmp_check/log/regress_log_020_createdb
postmaster log:
https://api.cirrus-ci.com/v1/artifact/task/6279465603956736/log/src/bin/scripts/tmp_check/log/020_createdb_main.log

recent versions failed similarly on cfbot:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/37/3192
https://cirrus-ci.com/task/5217140407009280?logs=test_bin#L121

# Running: createdb -T foobar2 foobar6 -S wal_log
createdb: error: too many command-line arguments (first is "wal_log")
Try "createdb --help" for more information.
not ok 31 - createdb -T foobar2 foobar6 -S wal_log exit code 0

#   Failed test 'createdb -T foobar2 foobar6 -S wal_log exit code 0'
#   at t/020_createdb.pl line 117.
not ok 32 - create database with WAL_LOG strategy: SQL found in server log

#   Failed test 'create database with WAL_LOG strategy: SQL found in server log'
#   at t/020_createdb.pl line 117.
#                   ''
#     doesn't match '(?^:statement: CREATE DATABASE foobar6 STRATEGY wal_log TEMPLATE foobar2)'
# Running: createdb -T foobar2 foobar7 -S file_copy
createdb: error: too many command-line arguments (first is "file_copy")
Try "createdb --help" for more information.
not ok 33 - createdb -T foobar2 foobar7 -S file_copy exit code 0

#   Failed test 'createdb -T foobar2 foobar7 -S file_copy exit code 0'
#   at t/020_createdb.pl line 122.
not ok 34 - create database with FILE_COPY strategy: SQL found in server log

#   Failed test 'create database with FILE_COPY strategy: SQL found in server log'
#   at t/020_createdb.pl line 122.
#                   ''
#     doesn't match '(?^:statement: CREATE DATABASE foobar7 STRATEGY file_copy TEMPLATE foobar2)'

Looks like there's some problem with commandline parsing?

Greetings,

Andres Freund



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Tue, Mar 29, 2022 at 1:35 PM Andres Freund <andres@anarazel.de> wrote:
> # Running: createdb -T foobar2 foobar6 -S wal_log
> createdb: error: too many command-line arguments (first is "wal_log")
> Try "createdb --help" for more information.
> not ok 31 - createdb -T foobar2 foobar6 -S wal_log exit code 0
>
> Looks like there's some problem with commandline parsing?

Apparently getopt_long() is fussier on Windows. I have committed a fix.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> Looks like there's some problem with commandline parsing?

That test script is expecting glibc-like laxness of switch
parsing.  Put the switches before the non-switch arguments.

            regards, tom lane



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Tue, Mar 29, 2022 at 1:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Looks like there's some problem with commandline parsing?
>
> That test script is expecting glibc-like laxness of switch
> parsing.  Put the switches before the non-switch arguments.

I just did that. :-)

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Mar 29, 2022 at 1:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> That test script is expecting glibc-like laxness of switch
>> parsing.  Put the switches before the non-switch arguments.

> I just did that. :-)

Yup, you pushed while I was typing.

FWIW, I don't think it's "Windows" enforcing this, it's our own
src/port/getopt[_long].c.  If there were a well-defined spec
for what glibc does with such cases, it might be interesting to
try to make our version bug-compatible with theirs.  But AFAIK
it's some random algorithm that they probably feel at liberty
to change.

            regards, tom lane



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Tue, Mar 29, 2022 at 2:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Tue, Mar 29, 2022 at 1:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> That test script is expecting glibc-like laxness of switch
> >> parsing.  Put the switches before the non-switch arguments.
>
> > I just did that. :-)
>
> Yup, you pushed while I was typing.
>
> FWIW, I don't think it's "Windows" enforcing this, it's our own
> src/port/getopt[_long].c.  If there were a well-defined spec
> for what glibc does with such cases, it might be interesting to
> try to make our version bug-compatible with theirs.  But AFAIK
> it's some random algorithm that they probably feel at liberty
> to change.

I guess that characterization surprises me. The man page for
getopt_long() says this, and has for a long time at least on systems
I've used:

ENVIRONMENT
     POSIXLY_CORRECT  If set, option processing stops when the first non-
                      option is found and a leading `-' or `+' in the
                      optstring is ignored.

And also this:

BUGS
     The argv argument is not really const as its elements may be permuted
     (unless POSIXLY_CORRECT is set).

Doesn't that make it pretty clear what the GNU version is doing?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Mar 29, 2022 at 2:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> it's some random algorithm that they probably feel at liberty
>> to change.

> I guess that characterization surprises me. The man page for
> getopt_long() says this, and has for a long time at least on systems
> I've used:

Yeah, they say they follow the POSIX spec when you set POSIXLY_CORRECT.
What they don't spell out in any detail is what they do when you don't.
We know that it involves rearranging the argv[] array behind the
application's back, but not what the rules are for doing that.  In
particular, they must have some undocumented and probably not very safe
method for deciding which arguments are neither switches nor switch
arguments.

(Actually, if I recall previous discussions properly, another stumbling
block to doing anything here is that we'd also have to change all our
documentation to explain it.  Fixing the command line synopses would
be a mess already, and explaining the rules would be worse.)

            regards, tom lane



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Tue, Mar 29, 2022 at 2:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Tue, Mar 29, 2022 at 2:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> it's some random algorithm that they probably feel at liberty
> >> to change.
>
> > I guess that characterization surprises me. The man page for
> > getopt_long() says this, and has for a long time at least on systems
> > I've used:
>
> Yeah, they say they follow the POSIX spec when you set POSIXLY_CORRECT.
> What they don't spell out in any detail is what they do when you don't.
> We know that it involves rearranging the argv[] array behind the
> application's back, but not what the rules are for doing that.  In
> particular, they must have some undocumented and probably not very safe
> method for deciding which arguments are neither switches nor switch
> arguments.

I mean, I think of an option as something that starts with '-'. The
documentation contains a caveat that says: "The special argument ‘--’
forces in all cases the end of option scanning." So I think I would
expect it just looks for arguments starting with '-' that do not
follow an argument that is exactly "--".

<looks around for the source code>

https://github.com/gcc-mirror/gcc/blob/master/libiberty/getopt.c

   If an element of ARGV starts with '-', and is not exactly "-" or "--",
   then it is an option element.  The characters of this element
   (aside from the initial '-') are option characters.  If `getopt'
   is called repeatedly, it returns successively each of the option characters
   from each of the option elements.

OK - so I was off slightly. Either "-" or "--" terminates the options
list. Apart from that anything starting with "-" is an option.

I think you're overestimating the level of mystery that's present
here, as well as the likelihood that the rules could ever be changed.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
Hi,

On 2022-03-29 11:55:05 -0400, Robert Haas wrote:
> I committed v6 instead.

Just noticed that it makes initdb a bit slower / the cluster a bit bigger,
because now there's WAL traffic from creating the databases.  There's an
optimization (albeit insufficient) to reduce WAL traffic in bootstrap mode,
but not for single user mode when the CREATE DATABASEs happen.

In an optimized build, with wal-segsize 1 (the most extreme case) using
FILE_COPY vs WAL_LOG:

perf stat ~/build/postgres/dev-optimize/install/bin/initdb /tmp/initdb/ --wal-segsize=1
WAL_LOG:

            487.58 msec task-clock                #    0.848 CPUs utilized
             2,874      context-switches          #    5.894 K/sec
                 0      cpu-migrations            #    0.000 /sec
            10,209      page-faults               #   20.938 K/sec
     1,550,483,095      cycles                    #    3.180 GHz
     2,537,618,094      instructions              #    1.64  insn per cycle
       492,780,121      branches                  #    1.011 G/sec
         7,384,884      branch-misses             #    1.50% of all branches

       0.575213800 seconds time elapsed

       0.349812000 seconds user
       0.133225000 seconds sys

FILE_COPY:

            476.54 msec task-clock                #    0.854 CPUs utilized
             3,005      context-switches          #    6.306 K/sec
                 0      cpu-migrations            #    0.000 /sec
            10,050      page-faults               #   21.090 K/sec
     1,516,058,200      cycles                    #    3.181 GHz
     2,504,126,907      instructions              #    1.65  insn per cycle
       488,042,856      branches                  #    1.024 G/sec
         7,327,364      branch-misses             #    1.50% of all branches

       0.557934976 seconds time elapsed

       0.360473000 seconds user
       0.112109000 seconds sys


the numbers are similar if repeated.

du -s /tmp/initdb/
WAL_LOG: 35112
FILE_COPY: 29288

So it seems we should specify a strategy in initdb? It kind of makes sense -
we're not going to read anything from those database. And because of the
ringbuffer of 256kB, we'll not even reduce IO meaningfully.

- Andres



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Wed, Mar 30, 2022 at 6:47 AM Andres Freund <andres@anarazel.de> wrote:
>
>
> du -s /tmp/initdb/
> WAL_LOG: 35112
> FILE_COPY: 29288
>
> So it seems we should specify a strategy in initdb? It kind of makes sense -
> we're not going to read anything from those database. And because of the
> ringbuffer of 256kB, we'll not even reduce IO meaningfully.

I think this makes sense, so you mean with initdb we will always use
file_copy or we want to give a command line option for initdb ?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Tue, Mar 29, 2022 at 9:25 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Mar 28, 2022 at 3:08 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > smgrcreate() as we would for most WAL records or whether it should be
> > adopting the new system introduced by
> > 49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. I wrote about this concern
> > over here:
> >
> > http://postgr.es/m/CA+TgmoYcUPL+WOJL2ZzhH=zmrhj0iOQ=iCFM0SuYqBbqZEamEg@mail.gmail.com
> >
> > But apart from that question your adaptations here look reasonable to me.
>
> That commit having been reverted, I committed v6 instead. Let's see
> what breaks...
>

There was a duplicate error check for the invalid createdb strategy
option in the test case, although it would not create any issue but it
is duplicate so I have fixed it in the attached patch.



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
Hi,

On 2022-03-30 09:28:58 +0530, Dilip Kumar wrote:
> On Wed, Mar 30, 2022 at 6:47 AM Andres Freund <andres@anarazel.de> wrote:
> >
> >
> > du -s /tmp/initdb/
> > WAL_LOG: 35112
> > FILE_COPY: 29288
> >
> > So it seems we should specify a strategy in initdb? It kind of makes sense -
> > we're not going to read anything from those database. And because of the
> > ringbuffer of 256kB, we'll not even reduce IO meaningfully.
> 
> I think this makes sense, so you mean with initdb we will always use
> file_copy or we want to give a command line option for initdb ?

Don't see a need for a commandline option / a situation where using WAL_LOG
would be preferrable for initdb.

Greetings,

Andres Freund



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
Hi,

On 2022-03-29 11:55:05 -0400, Robert Haas wrote:
> I committed v6 instead.

I was just discussing the WAL prefetching patch with Thomas. A question in
that discussion made me look at the coverage of REDO for CREATE DATABASE:
https://coverage.postgresql.org/src/backend/commands/dbcommands.c.gcov.html

Seems there's currently nothing hitting the REDO for
XLOG_DBASE_CREATE_FILE_COPY (currently line 3019). I think it'd be good to
keep coverage for that. How about adding a
  CREATE DATABASE ... STRATEGY file_copy
to 001_stream_rep.pl?


Might be worth adding a test for ALTER DATABASE ... SET TABLESPACE at the same
time, this patch did affect that path in some minor ways. And, somewhat
shockingly, we don't have a single test for it.

- Andres



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Thu, Mar 31, 2022 at 5:07 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-03-29 11:55:05 -0400, Robert Haas wrote:
> > I committed v6 instead.
>
> I was just discussing the WAL prefetching patch with Thomas. A question in
> that discussion made me look at the coverage of REDO for CREATE DATABASE:
> https://coverage.postgresql.org/src/backend/commands/dbcommands.c.gcov.html
>
> Seems there's currently nothing hitting the REDO for
> XLOG_DBASE_CREATE_FILE_COPY (currently line 3019). I think it'd be good to
> keep coverage for that. How about adding a
>   CREATE DATABASE ... STRATEGY file_copy
> to 001_stream_rep.pl?
>
>
> Might be worth adding a test for ALTER DATABASE ... SET TABLESPACE at the same
> time, this patch did affect that path in some minor ways. And, somewhat
> shockingly, we don't have a single test for it.

I will add tests for both of these cases and send the patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Thu, Mar 31, 2022 at 9:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Mar 31, 2022 at 5:07 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2022-03-29 11:55:05 -0400, Robert Haas wrote:
> > > I committed v6 instead.
> >
> > I was just discussing the WAL prefetching patch with Thomas. A question in
> > that discussion made me look at the coverage of REDO for CREATE DATABASE:
> > https://coverage.postgresql.org/src/backend/commands/dbcommands.c.gcov.html
> >
> > Seems there's currently nothing hitting the REDO for
> > XLOG_DBASE_CREATE_FILE_COPY (currently line 3019). I think it'd be good to
> > keep coverage for that. How about adding a
> >   CREATE DATABASE ... STRATEGY file_copy
> > to 001_stream_rep.pl?
> >
> >
> > Might be worth adding a test for ALTER DATABASE ... SET TABLESPACE at the same
> > time, this patch did affect that path in some minor ways. And, somewhat
> > shockingly, we don't have a single test for it.
>
> I will add tests for both of these cases and send the patch.


0001 is changing the strategy to file copy during initdb and 0002
patch adds the test cases for both these cases.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Thu, Mar 31, 2022 at 3:52 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> 0001 is changing the strategy to file copy during initdb and 0002
> patch adds the test cases for both these cases.

IMHO, 0001 looks fine, except for needing some adjustments to the wording.

I'm less sure about 0002. It's testing the stuff Andres mentioned, but
I'm not sure how good the tests are.

Andres, thoughts? Do you want me to polish and commit 0001?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
Hi,

On 2022-03-31 13:22:24 +0530, Dilip Kumar wrote:
> 0001 is changing the strategy to file copy during initdb and 0002
> patch adds the test cases for both these cases.

Thanks!

> From 4a997e2a95074a520777cd2b369f9c728b360969 Mon Sep 17 00:00:00 2001
> From: Dilip Kumar <dilipkumar@localhost.localdomain>
> Date: Thu, 31 Mar 2022 10:43:16 +0530
> Subject: [PATCH 1/2] Use file_copy strategy during initdb
> 
> Because skipping the checkpoint during initdb will not result
> in significant savings, so there is no point in using wal_log
> as that will simply increase the cluster size by generating
> extra wal.
> ---
>  src/bin/initdb/initdb.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
> index 5e36943..1256082 100644
> --- a/src/bin/initdb/initdb.c
> +++ b/src/bin/initdb/initdb.c
> @@ -1856,6 +1856,11 @@ make_template0(FILE *cmdfd)
>       * it would fail. To avoid that, assign a fixed OID to template0 rather
>       * than letting the server choose one.
>       *
> +     * Using file_copy strategy is preferable over wal_log here because
> +     * skipping the checkpoint during initdb will not result in significant
> +     * savings, so there is no point in using wal_log as that will simply
> +     * increase the cluster size by generating extra wal.

It's not just the increase in size, it's also the increase in time due to WAL logging.


>       * (Note that, while the user could have dropped and recreated these
>       * objects in the old cluster, the problem scenario only exists if the OID
>       * that is in use in the old cluster is also used in the new cluster - and
> @@ -1863,7 +1868,7 @@ make_template0(FILE *cmdfd)
>       */
>      static const char *const template0_setup[] = {
>          "CREATE DATABASE template0 IS_TEMPLATE = true ALLOW_CONNECTIONS = false OID = "
> -        CppAsString2(Template0ObjectId) ";\n\n",
> +        CppAsString2(Template0ObjectId) " STRATEGY = file_copy;\n\n",

I'd perhaps break this into a separate line, but...


> From d0759bcfc4fed674e938e4a03159f5953ca9718d Mon Sep 17 00:00:00 2001
> From: Dilip Kumar <dilipkumar@localhost.localdomain>
> Date: Thu, 31 Mar 2022 12:07:19 +0530
> Subject: [PATCH 2/2] Create database test coverage
> 
> Test create database strategy wal replay and alter database
> set tablespace.
> ---
>  src/test/modules/test_misc/t/002_tablespace.pl | 12 ++++++++++++
>  src/test/recovery/t/001_stream_rep.pl          | 24 ++++++++++++++++++++++++
>  2 files changed, 36 insertions(+)
> 
> diff --git a/src/test/modules/test_misc/t/002_tablespace.pl b/src/test/modules/test_misc/t/002_tablespace.pl
> index 04e5439..f3bbddc 100644
> --- a/src/test/modules/test_misc/t/002_tablespace.pl
> +++ b/src/test/modules/test_misc/t/002_tablespace.pl
> @@ -83,7 +83,19 @@ $result = $node->psql('postgres',
>      "ALTER TABLE t SET tablespace regress_ts1");
>  ok($result == 0, 'move table in-place->abs');
>  
> +# Test ALTER DATABASE SET TABLESPACE
> +$result = $node->psql('postgres',
> +    "CREATE DATABASE testdb TABLESPACE regress_ts1");
> +ok($result == 0, 'create database in tablespace 1');
> +$result = $node->psql('testdb',
> +    "CREATE TABLE t ()");
> +ok($result == 0, 'create table in testdb database');
> +$result = $node->psql('postgres',
> +    "ALTER DATABASE testdb SET TABLESPACE regress_ts2");
> +ok($result == 0, 'move database to tablespace 2');

This just tests the command doesn't fail, but not whether it actually did
something useful. Seem we should at least insert a row or two into the the
table, and verify they can be accessed?


> +# Create database with different strategies and check its presence in standby
> +$node_primary->safe_psql('postgres',
> +    "CREATE DATABASE testdb1 STRATEGY = FILE_COPY; ");
> +$node_primary->safe_psql('testdb1',
> +    "CREATE TABLE tab_int AS SELECT generate_series(1,10) AS a");
> +$node_primary->safe_psql('postgres',
> +    "CREATE DATABASE testdb2 STRATEGY = WAL_LOG; ");
> +$node_primary->safe_psql('testdb2',
> +    "CREATE TABLE tab_int AS SELECT generate_series(1,10) AS a");
> +
> +# Wait for standbys to catch up
> +$primary_lsn = $node_primary->lsn('write');
> +$node_primary->wait_for_catchup($node_standby_1, 'replay', $primary_lsn);
> +
> +$result =
> +  $node_standby_1->safe_psql('testdb1', "SELECT count(*) FROM tab_int");
> +print "standby 1: $result\n";
> +is($result, qq(10), 'check streamed content on standby 1');
> +
> +$result =
> +  $node_standby_1->safe_psql('testdb2', "SELECT count(*) FROM tab_int");
> +print "standby 1: $result\n";
> +is($result, qq(10), 'check streamed content on standby 1');
> +
>  # Check that only READ-only queries can run on standbys
>  is($node_standby_1->psql('postgres', 'INSERT INTO tab_int VALUES (1)'),
>      3, 'read-only queries on standby 1');

I'd probably add a function for creating database / table and then testing it,
with a strategy parameter. That way we can afterwards add more tests verifying
that everything worked.

Greetings,

Andres Freund



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
Hi,

On 2022-03-31 10:05:10 -0400, Robert Haas wrote:
> On Thu, Mar 31, 2022 at 3:52 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > 0001 is changing the strategy to file copy during initdb and 0002
> > patch adds the test cases for both these cases.
> 
> IMHO, 0001 looks fine, except for needing some adjustments to the wording.

Agreed.


> I'm less sure about 0002. It's testing the stuff Andres mentioned, but
> I'm not sure how good the tests are.

I came to a similar conclusion. It's still better than nothing, but it's just
a small bit of additional effort to do some basic testing that e.g. the move
actually worked...


> Andres, thoughts? Do you want me to polish and commit 0001?

Yes please!


FWIW, once the freeze is done I'm planning to set up scripting to see which
parts of the code we whacked around don't have test coverage...

Greetings,

Andres Freund



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Thu, Mar 31, 2022 at 12:25 PM Andres Freund <andres@anarazel.de> wrote:
> > Andres, thoughts? Do you want me to polish and commit 0001?
>
> Yes please!

Here is a polished version. Comments?

> FWIW, once the freeze is done I'm planning to set up scripting to see which
> parts of the code we whacked around don't have test coverage...

Sounds terrifying.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
On 2022-03-31 14:31:43 -0400, Robert Haas wrote:
> On Thu, Mar 31, 2022 at 12:25 PM Andres Freund <andres@anarazel.de> wrote:
> > > Andres, thoughts? Do you want me to polish and commit 0001?
> >
> > Yes please!
> 
> Here is a polished version. Comments?

LGTM.



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Thu, Mar 31, 2022 at 2:44 PM Andres Freund <andres@anarazel.de> wrote:
> On 2022-03-31 14:31:43 -0400, Robert Haas wrote:
> > On Thu, Mar 31, 2022 at 12:25 PM Andres Freund <andres@anarazel.de> wrote:
> > > > Andres, thoughts? Do you want me to polish and commit 0001?
> > >
> > > Yes please!
> >
> > Here is a polished version. Comments?
>
> LGTM.

Committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Thu, Mar 31, 2022 at 9:52 PM Andres Freund <andres@anarazel.de> wrote:

> > +     "ALTER DATABASE testdb SET TABLESPACE regress_ts2");
> > +ok($result == 0, 'move database to tablespace 2');
>
> This just tests the command doesn't fail, but not whether it actually did
> something useful. Seem we should at least insert a row or two into the the
> table, and verify they can be accessed?

Now, added some tuples and verified them.


> >  # Check that only READ-only queries can run on standbys
> >  is($node_standby_1->psql('postgres', 'INSERT INTO tab_int VALUES (1)'),
> >       3, 'read-only queries on standby 1');
>
> I'd probably add a function for creating database / table and then testing it,
> with a strategy parameter. That way we can afterwards add more tests verifying
> that everything worked.

I have created a function to create a database and table and verify
the content in it.  Another option is we can just keep the database
and table creation inside the function and the verification part
outside it so that if some future test case wants to create some extra
content and verify it then they can do so.   But with the current
tests in mind the way I got it in the attached patch has less
duplicate code so I preferred it this way.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
Hi,

On 2022-03-29 11:55:05 -0400, Robert Haas wrote:
> I committed v6 instead.

Coverity complains that this patch added GetDatabasePath() calls without
freeing its return value. Normally that'd be easy to dismiss, due to memory
contexts, but there's no granular resets in CreateDatabaseUsingFileCopy(). And
obviously there can be a lot of relations in one database - we shouldn't hold
onto the same path over and over again.

The case in recovery is worse, because there we don't have a memory context to
reset afaics. Oddly enough, it sure looks like we have an existing version of
this bug in the file-copy path?

Greetings,

Andres Freund



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Sun, Apr 3, 2022 at 9:52 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-03-29 11:55:05 -0400, Robert Haas wrote:
> > I committed v6 instead.
>
> Coverity complains that this patch added GetDatabasePath() calls without
> freeing its return value. Normally that'd be easy to dismiss, due to memory
> contexts, but there's no granular resets in CreateDatabaseUsingFileCopy(). And
> obviously there can be a lot of relations in one database - we shouldn't hold
> onto the same path over and over again.

> The case in recovery is worse, because there we don't have a memory context to
> reset afaics. Oddly enough, it sure looks like we have an existing version of
> this bug in the file-copy path?

Yeah, I see that the createdb() and dbase_redo() had this existing
problem and with this patch we have created a few more such
occurrences.
The attached patch fixes it.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Justin Pryzby
Date:
On Tue, Mar 29, 2022 at 11:55:05AM -0400, Robert Haas wrote:
> On Mon, Mar 28, 2022 at 3:08 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > smgrcreate() as we would for most WAL records or whether it should be
> > adopting the new system introduced by
> > 49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. I wrote about this concern
> > over here:
> >
> > http://postgr.es/m/CA+TgmoYcUPL+WOJL2ZzhH=zmrhj0iOQ=iCFM0SuYqBbqZEamEg@mail.gmail.com
> >
> > But apart from that question your adaptations here look reasonable to me.
> 
> That commit having been reverted, I committed v6 instead. Let's see
> what breaks...

There's a crash

2022-07-31 01:22:51.437 CDT client backend[13362] [unknown] PANIC:  could not open critical system index 2662

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007efe27999801 in __GI_abort () at abort.c:79
#2  0x00005583891941dc in errfinish (filename=<optimized out>, filename@entry=0x558389420437 "relcache.c",
lineno=lineno@entry=4328,
    funcname=funcname@entry=0x558389421680 <__func__.33178> "load_critical_index") at elog.c:675
#3  0x00005583891713ef in load_critical_index (indexoid=indexoid@entry=2662, heapoid=heapoid@entry=1259) at
relcache.c:4328
#4  0x0000558389172667 in RelationCacheInitializePhase3 () at relcache.c:4103
#5  0x00005583891b93a4 in InitPostgres (in_dbname=in_dbname@entry=0x55838a50d468 "a", dboid=dboid@entry=0,
username=username@entry=0x55838a50d448"pryzbyj", useroid=useroid@entry=0,
 
    load_session_libraries=<optimized out>, override_allow_connections=override_allow_connections@entry=false,
out_dbname=0x0)at postinit.c:1087
 
#6  0x0000558388daa7bb in PostgresMain (dbname=0x55838a50d468 "a", username=username@entry=0x55838a50d448 "pryzbyj") at
postgres.c:4081
#7  0x0000558388b9f423 in BackendRun (port=port@entry=0x55838a505dd0) at postmaster.c:4490
#8  0x0000558388ba6e07 in BackendStartup (port=port@entry=0x55838a505dd0) at postmaster.c:4218
#9  0x0000558388ba747f in ServerLoop () at postmaster.c:1808
#10 0x0000558388ba8f93 in PostmasterMain (argc=7, argv=<optimized out>) at postmaster.c:1480
#11 0x0000558388840e1f in main (argc=7, argv=0x55838a4dc000) at main.c:197

while :; do psql -qh /tmp postgres -c "DROP DATABASE a" -c "CREATE DATABASE a TEMPLATE postgres STRATEGY wal_log";
done
# Run this for a few loops and then ^C or hold down ^C until it stops,
# and then connect to postgres and try to connect to 'a':
postgres=# \c a
2022-07-31 01:22:51.437 CDT client backend[13362] [unknown] PANIC:  could not open critical system index 2662

Unfortunately, that isn't very consistent, and you have have to run it a bunch
of times...

I don't know if it's an issue of any significance that CREATE DATABASE / ^C
leaves behind a broken database, but it is an issue that the cluster crashes.

While struggling to reproduce that problem, I also hit this warning, which may
or may not be the same.  I added an abort() after WARNING in aset.c to get a
backtrace.

WARNING:  problem in alloc set PortalContext: bogus aset link in block 0x55a63f2f9d60, chunk 0x55a63f2fb138

Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No existe el archivo o el directorio.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f81144f1801 in __GI_abort () at abort.c:79
#2  0x000055a63c834c5d in AllocSetCheck (context=context@entry=0x55a63f26fea0) at aset.c:1491
#3  0x000055a63c835b09 in AllocSetDelete (context=0x55a63f26fea0) at aset.c:638
#4  0x000055a63c854322 in MemoryContextDelete (context=0x55a63f26fea0) at mcxt.c:252
#5  0x000055a63c8591d5 in PortalDrop (portal=portal@entry=0x55a63f2bb7a0, isTopCommit=isTopCommit@entry=false) at
portalmem.c:596
#6  0x000055a63c3e4a7b in exec_simple_query (query_string=query_string@entry=0x55a63f24db90 "CREATE DATABASE a TEMPLATE
postgresSTRATEGY wal_log ;") at postgres.c:1253
 
#7  0x000055a63c3e7fc1 in PostgresMain (dbname=<optimized out>, username=username@entry=0x55a63f279448 "pryzbyj") at
postgres.c:4505
#8  0x000055a63c1dc423 in BackendRun (port=port@entry=0x55a63f271dd0) at postmaster.c:4490
#9  0x000055a63c1e3e07 in BackendStartup (port=port@entry=0x55a63f271dd0) at postmaster.c:4218
#10 0x000055a63c1e447f in ServerLoop () at postmaster.c:1808
#11 0x000055a63c1e5f93 in PostmasterMain (argc=7, argv=<optimized out>) at postmaster.c:1480
#12 0x000055a63be7de1f in main (argc=7, argv=0x55a63f248000) at main.c:197

I reproduced that by running this a couple dozen times in an interactive psql.
It doesn't seem to affect STRATEGY=file_copy.

SET statement_timeout=0; DROP DATABASE a; SET statement_timeout='60ms'; CREATE DATABASE a TEMPLATE postgres STRATEGY
wal_log; \c a \c postgres
 

Also, if I understand correctly, this patch seems to assume that nobody is
connected to the source database.  But what's actually enforced is just that
nobody *else* is connected.  Is it any issue that the current DB can be used as
a source?  Anyway, both of the above problems are reproducible using a
different database.

|postgres=# CREATE DATABASE new TEMPLATE postgres STRATEGY wal_log;
|CREATE DATABASE

-- 
Justin



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Tue, Aug 2, 2022 at 1:50 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> Unfortunately, that isn't very consistent, and you have have to run it a bunch
> of times...

I was eventually able to reproduce this in part by using the
interactive psql method you describe. It didn't crash, but it did spit
out a bunch of funny error messages:

postgres=# SET statement_timeout=0; DROP DATABASE a; SET
statement_timeout='60ms'; CREATE DATABASE a TEMPLATE postgres STRATEGY
wal_log ; \c a \c postgres
SET
ERROR:  database "a" does not exist
SET
ERROR:  canceling statement due to statement timeout
WARNING:  problem in alloc set PortalContext: req size > alloc size
for chunk 0x7f99508911f0 in block 0x7f9950890800
WARNING:  problem in alloc set PortalContext: bad size 0 for chunk
0x7f99508911f0 in block 0x7f9950890800
WARNING:  problem in alloc set PortalContext: bad single-chunk
0x7f9950891208 in block 0x7f9950890800
WARNING:  problem in alloc set PortalContext: found inconsistent
memory block 0x7f9950890800
WARNING:  problem in alloc set PortalContext: req size > alloc size
for chunk 0x7f99508911f0 in block 0x7f9950890800
WARNING:  problem in alloc set PortalContext: bad size 0 for chunk
0x7f99508911f0 in block 0x7f9950890800
WARNING:  problem in alloc set PortalContext: bad single-chunk
0x7f9950891208 in block 0x7f9950890800
WARNING:  problem in alloc set PortalContext: found inconsistent
memory block 0x7f9950890800
connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL:
database "a" does not exist
Previous connection kept
postgres=# select * from pg_database;
 oid |  datname  | datdba | encoding | datlocprovider | datistemplate
| datallowconn | datconnlimit | datfrozenxid | datminmxid |
dattablespace | datcollate  |  datctype   | daticulocale |
datcollversion |           datacl

-----+-----------+--------+----------+----------------+---------------+--------------+--------------+--------------+------------+---------------+-------------+-------------+--------------+----------------+----------------------------
   5 | postgres  |     10 |        6 | c              | f
| t            |           -1 |          718 |          1 |
1663 | en_US.UTF-8 | en_US.UTF-8 |              |                |
   1 | template1 |     10 |        6 | c              | t
| t            |           -1 |          718 |          1 |
1663 | en_US.UTF-8 | en_US.UTF-8 |              |                |
{=c/rhaas,rhaas=CTc/rhaas}
   4 | template0 |     10 |        6 | c              | t
| f            |           -1 |          718 |          1 |
1663 | en_US.UTF-8 | en_US.UTF-8 |              |                |
{=c/rhaas,rhaas=CTc/rhaas}
(3 rows)

I then set backtrace_functions='AllocSetCheck' and reproduced it
again, which led to stack traces like this:

2022-08-02 16:50:32.490 EDT [98814] WARNING:  problem in alloc set
PortalContext: bad single-chunk 0x7f9950886608 in block 0x7f9950885c00
2022-08-02 16:50:32.490 EDT [98814] BACKTRACE:
2   postgres                            0x000000010cd37ef5 AllocSetCheck + 549
3   postgres                            0x000000010cd37730 AllocSetReset + 48
4   postgres                            0x000000010cd3f6f1
MemoryContextResetOnly + 81
5   postgres                            0x000000010cd378b9 AllocSetDelete + 73
6   postgres                            0x000000010cd41e09 PortalDrop + 425
7   postgres                            0x000000010cd427bb
AtCleanup_Portals + 203
8   postgres                            0x000000010c86476d
CleanupTransaction + 29
9   postgres                            0x000000010c865d4f
AbortCurrentTransaction + 63
10  postgres                            0x000000010cba1395 PostgresMain + 885
11  postgres                            0x000000010caf5472 PostmasterMain + 7586
12  postgres                            0x000000010ca31e3d main + 2205
13  libdyld.dylib                       0x00007fff699afcc9 start + 1
14  ???                                 0x0000000000000001 0x0 + 1

I recompiled with -O0 and hacked the code that emits the BACKTRACE:
bit to go into an infinite loop if it's hit, which enabled me to hook
up a debugger at the point of the failure. The debugger says:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x000000010e98a157
postgres`send_message_to_server_log(edata=0x000000010ec0f658) at
elog.c:2916:4
    frame #1: 0x000000010e9866d6 postgres`EmitErrorReport at elog.c:1537:3
    frame #2: 0x000000010e986016 postgres`errfinish(filename="aset.c",
lineno=1470, funcname="AllocSetCheck") at elog.c:592:2
    frame #3: 0x000000010e9c8465
postgres`AllocSetCheck(context=0x00007ff77c80d200) at aset.c:1469:5
    frame #4: 0x000000010e9c7c05
postgres`AllocSetDelete(context=0x00007ff77c80d200) at aset.c:638:2
    frame #5: 0x000000010e9d368b
postgres`MemoryContextDelete(context=0x00007ff77c80d200) at
mcxt.c:252:2
  * frame #6: 0x000000010e9d705b
postgres`PortalDrop(portal=0x00007ff77e028920, isTopCommit=false) at
portalmem.c:596:2
    frame #7: 0x000000010e9d7e0e postgres`AtCleanup_Portals at portalmem.c:907:3
    frame #8: 0x000000010e22030d postgres`CleanupTransaction at xact.c:2890:2
    frame #9: 0x000000010e2219da postgres`AbortCurrentTransaction at
xact.c:3328:4
    frame #10: 0x000000010e763237
postgres`PostgresMain(dbname="postgres", username="rhaas") at
postgres.c:4232:3
    frame #11: 0x000000010e6625aa
postgres`BackendRun(port=0x00007ff77c1042c0) at postmaster.c:4490:2
    frame #12: 0x000000010e661b18
postgres`BackendStartup(port=0x00007ff77c1042c0) at
postmaster.c:4218:3
    frame #13: 0x000000010e66088a postgres`ServerLoop at postmaster.c:1808:7
    frame #14: 0x000000010e65def2 postgres`PostmasterMain(argc=1,
argv=0x00007ff77ae05cf0) at postmaster.c:1480:11
    frame #15: 0x000000010e50521f postgres`main(argc=1,
argv=0x00007ff77ae05cf0) at main.c:197:3
    frame #16: 0x00007fff699afcc9 libdyld.dylib`start + 1
(lldb) fr sel 6
frame #6: 0x000000010e9d705b
postgres`PortalDrop(portal=0x00007ff77e028920, isTopCommit=false) at
portalmem.c:596:2
   593 MemoryContextDelete(portal->holdContext);
   594
   595 /* release subsidiary storage */
-> 596 MemoryContextDelete(portal->portalContext);
   597
   598 /* release portal struct (it's in TopPortalContext) */
   599 pfree(portal);
(lldb) fr sel 3
frame #3: 0x000000010e9c8465
postgres`AllocSetCheck(context=0x00007ff77c80d200) at aset.c:1469:5
   1466 * Check chunk size
   1467 */
   1468 if (dsize > chsize)
-> 1469 elog(WARNING, "problem in alloc set %s: req size > alloc size
for chunk %p in block %p",
   1470 name, chunk, block);
   1471 if (chsize < (1 << ALLOC_MINBITS))
   1472 elog(WARNING, "problem in alloc set %s: bad size %zu for chunk
%p in block %p",
(lldb) p dsize
(Size) $3 = 20
(lldb) p chsize
(Size) $4 = 0

It seems like CreateDatabaseUsingWalLog() must be doing something that
corrupts PortalContext, but at the moment I'm not sure what that thing
could be.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> WARNING:  problem in alloc set PortalContext: req size > alloc size
> for chunk 0x7f99508911f0 in block 0x7f9950890800
> WARNING:  problem in alloc set PortalContext: bad size 0 for chunk
> 0x7f99508911f0 in block 0x7f9950890800
> WARNING:  problem in alloc set PortalContext: bad single-chunk
> 0x7f9950891208 in block 0x7f9950890800
> WARNING:  problem in alloc set PortalContext: found inconsistent
> memory block 0x7f9950890800
> WARNING:  problem in alloc set PortalContext: req size > alloc size
> for chunk 0x7f99508911f0 in block 0x7f9950890800
> WARNING:  problem in alloc set PortalContext: bad size 0 for chunk
> 0x7f99508911f0 in block 0x7f9950890800
> WARNING:  problem in alloc set PortalContext: bad single-chunk
> 0x7f9950891208 in block 0x7f9950890800
> WARNING:  problem in alloc set PortalContext: found inconsistent
> memory block 0x7f9950890800

This looks like nothing so much as the fallout from something scribbling
past the end of an allocated palloc chunk, or perhaps writing on
already-freed space.  Perhaps running the test case under valgrind
would help to finger the culprit.

            regards, tom lane



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Justin Pryzby
Date:
On Tue, Aug 02, 2022 at 05:46:34PM -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > WARNING:  problem in alloc set PortalContext: req size > alloc size for chunk 0x7f99508911f0 in block
0x7f9950890800
> 
> This looks like nothing so much as the fallout from something scribbling
> past the end of an allocated palloc chunk, or perhaps writing on
> already-freed space.  Perhaps running the test case under valgrind
> would help to finger the culprit.

Yeah but my test case is so poor that it's a chore ...

(Sorry for that, but it took me 2 days to be able to reproduce the problem so I
sent it sooner rather than looking for a better way ... )

I got this interesting looking thing.

==11628== Invalid write of size 8
==11628==    at 0x1D12B3A: smgrsetowner (smgr.c:213)
==11628==    by 0x1C7C224: RelationGetSmgr (rel.h:572)
==11628==    by 0x1C7C224: RelationCopyStorageUsingBuffer (bufmgr.c:3725)
==11628==    by 0x1C7C7A6: CreateAndCopyRelationData (bufmgr.c:3817)
==11628==    by 0x14A4518: CreateDatabaseUsingWalLog (dbcommands.c:221)
==11628==    by 0x14AB009: createdb (dbcommands.c:1393)
==11628==    by 0x1D2B9AF: standard_ProcessUtility (utility.c:776)
==11628==    by 0x1D2C46A: ProcessUtility (utility.c:530)
==11628==    by 0x1D265F5: PortalRunUtility (pquery.c:1158)
==11628==    by 0x1D27089: PortalRunMulti (pquery.c:1315)
==11628==    by 0x1D27A7C: PortalRun (pquery.c:791)
==11628==    by 0x1D1E33D: exec_simple_query (postgres.c:1243)
==11628==    by 0x1D218BC: PostgresMain (postgres.c:4505)
==11628==  Address 0x1025bc18 is 2,712 bytes inside a block of size 8,192 free'd
==11628==    at 0x4033A3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==11628==    by 0x217D7C2: AllocSetReset (aset.c:608)
==11628==    by 0x219B57A: MemoryContextResetOnly (mcxt.c:181)
==11628==    by 0x217DBD5: AllocSetDelete (aset.c:654)
==11628==    by 0x219C1EC: MemoryContextDelete (mcxt.c:252)
==11628==    by 0x21A109F: PortalDrop (portalmem.c:596)
==11628==    by 0x21A269C: AtCleanup_Portals (portalmem.c:907)
==11628==    by 0x11FEAB1: CleanupTransaction (xact.c:2890)
==11628==    by 0x120A74C: AbortCurrentTransaction (xact.c:3328)
==11628==    by 0x1D2158C: PostgresMain (postgres.c:4232)
==11628==    by 0x1B15DB5: BackendRun (postmaster.c:4490)
==11628==    by 0x1B1D799: BackendStartup (postmaster.c:4218)
==11628==  Block was alloc'd at
==11628==    at 0x40327F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==11628==    by 0x217F0DC: AllocSetAlloc (aset.c:920)
==11628==    by 0x219E4D2: palloc (mcxt.c:1082)
==11628==    by 0x14A14BE: ScanSourceDatabasePgClassTuple (dbcommands.c:444)
==11628==    by 0x14A1CD8: ScanSourceDatabasePgClassPage (dbcommands.c:384)
==11628==    by 0x14A20BF: ScanSourceDatabasePgClass (dbcommands.c:322)
==11628==    by 0x14A4348: CreateDatabaseUsingWalLog (dbcommands.c:177)
==11628==    by 0x14AB009: createdb (dbcommands.c:1393)
==11628==    by 0x1D2B9AF: standard_ProcessUtility (utility.c:776)
==11628==    by 0x1D2C46A: ProcessUtility (utility.c:530)
==11628==    by 0x1D265F5: PortalRunUtility (pquery.c:1158)
==11628==    by 0x1D27089: PortalRunMulti (pquery.c:1315)

-- 
Justin



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
On 2022-08-02 17:04:16 -0500, Justin Pryzby wrote:
> I got this interesting looking thing.
> 
> ==11628== Invalid write of size 8
> ==11628==    at 0x1D12B3A: smgrsetowner (smgr.c:213)
> ==11628==    by 0x1C7C224: RelationGetSmgr (rel.h:572)
> ==11628==    by 0x1C7C224: RelationCopyStorageUsingBuffer (bufmgr.c:3725)
> ==11628==    by 0x1C7C7A6: CreateAndCopyRelationData (bufmgr.c:3817)
> ==11628==    by 0x14A4518: CreateDatabaseUsingWalLog (dbcommands.c:221)
> ==11628==    by 0x14AB009: createdb (dbcommands.c:1393)
> ==11628==    by 0x1D2B9AF: standard_ProcessUtility (utility.c:776)
> ==11628==    by 0x1D2C46A: ProcessUtility (utility.c:530)
> ==11628==    by 0x1D265F5: PortalRunUtility (pquery.c:1158)
> ==11628==    by 0x1D27089: PortalRunMulti (pquery.c:1315)
> ==11628==    by 0x1D27A7C: PortalRun (pquery.c:791)
> ==11628==    by 0x1D1E33D: exec_simple_query (postgres.c:1243)
> ==11628==    by 0x1D218BC: PostgresMain (postgres.c:4505)
> ==11628==  Address 0x1025bc18 is 2,712 bytes inside a block of size 8,192 free'd
> ==11628==    at 0x4033A3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==11628==    by 0x217D7C2: AllocSetReset (aset.c:608)
> ==11628==    by 0x219B57A: MemoryContextResetOnly (mcxt.c:181)
> ==11628==    by 0x217DBD5: AllocSetDelete (aset.c:654)
> ==11628==    by 0x219C1EC: MemoryContextDelete (mcxt.c:252)
> ==11628==    by 0x21A109F: PortalDrop (portalmem.c:596)
> ==11628==    by 0x21A269C: AtCleanup_Portals (portalmem.c:907)
> ==11628==    by 0x11FEAB1: CleanupTransaction (xact.c:2890)
> ==11628==    by 0x120A74C: AbortCurrentTransaction (xact.c:3328)
> ==11628==    by 0x1D2158C: PostgresMain (postgres.c:4232)
> ==11628==    by 0x1B15DB5: BackendRun (postmaster.c:4490)
> ==11628==    by 0x1B1D799: BackendStartup (postmaster.c:4218)
> ==11628==  Block was alloc'd at
> ==11628==    at 0x40327F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==11628==    by 0x217F0DC: AllocSetAlloc (aset.c:920)
> ==11628==    by 0x219E4D2: palloc (mcxt.c:1082)
> ==11628==    by 0x14A14BE: ScanSourceDatabasePgClassTuple (dbcommands.c:444)
> ==11628==    by 0x14A1CD8: ScanSourceDatabasePgClassPage (dbcommands.c:384)
> ==11628==    by 0x14A20BF: ScanSourceDatabasePgClass (dbcommands.c:322)
> ==11628==    by 0x14A4348: CreateDatabaseUsingWalLog (dbcommands.c:177)
> ==11628==    by 0x14AB009: createdb (dbcommands.c:1393)
> ==11628==    by 0x1D2B9AF: standard_ProcessUtility (utility.c:776)
> ==11628==    by 0x1D2C46A: ProcessUtility (utility.c:530)
> ==11628==    by 0x1D265F5: PortalRunUtility (pquery.c:1158)
> ==11628==    by 0x1D27089: PortalRunMulti (pquery.c:1315)

Ick. That looks like somehow we end up with smgr entries still pointing to
fake relcache entries, created in a prior attempt at create database.

Looks like you'd need error trapping to call FreeFakeRelcacheEntry() (or just
smgrclearowner()) in case of error.

Or perhaps we can instead prevent the fake relcache entry being set as the
owner in the first place?

Why do we even need fake relcache entries here?  Looks like all that they're
used for is a bunch of RelationGetSmgr() calls? Can't we instead just pass the
rnode to smgropen()? Given that we're doing that once for every buffer in the
body of RelationCopyStorageUsingBuffer(), doing it in a bunch of other
less-frequent places can't be a problem.
can't 

Greetings,

Andres Freund



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Wed, Aug 3, 2022 at 3:53 AM Andres Freund <andres@anarazel.de> wrote:
>
> On 2022-08-02 17:04:16 -0500, Justin Pryzby wrote:
> > I got this interesting looking thing.
> >
> > ==11628== Invalid write of size 8
> > ==11628==    at 0x1D12B3A: smgrsetowner (smgr.c:213)
> > ==11628==    by 0x1C7C224: RelationGetSmgr (rel.h:572)
> > ==11628==    by 0x1C7C224: RelationCopyStorageUsingBuffer (bufmgr.c:3725)
> > ==11628==    by 0x1C7C7A6: CreateAndCopyRelationData (bufmgr.c:3817)
> > ==11628==    by 0x14A4518: CreateDatabaseUsingWalLog (dbcommands.c:221)
> > ==11628==    by 0x14AB009: createdb (dbcommands.c:1393)
> > ==11628==    by 0x1D2B9AF: standard_ProcessUtility (utility.c:776)
> > ==11628==    by 0x1D2C46A: ProcessUtility (utility.c:530)
> > ==11628==    by 0x1D265F5: PortalRunUtility (pquery.c:1158)
> > ==11628==    by 0x1D27089: PortalRunMulti (pquery.c:1315)
> > ==11628==    by 0x1D27A7C: PortalRun (pquery.c:791)
> > ==11628==    by 0x1D1E33D: exec_simple_query (postgres.c:1243)
> > ==11628==    by 0x1D218BC: PostgresMain (postgres.c:4505)
> > ==11628==  Address 0x1025bc18 is 2,712 bytes inside a block of size 8,192 free'd
> > ==11628==    at 0x4033A3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
> > ==11628==    by 0x217D7C2: AllocSetReset (aset.c:608)
> > ==11628==    by 0x219B57A: MemoryContextResetOnly (mcxt.c:181)
> > ==11628==    by 0x217DBD5: AllocSetDelete (aset.c:654)
> > ==11628==    by 0x219C1EC: MemoryContextDelete (mcxt.c:252)
> > ==11628==    by 0x21A109F: PortalDrop (portalmem.c:596)
> > ==11628==    by 0x21A269C: AtCleanup_Portals (portalmem.c:907)
> > ==11628==    by 0x11FEAB1: CleanupTransaction (xact.c:2890)
> > ==11628==    by 0x120A74C: AbortCurrentTransaction (xact.c:3328)
> > ==11628==    by 0x1D2158C: PostgresMain (postgres.c:4232)
> > ==11628==    by 0x1B15DB5: BackendRun (postmaster.c:4490)
> > ==11628==    by 0x1B1D799: BackendStartup (postmaster.c:4218)
> > ==11628==  Block was alloc'd at
> > ==11628==    at 0x40327F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
> > ==11628==    by 0x217F0DC: AllocSetAlloc (aset.c:920)
> > ==11628==    by 0x219E4D2: palloc (mcxt.c:1082)
> > ==11628==    by 0x14A14BE: ScanSourceDatabasePgClassTuple (dbcommands.c:444)
> > ==11628==    by 0x14A1CD8: ScanSourceDatabasePgClassPage (dbcommands.c:384)
> > ==11628==    by 0x14A20BF: ScanSourceDatabasePgClass (dbcommands.c:322)
> > ==11628==    by 0x14A4348: CreateDatabaseUsingWalLog (dbcommands.c:177)
> > ==11628==    by 0x14AB009: createdb (dbcommands.c:1393)
> > ==11628==    by 0x1D2B9AF: standard_ProcessUtility (utility.c:776)
> > ==11628==    by 0x1D2C46A: ProcessUtility (utility.c:530)
> > ==11628==    by 0x1D265F5: PortalRunUtility (pquery.c:1158)
> > ==11628==    by 0x1D27089: PortalRunMulti (pquery.c:1315)
>
> Ick. That looks like somehow we end up with smgr entries still pointing to
> fake relcache entries, created in a prior attempt at create database.

The surprising thing is how the smgr entry survived the transaction
abort, I mean AtEOXact_SMgr should have closed the smgr and should
have removed from the
smgr cache.

> Looks like you'd need error trapping to call FreeFakeRelcacheEntry() (or just
> smgrclearowner()) in case of error.
>
> Or perhaps we can instead prevent the fake relcache entry being set as the
> owner in the first place?
>
> Why do we even need fake relcache entries here?  Looks like all that they're
> used for is a bunch of RelationGetSmgr() calls? Can't we instead just pass the
> rnode to smgropen()? Given that we're doing that once for every buffer in the
> body of RelationCopyStorageUsingBuffer(), doing it in a bunch of other
> less-frequent places can't be a problem.
> can't

I think in some of the previous versions of the patch we were using
smgropen() but changed it so that we do not reuse the smgr after it
gets removed during interrupt processing, see discussion here[1]

[1]
https://www.postgresql.org/message-id/CA%2BTgmoYKovODW2Y7rQmmRFaKu445p9uAahjpgfbY8eyeL07BXA%40mail.gmail.com

From the Valgrind report, it is clear that we are getting the smgr
entry whose smgr->smgr_owner is pointing into the fake relcache entry.
So I am investigating further how it is possible for the smgr created
during a previous create database attempt to survive beyond abort
transaction.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Wed, Aug 3, 2022 at 11:28 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Aug 3, 2022 at 3:53 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > On 2022-08-02 17:04:16 -0500, Justin Pryzby wrote:
> > > I got this interesting looking thing.
> > >
> > > ==11628== Invalid write of size 8
> > > ==11628==    at 0x1D12B3A: smgrsetowner (smgr.c:213)
> > > ==11628==    by 0x1C7C224: RelationGetSmgr (rel.h:572)
> > > ==11628==    by 0x1C7C224: RelationCopyStorageUsingBuffer (bufmgr.c:3725)
> > > ==11628==    by 0x1C7C7A6: CreateAndCopyRelationData (bufmgr.c:3817)
> > > ==11628==    by 0x14A4518: CreateDatabaseUsingWalLog (dbcommands.c:221)
> > > ==11628==    by 0x14AB009: createdb (dbcommands.c:1393)
> > > ==11628==    by 0x1D2B9AF: standard_ProcessUtility (utility.c:776)
> > > ==11628==    by 0x1D2C46A: ProcessUtility (utility.c:530)
> > > ==11628==    by 0x1D265F5: PortalRunUtility (pquery.c:1158)
> > > ==11628==    by 0x1D27089: PortalRunMulti (pquery.c:1315)
> > > ==11628==    by 0x1D27A7C: PortalRun (pquery.c:791)
> > > ==11628==    by 0x1D1E33D: exec_simple_query (postgres.c:1243)
> > > ==11628==    by 0x1D218BC: PostgresMain (postgres.c:4505)
> > > ==11628==  Address 0x1025bc18 is 2,712 bytes inside a block of size 8,192 free'd
> > > ==11628==    at 0x4033A3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
> > > ==11628==    by 0x217D7C2: AllocSetReset (aset.c:608)
> > > ==11628==    by 0x219B57A: MemoryContextResetOnly (mcxt.c:181)
> > > ==11628==    by 0x217DBD5: AllocSetDelete (aset.c:654)
> > > ==11628==    by 0x219C1EC: MemoryContextDelete (mcxt.c:252)
> > > ==11628==    by 0x21A109F: PortalDrop (portalmem.c:596)
> > > ==11628==    by 0x21A269C: AtCleanup_Portals (portalmem.c:907)
> > > ==11628==    by 0x11FEAB1: CleanupTransaction (xact.c:2890)
> > > ==11628==    by 0x120A74C: AbortCurrentTransaction (xact.c:3328)
> > > ==11628==    by 0x1D2158C: PostgresMain (postgres.c:4232)
> > > ==11628==    by 0x1B15DB5: BackendRun (postmaster.c:4490)
> > > ==11628==    by 0x1B1D799: BackendStartup (postmaster.c:4218)
> > > ==11628==  Block was alloc'd at
> > > ==11628==    at 0x40327F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
> > > ==11628==    by 0x217F0DC: AllocSetAlloc (aset.c:920)
> > > ==11628==    by 0x219E4D2: palloc (mcxt.c:1082)
> > > ==11628==    by 0x14A14BE: ScanSourceDatabasePgClassTuple (dbcommands.c:444)
> > > ==11628==    by 0x14A1CD8: ScanSourceDatabasePgClassPage (dbcommands.c:384)
> > > ==11628==    by 0x14A20BF: ScanSourceDatabasePgClass (dbcommands.c:322)
> > > ==11628==    by 0x14A4348: CreateDatabaseUsingWalLog (dbcommands.c:177)
> > > ==11628==    by 0x14AB009: createdb (dbcommands.c:1393)
> > > ==11628==    by 0x1D2B9AF: standard_ProcessUtility (utility.c:776)
> > > ==11628==    by 0x1D2C46A: ProcessUtility (utility.c:530)
> > > ==11628==    by 0x1D265F5: PortalRunUtility (pquery.c:1158)
> > > ==11628==    by 0x1D27089: PortalRunMulti (pquery.c:1315)
> >
> > Ick. That looks like somehow we end up with smgr entries still pointing to
> > fake relcache entries, created in a prior attempt at create database.
>
> The surprising thing is how the smgr entry survived the transaction
> abort, I mean AtEOXact_SMgr should have closed the smgr and should
> have removed from the
> smgr cache.
>

Okay, so AtEOXact_SMgr will only get rid of unowned smgr and ours are
owned by a fake relcache and whose lifetime is just portal memory
context which will go away at the transaction end.  So as Andres
suggested options could be that a) we catch the error and do
FreeFakeRelcacheEntry.  b) directly use smgropen instead of
RelationGetSmgr because actually, we do not want the owner to be set
for these smgrs.

I think option b) looks better to me, I will prepare a patch and test
whether the error goes away with that or not.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Wed, Aug 3, 2022 at 12:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>

> Okay, so AtEOXact_SMgr will only get rid of unowned smgr and ours are
> owned by a fake relcache and whose lifetime is just portal memory
> context which will go away at the transaction end.  So as Andres
> suggested options could be that a) we catch the error and do
> FreeFakeRelcacheEntry.  b) directly use smgropen instead of
> RelationGetSmgr because actually, we do not want the owner to be set
> for these smgrs.
>
> I think option b) looks better to me, I will prepare a patch and test
> whether the error goes away with that or not.
>

Here is the patch which directly uses smgropen instead of using fake
relcache entry.  We don't preserve the smgr pointer and whenever
required we again call the smgropen.

With this patch it resolved the problem for me at least what I was
able to reproduce.  I was able to reproduce the WARNING messages that
Robert got as well as the valgrind error that Justin got and with this
patch both are resolved.

@Justin can you help in verifying the original issue?

Another alternative could be that continue using fake relcache entry
but instead of RelationGetSmgr() create some new function which
doesn't set the owner in the smgr.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Wed, Aug 3, 2022 at 1:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Aug 3, 2022 at 12:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
>
> > Okay, so AtEOXact_SMgr will only get rid of unowned smgr and ours are
> > owned by a fake relcache and whose lifetime is just portal memory
> > context which will go away at the transaction end.  So as Andres
> > suggested options could be that a) we catch the error and do
> > FreeFakeRelcacheEntry.  b) directly use smgropen instead of
> > RelationGetSmgr because actually, we do not want the owner to be set
> > for these smgrs.
> >
> > I think option b) looks better to me, I will prepare a patch and test
> > whether the error goes away with that or not.
> >
>
> Here is the patch which directly uses smgropen instead of using fake
> relcache entry.  We don't preserve the smgr pointer and whenever
> required we again call the smgropen.
>
> With this patch it resolved the problem for me at least what I was
> able to reproduce.  I was able to reproduce the WARNING messages that
> Robert got as well as the valgrind error that Justin got and with this
> patch both are resolved.

Another version of the patch which closes the smgr at the end using
smgrcloserellocator() and I have also added a commit message.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Wed, Aug 3, 2022 at 7:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Another version of the patch which closes the smgr at the end using
> smgrcloserellocator() and I have also added a commit message.

Hmm, but didn't we decide against doing it that way intentionally? The
comment you're deleting says "If we didn't do this and used the smgr
layer directly, we would have to worry about invalidations."

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Wed, 3 Aug 2022 at 9:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Aug 3, 2022 at 7:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Another version of the patch which closes the smgr at the end using
> smgrcloserellocator() and I have also added a commit message.

Hmm, but didn't we decide against doing it that way intentionally? The
comment you're deleting says "If we didn't do this and used the smgr
layer directly, we would have to worry about invalidations."

I think we only need to worry if we keep the smgr reference around and try to reuse it.  But in this patch I am not keeping the reference to the smgr.

Dilip
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Justin Pryzby
Date:
On Wed, Aug 03, 2022 at 04:45:23PM +0530, Dilip Kumar wrote:
> Another version of the patch which closes the smgr at the end using
> smgrcloserellocator() and I have also added a commit message.

Thanks for providing a patch.
This seems to fix the second problem with accessing freed memory.

But I reproduced the first problem with a handful of tries interrupting the
while loop:

2022-08-03 10:39:50.129 CDT client backend[5530] [unknown] PANIC:  could not open critical system index 2662

In the failure, when trying to connect to the new "a" DB, it does this:

[pid 10700] openat(AT_FDCWD, "base/17003/pg_filenode.map", O_RDONLY) = 11
[pid 10700] read(11,
"\27'Y\0\21\0\0\0\353\4\0\0\353\4\0\0\341\4\0\0\341\4\0\0\347\4\0\0\347\4\0\0\337\4\0\0\337\4\0\0\24\v\0\0\24\v\0\0\25\v\0\0\25\v\0\0K\20\0\0K\20\0\0L\20\0\0L\20\0\0\202\n\0\0\202\n\0\0\203\n\0\0\203\n\0\0\217\n\0\0\217\n\0\0\220\n\0\0\220\n\0\0b\n\0\0b\n\0\0c\n\0\0c\n\0\0f\n\0\0f\n\0\0g\n\0\0g\n\0\0\177\r\0\0\177\r\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\362\366\252\337",
524)= 524 
[pid 10700] close(11)                   = 0
[pid 10700] openat(AT_FDCWD, "base/17003/pg_internal.init", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 10700] openat(AT_FDCWD, "base/17003/1259", O_RDWR) = 11
[pid 10700] lseek(11, 0, SEEK_END)      = 106496
[pid 10700] lseek(11, 0, SEEK_END)      = 106496

And then reads nothing but zero bytes from FD 11 (rel 1259/pg_class)

So far, I haven't succeeded in eliciting anything useful from valgrind.

--
Justin



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Wed, Aug 3, 2022 at 9:32 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Wed, Aug 03, 2022 at 04:45:23PM +0530, Dilip Kumar wrote:
> > Another version of the patch which closes the smgr at the end using
> > smgrcloserellocator() and I have also added a commit message.
>
> Thanks for providing a patch.
> This seems to fix the second problem with accessing freed memory.

Thanks for the confirmation.

> But I reproduced the first problem with a handful of tries interrupting the
> while loop:
>
> 2022-08-03 10:39:50.129 CDT client backend[5530] [unknown] PANIC:  could not open critical system index 2662
>
> In the failure, when trying to connect to the new "a" DB, it does this:
>
> [pid 10700] openat(AT_FDCWD, "base/17003/pg_filenode.map", O_RDONLY) = 11
> [pid 10700] read(11,
"\27'Y\0\21\0\0\0\353\4\0\0\353\4\0\0\341\4\0\0\341\4\0\0\347\4\0\0\347\4\0\0\337\4\0\0\337\4\0\0\24\v\0\0\24\v\0\0\25\v\0\0\25\v\0\0K\20\0\0K\20\0\0L\20\0\0L\20\0\0\202\n\0\0\202\n\0\0\203\n\0\0\203\n\0\0\217\n\0\0\217\n\0\0\220\n\0\0\220\n\0\0b\n\0\0b\n\0\0c\n\0\0c\n\0\0f\n\0\0f\n\0\0g\n\0\0g\n\0\0\177\r\0\0\177\r\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\362\366\252\337",
524)= 524 
> [pid 10700] close(11)                   = 0
> [pid 10700] openat(AT_FDCWD, "base/17003/pg_internal.init", O_RDONLY) = -1 ENOENT (No such file or directory)
> [pid 10700] openat(AT_FDCWD, "base/17003/1259", O_RDWR) = 11
> [pid 10700] lseek(11, 0, SEEK_END)      = 106496
> [pid 10700] lseek(11, 0, SEEK_END)      = 106496
>
> And then reads nothing but zero bytes from FD 11 (rel 1259/pg_class)
>
> So far, I haven't succeeded in eliciting anything useful from valgrind.

I tried multiple times but had no luck with reproducing this issue.
Do you have some logs to know that just before ^C what was the last
query executed and whether it got canceled or executed completely?
Because theoretically, if the create database failed anywhere in
between then it should at least clean the directory of that newly
created database but seems there are some corrupted data in that
directory so seems it is not symptoms of just the create database
failure but some combination of multiple things.  I will put more
thought into this tomorrow.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Justin Pryzby
Date:
On Wed, Aug 03, 2022 at 11:02:00AM -0500, Justin Pryzby wrote:
> But I reproduced the first problem with a handful of tries interrupting the
> while loop:
> 
> 2022-08-03 10:39:50.129 CDT client backend[5530] [unknown] PANIC:  could not open critical system index 2662
...
> So far, I haven't succeeded in eliciting anything useful from valgrind.

Now, I've reproduced the problem under valgrind, but it doesn't show anything
useful:

pryzbyj@pryzbyj:~$ while :; do psql -h /tmp template1 -c "DROP DATABASE a" -c "CREATE DATABASE a TEMPLATE postgres
STRATEGYwal_log"; done
 
ERROR:  database "a" does not exist
CREATE DATABASE
^CCancel request sent
ERROR:  canceling statement due to user request
ERROR:  database "a" already exists
^C
pryzbyj@pryzbyj:~$ ^C
pryzbyj@pryzbyj:~$ ^C
pryzbyj@pryzbyj:~$ ^C
pryzbyj@pryzbyj:~$ psql -h /tmp a -c ''
2022-08-03 11:57:39.178 CDT client backend[31321] [unknown] PANIC:  could not open critical system index 2662
psql: error: falló la conexión al servidor en el socket «/tmp/.s.PGSQL.5432»: PANIC:  could not open critical system
index2662
 


On the server process, nothing interesting but the backtrace (the error was
before this, while writing the relation file, but there's nothing suspicious).

2022-08-03 11:08:06.628 CDT client backend[2841] [unknown] PANIC:  could not open critical system index 2662
==2841==
==2841== Process terminating with default action of signal 6 (SIGABRT)
==2841==    at 0x5FBBE97: raise (raise.c:51)
==2841==    by 0x5FBD800: abort (abort.c:79)
==2841==    by 0x2118DEF: errfinish (elog.c:675)
==2841==    by 0x20F6002: load_critical_index (relcache.c:4328)
==2841==    by 0x20F727A: RelationCacheInitializePhase3 (relcache.c:4103)
==2841==    by 0x213DFA5: InitPostgres (postinit.c:1087)
==2841==    by 0x1D20D72: PostgresMain (postgres.c:4081)
==2841==    by 0x1B15CFD: BackendRun (postmaster.c:4490)
==2841==    by 0x1B1D6E1: BackendStartup (postmaster.c:4218)
==2841==    by 0x1B1DD59: ServerLoop (postmaster.c:1808)
==2841==    by 0x1B1F86D: PostmasterMain (postmaster.c:1480)
==2841==    by 0x17B7150: main (main.c:197)

Below, I reproduced it without valgrind (and without LANG):

pryzbyj@pryzbyj:~/src/postgres$ while :; do psql -qh /tmp template1 -c "DROP DATABASE a" -c "CREATE DATABASE a TEMPLATE
postgresSTRATEGY wal_log"; done
 
2022-08-03 11:59:52.675 CDT checkpointer[1881] LOG:  checkpoint starting: immediate force wait
2022-08-03 11:59:52.862 CDT checkpointer[1881] LOG:  checkpoint complete: wrote 4 buffers (0.0%); 0 WAL file(s) added,
0removed, 0 recycled; write=0.045 s, sync=0.038 s, total=0.188 s; sync files=3, longest=0.019 s, average=0.013 s;
distance=3kB, estimate=3 kB; lsn=0/24862508, redo lsn=0/248624D0
 
2022-08-03 11:59:53.213 CDT checkpointer[1881] LOG:  checkpoint starting: immediate force wait
2022-08-03 11:59:53.409 CDT checkpointer[1881] LOG:  checkpoint complete: wrote 4 buffers (0.0%); 0 WAL file(s) added,
0removed, 0 recycled; write=0.030 s, sync=0.054 s, total=0.196 s; sync files=4, longest=0.029 s, average=0.014 s;
distance=4042kB, estimate=4042 kB; lsn=0/24C54D88, redo lsn=0/24C54D50
 
^CCancel request sent
2022-08-03 11:59:53.750 CDT checkpointer[1881] LOG:  checkpoint starting: immediate force wait
2022-08-03 11:59:53.930 CDT checkpointer[1881] LOG:  checkpoint complete: wrote 4 buffers (0.0%); 0 WAL file(s) added,
0removed, 1 recycled; write=0.029 s, sync=0.027 s, total=0.181 s; sync files=4, longest=0.022 s, average=0.007 s;
distance=4042kB, estimate=4042 kB; lsn=0/250476D0, redo lsn=0/25047698
 
2022-08-03 11:59:54.270 CDT checkpointer[1881] LOG:  checkpoint starting: immediate force wait
^C2022-08-03 11:59:54.294 CDT client backend[1903] psql ERROR:  canceling statement due to user request
2022-08-03 11:59:54.294 CDT client backend[1903] psql STATEMENT:  DROP DATABASE a
Cancel request sent
ERROR:  canceling statement due to user request
2022-08-03 11:59:54.296 CDT client backend[1903] psql ERROR:  database "a" already exists
2022-08-03 11:59:54.296 CDT client backend[1903] psql STATEMENT:  CREATE DATABASE a TEMPLATE postgres STRATEGY wal_log
ERROR:  database "a" already exists
^C
pryzbyj@pryzbyj:~/src/postgres$ ^C
pryzbyj@pryzbyj:~/src/postgres$ ^C
pryzbyj@pryzbyj:~/src/postgres$ 2022-08-03 11:59:54.427 CDT checkpointer[1881] LOG:  checkpoint complete: wrote 4
buffers(0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.024 s, sync=0.036 s, total=0.158 s; sync files=4,
longest=0.027s, average=0.009 s; distance=4042 kB, estimate=4042 kB; lsn=0/2543A108, redo lsn=0/2543A0A8
 
^C
pryzbyj@pryzbyj:~/src/postgres$ ^C
pryzbyj@pryzbyj:~/src/postgres$ ^C
pryzbyj@pryzbyj:~/src/postgres$ psql -h /tmp a -c ''
                                                                                          2022-08-03 11:59:56.617 CDT
clientbackend[1914] [unknown] PANIC:  could not open critical system index 2662
 



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
Hi,

On 2022-08-03 12:01:18 -0500, Justin Pryzby wrote:
> Now, I've reproduced the problem under valgrind, but it doesn't show anything
> useful

Yea, that looks like an issue on a different level.

> 
> pryzbyj@pryzbyj:~$ while :; do psql -h /tmp template1 -c "DROP DATABASE a" -c "CREATE DATABASE a TEMPLATE postgres
STRATEGYwal_log"; done
 
> ERROR:  database "a" does not exist
> CREATE DATABASE
> ^CCancel request sent
> ERROR:  canceling statement due to user request
> ERROR:  database "a" already exists
> ^C

Hm. This looks more like an issue of DROP DATABASE not being interruptible. I
suspect this isn't actually related to STRATEGY wal_log and could likely be
reproduced in older versions too.

It's pretty obvious that dropdb() isn't safe against being interrupted. We
delete the data before we have committed the deletion of the pg_database
entry.

Seems like we should hold interrupts across the remove_dbtablespaces() until
*after* we've committed the transaction?

Greetings,

Andres Freund



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Justin Pryzby
Date:
On Wed, Aug 03, 2022 at 11:26:43AM -0700, Andres Freund wrote:
> Hm. This looks more like an issue of DROP DATABASE not being interruptible. I
> suspect this isn't actually related to STRATEGY wal_log and could likely be
> reproduced in older versions too.

I couldn't reproduce it with file_copy, but my recipe isn't exactly reliable.
That may just mean that it's easier to hit now.

-- 
Justin



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Thu, Aug 4, 2022 at 12:18 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Wed, Aug 03, 2022 at 11:26:43AM -0700, Andres Freund wrote:
> > Hm. This looks more like an issue of DROP DATABASE not being interruptible. I
> > suspect this isn't actually related to STRATEGY wal_log and could likely be
> > reproduced in older versions too.
>
> I couldn't reproduce it with file_copy, but my recipe isn't exactly reliable.
> That may just mean that it's easier to hit now.

I think this looks like a problem with drop db but IMHO you are seeing
this behavior only when a database is created using WAL LOG because in
this strategy we are using buffers to write the destination database
pages and some of the dirty buffers and sync requests might still be
pending.  And now when we try to drop the database it drops all the
dirty buffers and all pending sync requests and then before it
actually removes the directory it gets interrupted and now you see the
database directory on disk which is partially corrupted.  See below
sequence of drop database


dropdb()
{
...
DropDatabaseBuffers(db_id);
...
ForgetDatabaseSyncRequests(db_id);
...
RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);

WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));
 -- Inside this it can process the cancel query and get interrupted
remove_dbtablespaces(db_id);
..
}

I reproduced the same error by inducing error just before
WaitForProcSignalBarrier.

postgres[14968]=# CREATE DATABASE a STRATEGY WAL_LOG ; drop database a;
CREATE DATABASE
ERROR:  XX000: test error
LOCATION:  dropdb, dbcommands.c:1684
postgres[14968]=# \c a
connection to server on socket "/tmp/.s.PGSQL.5432" failed: PANIC:
could not open critical system index 2662
Previous connection kept
postgres[14968]=#


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Thu, Aug 4, 2022 at 9:41 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Aug 4, 2022 at 12:18 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> >
> > On Wed, Aug 03, 2022 at 11:26:43AM -0700, Andres Freund wrote:
> > > Hm. This looks more like an issue of DROP DATABASE not being interruptible. I
> > > suspect this isn't actually related to STRATEGY wal_log and could likely be
> > > reproduced in older versions too.
> >
> > I couldn't reproduce it with file_copy, but my recipe isn't exactly reliable.
> > That may just mean that it's easier to hit now.
>
> I think this looks like a problem with drop db but IMHO you are seeing
> this behavior only when a database is created using WAL LOG because in
> this strategy we are using buffers to write the destination database
> pages and some of the dirty buffers and sync requests might still be
> pending.  And now when we try to drop the database it drops all the
> dirty buffers and all pending sync requests and then before it
> actually removes the directory it gets interrupted and now you see the
> database directory on disk which is partially corrupted.  See below
> sequence of drop database
>
>
> dropdb()
> {
> ...
> DropDatabaseBuffers(db_id);
> ...
> ForgetDatabaseSyncRequests(db_id);
> ...
> RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT);
>
> WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));
>  -- Inside this it can process the cancel query and get interrupted
> remove_dbtablespaces(db_id);
> ..
> }
>
> I reproduced the same error by inducing error just before
> WaitForProcSignalBarrier.
>
> postgres[14968]=# CREATE DATABASE a STRATEGY WAL_LOG ; drop database a;
> CREATE DATABASE
> ERROR:  XX000: test error
> LOCATION:  dropdb, dbcommands.c:1684
> postgres[14968]=# \c a
> connection to server on socket "/tmp/.s.PGSQL.5432" failed: PANIC:
> could not open critical system index 2662
> Previous connection kept
> postgres[14968]=#

So basically, from this we can say it is completely a problem with
drop databases, I mean I can produce any behavior by interrupting drop
database
1. If we created some tables/inserted data and the drop database got
cancelled, it might have a database directory and those objects are
lost.
2.  Or you can even drop the database directory and then get cancelled
before deleting the pg_database entry then also you will end up with a
corrupted database, doesn't matter whether you created it with WAL LOG
or FILE COPY.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Wed, Aug 3, 2022 at 7:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Another version of the patch which closes the smgr at the end using
> smgrcloserellocator() and I have also added a commit message.

I have reviewed this patch and I don't see a problem with it. However,
it would be nice if Andres or someone else who understands this area
well (Tom? Thomas?) would also review it, because I also reviewed
what's in the tree now and that turns out to be buggy, which leads me
to conclude that I don't understand this area as well as would be
desirable.

I'm inclined to hold off on committing this until next week, not only
for that reason, but also because there's a wrap planned on Monday,
and committing anything now seems like it might have too much of a
risk of upsetting that plan.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Justin Pryzby
Date:
On Thu, Aug 04, 2022 at 04:07:01PM -0400, Robert Haas wrote:
> I'm inclined to hold off on committing this until next week, not only

+1

I don't see any reason to hurry to fix problems that occur when DROP DATABASE
is interrupted.

Sorry to beat up your patches so much and for such crappy test cases^C

-- 
Justin



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Tom Lane
Date:
Justin Pryzby <pryzby@telsasoft.com> writes:
> On Thu, Aug 04, 2022 at 04:07:01PM -0400, Robert Haas wrote:
>> I'm inclined to hold off on committing this until next week, not only

> +1

+1 ... there are some other v15 open items that I don't think we'll
see fixed for beta3, either.

            regards, tom lane



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
Hi,

On 2022-08-03 16:45:23 +0530, Dilip Kumar wrote:
> Another version of the patch which closes the smgr at the end using
> smgrcloserellocator() and I have also added a commit message.

What's the motivation behind the explicit close?


> @@ -258,8 +258,8 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
>      Page        page;
>      List       *rlocatorlist = NIL;
>      LockRelId    relid;
> -    Relation    rel;
>      Snapshot    snapshot;
> +    SMgrRelation    smgr;
>      BufferAccessStrategy bstrategy;
>  
>      /* Get pg_class relfilenumber. */
> @@ -276,16 +276,9 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
>      rlocator.dbOid = dbid;
>      rlocator.relNumber = relfilenumber;
>  
> -    /*
> -     * We can't use a real relcache entry for a relation in some other
> -     * database, but since we're only going to access the fields related to
> -     * physical storage, a fake one is good enough. If we didn't do this and
> -     * used the smgr layer directly, we would have to worry about
> -     * invalidations.
> -     */
> -    rel = CreateFakeRelcacheEntry(rlocator);
> -    nblocks = smgrnblocks(RelationGetSmgr(rel), MAIN_FORKNUM);
> -    FreeFakeRelcacheEntry(rel);
> +    smgr = smgropen(rlocator, InvalidBackendId);
> +    nblocks = smgrnblocks(smgr, MAIN_FORKNUM);
> +    smgrclose(smgr);

Why are you opening and then closing again? Part of the motivation for the
question is that a local SMgrRelation variable may lead to it being used
further, opening up interrupt processing issues.


> +    rlocator.locator = src_rlocator;
> +    smgrcloserellocator(rlocator);
> +
> +    rlocator.locator = dst_rlocator;
> +    smgrcloserellocator(rlocator);

As mentioned above, it's not clear to me why this is now done...

Otherwise looks good to me.

Greetings,

Andres Freund



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
Hi,

On 2022-08-04 16:07:01 -0400, Robert Haas wrote:
> On Wed, Aug 3, 2022 at 7:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > Another version of the patch which closes the smgr at the end using
> > smgrcloserellocator() and I have also added a commit message.
> 
> I have reviewed this patch and I don't see a problem with it. However,
> it would be nice if Andres or someone else who understands this area
> well (Tom? Thomas?) would also review it, because I also reviewed
> what's in the tree now and that turns out to be buggy, which leads me
> to conclude that I don't understand this area as well as would be
> desirable.

I don't think this issue is something I'd have caught "originally"
either. It's probably more of a "fake relcache entry" issue (or undocumented
limitation) than a bug in the new code.

I did a quick review upthread - some minor quibbles aside, I think it looks
good.


> I'm inclined to hold off on committing this until next week, not only
> for that reason, but also because there's a wrap planned on Monday,
> and committing anything now seems like it might have too much of a
> risk of upsetting that plan.

Makes sense. Unlikely to be a blocker for anybody.

Greetings,

Andres Freund



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
Hi,

On 2022-08-04 16:38:35 +0530, Dilip Kumar wrote:
> So basically, from this we can say it is completely a problem with
> drop databases, I mean I can produce any behavior by interrupting drop
> database
> 1. If we created some tables/inserted data and the drop database got
> cancelled, it might have a database directory and those objects are
> lost.
> 2.  Or you can even drop the database directory and then get cancelled
> before deleting the pg_database entry then also you will end up with a
> corrupted database, doesn't matter whether you created it with WAL LOG
> or FILE COPY.

Yea. I think at the very least we need to start holding interrupts before the
DropDatabaseBuffers() and do so until commit. That's probably best done by
doing the transaction commit inside dropdb.

Greetings,

Andres Freund



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I have reviewed this patch and I don't see a problem with it. However,
> it would be nice if Andres or someone else who understands this area
> well (Tom? Thomas?) would also review it, because I also reviewed
> what's in the tree now and that turns out to be buggy, which leads me
> to conclude that I don't understand this area as well as would be
> desirable.

FWIW, I approve of getting rid of the use of CreateFakeRelcacheEntry
here, because I do not think that mechanism is meant to be used
outside of WAL replay.  However, this patch fails to remove it from
CreateAndCopyRelationData, which seems likely to be just as much
at risk.

The "invalidation" comment bothered me for awhile, but I think it's
fine: we know that no other backend can connect to the source DB
because we have it locked, and we know that no other backend can
connect to the destination DB because it doesn't exist yet according
to the catalogs, so nothing could possibly occur to invalidate our
idea of where the physical files are.  It would be nice to document
these assumptions, though, rather than merely remove all the relevant
commentary.

While I'm at it, I would like to strenuously object to the current
framing of CreateAndCopyRelationData as a general-purpose copying
mechanism.  Because of the above assumptions, I think it's utterly
unsafe to use anywhere except in CREATE DATABASE.  The header comment
fails to warn about that at all, and placing it in bufmgr.c rather
than static in dbcommands.c is just an invitation to future misuse.
Perhaps I'm overly sensitive to that because I just finished cleaning
up somebody's misuse of non-general-purpose code (1aa8dad41), but
as this stands I think it's positively dangerous.

            regards, tom lane



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> Yea. I think at the very least we need to start holding interrupts before the
> DropDatabaseBuffers() and do so until commit. That's probably best done by
> doing the transaction commit inside dropdb.

We've talked before about ignoring interrupts across commit, but
I find the idea a bit scary.  In any case, DROP DATABASE is far
from the only place with a problem.

            regards, tom lane



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Justin Pryzby
Date:
On Thu, Aug 04, 2022 at 06:02:50PM -0400, Tom Lane wrote:
> The "invalidation" comment bothered me for awhile, but I think it's
> fine: we know that no other backend can connect to the source DB
> because we have it locked,

About that - is it any problem that the currently-connected db can be used as a
template?  It's no issue for 2-phase commit, because "create database" cannot
run in an txn.

-- 
Justin



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
Hi,

On 2022-08-04 18:05:25 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Yea. I think at the very least we need to start holding interrupts before the
> > DropDatabaseBuffers() and do so until commit. That's probably best done by
> > doing the transaction commit inside dropdb.
> 
> We've talked before about ignoring interrupts across commit, but
> I find the idea a bit scary.

I'm not actually suggesting to do so across commit, just until the
commit. Maintaining that seems easiest if dropdb() does the commit internally.


> In any case, DROP DATABASE is far from the only place with a problem.

What other place has a database corrupting potential of this magnitude just
because interrupts are accepted?  We throw valid s_b contents away and then
accept interrupts before committing - with predictable results. We also accept
interrupts as part of deleting the db data dir (due to catalog access).

Greetings,

Andres Freund



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Tom Lane
Date:
I wrote:
> While I'm at it, I would like to strenuously object to the current
> framing of CreateAndCopyRelationData as a general-purpose copying
> mechanism.

And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer
not completely broken?

        /* Read block from source relation. */
        srcBuf = ReadBufferWithoutRelcache(src->rd_locator, forkNum, blkno,
                                           RBM_NORMAL, bstrategy_src,
                                           permanent);
        srcPage = BufferGetPage(srcBuf);
        if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
        {
            ReleaseBuffer(srcBuf);
            continue;
        }

        /* Use P_NEW to extend the destination relation. */
        dstBuf = ReadBufferWithoutRelcache(dst->rd_locator, forkNum, P_NEW,
                                           RBM_NORMAL, bstrategy_dst,
                                           permanent);

You can't skip pages just because they are empty.  Well, maybe you could
if you were doing something to ensure that you zero-fill the corresponding
blocks on the destination side.  But this isn't doing that.  It's using
P_NEW for dstBuf, which will have the effect of silently collapsing out
such pages.  Maybe in isolation a heap could withstand that, but its
indexes won't be happy (and I guess t_ctid chain links won't either).

I think you should just lose the if() stanza.  There's no optimization to
be had here that's worth any extra complication.

(This seems worth fixing before beta3, as it looks like a rather
nasty data corruption hazard.)

            regards, tom lane



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Tom Lane
Date:
I wrote:
> And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer
> not completely broken?

[pile^2]  Also, what is the rationale for locking the target buffer
but not the source buffer?  That seems pretty hard to justify from
here, even granting the assumption that we don't expect any other
processes to be interested in these buffers (which I don't grant,
because checkpointer).

            regards, tom lane



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
Hi,

On 2022-08-04 19:01:06 -0400, Tom Lane wrote:
> And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer
> not completely broken?
> 
>         /* Read block from source relation. */
>         srcBuf = ReadBufferWithoutRelcache(src->rd_locator, forkNum, blkno,
>                                            RBM_NORMAL, bstrategy_src,
>                                            permanent);
>         srcPage = BufferGetPage(srcBuf);
>         if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
>         {
>             ReleaseBuffer(srcBuf);
>             continue;
>         }
> 
>         /* Use P_NEW to extend the destination relation. */
>         dstBuf = ReadBufferWithoutRelcache(dst->rd_locator, forkNum, P_NEW,
>                                            RBM_NORMAL, bstrategy_dst,
>                                            permanent);
> 
> You can't skip pages just because they are empty.  Well, maybe you could
> if you were doing something to ensure that you zero-fill the corresponding
> blocks on the destination side.  But this isn't doing that.  It's using
> P_NEW for dstBuf, which will have the effect of silently collapsing out
> such pages.  Maybe in isolation a heap could withstand that, but its
> indexes won't be happy (and I guess t_ctid chain links won't either).
> 
> I think you should just lose the if() stanza.  There's no optimization to
> be had here that's worth any extra complication.
> 
> (This seems worth fixing before beta3, as it looks like a rather
> nasty data corruption hazard.)

Ugh, yes. And even with this fixed I think this should grow at least an
assertion that the block numbers match, probably even an elog.

Greetings,

Andres



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2022-08-04 18:05:25 -0400, Tom Lane wrote:
>> In any case, DROP DATABASE is far from the only place with a problem.

> What other place has a database corrupting potential of this magnitude just
> because interrupts are accepted?  We throw valid s_b contents away and then
> accept interrupts before committing - with predictable results. We also accept
> interrupts as part of deleting the db data dir (due to catalog access).

Those things would be better handled by moving the data-discarding
steps to post-commit.  Maybe that argues for having an internal
commit halfway through DROP DATABASE: remove pg_database row,
commit, start new transaction, clean up.

            regards, tom lane



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2022-08-04 19:01:06 -0400, Tom Lane wrote:
>> (This seems worth fixing before beta3, as it looks like a rather
>> nasty data corruption hazard.)

> Ugh, yes. And even with this fixed I think this should grow at least an
> assertion that the block numbers match, probably even an elog.

Yeah, the assumption that P_NEW would automatically match the source block
was making me itchy too.  An explicit test-and-elog seems worth the
cycles.

            regards, tom lane



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
Hi,

On August 4, 2022 4:11:13 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>I wrote:
>> And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer
>> not completely broken?
>
>[pile^2]  Also, what is the rationale for locking the target buffer
>but not the source buffer?  That seems pretty hard to justify from
>here, even granting the assumption that we don't expect any other
>processes to be interested in these buffers (which I don't grant,
>because checkpointer).

I'm not arguing it's good or should stay that way, but it's probably okayish that checkpointer / bgwriter have access,
giventhat they will never modify buffers. They just take a lock to prevent concurrent modifications, which
RelationCopyStorageUsingBufferhopefully doesn't do.  

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
Hi,

On August 4, 2022 4:20:16 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Yeah, the assumption that P_NEW would automatically match the source block
>was making me itchy too.  An explicit test-and-elog seems worth the
>cycles.

Is there a good reason to rely on P_NEW at all? Both from an efficiency and robustness POV it seems like it'd be better
touse smgrextend to bulk extend just after smgrcreate() and then fill s_b u using RBM_ZERO (or whatever it is called).
Thatbulk smgrextend would later be a good point to use fallocate so the FS can immediately size the file correctly,
withouta lot of pointless writes of zeroes.  

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On August 4, 2022 4:11:13 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> [pile^2]  Also, what is the rationale for locking the target buffer
>> but not the source buffer?  That seems pretty hard to justify from
>> here, even granting the assumption that we don't expect any other
>> processes to be interested in these buffers (which I don't grant,
>> because checkpointer).

> I'm not arguing it's good or should stay that way, but it's probably okayish that checkpointer / bgwriter have
access,given that they will never modify buffers. They just take a lock to prevent concurrent modifications, which
RelationCopyStorageUsingBufferhopefully doesn't do.  

I'm not arguing that it's actively broken today --- but AFAIR,
every other access to a shared buffer takes a buffer lock.
It does not seem to me to be very future-proof for this code to
decide it's exempt from that rule, without so much as a comment
justifying it.  Furthermore, what's the gain?  We aren't expecting
contention here, I think.  If we were, then it probably *would* be
actively broken.

            regards, tom lane



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> Is there a good reason to rely on P_NEW at all? Both from an efficiency
> and robustness POV it seems like it'd be better to use smgrextend to
> bulk extend just after smgrcreate() and then fill s_b u using RBM_ZERO
> (or whatever it is called).  That bulk smgrextend would later be a good
> point to use fallocate so the FS can immediately size the file
> correctly, without a lot of pointless writes of zeroes.

Hmm ... makes sense.  We'd be using smgrextend to write just the last page
of the file, relying on its API spec "Note that we assume writing a block
beyond current EOF causes intervening file space to become filled with
zeroes".  I don't know that we're using that assumption aggressively
today, but as long as it doesn't confuse the kernel it'd probably be fine.

            regards, tom lane



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Thu, Aug 4, 2022 at 6:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > I have reviewed this patch and I don't see a problem with it. However,
> > it would be nice if Andres or someone else who understands this area
> > well (Tom? Thomas?) would also review it, because I also reviewed
> > what's in the tree now and that turns out to be buggy, which leads me
> > to conclude that I don't understand this area as well as would be
> > desirable.
>
> FWIW, I approve of getting rid of the use of CreateFakeRelcacheEntry
> here, because I do not think that mechanism is meant to be used
> outside of WAL replay.  However, this patch fails to remove it from
> CreateAndCopyRelationData, which seems likely to be just as much
> at risk.

It looks to me like it does?

> The "invalidation" comment bothered me for awhile, but I think it's
> fine: we know that no other backend can connect to the source DB
> because we have it locked, and we know that no other backend can
> connect to the destination DB because it doesn't exist yet according
> to the catalogs, so nothing could possibly occur to invalidate our
> idea of where the physical files are.  It would be nice to document
> these assumptions, though, rather than merely remove all the relevant
> commentary.

I don't think that's the point. We could always suffer a sinval reset
or a PROCSIGNAL_BARRIER_SMGRRELEASE. But since the code avoids ever
reusing the smgr, it should be OK. I think.

> While I'm at it, I would like to strenuously object to the current
> framing of CreateAndCopyRelationData as a general-purpose copying
> mechanism.  Because of the above assumptions, I think it's utterly
> unsafe to use anywhere except in CREATE DATABASE.  The header comment
> fails to warn about that at all, and placing it in bufmgr.c rather
> than static in dbcommands.c is just an invitation to future misuse.
> Perhaps I'm overly sensitive to that because I just finished cleaning
> up somebody's misuse of non-general-purpose code (1aa8dad41), but
> as this stands I think it's positively dangerous.

OK. No objection to you revising the comments however you feel is appropriate.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Thu, Aug 4, 2022 at 7:01 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer
> not completely broken?

Ouch. That's pretty bad.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Thu, Aug 4, 2022 at 7:11 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> [pile^2]  Also, what is the rationale for locking the target buffer
> but not the source buffer?  That seems pretty hard to justify from
> here, even granting the assumption that we don't expect any other
> processes to be interested in these buffers (which I don't grant,
> because checkpointer).

Ooph. I agree.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Fri, Aug 5, 2022 at 4:31 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
> > While I'm at it, I would like to strenuously object to the current
> > framing of CreateAndCopyRelationData as a general-purpose copying
> > mechanism.
>
> And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer
> not completely broken?
>
>         /* Read block from source relation. */
>         srcBuf = ReadBufferWithoutRelcache(src->rd_locator, forkNum, blkno,
>                                            RBM_NORMAL, bstrategy_src,
>                                            permanent);
>         srcPage = BufferGetPage(srcBuf);
>         if (PageIsNew(srcPage) || PageIsEmpty(srcPage))
>         {
>             ReleaseBuffer(srcBuf);
>             continue;
>         }
>
>         /* Use P_NEW to extend the destination relation. */
>         dstBuf = ReadBufferWithoutRelcache(dst->rd_locator, forkNum, P_NEW,
>                                            RBM_NORMAL, bstrategy_dst,
>                                            permanent);
>
> You can't skip pages just because they are empty.  Well, maybe you could
> if you were doing something to ensure that you zero-fill the corresponding
> blocks on the destination side.  But this isn't doing that.  It's using
> P_NEW for dstBuf, which will have the effect of silently collapsing out
> such pages.  Maybe in isolation a heap could withstand that, but its
> indexes won't be happy (and I guess t_ctid chain links won't either).
>
> I think you should just lose the if() stanza.  There's no optimization to
> be had here that's worth any extra complication.
>
> (This seems worth fixing before beta3, as it looks like a rather
> nasty data corruption hazard.)

Yeah this is broken.

--
Dilip
-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Fri, Aug 5, 2022 at 5:36 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On August 4, 2022 4:20:16 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >Yeah, the assumption that P_NEW would automatically match the source block
> >was making me itchy too.  An explicit test-and-elog seems worth the
> >cycles.
>
> Is there a good reason to rely on P_NEW at all?

I think there were 2 arguments for which we used bufmgr instead of
smgrextend for the destination database

1) (Comment from Andres) The big benefit would be that the *target*
database does not have to be written out / shared buffer is
immediately populated. [1]
2) (Comment from Robert) We wanted to avoid writing new code which
bypasses the shared buffers.

[1]https://www.postgresql.org/message-id/20210905202800.ji4fnfs3xzhvo7l6%40alap3.anarazel.de

 Both from an efficiency and robustness POV it seems like it'd be
better to use smgrextend to bulk extend just after smgrcreate() and
then fill s_b u using RBM_ZERO (or whatever it is called).  That bulk
smgrextend would later be a good point to use fallocate so the FS can
immediately size the file correctly, without a lot of pointless writes
of zeroes.

Yeah okay, so you mean since we already know the nblocks in the source
file so we can directly do smgrextend in bulk before the copy loop and
then we can just copy block by block using bufmgr with proper blkno
instead of P_NEW.  Yeah I think this looks optimized to me and this
will take care of the above 2 points I mentioned that we will still
have the target database pages in the shared buffers and we are not
bypassing the shared buffers also.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Fri, Aug 5, 2022 at 2:59 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-08-03 16:45:23 +0530, Dilip Kumar wrote:
> > Another version of the patch which closes the smgr at the end using
> > smgrcloserellocator() and I have also added a commit message.
>
> What's the motivation behind the explicit close?
>
>
> > @@ -258,8 +258,8 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
> >       Page            page;
> >       List       *rlocatorlist = NIL;
> >       LockRelId       relid;
> > -     Relation        rel;
> >       Snapshot        snapshot;
> > +     SMgrRelation    smgr;
> >       BufferAccessStrategy bstrategy;
> >
> >       /* Get pg_class relfilenumber. */
> > @@ -276,16 +276,9 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath)
> >       rlocator.dbOid = dbid;
> >       rlocator.relNumber = relfilenumber;
> >
> > -     /*
> > -      * We can't use a real relcache entry for a relation in some other
> > -      * database, but since we're only going to access the fields related to
> > -      * physical storage, a fake one is good enough. If we didn't do this and
> > -      * used the smgr layer directly, we would have to worry about
> > -      * invalidations.
> > -      */
> > -     rel = CreateFakeRelcacheEntry(rlocator);
> > -     nblocks = smgrnblocks(RelationGetSmgr(rel), MAIN_FORKNUM);
> > -     FreeFakeRelcacheEntry(rel);
> > +     smgr = smgropen(rlocator, InvalidBackendId);
> > +     nblocks = smgrnblocks(smgr, MAIN_FORKNUM);
> > +     smgrclose(smgr);
>
> Why are you opening and then closing again? Part of the motivation for the
> question is that a local SMgrRelation variable may lead to it being used
> further, opening up interrupt processing issues.

Yeah okay, I think there is no reason to close, in the previous
version I had like below and I think that's a better idea.

nblocks = smgrnblocks(smgropen(rlocator, InvalidBackendId), MAIN_FORKNUM)

>
> > +     rlocator.locator = src_rlocator;
> > +     smgrcloserellocator(rlocator);
> > +
> > +     rlocator.locator = dst_rlocator;
> > +     smgrcloserellocator(rlocator);
>
> As mentioned above, it's not clear to me why this is now done...
>
> Otherwise looks good to me.

Yeah maybe it is not necessary to close as these unowned smgr will
automatically get closed on the transaction end.  Actually the
previous person of the patch had both these comments fixed.  The
reason for explicitly closing it is that I have noticed that most of
the places we explicitly close the smgr where we do smgropen e.g.
index_copy_data(), heapam_relation_copy_data() OTOH some places we
don't close it e.g. IssuePendingWritebacks().   So now I think that in
our case better we do not close it because I do not like this specific
code at the end to close the smgr.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Fri, Aug 5, 2022 at 10:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Yeah maybe it is not necessary to close as these unowned smgr will
> automatically get closed on the transaction end.  Actually the
> previous person of the patch had both these comments fixed.  The
> reason for explicitly closing it is that I have noticed that most of
> the places we explicitly close the smgr where we do smgropen e.g.
> index_copy_data(), heapam_relation_copy_data() OTOH some places we
> don't close it e.g. IssuePendingWritebacks().   So now I think that in
> our case better we do not close it because I do not like this specific
> code at the end to close the smgr.

PFA patches for different problems discussed in the thread

0001 - Fix the problem of skipping the empty block and buffer lock on
source buffer
0002 - Remove fake relcache entry (same as 0001-BugfixInWalLogCreateDB.patch)
0003 - Optimization to avoid extending block by block

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
Hi,

On 2022-08-04 19:14:08 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2022-08-04 18:05:25 -0400, Tom Lane wrote:
> >> In any case, DROP DATABASE is far from the only place with a problem.
> 
> > What other place has a database corrupting potential of this magnitude just
> > because interrupts are accepted?  We throw valid s_b contents away and then
> > accept interrupts before committing - with predictable results. We also accept
> > interrupts as part of deleting the db data dir (due to catalog access).
> 
> Those things would be better handled by moving the data-discarding
> steps to post-commit.  Maybe that argues for having an internal
> commit halfway through DROP DATABASE: remove pg_database row,
> commit, start new transaction, clean up.

That'd still require holding interrupts, I think. We shouldn't accept
interrupts until the on-disk data is actually deleted.


In theory I think we should have a pg_database column indicating whether the
database is valid or not. For database creation, insert the pg_database row
with valid=false, commit, then do the filesystem operation, then mark as
valid, commit. For database drop, mark as invalid, commit, remove filesystem
stuff, delete row, commit.  With dropdb allowed against an invalid database,
but obviously nothing else.  But clearly this isn't a short term /
backpatchable thing.


Greetings,

Andres Freund



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Tom Lane
Date:
Dilip Kumar <dilipbalaut@gmail.com> writes:
> PFA patches for different problems discussed in the thread

> 0001 - Fix the problem of skipping the empty block and buffer lock on
> source buffer
> 0002 - Remove fake relcache entry (same as 0001-BugfixInWalLogCreateDB.patch)
> 0003 - Optimization to avoid extending block by block

I pushed 0001, because it seems fairly critical to get that in before
beta3.  The others can stand more leisurely discussion.

I note from
https://coverage.postgresql.org/src/backend/storage/buffer/bufmgr.c.gcov.html
that the block-skipping path is actually taken in our tests (this won't be
visible there for very much longer of course).  So we actually *are*
making a corrupt copy, and we haven't noticed.  This is perhaps not too
surprising, because the only test case that I can find is in
020_createdb.pl:

$node->issues_sql_like(
    [ 'createdb', '-T', 'foobar2', '-S', 'wal_log', 'foobar6' ],
    qr/statement: CREATE DATABASE foobar6 STRATEGY wal_log TEMPLATE foobar2/,
    'create database with WAL_LOG strategy');

which is, um, not exactly a robust test of whether anything happened
at all, let alone whether it was correct.  I'm not real sure that
this test would even notice if the CREATE reported failure.

            regards, tom lane



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Tom Lane
Date:
Dilip Kumar <dilipbalaut@gmail.com> writes:
> On Fri, Aug 5, 2022 at 10:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> Yeah maybe it is not necessary to close as these unowned smgr will
>> automatically get closed on the transaction end.

I do not think this is a great idea for the per-relation smgrs created
during RelationCopyStorageUsingBuffer.  Yeah, they'll be mopped up at
transaction end, but that doesn't mean that creating possibly tens of
thousands of transient smgrs isn't going to cause performance issues.

I think RelationCopyStorageUsingBuffer needs to open and then close
the smgrs it uses, which means that ReadBufferWithoutRelcache is not the
appropriate API for it to use, either; need to go down another level.

            regards, tom lane



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Sat, Aug 6, 2022 at 9:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Dilip Kumar <dilipbalaut@gmail.com> writes:
> > On Fri, Aug 5, 2022 at 10:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >> Yeah maybe it is not necessary to close as these unowned smgr will
> >> automatically get closed on the transaction end.
>
> I do not think this is a great idea for the per-relation smgrs created
> during RelationCopyStorageUsingBuffer.  Yeah, they'll be mopped up at
> transaction end, but that doesn't mean that creating possibly tens of
> thousands of transient smgrs isn't going to cause performance issues.

Okay, so for that we can simply call smgrcloserellocator(rlocator);
before exiting the RelationCopyStorageUsingBuffer() right?

> I think RelationCopyStorageUsingBuffer needs to open and then close
> the smgrs it uses, which means that ReadBufferWithoutRelcache is not the
> appropriate API for it to use, either; need to go down another level.

Not sure how going down another level would help, the whole point is
that we don't want to keep the reference of the smgr for a long time
especially in the loop which is interruptible.  So everytime we need
smgr we can call smgropen and if it is already in the smgr cache then
we will get it from there.  So I think it makes sense that when we are
exiting the function that time we can just call smgrcloserellocator()
so that if it is opened it will be closed and otherwise it will do
nothing.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Andres Freund
Date:
Hi,

On 2022-08-07 09:24:40 +0530, Dilip Kumar wrote:
> On Sat, Aug 6, 2022 at 9:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Dilip Kumar <dilipbalaut@gmail.com> writes:
> > > On Fri, Aug 5, 2022 at 10:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >> Yeah maybe it is not necessary to close as these unowned smgr will
> > >> automatically get closed on the transaction end.
> >
> > I do not think this is a great idea for the per-relation smgrs created
> > during RelationCopyStorageUsingBuffer.  Yeah, they'll be mopped up at
> > transaction end, but that doesn't mean that creating possibly tens of
> > thousands of transient smgrs isn't going to cause performance issues.

I was assuming that the files would get reopened at the end of the transaction
anyway, but it looks like that's not the case, unless wal_level=minimal.

Hm. CreateAndCopyRelationData() calls RelationCreateStorage() with
register_delete = false, which is ok because createdb_failure_callback will
clean things up. But that's another thing that's not great for a routine with
a general name...


> Okay, so for that we can simply call smgrcloserellocator(rlocator);
> before exiting the RelationCopyStorageUsingBuffer() right?

Yea, I think so.


> > I think RelationCopyStorageUsingBuffer needs to open and then close
> > the smgrs it uses, which means that ReadBufferWithoutRelcache is not the
> > appropriate API for it to use, either; need to go down another level.
> 
> Not sure how going down another level would help, the whole point is
> that we don't want to keep the reference of the smgr for a long time
> especially in the loop which is interruptible.

Yea, I'm not following either.


Greetings,

Andres Freund



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Sun, Aug 7, 2022 at 9:47 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-08-07 09:24:40 +0530, Dilip Kumar wrote:
> > On Sat, Aug 6, 2022 at 9:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >
> > > Dilip Kumar <dilipbalaut@gmail.com> writes:
> > > > On Fri, Aug 5, 2022 at 10:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >> Yeah maybe it is not necessary to close as these unowned smgr will
> > > >> automatically get closed on the transaction end.
> > >
> > > I do not think this is a great idea for the per-relation smgrs created
> > > during RelationCopyStorageUsingBuffer.  Yeah, they'll be mopped up at
> > > transaction end, but that doesn't mean that creating possibly tens of
> > > thousands of transient smgrs isn't going to cause performance issues.
>
> I was assuming that the files would get reopened at the end of the transaction
> anyway, but it looks like that's not the case, unless wal_level=minimal.
>
> Hm. CreateAndCopyRelationData() calls RelationCreateStorage() with
> register_delete = false, which is ok because createdb_failure_callback will
> clean things up. But that's another thing that's not great for a routine with
> a general name...
>
>
> > Okay, so for that we can simply call smgrcloserellocator(rlocator);
> > before exiting the RelationCopyStorageUsingBuffer() right?
>
> Yea, I think so.

Done, along with that, I have also got the hunk of smgropen and
smgrclose in ScanSourceDatabasePgClass() which I had in v1 patch[1].
Because here we do not want to reuse the smgr of the pg_class again so
instead of closing at the end with smgrcloserellocator() we can just
keep the smgr reference and close immediately after getting the number
of blocks.  Whereas in CreateAndCopyRelationData and
RelationCopyStorageUsingBuffer() we are using the smgr of the source
and dest relation multiple time so it make sense to not close it
immediately and we can close while exiting the function with
smgrcloserellocator().

[1]
+ smgr = smgropen(rlocator, InvalidBackendId);
+ nblocks = smgrnblocks(smgr, MAIN_FORKNUM);
+ smgrclose(smgr);



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Wed, Aug 10, 2022 at 1:01 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Done, along with that, I have also got the hunk of smgropen and
> smgrclose in ScanSourceDatabasePgClass() which I had in v1 patch[1].
> Because here we do not want to reuse the smgr of the pg_class again so
> instead of closing at the end with smgrcloserellocator() we can just
> keep the smgr reference and close immediately after getting the number
> of blocks.  Whereas in CreateAndCopyRelationData and
> RelationCopyStorageUsingBuffer() we are using the smgr of the source
> and dest relation multiple time so it make sense to not close it
> immediately and we can close while exiting the function with
> smgrcloserellocator().

As far as I know, this 0001 addresses all outstanding comments and
fixes the reported bug.

Does anyone think otherwise?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Thu, Aug 11, 2022 at 2:15 PM Robert Haas <robertmhaas@gmail.com> wrote:
> As far as I know, this 0001 addresses all outstanding comments and
> fixes the reported bug.
>
> Does anyone think otherwise?

If they do, they're keeping quiet, so I committed this and
back-patched it to v15.

Regarding 0002 -- should it, perhaps, use PGAlignedBlock?

Although 0002 is formally a performance optimization, I'm inclined to
think that if we're going to commit it, it should also be back-patched
into v15, because letting the code diverge when we're not even out of
beta yet seems painful.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Fri, Aug 12, 2022 at 6:33 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Aug 11, 2022 at 2:15 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > As far as I know, this 0001 addresses all outstanding comments and
> > fixes the reported bug.
> >
> > Does anyone think otherwise?
>
> If they do, they're keeping quiet, so I committed this and
> back-patched it to v15.
>
> Regarding 0002 -- should it, perhaps, use PGAlignedBlock?

Yes we can do that, although here we are not using this buffer
directly as a "Page" so we do not have any real alignment issue but I
do not see any problem in using PGAlignedBlock so change that.

> Although 0002 is formally a performance optimization, I'm inclined to
> think that if we're going to commit it, it should also be back-patched
> into v15, because letting the code diverge when we're not even out of
> beta yet seems painful.

Yeah that makes sense to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Robert Haas
Date:
On Wed, Aug 17, 2022 at 12:02 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > Although 0002 is formally a performance optimization, I'm inclined to
> > think that if we're going to commit it, it should also be back-patched
> > into v15, because letting the code diverge when we're not even out of
> > beta yet seems painful.
>
> Yeah that makes sense to me.

OK, done.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Justin Pryzby
Date:
On Tue, Aug 02, 2022 at 12:50:43PM -0500, Justin Pryzby wrote:
> Also, if I understand correctly, this patch seems to assume that nobody is
> connected to the source database.  But what's actually enforced is just that
> nobody *else* is connected.  Is it any issue that the current DB can be used as
> a source?  Anyway, both of the above problems are reproducible using a
> different database.
> 
> |postgres=# CREATE DATABASE new TEMPLATE postgres STRATEGY wal_log;
> |CREATE DATABASE

On Thu, Aug 04, 2022 at 05:16:04PM -0500, Justin Pryzby wrote:
> On Thu, Aug 04, 2022 at 06:02:50PM -0400, Tom Lane wrote:
> > The "invalidation" comment bothered me for awhile, but I think it's
> > fine: we know that no other backend can connect to the source DB
> > because we have it locked,
> 
> About that - is it any problem that the currently-connected db can be used as a
> template?  It's no issue for 2-phase commit, because "create database" cannot
> run in an txn.

Would anybody want to comment on this ?
Is it okay that the *current* DB can be used as a template ?



Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
Dilip Kumar
Date:
On Fri, Sep 2, 2022 at 5:25 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
>
> On Tue, Aug 02, 2022 at 12:50:43PM -0500, Justin Pryzby wrote:
> > Also, if I understand correctly, this patch seems to assume that nobody is
> > connected to the source database.  But what's actually enforced is just that
> > nobody *else* is connected.  Is it any issue that the current DB can be used as
> > a source?  Anyway, both of the above problems are reproducible using a
> > different database.
> >
> > |postgres=# CREATE DATABASE new TEMPLATE postgres STRATEGY wal_log;
> > |CREATE DATABASE
>
> On Thu, Aug 04, 2022 at 05:16:04PM -0500, Justin Pryzby wrote:
> > On Thu, Aug 04, 2022 at 06:02:50PM -0400, Tom Lane wrote:
> > > The "invalidation" comment bothered me for awhile, but I think it's
> > > fine: we know that no other backend can connect to the source DB
> > > because we have it locked,
> >
> > About that - is it any problem that the currently-connected db can be used as a
> > template?  It's no issue for 2-phase commit, because "create database" cannot
> > run in an txn.
>
> Would anybody want to comment on this ?
> Is it okay that the *current* DB can be used as a template ?

I don't think there should be any problem with that.  The main problem
could have been that since we are reading the pg_class tuple block by
block there could be an issue if someone concurrently modifies the
pg_class or there are some tuples that are inserted by the prepared
transaction.  But in this case, the same backend can not have an open
prepared transaction while creating a database and that backend of
course can not perform any parallel operation as well.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



RE: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

From
"Yoshikazu Imai (Fujitsu)"
Date:
Hi.

While investigating the codes in RelationCopyStorageUsingBuffer, I wonder that
there is any reason to use RBM_NORMAL for acquiring destination buffer.
I think we can use RBM_ZERO_AND_LOCK here instead of RBM_NORMAL.

When we use RBM_NORMAL, a destination block is read by smgrread and it's
content is verified with PageIsVerifiedExtended in ReadBuffer_common.
A page verification normally succeeds because destination blocks are
zero-filled by using smgrextend, but do we trust that it is surely zero-filled?

On Fri, Aug 5, 2022 at 00:32 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Hmm ... makes sense.  We'd be using smgrextend to write just the last page
> of the file, relying on its API spec "Note that we assume writing a block
> beyond current EOF causes intervening file space to become filled with
> zeroes".  I don't know that we're using that assumption aggressively
> today, but as long as it doesn't confuse the kernel it'd probably be fine.

If there is a block which is not zero-filled, page verification would fail and
eventually CREATE DATABASE would fail.
(I also think that originally we don't need page verification for destination blocks
here because those blocks are just overwritten by source buffer.)

Also, from performance POV, I think it is good to use RBM_ZERO_AND_LOCK.
In RBM_NORMAL, destination blocks are read from disk by smgrread each time, but
in RBM_ZERO_AND_LOCK, we only set buffers zero-filled by MemSet and don't
access the disk which makes performance better.
If we expect the destination buffer is always zero-filled, we can use
RBM_ZERO_AND_LOCK.


Apart from above, there seems to be an old comment which is for the old codes
when acquiring destination buffer by using P_NEW.

"/* Use P_NEW to extend the destination relation. */"


--
Yoshikazu Imai

> -----Original Message-----
> From: Dilip Kumar <dilipbalaut@gmail.com>
> Sent: Friday, September 2, 2022 11:22 PM
> To: Justin Pryzby <pryzby@telsasoft.com>
> Cc: Robert Haas <robertmhaas@gmail.com>; Tom Lane <tgl@sss.pgh.pa.us>; Andres Freund <andres@anarazel.de>;
> Ashutosh Sharma <ashu.coek88@gmail.com>; Maciek Sakrejda <m.sakrejda@gmail.com>; Bruce Momjian
> <bruce@momjian.us>; Alvaro Herrera <alvherre@alvh.no-ip.org>; Andrew Dunstan <andrew@dunslane.net>; Heikki
> Linnakangas <hlinnaka@iki.fi>; pgsql-hackers@lists.postgresql.org; Thomas Munro <thomas.munro@gmail.com>
> Subject: Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
> 
> On Fri, Sep 2, 2022 at 5:25 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> >
> > On Tue, Aug 02, 2022 at 12:50:43PM -0500, Justin Pryzby wrote:
> > > Also, if I understand correctly, this patch seems to assume that
> > > nobody is connected to the source database.  But what's actually
> > > enforced is just that nobody *else* is connected.  Is it any issue
> > > that the current DB can be used as a source?  Anyway, both of the
> > > above problems are reproducible using a different database.
> > >
> > > |postgres=# CREATE DATABASE new TEMPLATE postgres STRATEGY wal_log;
> > > |CREATE DATABASE
> >
> > On Thu, Aug 04, 2022 at 05:16:04PM -0500, Justin Pryzby wrote:
> > > On Thu, Aug 04, 2022 at 06:02:50PM -0400, Tom Lane wrote:
> > > > The "invalidation" comment bothered me for awhile, but I think
> > > > it's
> > > > fine: we know that no other backend can connect to the source DB
> > > > because we have it locked,
> > >
> > > About that - is it any problem that the currently-connected db can
> > > be used as a template?  It's no issue for 2-phase commit, because
> > > "create database" cannot run in an txn.
> >
> > Would anybody want to comment on this ?
> > Is it okay that the *current* DB can be used as a template ?
> 
> I don't think there should be any problem with that.  The main problem could have been that since we are reading the
> pg_class tuple block by block there could be an issue if someone concurrently modifies the pg_class or there are
some
> tuples that are inserted by the prepared transaction.  But in this case, the same backend can not have an open
prepared
> transaction while creating a database and that backend of course can not perform any parallel operation as well.
> 
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>