Thread: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
Currently, CREATE DATABASE forces a checkpoint, then copies all the files, then forces another checkpoint. The comments in the createdb() function explain the reasons for this. The attached patch fixes this problem by making CREATE DATABASE completely WAL-logged so that now we can avoid checkpoints. The patch modifies both CREATE DATABASE and ALTER DATABASE..SET TABLESPACE to be fully WAL-logged. One main advantage of this change is that it will be cheaper. Forcing checkpoints on an idle system is no big deal, but when the system is under heavy write load, it's very expensive. Another advantage is that it makes things better for features like TDE, which might want the pages in the source database to be encrypted using a different key or nonce than the pages in the target database. Design Idea: ----------------- First, create the target database directory along with the version file and WAL-log this operation. Create the "relation map file" in the target database and copy the content from the source database. For this, we can use some modified versions of the write_relmap_file() and WAL-log the relmap create operation along with the file content. Now, read the relmap file to find the relfilenode for pg_class and then we read pg_class block by block and decode the tuples. For reading the pg_class blocks, we can use ReadBufferWithoutRelCache() so that we don't need the relcache. Nothing prevents us from checking visibility for tuples in another database because CLOG is global to the cluster. And nothing prevents us from deforming those tuples because the column definitions for pg_class have to be the same in every database. Then we can get the relfilenode of every file we need to copy, and prepare a list of all such relfilenode. Next, for each relfilenode in the source database, create a respective relfilenode in the target database (for all forks) using smgrcreate, which is already a WAL-logged operation. Now read the source relfilenode block by block using ReadBufferWithoutRelCache() and copy the block to the target relfilenode using smgrextend() and WAL-log them using log_newpage(). For the source database, we can not directly use the smgrread(), because there could be some dirty buffers so we will have to read them through the buffer manager interface, otherwise, we will have to flush all the dirty buffers. WAL sequence using pg_waldump ---------------------------------------------------- 1. (new wal to create db dir and write PG_VERSION file) rmgr: Database desc: CREATE create dir 1663/16394 2. (new wal to create and write relmap file) rmgr: RelMap desc: CREATE database 16394 tablespace 1663 size 512 2. (create relfilenode) rmgr: Storage desc: CREATE base/16394/16384 rmgr: Storage desc: CREATE base/16394/2619 3. (write page data) rmgr: XLOG desc: FPI , blkref #0: rel 1663/16394/2619 blk 0 FPW rmgr: XLOG desc: FPI , blkref #0: rel 1663/16394/2619 blk 1 FPW ............ 4. (create other forks) rmgr: Storage desc: CREATE base/16394/2619_fsm rmgr: Storage CREATE base/16394/2619_vm ............. I have attached a POC patch, which shows this idea, with this patch all basic sanity testing and the "check-world" is passing. Open points: ------------------- - This is a POC patch so needs more refactoring/cleanup and testing. - Might need to relook into the SMGR level API usage. Credits: ----------- Thanks to Robert Haas, for suggesting this idea and the high-level design. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Jun 15, 2021 at 04:50:24PM +0530, Dilip Kumar wrote: > Currently, CREATE DATABASE forces a checkpoint, then copies all the > files, then forces another checkpoint. The comments in the createdb() > function explain the reasons for this. The attached patch fixes this > problem by making CREATE DATABASE completely WAL-logged so that now we > can avoid checkpoints. The patch modifies both CREATE DATABASE and > ALTER DATABASE..SET TABLESPACE to be fully WAL-logged. > > One main advantage of this change is that it will be cheaper. Forcing > checkpoints on an idle system is no big deal, but when the system is > under heavy write load, it's very expensive. Another advantage is that > it makes things better for features like TDE, which might want the > pages in the source database to be encrypted using a different key or > nonce than the pages in the target database. I only had a quick look at the patch but AFAICS your patch makes the new behavior mandatory. Wouldn't it make sense to have a way to use the previous approach? People creating wanting to copy somewhat big database and with a slow replication may prefer to pay 2 checkpoints rather than stream everything. Same for people who have an otherwise idle system (I often use that to make temporary backups and/or prepare multiple dataset and most of the time the checkpoint is basically free).
On 15/06/2021 14:20, Dilip Kumar wrote: > Design Idea: > ----------------- > First, create the target database directory along with the version > file and WAL-log this operation. Create the "relation map file" in > the target database and copy the content from the source database. For > this, we can use some modified versions of the write_relmap_file() and > WAL-log the relmap create operation along with the file content. Now, > read the relmap file to find the relfilenode for pg_class and then we > read pg_class block by block and decode the tuples. For reading the > pg_class blocks, we can use ReadBufferWithoutRelCache() so that we > don't need the relcache. Nothing prevents us from checking visibility > for tuples in another database because CLOG is global to the cluster. > And nothing prevents us from deforming those tuples because the column > definitions for pg_class have to be the same in every database. Then > we can get the relfilenode of every file we need to copy, and prepare > a list of all such relfilenode. I guess that would work, but you could also walk the database directory like copydir() does. How you find the relations to copy is orthogonal to whether you WAL-log them or use checkpoints. And whether you use the buffer cache is also orthogonal to the rest of the proposal; you could issue FlushDatabaseBuffers() instead of a checkpoint. > Next, for each relfilenode in the > source database, create a respective relfilenode in the target > database (for all forks) using smgrcreate, which is already a > WAL-logged operation. Now read the source relfilenode block by block > using ReadBufferWithoutRelCache() and copy the block to the target > relfilenode using smgrextend() and WAL-log them using log_newpage(). > For the source database, we can not directly use the smgrread(), > because there could be some dirty buffers so we will have to read them > through the buffer manager interface, otherwise, we will have to flush > all the dirty buffers. Yeah, WAL-logging the contents of the source database would certainly be less weird than the current system. As Julien also pointed out, the question is, are there people using on "CREATE DATABASE foo TEMPLATE bar" to copy a large source database, on the premise that it's fast because it skips WAL-logging? In principle, we could have both mechanisms, and use the new WAL-logged system if the database is small, and the old system with checkpoints if it's large. But I don't like idea of having to maintain both. - Heikki
On Tue, Jun 15, 2021 at 5:34 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 15/06/2021 14:20, Dilip Kumar wrote: > > Design Idea: . Then > > we can get the relfilenode of every file we need to copy, and prepare > > a list of all such relfilenode. > > I guess that would work, but you could also walk the database directory > like copydir() does. How you find the relations to copy is orthogonal to > whether you WAL-log them or use checkpoints. And whether you use the > buffer cache is also orthogonal to the rest of the proposal; you could > issue FlushDatabaseBuffers() instead of a checkpoint. Yeah, that would also work, but I thought since we are already avoiding the checkpoint so let's avoid FlushDatabaseBuffers() also and directly use the lower level buffer manager API which doesn't need recache. And I am using pg_class to identify the useful relfilenode so that we can avoid processing some unwanted relfilenode but yeah I agree that this is orthogonal to whether we use checkpoint or not. > Yeah, WAL-logging the contents of the source database would certainly be > less weird than the current system. As Julien also pointed out, the > question is, are there people using on "CREATE DATABASE foo TEMPLATE > bar" to copy a large source database, on the premise that it's fast > because it skips WAL-logging? > > In principle, we could have both mechanisms, and use the new WAL-logged > system if the database is small, and the old system with checkpoints if > it's large. But I don't like idea of having to maintain both. Yeah, I agree in some cases, where we don't have many dirty buffers, checkpointing can be faster. I think code wise maintaining two approaches will not be a very difficult job because the old approach just calls copydir(), but I am thinking about how can we decide which approach is better in which scenario. I don't think we can take calls just based on the database size? It would also depend upon many other factors e.g. how busy your system is, how many total dirty buffers are there in the cluster right? because checkpoint will affect the performance of the operation going on in other databases in the cluster. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Am I mistaken in thinking that this would allow CREATE DATABASE to run inside a transaction block now, further reducing the DDL commands that are non-transactional?
On 6/15/21 8:04 AM, Heikki Linnakangas wrote: > > Yeah, WAL-logging the contents of the source database would certainly > be less weird than the current system. As Julien also pointed out, the > question is, are there people using on "CREATE DATABASE foo TEMPLATE > bar" to copy a large source database, on the premise that it's fast > because it skips WAL-logging? I'm 100% certain there are. It's not even a niche case. > > In principle, we could have both mechanisms, and use the new > WAL-logged system if the database is small, and the old system with > checkpoints if it's large. But I don't like idea of having to maintain > both. > > Rather than use size, I'd be inclined to say use this if the source database is marked as a template, and use the copydir approach for anything that isn't. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Tue, Jun 15, 2021 at 9:31 PM Andrew Dunstan <andrew@dunslane.net> wrote: > > Rather than use size, I'd be inclined to say use this if the source > database is marked as a template, and use the copydir approach for > anything that isn't. Looks like a good approach.
At Tue, 15 Jun 2021 22:07:32 +0800, Julien Rouhaud <rjuju123@gmail.com> wrote in > On Tue, Jun 15, 2021 at 9:31 PM Andrew Dunstan <andrew@dunslane.net> wrote: > > > > Rather than use size, I'd be inclined to say use this if the source > > database is marked as a template, and use the copydir approach for > > anything that isn't. > > Looks like a good approach. If we are willing to maintain the two methods. Couldn't we just skip the checkpoints if the database is known to "clean", which means no page has been loaded for the database since startup? We can use the "template" mark to reject connections to the database. (I'm afraid that we also should prevent vacuum to visit the template databases, but...) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Jun 15, 2021 at 7:01 PM Andrew Dunstan <andrew@dunslane.net> wrote: > > Rather than use size, I'd be inclined to say use this if the source > database is marked as a template, and use the copydir approach for > anything that isn't. Yeah, that is possible, on the other thought wouldn't it be good to provide control to the user by providing two different commands, e.g. COPY DATABASE for the existing method (copydir) and CREATE DATABASE for the new method (fully wal logged)? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hi, On 2021-06-15 16:50:24 +0530, Dilip Kumar wrote: > The patch modifies both CREATE DATABASE and ALTER DATABASE..SET > TABLESPACE to be fully WAL-logged. Generally quite a bit in favor of this - the current approach is very heavyweight, slow and I think we have a few open corner bugs related to it. > Design Idea: > ----------------- > First, create the target database directory along with the version > file and WAL-log this operation. What happens if you crash / promote at this point? > Create the "relation map file" in the target database and copy the > content from the source database. For this, we can use some modified > versions of the write_relmap_file() and WAL-log the relmap create > operation along with the file content. Now, read the relmap file to > find the relfilenode for pg_class and then we read pg_class block by > block and decode the tuples. This doesn't seem like a great approach - you're not going to be able to use much of the normal infrastructure around processing tuples. So it seems like it'd end up with quite a bit of special case code that needs to maintained in parallel. > Now read the source relfilenode block by block using > ReadBufferWithoutRelCache() and copy the block to the target > relfilenode using smgrextend() and WAL-log them using log_newpage(). > For the source database, we can not directly use the smgrread(), > because there could be some dirty buffers so we will have to read them > through the buffer manager interface, otherwise, we will have to flush > all the dirty buffers. I think we might need a bit more batching for the WAL logging. There are cases of template database considerably bigger than the default and the overhead of logging each write separately seems likely to be noticable. Greetings, Andres Freund
Hi, On 2021-06-15 18:11:23 +0530, Dilip Kumar wrote: > On Tue, Jun 15, 2021 at 5:34 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > > On 15/06/2021 14:20, Dilip Kumar wrote: > > > Design Idea: > . Then > > > we can get the relfilenode of every file we need to copy, and prepare > > > a list of all such relfilenode. > > > > I guess that would work, but you could also walk the database directory > > like copydir() does. How you find the relations to copy is orthogonal to > > whether you WAL-log them or use checkpoints. And whether you use the > > buffer cache is also orthogonal to the rest of the proposal; you could > > issue FlushDatabaseBuffers() instead of a checkpoint. > > Yeah, that would also work, but I thought since we are already > avoiding the checkpoint so let's avoid FlushDatabaseBuffers() also and > directly use the lower level buffer manager API which doesn't need > recache. And I am using pg_class to identify the useful relfilenode > so that we can avoid processing some unwanted relfilenode but yeah I > agree that this is orthogonal to whether we use checkpoint or not. It's not entirely obvious to me that it's important to avoid FlushDatabaseBuffers() on its own. Forcing a checkpoint is problematic because it unnecessarily writes out dirty buffers in other databases, triggers FPWs etc. Normally a database used as a template won't have a meaningful amount of dirty buffers itself, so the FlushDatabaseBuffers() shouldn't trigger a lot of writes. Of course, there is the matter of FlushDatabaseBuffers() not being cheap with a large shared_buffers - but I suspect that's not a huge factor compared to the rest of the database creation cost. I think the better argument for going through shared buffers is that it might be worth doing so for the *target* database. A common use of frequently creating databases, in particular with a non-default template database, is to run regression tests with pre-created schema / data - writing out all that data just to have it then dropped a few seconds later after the regression test completed is wasteful. > > In principle, we could have both mechanisms, and use the new WAL-logged > > system if the database is small, and the old system with checkpoints if > > it's large. But I don't like idea of having to maintain both. > > Yeah, I agree in some cases, where we don't have many dirty buffers, > checkpointing can be faster. I don't think the main issue is the speed of checkpointing itself? The reaoson to maintain the old paths is that the "new approach" is bloating WAL volume, no? Right now cloning a 1TB database costs a few hundred bytes of WAL and about 1TB of write IO. With the proposed approach, the write volume approximately doubles, because there'll also be about 1TB in WAL. Greetings, Andres Freund
On 6/15/21 3:31 PM, Andrew Dunstan wrote: > > On 6/15/21 8:04 AM, Heikki Linnakangas wrote: >> >> Yeah, WAL-logging the contents of the source database would certainly >> be less weird than the current system. As Julien also pointed out, the >> question is, are there people using on "CREATE DATABASE foo TEMPLATE >> bar" to copy a large source database, on the premise that it's fast >> because it skips WAL-logging? > > > I'm 100% certain there are. It's not even a niche case. > > >> >> In principle, we could have both mechanisms, and use the new >> WAL-logged system if the database is small, and the old system with >> checkpoints if it's large. But I don't like idea of having to maintain >> both. >> >> > > Rather than use size, I'd be inclined to say use this if the source > database is marked as a template, and use the copydir approach for > anything that isn't. > I think we should be asking what is the benefit of that use case, and perhaps try addressing that without having to maintain two entirely different ways to do CREATE DATABASE. It's not like we're sure the current code is 100% reliable in various corner cases, I doubt having two separate approaches will improve the situation :-/ I can see three reasons why people want to skip the WAL logging: 1) it's faster, because there's no CPU and I/O for building the WAL I wonder if some optimization / batching could help with (1), as suggested by Andres elsewhere in this thread. 2) it saves the amount of WAL (could matter with large template databases and WAL archiving, etc.) We can't really do much about this - we need to log all the data. But the batching from (1) might help a bit too, I guess. 3) saves the amount of WAL that needs to be copied to standby, so that there's no increase of replication lag, etc. particularly when the network link has limited bandwidth I think this is a more general issue - some operations that may generate a lot of WAL, and we generally assume it's better to do that rather than hold exclusive locks for long time. But maybe we could have some throttling, to limit the amount of WAL per second, similarly to what we have to plain vacuum. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
()log_newpage()On Thu, Jun 17, 2021 at 3:28 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2021-06-15 16:50:24 +0530, Dilip Kumar wrote: > > The patch modifies both CREATE DATABASE and ALTER DATABASE..SET > > TABLESPACE to be fully WAL-logged. > > Generally quite a bit in favor of this - the current approach is very > heavyweight, slow and I think we have a few open corner bugs related to > it. Great! > > > Design Idea: > > ----------------- > > First, create the target database directory along with the version > > file and WAL-log this operation. > > What happens if you crash / promote at this point? I will check this. > > Create the "relation map file" in the target database and copy the > > content from the source database. For this, we can use some modified > > versions of the write_relmap_file() and WAL-log the relmap create > > operation along with the file content. Now, read the relmap file to > > find the relfilenode for pg_class and then we read pg_class block by > > block and decode the tuples. > > This doesn't seem like a great approach - you're not going to be able to > use much of the normal infrastructure around processing tuples. So it > seems like it'd end up with quite a bit of special case code that needs > to maintained in parallel. Yeah, this needs some special-purpose code but it is not too much code. I agree that instead of scanning the pg_class we can scan all the tablespaces and under that identify the source database directory as we do now. And from there we can copy each relfilenode block by block with wal log. Honestly, these both seem like a special-purpose code. Another problem with directly scanning the directory is, how we are supposed to get the "relpersistence" which is stored in pg_class tuple right? > > > Now read the source relfilenode block by block using > > ReadBufferWithoutRelCache() and copy the block to the target > > relfilenode using smgrextend() and WAL-log them using log_newpage(). > > For the source database, we can not directly use the smgrread(), > > because there could be some dirty buffers so we will have to read them > > through the buffer manager interface, otherwise, we will have to flush > > all the dirty buffers. > > I think we might need a bit more batching for the WAL logging. There are > cases of template database considerably bigger than the default and the > overhead of logging each write separately seems likely to be noticable. Yeah, we can do that, and instead of using log_newpage() we can use log_newpages(), to log multiple pages at once. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Jun 17, 2021 at 3:43 AM Andres Freund <andres@anarazel.de> wrote: > > Yeah, that would also work, but I thought since we are already > > avoiding the checkpoint so let's avoid FlushDatabaseBuffers() also and > > directly use the lower level buffer manager API which doesn't need > > recache. And I am using pg_class to identify the useful relfilenode > > so that we can avoid processing some unwanted relfilenode but yeah I > > agree that this is orthogonal to whether we use checkpoint or not. > > It's not entirely obvious to me that it's important to avoid > FlushDatabaseBuffers() on its own. Forcing a checkpoint is problematic because > it unnecessarily writes out dirty buffers in other databases, triggers FPWs > etc. Normally a database used as a template won't have a meaningful amount of > dirty buffers itself, so the FlushDatabaseBuffers() shouldn't trigger a lot of > writes. Of course, there is the matter of FlushDatabaseBuffers() not being > cheap with a large shared_buffers - but I suspect that's not a huge factor > compared to the rest of the database creation cost. Okay so if I think from that POW, then maybe we can just FlushDatabaseBuffers() and then directly use smgrread() calls. > I think the better argument for going through shared buffers is that it might > be worth doing so for the *target* database. A common use of frequently > creating databases, in particular with a non-default template database, is to > run regression tests with pre-created schema / data - writing out all that data > just to have it then dropped a few seconds later after the regression test > completed is wasteful. Okay, I am not sure how common this use case is but for this use case it makes sense to use bufmgr for the target database. > > > In principle, we could have both mechanisms, and use the new WAL-logged > > > system if the database is small, and the old system with checkpoints if > > > it's large. But I don't like idea of having to maintain both. > > > > Yeah, I agree in some cases, where we don't have many dirty buffers, > > checkpointing can be faster. > > I don't think the main issue is the speed of checkpointing itself? The reaoson > to maintain the old paths is that the "new approach" is bloating WAL volume, > no? Right now cloning a 1TB database costs a few hundred bytes of WAL and about > 1TB of write IO. With the proposed approach, the write volume approximately > doubles, because there'll also be about 1TB in WAL. Make sense. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On 17/06/2021 08:45, Dilip Kumar wrote: > Another problem with directly scanning the directory is, how we > are supposed to get the "relpersistence" which is stored in pg_class > tuple right? You only need relpersistence if you want to use the buffer cache, right? I think that's a good argument for not using it. - Heikki
On Thu, Jun 17, 2021 at 2:50 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > On 17/06/2021 08:45, Dilip Kumar wrote: > > Another problem with directly scanning the directory is, how we > > are supposed to get the "relpersistence" which is stored in pg_class > > tuple right? > > You only need relpersistence if you want to use the buffer cache, right? > I think that's a good argument for not using it. Yeah, that is the one place, another place I am using it to decide whether to WAL log the new page while writing into the target relfilenode, if it is unlogged relation then I am not WAL logging. But now, I think that is not the right idea, during creating the database we should WAL log all the pages irrespective of whether the table is logged or unlogged. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Jun 16, 2021 at 6:13 PM Andres Freund <andres@anarazel.de> wrote: > I don't think the main issue is the speed of checkpointing itself? The reaoson > to maintain the old paths is that the "new approach" is bloating WAL volume, > no? Right now cloning a 1TB database costs a few hundred bytes of WAL and about > 1TB of write IO. With the proposed approach, the write volume approximately > doubles, because there'll also be about 1TB in WAL. This is a good point, but on the other hand, I think this smells a lot like the wal_level=minimal optimization where we don't need to log data being bulk-loaded into a table created in the same transaction if wal_level=minimal. In theory, that optimization has a lot of value, but in practice it gets a lot of bad press on this list, because (1) sometimes doing the fsync is more expensive than writing the extra WAL would have been and (2) most people want to run with wal_level=replica/logical so it ends up being a code path that isn't used much and is therefore more likely than average to have bugs nobody's terribly interested in fixing (except Noah ... thanks Noah!). If we add features in the future, lke TDE or perhaps incremental backup, that rely on new pages getting new LSNs instead of recycled ones, this may turn into the same kind of wart. And as with that optimization, you're probably not even better off unless the database is pretty big, and you might be worse off if you have to do fsyncs or flush buffers synchronously. I'm not severely opposed to keeping both methods around, so if that's really what people want to do, OK, but I guess I wonder whether we're really going to be happy with that decision down the road. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jun 17, 2021 at 5:20 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > You only need relpersistence if you want to use the buffer cache, right? > I think that's a good argument for not using it. I think the root of the problem with this feature is that it doesn't go through shared_buffers, so in my opinion, it would be better if we can make it all go through shared_buffers. It seems like you're advocating a middle ground where half of the operation goes through shared_buffers and the other half doesn't, but that sounds like getting rid of half of the hack when we could have gotten rid of all of it. I think things that don't go through shared_buffers are bad, and we should be making an effort to get rid of them where we can reasonably do so. I believe I've both introduced and fixed my share of bugs that were caused by such cases, and I think the behavior of the whole system would be a lot easier to reason about if we had fewer of those, or none. I can also think of at least one significant advantage of driving this off the remote database's pg_class rather than the filesystem contents. It's a known defect of PostgreSQL that if you create a table and then crash, you leave behind a dead file that never gets removed. If you now copy the database that contains that orphaned file, you would ideally prefer not to copy that file, but if you do a copy based on the filesystem contents, then you will. If you drive the copy off of pg_class, you won't. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2021-06-17 13:53:38 -0400, Robert Haas wrote: > On Thu, Jun 17, 2021 at 5:20 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > You only need relpersistence if you want to use the buffer cache, right? > > I think that's a good argument for not using it. Do we really need pg_class to figure this out? Can't we just check if the relation has an init fork? > I can also think of at least one significant advantage of driving this > off the remote database's pg_class rather than the filesystem > contents. It's a known defect of PostgreSQL that if you create a table > and then crash, you leave behind a dead file that never gets removed. > If you now copy the database that contains that orphaned file, you > would ideally prefer not to copy that file, but if you do a copy based > on the filesystem contents, then you will. If you drive the copy off > of pg_class, you won't. I'm very unconvinced this is the place to tackle the issue of orphan relfilenodes. It'd be one thing if it were doable by existing code, e.g. because we supported cross-database relation accesses fully, but we don't. Adding a hacky special case implementation for cross-database relation accesses that violates all kinds of assumptions (like holding a lock on a relation when accessing it / pinning pages, processing relcache invals, ...) doesn't seem like a good plan. I don't think this is an academic concern: You need to read from shared buffers to read the "remote" pg_class, otherwise you'll potentially miss changes. But it's not correct to read in pages or to pin pages without holding a lock, and there's code that relies on that (see e.g. InvalidateBuffer()). Greetings, Andres Freund
On Thu, Jun 17, 2021 at 2:17 PM Andres Freund <andres@anarazel.de> wrote: > Adding a hacky special case implementation for cross-database relation > accesses that violates all kinds of assumptions (like holding a lock on > a relation when accessing it / pinning pages, processing relcache > invals, ...) doesn't seem like a good plan. I agree that we don't want hacky code that violates assumptions, but bypassing shared_buffers is a bit hacky, too. Can't we lock the relations as we're copying them? We know pg_class's OID a fortiori, and we can find out all the other OIDs as we go. I'm just thinking that the hackiness of going around shared_buffers feels irreducible, but maybe the hackiness in the patch is something that can be solved with more engineering. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2021-06-17 14:22:52 -0400, Robert Haas wrote: > On Thu, Jun 17, 2021 at 2:17 PM Andres Freund <andres@anarazel.de> wrote: > > Adding a hacky special case implementation for cross-database relation > > accesses that violates all kinds of assumptions (like holding a lock on > > a relation when accessing it / pinning pages, processing relcache > > invals, ...) doesn't seem like a good plan. > > I agree that we don't want hacky code that violates assumptions, but > bypassing shared_buffers is a bit hacky, too. Can't we lock the > relations as we're copying them? We know pg_class's OID a fortiori, > and we can find out all the other OIDs as we go. We possibly can - but I'm not sure that won't end up violating some other assumptions. > I'm just thinking that the hackiness of going around shared_buffers > feels irreducible, but maybe the hackiness in the patch is something > that can be solved with more engineering. Which bypassing of shared buffers are you talking about here? We'd still have to solve a subset of the issues around locking (at least on the source side), but I don't think we need to read pg_class contents to be able to go through shared_buffers? As I suggested, we can use the init fork presence to infer relpersistence? Or do you mean that looking at the filesystem at all is bypassing shared buffers? Greetings, Andres Freund
On Thu, Jun 17, 2021 at 2:48 PM Andres Freund <andres@anarazel.de> wrote: > Or do you mean that looking at the filesystem at all is bypassing shared > buffers? This is what I mean. I think we will end up in a better spot if we can avoid doing that without creating too much ugliness elsewhere. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Jun 18, 2021 at 12:50 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jun 17, 2021 at 2:48 PM Andres Freund <andres@anarazel.de> wrote: > > Or do you mean that looking at the filesystem at all is bypassing shared > > buffers? > > This is what I mean. I think we will end up in a better spot if we can > avoid doing that without creating too much ugliness elsewhere. > The patch was not getting applied on head so I have rebased it, along with that now I have used bufmgr layer for writing writing/logging destination pages as well instead of directly using sgmr layer. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Jul 6, 2021 at 3:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Jun 18, 2021 at 12:50 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Thu, Jun 17, 2021 at 2:48 PM Andres Freund <andres@anarazel.de> wrote: > > > Or do you mean that looking at the filesystem at all is bypassing shared > > > buffers? > > > > This is what I mean. I think we will end up in a better spot if we can > > avoid doing that without creating too much ugliness elsewhere. > > > > The patch was not getting applied on head so I have rebased it, along > with that now I have used bufmgr layer for writing writing/logging > destination pages as well instead of directly using sgmr layer. I have done further cleanup of the patch and also divided it into 3 patches. 0001 - Currently, write_relmap_file and load_relmap_file are tightly coupled with shared_map and local_map. As part of the higher level patch set we need remap read/write interfaces that are not dependent upon shared_map and local_map, and we should be able to pass map memory as an external parameter instead. 0002- Support new interfaces in relmapper, 1) Support copying the relmap file from one database path to the other database path. 2) Like RelationMapOidToFilenode, provide another interface which do the same but instead of getting it for the database we are connected to it will get it for the input database path. These interfaces are required for the next patch for supporting the wal logged created database. 0003- The main patch for WAL logging the created database operation. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Thu, Sep 2, 2021 at 2:06 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > 0003- The main patch for WAL logging the created database operation. Andres pointed out that this approach ends up accessing relations without taking a lock on them. It doesn't look like you did anything about that. + /* Built-in oids are mapped directly */ + if (classForm->oid < FirstGenbkiObjectId) + relfilenode = classForm->oid; + else if (OidIsValid(classForm->relfilenode)) + relfilenode = classForm->relfilenode; + else + continue; Am I missing something, or is this totally busted? [rhaas pgsql]$ createdb [rhaas pgsql]$ psql psql (15devel) Type "help" for help. rhaas=# select oid::regclass from pg_class where relfilenode not in (0, oid) and oid < 10000; oid ----- (0 rows) rhaas=# vacuum full pg_attrdef; VACUUM rhaas=# select oid::regclass from pg_class where relfilenode not in (0, oid) and oid < 10000; oid -------------------------------- pg_attrdef_adrelid_adnum_index pg_attrdef_oid_index pg_toast.pg_toast_2604 pg_toast.pg_toast_2604_index pg_attrdef (5 rows) /* + * Now drop all buffers holding data of the target database; they should + * no longer be dirty so DropDatabaseBuffers is safe. The way things worked before, this was true, but now AFAICS it's false. I'm not sure whether that means that DropDatabaseBuffers() here is actually unsafe or whether it just means that you haven't updated the comment to explain the reason. + * Since we copy the file directly without looking at the shared buffers, + * we'd better first flush out any pages of the source relation that are + * in shared buffers. We assume no new changes will be made while we are + * holding exclusive lock on the rel. Ditto. + /* As always, WAL must hit the disk before the data update does. */ Actually, the way it's coded now, part of the on-disk changes are done before WAL is issued, and part are done after. I doubt that's the right idea. There's nothing special about writing the actual payload bytes vs. the other on-disk changes (creating directories and files). In any case the ordering deserves a better-considered comment than this one. + XLogRegisterData((char *) PG_MAJORVERSION, nbytes); Surely this is utterly pointless. + CopyDatabase(src_dboid, dboid, src_deftablespace, dst_deftablespace); PG_END_ENSURE_ERROR_CLEANUP(createdb_failure_callback, PointerGetDatum(&fparms)); I'd leave braces around the code for which we're ensuring error cleanup even if it's just one line. + if (info == XLOG_DBASEDIR_CREATE) { xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record); Seems odd to rename the record but not change the name of the struct. I think I would be inclined to keep the existing record name, but if we're going to change it we should be more thorough. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Jun 18, 2021 at 12:18 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-06-17 14:22:52 -0400, Robert Haas wrote:
> On Thu, Jun 17, 2021 at 2:17 PM Andres Freund <andres@anarazel.de> wrote:
> > Adding a hacky special case implementation for cross-database relation
> > accesses that violates all kinds of assumptions (like holding a lock on
> > a relation when accessing it / pinning pages, processing relcache
> > invals, ...) doesn't seem like a good plan.
>
> I agree that we don't want hacky code that violates assumptions, but
> bypassing shared_buffers is a bit hacky, too. Can't we lock the
> relations as we're copying them? We know pg_class's OID a fortiori,
> and we can find out all the other OIDs as we go.
We possibly can - but I'm not sure that won't end up violating some
other assumptions.
Yeah, we can surely lock the relation as described by Robert, but IMHO, while creating the database we are already holding the exclusive lock on the database and there is no one else allowed to be connected to the database, so do we actually need to bother about the lock for the correctness?
> I'm just thinking that the hackiness of going around shared_buffers
> feels irreducible, but maybe the hackiness in the patch is something
> that can be solved with more engineering.
Which bypassing of shared buffers are you talking about here? We'd still
have to solve a subset of the issues around locking (at least on the
source side), but I don't think we need to read pg_class contents to be
able to go through shared_buffers? As I suggested, we can use the init
fork presence to infer relpersistence?
I believe we want to avoid scanning pg_class for identifying the relation list so that we can avoid this special-purpose code? IMHO, scanning the disk, such as going through all the tablespaces and then finding the source database directory and identifying each relfilenode, also appears to be a special-purpose code, unless I am missing what you mean by special-purpose code.
Or do you mean that looking at the filesystem at all is bypassing shared
buffers?
I think we already have such a code in multiple places where we bypass the shared buffers for copying the relation e.g. index_copy_data(), heapam_relation_copy_data(). So the current system as it stands, we can not claim that we are designing something for the first time where we are bypassing the shared buffers. So if we are thinking that bypassing the shared buffers is hackish and we don't want to use it for the new patches then lets avoid it completely even while identifying the relfilenodes to be copied.
On Thu, Sep 2, 2021 at 8:52 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Sep 2, 2021 at 2:06 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> 0003- The main patch for WAL logging the created database operation.
Andres pointed out that this approach ends up accessing relations
without taking a lock on them. It doesn't look like you did anything
about that.
I missed that, I have shared my opinion about this in my last email [1]
+ /* Built-in oids are mapped directly */
+ if (classForm->oid < FirstGenbkiObjectId)
+ relfilenode = classForm->oid;
+ else if (OidIsValid(classForm->relfilenode))
+ relfilenode = classForm->relfilenode;
+ else
+ continue;
Am I missing something, or is this totally busted?
Oops, I think the condition should be like below, but I will think carefully before posting the next version if there is something else I am missing.
if (OidIsValid(classForm->relfilenode))
relfilenode = classForm->relfilenode;
else if if (classForm->oid < FirstGenbkiObjectId)
relfilenode = classForm->oid;
else
continue
/*
+ * Now drop all buffers holding data of the target database; they should
+ * no longer be dirty so DropDatabaseBuffers is safe.
The way things worked before, this was true, but now AFAICS it's
false. I'm not sure whether that means that DropDatabaseBuffers() here
is actually unsafe or whether it just means that you haven't updated
the comment to explain the reason.
I think DropDatabaseBuffers(), itself is unsafe, we just copied pages using bufmgr and dirtied the buffers so dropping buffers is definitely unsafe here.
+ * Since we copy the file directly without looking at the shared buffers,
+ * we'd better first flush out any pages of the source relation that are
+ * in shared buffers. We assume no new changes will be made while we are
+ * holding exclusive lock on the rel.
Ditto.
Yeah this comment no longer makes sense now.
+ /* As always, WAL must hit the disk before the data update does. */
Actually, the way it's coded now, part of the on-disk changes are done
before WAL is issued, and part are done after. I doubt that's the
right idea.
There's nothing special about writing the actual payload
bytes vs. the other on-disk changes (creating directories and files).
In any case the ordering deserves a better-considered comment than
this one.
Agreed to all, but In general, I think WAL hitting the disk before data is more applicable for the shared buffers, where we want to flush the WAL first before writing the shared buffer so that in case of torn page we have an option to recover the page from previous FPI. But in such cases where we are creating a directory or file there is no such requirement. Anyways, I agreed with the comments that it should be more uniform and the comment should be correct.
+ XLogRegisterData((char *) PG_MAJORVERSION, nbytes);
Surely this is utterly pointless.
Yeah it is. During the WAL replay also we know the PG_MAJORVERSION :)
+ CopyDatabase(src_dboid, dboid, src_deftablespace, dst_deftablespace);
PG_END_ENSURE_ERROR_CLEANUP(createdb_failure_callback,
PointerGetDatum(&fparms));
I'd leave braces around the code for which we're ensuring error
cleanup even if it's just one line.
Okay
+ if (info == XLOG_DBASEDIR_CREATE)
{
xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
Seems odd to rename the record but not change the name of the struct.
I think I would be inclined to keep the existing record name, but if
we're going to change it we should be more thorough.
Right, I think we can leave the record name as it is.
On Fri, Sep 3, 2021 at 6:23 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: >> + /* Built-in oids are mapped directly */ >> + if (classForm->oid < FirstGenbkiObjectId) >> + relfilenode = classForm->oid; >> + else if (OidIsValid(classForm->relfilenode)) >> + relfilenode = classForm->relfilenode; >> + else >> + continue; >> >> Am I missing something, or is this totally busted? > > Oops, I think the condition should be like below, but I will think carefully before posting the next version if there issomething else I am missing. > > if (OidIsValid(classForm->relfilenode)) > relfilenode = classForm->relfilenode; > else if if (classForm->oid < FirstGenbkiObjectId) > relfilenode = classForm->oid; > else > continue What about mapped rels that have been rewritten at some point? > Agreed to all, but In general, I think WAL hitting the disk before data is more applicable for the shared buffers, wherewe want to flush the WAL first before writing the shared buffer so that in case of torn page we have an option to recoverthe page from previous FPI. But in such cases where we are creating a directory or file there is no such requirement. Anyways, I agreed with the comments that it should be more uniform and the comment should be correct. There have been previous debates about whether WAL records for filesystem operations should be issued before or after those operations are performed. I'm not sure how easy those discussion are to find in the archives, but it's very relevant here. I think the short version is - if we write a WAL record first and then the operation fails afterward, we have to PANIC. But if we perform the operation first and then write the WAL record if it succeeds, then we could crash before writing WAL and end up out of sync with our standbys. If we then later do any WAL-logged operation locally that depends on that operation having been performed, replay will fail on the standby. There used to be, or maybe still are, comments in the code defending the latter approach, but more recently it's been strongly criticized. The thinking, AIUI, is basically that filesystem operations really ought not to fail, because nobody should be doing weird things to the data directory, and if they do, panicking is OK. But having replay fail in strange ways on the standby later is not OK. I'm not sure if everyone agrees with that logic; it seems somewhat debatable. I *think* I personally agree with it but ... I'm not even 100% sure about that. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2021-09-03 14:25:10 +0530, Dilip Kumar wrote: > Yeah, we can surely lock the relation as described by Robert, but IMHO, > while creating the database we are already holding the exclusive lock on > the database and there is no one else allowed to be connected to the > database, so do we actually need to bother about the lock for the > correctness? The problem is that checkpointer, bgwriter, buffer reclaim don't care about the database of the buffer they're working on... The exclusive lock on the database doesn't change anything about that. Perhaps you can justify it's safe because there can't be any dirty buffers or such though. > I think we already have such a code in multiple places where we bypass the > shared buffers for copying the relation > e.g. index_copy_data(), heapam_relation_copy_data(). That's not at all comparable. We hold an exclusive lock on the relation at that point, and we don't have a separate implementation of reading tuples from the table or something like that. Greetings, Andres Freund
On Sat, Sep 4, 2021 at 3:24 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2021-09-03 14:25:10 +0530, Dilip Kumar wrote:
> Yeah, we can surely lock the relation as described by Robert, but IMHO,
> while creating the database we are already holding the exclusive lock on
> the database and there is no one else allowed to be connected to the
> database, so do we actually need to bother about the lock for the
> correctness?
The problem is that checkpointer, bgwriter, buffer reclaim don't care about
the database of the buffer they're working on... The exclusive lock on the
database doesn't change anything about that.
But these directly operate on the buffers and In my patch, whether we are reading the pg_class for identifying the relfilenode or we are copying the relation block by block we are always holding the lock on the buffer.
Perhaps you can justify it's safe
because there can't be any dirty buffers or such though.
> I think we already have such a code in multiple places where we bypass the
> shared buffers for copying the relation
> e.g. index_copy_data(), heapam_relation_copy_data().
That's not at all comparable. We hold an exclusive lock on the relation at
that point, and we don't have a separate implementation of reading tuples from
the table or something like that.
Okay, but my example was against the point Robert raised that he feels that bypassing the shared buffer anywhere is hackish. But yeah, I agree his point might be that even if we are using it in existing code we can not justify it.
For moving forward I think the main open concerns we have as of now are
1. Special purpose code of scanning pg_class, so that we can solve it by scanning the source database directory, I think Robert doesn't like this approach because we are directly scanning to directory and bypassing the shared buffers? But this is not any worse than what we have now right? I mean now also we are scanning the directory directly, so only change will be instead of copying files directly we will read file and copy block by block.
2. Another problem is, while copying the relation we are accessing the relation buffers but we are not holding the relation lock, but we are already holding the buffer so I am not sure do we really have a problem here w.r.t checkpointer, bgwriter? But if we have the problem then also we can create the lock tag and acquire the relation lock.
3. While copying the relation whether to use the bufmgr or directly use the smgr?
If we use the bufmgr then maybe we can avoid flushing some of the buffers to the disk and save some I/O but in general we copy from the template database so there might not be a lot of dirty buffers and we might not save anything, OTOH, if we directly use the smgr for copying the relation data we can reuse some existing code RelationCopyStorage() and the patch will be simpler. Other than just code simplicity or IO there is also a concern by Robert that he doesn't like to bypass the bufmgr, and that will be applicable to the point #1 as well as #3.
Thoughts?
On 2021-09-05 14:22:51 +0530, Dilip Kumar wrote: > On Sat, Sep 4, 2021 at 3:24 AM Andres Freund <andres@anarazel.de> wrote: > > > Hi, > > > > On 2021-09-03 14:25:10 +0530, Dilip Kumar wrote: > > > Yeah, we can surely lock the relation as described by Robert, but IMHO, > > > while creating the database we are already holding the exclusive lock on > > > the database and there is no one else allowed to be connected to the > > > database, so do we actually need to bother about the lock for the > > > correctness? > > > > The problem is that checkpointer, bgwriter, buffer reclaim don't care about > > the database of the buffer they're working on... The exclusive lock on the > > database doesn't change anything about that. > > > But these directly operate on the buffers and In my patch, whether we are > reading the pg_class for identifying the relfilenode or we are copying the > relation block by block we are always holding the lock on the buffer. I don't think a buffer lock is really sufficient. See e.g. code like: static void InvalidateBuffer(BufferDesc *buf) { ... /* * We assume the only reason for it to be pinned is that someone else is * flushing the page out. Wait for them to finish. (This could be an * infinite loop if the refcount is messed up... it would be nice to time * out after awhile, but there seems no way to be sure how many loops may * be needed. Note that if the other guy has pinned the buffer but not * yet done StartBufferIO, WaitIO will fall through and we'll effectively * be busy-looping here.) */ if (BUF_STATE_GET_REFCOUNT(buf_state) != 0) { UnlockBufHdr(buf, buf_state); LWLockRelease(oldPartitionLock); /* safety check: should definitely not be our *own* pin */ if (GetPrivateRefCount(BufferDescriptorGetBuffer(buf)) > 0) elog(ERROR, "buffer is pinned in InvalidateBuffer"); WaitIO(buf); goto retry; } IOW, currently we assume that you're only allowed to pin a block in a relation while you hold a lock on the relation. It might be a good idea to change that, but it's not as trivial as one might think - consider e.g. dropping a relation's buffers while holding an exclusive lock: If there's potential concurrent reads of that buffer we'd be in trouble. > 3. While copying the relation whether to use the bufmgr or directly use the > smgr? > > If we use the bufmgr then maybe we can avoid flushing some of the buffers > to the disk and save some I/O but in general we copy from the template > database so there might not be a lot of dirty buffers and we might not save > anything I would assume the big benefit would be that the *target* database does not have to be written out / shared buffer is immediately populated. Greetings, Andres Freund
On Mon, Sep 6, 2021 at 1:58 AM Andres Freund <andres@anarazel.de> wrote:
On 2021-09-05 14:22:51 +0530, Dilip Kumar wrote:
> But these directly operate on the buffers and In my patch, whether we are
> reading the pg_class for identifying the relfilenode or we are copying the
> relation block by block we are always holding the lock on the buffer.
I don't think a buffer lock is really sufficient. See e.g. code like:
I agree that the only buffer lock is not sufficient, but here we are talking about the case where we are already holding the exclusive lock on the database + the buffer lock. So the cases like below which should be called only from the drop relation must be protected by the database exclusive lock and the other example like buffer reclaim/checkpointer should be protected by the buffer pin + lock. Having said that, I am not against the point that we should not acquire the relation lock in our case. I agree that if there is an assumption that for holding the buffer pin we must be holding the relation lock then better not to break that.
static void
InvalidateBuffer(BufferDesc *buf)
{
...
/*
* We assume the only reason for it to be pinned is that someone else is
* flushing the page out. Wait for them to finish. (This could be an
* infinite loop if the refcount is messed up... it would be nice to time
* out after awhile, but there seems no way to be sure how many loops may
* be needed. Note that if the other guy has pinned the buffer but not
* yet done StartBufferIO, WaitIO will fall through and we'll effectively
* be busy-looping here.)
*/
if (BUF_STATE_GET_REFCOUNT(buf_state) != 0)
{
UnlockBufHdr(buf, buf_state);
LWLockRelease(oldPartitionLock);
/* safety check: should definitely not be our *own* pin */
if (GetPrivateRefCount(BufferDescriptorGetBuffer(buf)) > 0)
elog(ERROR, "buffer is pinned in InvalidateBuffer");
WaitIO(buf);
goto retry;
}
IOW, currently we assume that you're only allowed to pin a block in a relation
while you hold a lock on the relation. It might be a good idea to change that,
but it's not as trivial as one might think - consider e.g. dropping a
relation's buffers while holding an exclusive lock: If there's potential
concurrent reads of that buffer we'd be in trouble.
> 3. While copying the relation whether to use the bufmgr or directly use the
> smgr?
>
> If we use the bufmgr then maybe we can avoid flushing some of the buffers
> to the disk and save some I/O but in general we copy from the template
> database so there might not be a lot of dirty buffers and we might not save
> anything
I would assume the big benefit would be that the *target* database does not
have to be written out / shared buffer is immediately populated.
Okay, that makes sense. Infact for using the shared buffers for the destination database's relation we don't even have the locking issue, because that database is not yet accessible to anyone right?
--
On Mon, Sep 6, 2021 at 11:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote
Okay, that makes sense. Infact for using the shared buffers for the destination database's relation we don't even have the locking issue, because that database is not yet accessible to anyone right?
Based on all these discussions I am planning to change the design as below,
- FlushDatabaseBuffers().
- Scan the database directory under each tablespace and prepare a tablespace-wise relfilenode list, along with this we will remember the persistent level as well based on the presence of INITFORK.
- Next, copy each relfilenode to the destination, while copying for the source relation directly use the smgrread whereas for the destination relation use bufmgr. The main reasons for not using the bufmgr for the source relations are a) We can avoid acquiring a special-purpose lock on the relation b) We are copying from the template database so in most of the cases there might not be many dirty buffers for that database so there is no real need for using the shared buffers.
Any objections to the above design?
On Fri, Sep 3, 2021 at 5:54 PM Andres Freund <andres@anarazel.de> wrote: > > I think we already have such a code in multiple places where we bypass the > > shared buffers for copying the relation > > e.g. index_copy_data(), heapam_relation_copy_data(). > > That's not at all comparable. We hold an exclusive lock on the relation at > that point, and we don't have a separate implementation of reading tuples from > the table or something like that. I don't think there's a way to do this that is perfectly clean, so the discussion here is really about finding the least unpleasant alternative. I *really* like the idea of using pg_class to figure out what relations to copy. As far as I'm concerned, pg_class is the canonical list of what's in the database, and to the extent that the filesystem happens to agree, that's good luck. From that perspective, using the filesystem to figure out what to copy is by definition a hack. Now, having to use dedicated tuple-reading code is also a hack, but to me that's largely an accident of questionable design decisions elsewhere. You can't read a buffer with just the minimal amount of information that you need to read a buffer; you have to have a relcache entry, so we have things like ReadBufferWithoutRelcache and CreateFakeRelcacheEntry. It's a little crazy to me that someone saw that ReadBuffer() needed a thing which some callers might not have and instead of saying "hmm, maybe we ought to change the arguments so that anyone with enough information to call this function can do so," they said "hmm, let's create a fake object that is not really the same as a real one but good enough to fool the function into doing the right thing, probably." I think the code layering here is just flat-out broken and ought to be fixed. A layer whose job it is to read and write blocks should not know that relations are even a thing. (The widespread use of global variables in the relcache code, the catcache code, and many other places in lieu of explicit parameter-passing just makes everything a lot worse.) So I think if we commit to the hackiness of the sort that this patch introduces, there is some hope of things getting better in the future. I don't think it's a real easy path forward, but maybe it's possible. If on the other hand we commit to using the filesystem, I don't see how it ever gets any better. Unlogged tables are a great example of a feature that depended on the filesystem and it now seems to me to be - by far - the worst thing about that feature. I have no idea how to get rid of that dependency or all of the associated problems without reverting the feature. But in this case, we seem to have another option, and so I think we should take it. Your (or other people's mileage) may vary ... this is just my view of it. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Sep 8, 2021 at 9:54 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Sep 3, 2021 at 5:54 PM Andres Freund <andres@anarazel.de> wrote:
> > I think we already have such a code in multiple places where we bypass the
> > shared buffers for copying the relation
> > e.g. index_copy_data(), heapam_relation_copy_data().
>
> That's not at all comparable. We hold an exclusive lock on the relation at
> that point, and we don't have a separate implementation of reading tuples from
> the table or something like that.
I don't think there's a way to do this that is perfectly clean, so the
discussion here is really about finding the least unpleasant
alternative. I *really* like the idea of using pg_class to figure out
what relations to copy. As far as I'm concerned, pg_class is the
canonical list of what's in the database, and to the extent that the
filesystem happens to agree, that's good luck. From that perspective,
using the filesystem to figure out what to copy is by definition a
hack.
I agree with you, even though I think that scanning pg_class for identifying the relfilenode looks like a more sensible thing to do than directly scanning the file system, we need to consider one point that, now also in current system (in create database) we are scanning the directory for copying the file so instead of copying them directly we need to logically identify the relfilenode and then copy it block by block, so maybe this approach will not make anyone unhappy because it is not any worse than the current system.
On Sat, Sep 11, 2021 at 12:17 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I agree with you, even though I think that scanning pg_class for identifying the relfilenode looks like a more sensiblething to do than directly scanning the file system, we need to consider one point that, now also in current system (in create database) we are scanning the directory for copying the file so instead of copying them directly we needto logically identify the relfilenode and then copy it block by block, so maybe this approach will not make anyone unhappybecause it is not any worse than the current system. So, I agree. If we can't get agreement on this approach, then we can do that, and as you say, it's no worse than what we are doing now. But I am just trying to lay out my view of why I think that's not as good as this. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Sep 2, 2021 at 8:52 PM Robert Haas <robertmhaas@gmail.com> wrote: PFA, updated version of the patch, where I have fixed the issues reported by you and also done some more refactoring and patch split, next I am planning to post the patch with another approach where we scan the directory instead of scanning the pg_class for identifying the relfilenodes. For specific comments please find my response inline, > Andres pointed out that this approach ends up accessing relations > without taking a lock on them. It doesn't look like you did anything > about that. Now I have acquired a lock before scanning the pg_class as well as other relfilenode. > > + /* Built-in oids are mapped directly */ > + if (classForm->oid < FirstGenbkiObjectId) > + relfilenode = classForm->oid; > + else if (OidIsValid(classForm->relfilenode)) > + relfilenode = classForm->relfilenode; > + else > + continue; > > Am I missing something, or is this totally busted? Handled the mapped relation using relmapper. > /* > + * Now drop all buffers holding data of the target database; they should > + * no longer be dirty so DropDatabaseBuffers is safe. > > The way things worked before, this was true, but now AFAICS it's > false. I'm not sure whether that means that DropDatabaseBuffers() here > is actually unsafe or whether it just means that you haven't updated > the comment to explain the reason. Now we can only drop the buffer specific to old tablespace not the new tablespace so can not directly use the dboid, so extended the DropDatabaseBuffers interface to take tablespace oid as and input and updated the comments accordingly. > + * Since we copy the file directly without looking at the shared buffers, > + * we'd better first flush out any pages of the source relation that are > + * in shared buffers. We assume no new changes will be made while we are > + * holding exclusive lock on the rel. > > Ditto. I think these comments is related to index_copy_data() and this is still valid, it is showing in the patch due to some refactoring so I have separated out this refactoring patch as 0003 to avoid confusion. > > + /* As always, WAL must hit the disk before the data update does. */ > > Actually, the way it's coded now, part of the on-disk changes are done > before WAL is issued, and part are done after. I doubt that's the > right idea. There's nothing special about writing the actual payload > bytes vs. the other on-disk changes (creating directories and files). > In any case the ordering deserves a better-considered comment than > this one. Changed, now WAL first and then disk change. Open question: - Scan pg_class vs scan directories - Whether to retain the old created database mechanism as option or not. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, Sep 27, 2021 at 12:23 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > Open question: > - Scan pg_class vs scan directories > - Whether to retain the old created database mechanism as option or not. I have done some code improvement in 0001 and 0002. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, Oct 4, 2021 at 2:51 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: I have implemented the patch with approach2 as well, i.e. instead of scanning the pg-class, we scan the directory. IMHO, we have already discussed most of the advantages and disadvantages of both approaches so I don't want to mention those again. But I have noticed one more issue with the approach2, basically, if we scan the directory then we don't have any way to identify the relation-OID and that is required in order to acquire the relation lock before copying it, right? Patch details: 0001 to 0006 implements an approach1 0007 removes the code of pg_class scanning and adds the directory scan. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- v5-0001-Refactor-relmap-load-and-relmap-write-functions.patch
- v5-0003-Refactor-index_copy_data.patch
- v5-0004-Extend-bufmgr-interfaces.patch
- v5-0005-New-interface-to-lock-relation-id.patch
- v5-0002-Extend-relmap-interfaces.patch
- v5-0006-WAL-logged-CREATE-DATABASE.patch
- v5-0007-POC-WAL-LOG-CREATE-DATABASE-APPROACH-2.patch
Hi,
I've looked over this patch set and email thread a couple times, and I don't see anything amiss, but I'm also not terribly familiar with the subsystems this part of the code relies on. I haven't yet tried to stress test with a large database, but it seems like a good idea to do so.
I have a couple comments and questions:
0006:
+ * XXX We can optimize RelationMapOidToFileenodeForDatabase API
+ * so that instead of reading the relmap file every time, it can
+ * save it in a temporary variable and use it for subsequent
+ * calls. Then later reset it once we're done or at the
+ * transaction end.
Do we really need to consider optimizing here? Only a handful of relations will be found in the relmap, right?
+ * Once we start copying files from the source database, we need to be able
+ * to clean 'em up if we fail. Use an ENSURE block to make sure this
+ * happens. (This is not a 100% solution, because of the possibility of
+ * failure during transaction commit after we leave this routine, but it
+ * should handle most scenarios.)
This comment in master started with
- * Once we start copying subdirectories, we need to be able to clean 'em
Is the distinction important enough to change this comment? Also, is "most scenarios" still true with the patch? I haven't read into how ENSURE works.
Same with this comment change, seems fine the way it was:
- * Use an ENSURE block to make sure we remove the debris if the copy fails
- * (eg, due to out-of-disk-space). This is not a 100% solution, because
- * of the possibility of failure during transaction commit, but it should
- * handle most scenarios.
+ * Use an ENSURE block to make sure we remove the debris if the copy fails.
+ * This is not a 100% solution, because of the possibility of failure
+ * during transaction commit, but it should handle most scenarios.
And do we need additional tests? Maybe we don't, but it seems good to make sure.
I haven't looked at 0007, and I have no opinion on which approach is better.
I've looked over this patch set and email thread a couple times, and I don't see anything amiss, but I'm also not terribly familiar with the subsystems this part of the code relies on. I haven't yet tried to stress test with a large database, but it seems like a good idea to do so.
I have a couple comments and questions:
0006:
+ * XXX We can optimize RelationMapOidToFileenodeForDatabase API
+ * so that instead of reading the relmap file every time, it can
+ * save it in a temporary variable and use it for subsequent
+ * calls. Then later reset it once we're done or at the
+ * transaction end.
Do we really need to consider optimizing here? Only a handful of relations will be found in the relmap, right?
+ * Once we start copying files from the source database, we need to be able
+ * to clean 'em up if we fail. Use an ENSURE block to make sure this
+ * happens. (This is not a 100% solution, because of the possibility of
+ * failure during transaction commit after we leave this routine, but it
+ * should handle most scenarios.)
This comment in master started with
- * Once we start copying subdirectories, we need to be able to clean 'em
Is the distinction important enough to change this comment? Also, is "most scenarios" still true with the patch? I haven't read into how ENSURE works.
Same with this comment change, seems fine the way it was:
- * Use an ENSURE block to make sure we remove the debris if the copy fails
- * (eg, due to out-of-disk-space). This is not a 100% solution, because
- * of the possibility of failure during transaction commit, but it should
- * handle most scenarios.
+ * Use an ENSURE block to make sure we remove the debris if the copy fails.
+ * This is not a 100% solution, because of the possibility of failure
+ * during transaction commit, but it should handle most scenarios.
And do we need additional tests? Maybe we don't, but it seems good to make sure.
I haven't looked at 0007, and I have no opinion on which approach is better.
On Tue, Nov 23, 2021 at 10:29 PM John Naylor <john.naylor@enterprisedb.com> wrote: > > Hi, > > I've looked over this patch set and email thread a couple times, and I don't see anything amiss, but I'm also not terriblyfamiliar with the subsystems this part of the code relies on. I haven't yet tried to stress test with a large database,but it seems like a good idea to do so. Thanks, John for looking into the patches. Yeah, that makes sense, next week I will try to test with a large database and maybe with multiple tablespaces as well to see how this behaves. > I have a couple comments and questions: > > 0006: > > + * XXX We can optimize RelationMapOidToFileenodeForDatabase API > + * so that instead of reading the relmap file every time, it can > + * save it in a temporary variable and use it for subsequent > + * calls. Then later reset it once we're done or at the > + * transaction end. > > Do we really need to consider optimizing here? Only a handful of relations will be found in the relmap, right? You are right, it is actually not required I will remove this comment. > > + * Once we start copying files from the source database, we need to be able > + * to clean 'em up if we fail. Use an ENSURE block to make sure this > + * happens. (This is not a 100% solution, because of the possibility of > + * failure during transaction commit after we leave this routine, but it > + * should handle most scenarios.) > > This comment in master started with > > - * Once we start copying subdirectories, we need to be able to clean 'em > > Is the distinction important enough to change this comment? Also, is "most scenarios" still true with the patch? I haven'tread into how ENSURE works. Actually, it is like PG_TRY(), CATCH() block with extra assurance to cleanup on shm_exit as well. And in the cleanup function, we go through all the tablespaces and remove the new DB-related directory which we are trying to create. And you are right, we actually don't need to change the comments. > Same with this comment change, seems fine the way it was: Correct. > - * Use an ENSURE block to make sure we remove the debris if the copy fails > - * (eg, due to out-of-disk-space). This is not a 100% solution, because > - * of the possibility of failure during transaction commit, but it should > - * handle most scenarios. > + * Use an ENSURE block to make sure we remove the debris if the copy fails. > + * This is not a 100% solution, because of the possibility of failure > + * during transaction commit, but it should handle most scenarios. > > And do we need additional tests? Maybe we don't, but it seems good to make sure. > > I haven't looked at 0007, and I have no opinion on which approach is better. Okay, I like approach 6 because of mainly two reasons, 1) it is not directly scanning the raw file to identify which files to copy so seems cleaner to me 2) with 0007 if we directly scan directory we don't know the relation oid, so before acquiring the buffer lock there is no way to acquire the relation lock. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Oct 5, 2021 at 7:07 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Patch details: > 0001 to 0006 implements an approach1 > 0007 removes the code of pg_class scanning and adds the directory scan. > I had a scan through the patches, though have not yet actually run any tests to try to better gauge their benefit. I do have some initial review comments though: 0003 src/backend/commands/tablecmds.c (1) RelationCopyAllFork() In the following comment: +/* + * Copy source smgr all fork's data to the destination smgr. + */ Shouldn't it say "smgr relation"? Also, you could additionally say ", using a specified fork data copying function." or something like that, to account for the additional argument. 0006 src/backend/commands/dbcommands.c (1) function prototype location The following prototype is currently located in the "non-export function prototypes" section of the source file, but it's not static - shouldn't it be in dbcommands.h? +void RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst, + ForkNumber forkNum, char relpersistence); (2) CreateDirAndVersionFile() Shouldn't the following code: + fd = OpenTransientFile(versionfile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY); + if (fd < 0 && errno == EEXIST && isRedo) + fd = OpenTransientFile(versionfile, O_RDWR | PG_BINARY); actually be: + fd = OpenTransientFile(versionfile, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY); + if (fd < 0 && errno == EEXIST && isRedo) + fd = OpenTransientFile(versionfile, O_WRONLY | O_TRUNC | PG_BINARY); since we're only writing to that file descriptor and we want to truncate the file if it already exists. The current comment says "... open it in the write mode.", but should say "... open it in write mode." Also, shouldn't you be writing a newline (\n) after the PG_MAJORVERSION ? (compare with code in initdb.c) (3) GetDatabaseRelationList() Shouldn't: + if (PageIsNew(page) || PageIsEmpty(page)) + continue; be: + if (PageIsNew(page) || PageIsEmpty(page)) + { + UnlockReleaseBuffer(buf); + continue; + } ? Also, in the following code: + if (rnodelist == NULL) + rnodelist = list_make1(relinfo); + else + rnodelist = lappend(rnodelist, relinfo); it should really be "== NIL" rather than "== NULL". But in any case, that code can just be: rnodelist = lappend(rnodelist, relinfo); because lappend() will create a list if the first arg is NIL. (4) RelationCopyStorageUsingBuffer() In the function comments, IMO it is better to use "APIs" instead of "apis". Also, better to use "get" instead of "got" in the following comment: + /* If we got a cancel signal during the copy of the data, quit */ 0007 (I think I prefer the first approach rather than this 2nd approach) src/backend/commands/dbcommands.c (1) createdb() pfree(srcpath) seems to be missing, in the case that CopyDatabase() gets called. (2) GetRelfileNodeFromFileName() %s in sscanf() allows an unbounded read and is considered potentially dangerous (allows buffer overflow), especially here where FORKNAMECHARS is so small. + nmatch = sscanf(filename, "%u_%s", &relfilenode, forkname); how about using the following instead in this case: + nmatch = sscanf(filename, "%u_%4s", &relfilenode, forkname); ? (even if there were > 4 chars after the underscore, it would still match and InvalidOid would be returned because nmatch==2) Regards, Greg Nancarrow Fujitsu Australia
On Thu, Nov 25, 2021 at 1:07 PM Greg Nancarrow <gregn4422@gmail.com> wrote: > > On Tue, Oct 5, 2021 at 7:07 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > Patch details: > > 0001 to 0006 implements an approach1 > > 0007 removes the code of pg_class scanning and adds the directory scan. > > > > I had a scan through the patches, though have not yet actually run any > tests to try to better gauge their benefit. > I do have some initial review comments though: > > 0003 > > src/backend/commands/tablecmds.c > (1) RelationCopyAllFork() > In the following comment: > > +/* > + * Copy source smgr all fork's data to the destination smgr. > + */ > > Shouldn't it say "smgr relation"? > Also, you could additionally say ", using a specified fork data > copying function." or something like that, to account for the > additional argument. > > > 0006 > > src/backend/commands/dbcommands.c > (1) function prototype location > > The following prototype is currently located in the "non-export > function prototypes" section of the source file, but it's not static - > shouldn't it be in dbcommands.h? > > +void RelationCopyStorageUsingBuffer(SMgrRelation src, SMgrRelation dst, > + ForkNumber forkNum, char relpersistence); > > (2) CreateDirAndVersionFile() > Shouldn't the following code: > > + fd = OpenTransientFile(versionfile, O_RDWR | O_CREAT | O_EXCL | PG_BINARY); > + if (fd < 0 && errno == EEXIST && isRedo) > + fd = OpenTransientFile(versionfile, O_RDWR | PG_BINARY); > > actually be: > > + fd = OpenTransientFile(versionfile, O_WRONLY | O_CREAT | O_EXCL | PG_BINARY); > + if (fd < 0 && errno == EEXIST && isRedo) > + fd = OpenTransientFile(versionfile, O_WRONLY | O_TRUNC | PG_BINARY); > > since we're only writing to that file descriptor and we want to > truncate the file if it already exists. > > The current comment says "... open it in the write mode.", but should > say "... open it in write mode." > > Also, shouldn't you be writing a newline (\n) after the > PG_MAJORVERSION ? (compare with code in initdb.c) > > (3) GetDatabaseRelationList() > Shouldn't: > > + if (PageIsNew(page) || PageIsEmpty(page)) > + continue; > > be: > > + if (PageIsNew(page) || PageIsEmpty(page)) > + { > + UnlockReleaseBuffer(buf); > + continue; > + } > > ? > > Also, in the following code: > > + if (rnodelist == NULL) > + rnodelist = list_make1(relinfo); > + else > + rnodelist = lappend(rnodelist, relinfo); > > it should really be "== NIL" rather than "== NULL". > But in any case, that code can just be: > > rnodelist = lappend(rnodelist, relinfo); > > because lappend() will create a list if the first arg is NIL. > > (4) RelationCopyStorageUsingBuffer() > > In the function comments, IMO it is better to use "APIs" instead of "apis". > > Also, better to use "get" instead of "got" in the following comment: > > + /* If we got a cancel signal during the copy of the data, quit */ Thanks for the review and many valuable comments, I have fixed all of them except this comment (/* If we got a cancel signal during the copy of the data, quit */) because this looks fine to me. 0007, I have dropped from the patchset for now. I have also included fixes for comments given by John. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Thu, Nov 25, 2021 at 10:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Thanks for the review and many valuable comments, I have fixed all of > them except this comment (/* If we got a cancel signal during the copy > of the data, quit */) because this looks fine to me. 0007, I have > dropped from the patchset for now. I have also included fixes for > comments given by John. > Any progress/results yet on testing against a large database (as suggested by John Naylor) and multiple tablespaces? Thanks for the patch updates. I have some additional minor comments: 0002 (1) Tidy patch comment I suggest minor tidying of the patch comment, as follows: Support new interfaces in relmapper, 1) Support copying the relmap file from one database path to another database path. 2) Like RelationMapOidToFilenode, provide another interface which does the same but, instead of getting it for the database we are connected to, it will get it for the input database path. These interfaces are required for the next patch, for supporting the WAL-logged created database. 0003 src/include/commands/tablecmds.h (1) typedef void (*copy_relation_storage) ... The new typename "copy_relation_storage" needs to be added to src/tools/pgindent/typedefs.list 0006 src/backend/commands/dbcommands.c (1) CreateDirAndVersionFile After writing to the file, you should probably pfree(buf.data), right? Actually, I don't think StringInfo (dynamic string allocation) is needed here, since the version string is so short, so why not just use a local "char buf[16]" buffer and snprintf() the PG_MAJORVERSION+newline into that? Also (as mentioned in my first review) shouldn't the "O_TRUNC" flag be additionally specified in the case when OpenTransientFile() is tried for a 2nd time because of errno==EEXIST on the 1st attempt? (otherwise if the existing file did contain something you'd end up writing after the existing data in the file). src/backend/commands/dbcommands.c (2) typedef struct CreateDBRelInfo ... CreateDBRelInfo The new typename "CreateDBRelInfo" needs to be added to src/tools/pgindent/typedefs.list src/bin/pg_rewind/parsexlog.c (3) Include additional header file It seems that the following additional header file is not needed to compile the source file: +#include "utils/relmapper.h" Regards, Greg Nancarrow Fujitsu Australia
On Thu, Nov 25, 2021 at 10:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Thanks for the review and many valuable comments, I have fixed all of > them except this comment (/* If we got a cancel signal during the copy > of the data, quit */) because this looks fine to me. 0007, I have > dropped from the patchset for now. I have also included fixes for > comments given by John. > I found the following issue with the patches applied: A server crash occurs after the following sequence of commands: create tablespace tbsp1 location '<directory>/tbsp1'; create tablespace tbsp2 location '<directory>/tbsp2'; create database test1 tablespace tbsp1; create database test2 template test1 tablespace tbsp2; alter database test2 set tablespace tbsp1; checkpoint; The following type of message is seen in the server log: 2021-12-01 16:48:26.623 AEDT [67423] PANIC: could not fsync file "pg_tblspc/16385/PG_15_202111301/16387/3394": No such file or directory 2021-12-01 16:48:27.228 AEDT [67422] LOG: checkpointer process (PID 67423) was terminated by signal 6: Aborted 2021-12-01 16:48:27.228 AEDT [67422] LOG: terminating any other active server processes 2021-12-01 16:48:27.233 AEDT [67422] LOG: all server processes terminated; reinitializing Also (prior to running the checkpoint command above) I've seen errors like the following when running pg_dumpall: pg_dump: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: PANIC: could not open critical system index 2662 pg_dumpall: error: pg_dump failed on database "test2", exiting Hopefully the above example will help in tracking down the cause. Regards, Greg Nancarrow Fujitsu Australia
On Wed, Dec 1, 2021 at 12:07 PM Greg Nancarrow <gregn4422@gmail.com> wrote: > > On Thu, Nov 25, 2021 at 10:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > Thanks for the review and many valuable comments, I have fixed all of > > them except this comment (/* If we got a cancel signal during the copy > > of the data, quit */) because this looks fine to me. 0007, I have > > dropped from the patchset for now. I have also included fixes for > > comments given by John. > > > > I found the following issue with the patches applied: > > A server crash occurs after the following sequence of commands: > > create tablespace tbsp1 location '<directory>/tbsp1'; > create tablespace tbsp2 location '<directory>/tbsp2'; > create database test1 tablespace tbsp1; > create database test2 template test1 tablespace tbsp2; > alter database test2 set tablespace tbsp1; > checkpoint; > > The following type of message is seen in the server log: > > 2021-12-01 16:48:26.623 AEDT [67423] PANIC: could not fsync file > "pg_tblspc/16385/PG_15_202111301/16387/3394": No such file or > directory Thanks a lot for testing this. From the error, it seems like some of the old buffer w.r.t. the previous tablespace is not dropped after the movedb. Actually, we are calling DropDatabaseBuffers() after copying to a new tablespace and dropping all the buffers of this database w.r.t the old tablespace. But seems something is missing, I will reproduce this and try to fix it by tomorrow. I will also fix the other review comments raised by you in the previous mail. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Dec 1, 2021 at 6:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Thanks a lot for testing this. From the error, it seems like some of > the old buffer w.r.t. the previous tablespace is not dropped after the > movedb. Actually, we are calling DropDatabaseBuffers() after copying > to a new tablespace and dropping all the buffers of this database > w.r.t the old tablespace. But seems something is missing, I will > reproduce this and try to fix it by tomorrow. I will also fix the > other review comments raised by you in the previous mail. Okay, I got the issue, basically we are dropping the database buffers but not unregistering the existing sync request for database buffers w.r.t old tablespace. Attached patch fixes that. I also had to extend ForgetDatabaseSyncRequests so that we can delete the sync request of the database for the particular tablespace so added another patch for the same (0006). I will test the performance scenario next week, which is suggested by John. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- v7-0003-Refactor-index_copy_data.patch
- v7-0004-Extend-bufmgr-interfaces.patch
- v7-0001-Refactor-relmap-load-and-relmap-write-functions.patch
- v7-0002-Extend-relmap-interfaces.patch
- v7-0005-New-interface-to-lock-relation-id.patch
- v7-0006-Extend-ForgetDatabaseSyncRequests-interface.patch
- v7-0007-WAL-logged-CREATE-DATABASE.patch
I see that this patch is reducing the database creation time by almost 3-4 times provided that the template database has some user data in it. However, there are couple of points to be noted:
1) It makes the crash recovery a bit slower than before if the crash has occurred after the execution of a create database statement. Moreover, if the template database size is big, it might even generate a lot of WAL files which the user needs to be aware of.
2) This will put a lot of load on the first checkpoint that will occur after creating the database statement. I will experiment around this to see if this has any side effects.
--
Further, the code changes in the patch looks good. I just have few comments:
+void
+LockRelationId(LockRelId *relid, LOCKMODE lockmode)
+{
+ LOCKTAG tag;
+ LOCALLOCK *locallock;
+ LockAcquireResult res;
+
+ SET_LOCKTAG_RELATION(tag, relid->dbId, relid->relId);
--
if (info == XLOG_DBASE_CREATE)
{
xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
- char *src_path;
- char *dst_path;
- struct stat st;
-
- src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
- dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+ char *dbpath;
- /*
- * Our theory for replaying a CREATE is to forcibly drop the target
- * subdirectory if present, then re-copy the source data. This may be
- * more work than needed, but it is simple to implement.
- */
- if (stat(dst_path, &st) == 0 && S_ISDIR(st.st_mode))
- {
- if (!rmtree(dst_path, true))
- /* If this failed, copydir() below is going to error. */
- ereport(WARNING,
- (errmsg("some useless files may be left behind in old database directory \"%s\"",
- dst_path)));
- }
I think this is a significant change and probably needs some kind of explanation/comments as-in why we are just creating a dir and copying the version file when replaying create database operation. Earlier, this meant replaying the complete create database operation, that doesn't seem to be the case now.
--
Have you intentionally skipped pg_internal.init file from being copied to the target database?
1) It makes the crash recovery a bit slower than before if the crash has occurred after the execution of a create database statement. Moreover, if the template database size is big, it might even generate a lot of WAL files which the user needs to be aware of.
2) This will put a lot of load on the first checkpoint that will occur after creating the database statement. I will experiment around this to see if this has any side effects.
--
Further, the code changes in the patch looks good. I just have few comments:
+void
+LockRelationId(LockRelId *relid, LOCKMODE lockmode)
+{
+ LOCKTAG tag;
+ LOCALLOCK *locallock;
+ LockAcquireResult res;
+
+ SET_LOCKTAG_RELATION(tag, relid->dbId, relid->relId);
Should there be an assertion statement here to ensure that relid->dbid and relid->relid is valid?
--
if (info == XLOG_DBASE_CREATE)
{
xl_dbase_create_rec *xlrec = (xl_dbase_create_rec *) XLogRecGetData(record);
- char *src_path;
- char *dst_path;
- struct stat st;
-
- src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
- dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
+ char *dbpath;
- /*
- * Our theory for replaying a CREATE is to forcibly drop the target
- * subdirectory if present, then re-copy the source data. This may be
- * more work than needed, but it is simple to implement.
- */
- if (stat(dst_path, &st) == 0 && S_ISDIR(st.st_mode))
- {
- if (!rmtree(dst_path, true))
- /* If this failed, copydir() below is going to error. */
- ereport(WARNING,
- (errmsg("some useless files may be left behind in old database directory \"%s\"",
- dst_path)));
- }
I think this is a significant change and probably needs some kind of explanation/comments as-in why we are just creating a dir and copying the version file when replaying create database operation. Earlier, this meant replaying the complete create database operation, that doesn't seem to be the case now.
--
Have you intentionally skipped pg_internal.init file from being copied to the target database?
--
With Regards,
Ashutosh Sharma.
On Thu, Dec 2, 2021 at 7:20 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Wed, Dec 1, 2021 at 6:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Thanks a lot for testing this. From the error, it seems like some of
> the old buffer w.r.t. the previous tablespace is not dropped after the
> movedb. Actually, we are calling DropDatabaseBuffers() after copying
> to a new tablespace and dropping all the buffers of this database
> w.r.t the old tablespace. But seems something is missing, I will
> reproduce this and try to fix it by tomorrow. I will also fix the
> other review comments raised by you in the previous mail.
Okay, I got the issue, basically we are dropping the database buffers
but not unregistering the existing sync request for database buffers
w.r.t old tablespace. Attached patch fixes that. I also had to extend
ForgetDatabaseSyncRequests so that we can delete the sync request of
the database for the particular tablespace so added another patch for
the same (0006).
I will test the performance scenario next week, which is suggested by John.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Fri, Dec 3, 2021 at 7:38 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > I see that this patch is reducing the database creation time by almost 3-4 times provided that the template database hassome user data in it. However, there are couple of points to be noted: Thanks a lot for looking into the patches. > > 1) It makes the crash recovery a bit slower than before if the crash has occurred after the execution of a create databasestatement. Moreover, if the template database size is big, it might even generate a lot of WAL files which the userneeds to be aware of. Yes it will but actually that is the only correct way to do it, in current we are just logging the WAL as copying the source directory to destination directory without really noting down exactly what we wanted to copy, so we are force to do the checkpoint right after create database because in crash recovery we can not actually replay that WAL. Because WAL just say copy the source to destination so it is very much possible that at the DO time source directory had some different content than the REDO time so this would have created the inconsistencies in the crash recovery so to avoid this bug they force the checkpoint so now also if you do force checkpoint then again crash recovery will be equally fast. So I would not say that we have made crash recovery slow but we have removed some bugs and with that now we don't need to force the checkpoint. Also note that in current code even with force checkpoint the bug is not completely avoided in all the cases, check below comments from the code[1]. > 2) This will put a lot of load on the first checkpoint that will occur after creating the database statement. I will experimentaround this to see if this has any side effects. But now a checkpoint can happen at its own need and there is no need to force a checkpoint like it was before patch. So the major goal of this patch is 1) Correctly WAL log the create database which is hack in the current system, 2) Avoid force checkpoints, 3) We copy page by page so it will support TDE because if the source and destination database has different encryption then we can reencrypt the page before copying to destination database, which is not possible in current system as we are copying directory 4) Now the new database pages will get the latest LSN which is the correct things earlier new database pages were getting copied directly with old LSN only. > Further, the code changes in the patch looks good. I just have few comments: I will look into the other comments and get back to you, thanks. [1] * In PITR replay, the first of these isn't an issue, and the second * is only a risk if the CREATE DATABASE and subsequent template * database change both occur while a base backup is being taken. * There doesn't seem to be much we can do about that except document * it as a limitation. * * Perhaps if we ever implement CREATE DATABASE in a less cheesy way, * we can avoid this. */ RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT); -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, Dec 3, 2021 at 8:28 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Dec 3, 2021 at 7:38 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> I see that this patch is reducing the database creation time by almost 3-4 times provided that the template database has some user data in it. However, there are couple of points to be noted:
Thanks a lot for looking into the patches.
>
> 1) It makes the crash recovery a bit slower than before if the crash has occurred after the execution of a create database statement. Moreover, if the template database size is big, it might even generate a lot of WAL files which the user needs to be aware of.
Yes it will but actually that is the only correct way to do it, in
current we are just logging the WAL as copying the source directory to
destination directory without really noting down exactly what we
wanted to copy, so we are force to do the checkpoint right after
create database because in crash recovery we can not actually replay
that WAL. Because WAL just say copy the source to destination so it
is very much possible that at the DO time source directory had some
different content than the REDO time so this would have created the
inconsistencies in the crash recovery so to avoid this bug they force
the checkpoint so now also if you do force checkpoint then again crash
recovery will be equally fast. So I would not say that we have made
crash recovery slow but we have removed some bugs and with that now we
don't need to force the checkpoint. Also note that in current code
even with force checkpoint the bug is not completely avoided in all
the cases, check below comments from the code[1].
> 2) This will put a lot of load on the first checkpoint that will occur after creating the database statement. I will experiment around this to see if this has any side effects.
But now a checkpoint can happen at its own need and there is no need
to force a checkpoint like it was before patch.
So the major goal of this patch is 1) Correctly WAL log the create
database which is hack in the current system, 2) Avoid force
checkpoints, 3) We copy page by page so it will support TDE because if
the source and destination database has different encryption then we
can reencrypt the page before copying to destination database, which
is not possible in current system as we are copying directory 4) Now
the new database pages will get the latest LSN which is the correct
things earlier new database pages were getting copied directly with
old LSN only.
OK. Understood, thanks.!
--
With Regards,
Ashutosh Sharma.
Here are few more review comments:
1) It seems that we are not freeing the memory allocated for buf.data in CreateDirAndVersionFile().
--
+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
2) Do we need to pass dbpath here? I mean why not reconstruct it from dbid and tsid.
--
3) Not sure if this point has already been discussed, Will we be able to recover the data when wal_level is set to minimal because the following condition would be false with this wal level.
+ use_wal = XLogIsNeeded() &&
+ (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
--
+ */
+static void
+CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
+{
2) Do we need to pass dbpath here? I mean why not reconstruct it from dbid and tsid.
--
3) Not sure if this point has already been discussed, Will we be able to recover the data when wal_level is set to minimal because the following condition would be false with this wal level.
+ use_wal = XLogIsNeeded() &&
+ (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
--
With Regards,
Ashutosh Sharma.
On Mon, Dec 6, 2021 at 9:12 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
On Fri, Dec 3, 2021 at 8:28 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:On Fri, Dec 3, 2021 at 7:38 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> I see that this patch is reducing the database creation time by almost 3-4 times provided that the template database has some user data in it. However, there are couple of points to be noted:
Thanks a lot for looking into the patches.
>
> 1) It makes the crash recovery a bit slower than before if the crash has occurred after the execution of a create database statement. Moreover, if the template database size is big, it might even generate a lot of WAL files which the user needs to be aware of.
Yes it will but actually that is the only correct way to do it, in
current we are just logging the WAL as copying the source directory to
destination directory without really noting down exactly what we
wanted to copy, so we are force to do the checkpoint right after
create database because in crash recovery we can not actually replay
that WAL. Because WAL just say copy the source to destination so it
is very much possible that at the DO time source directory had some
different content than the REDO time so this would have created the
inconsistencies in the crash recovery so to avoid this bug they force
the checkpoint so now also if you do force checkpoint then again crash
recovery will be equally fast. So I would not say that we have made
crash recovery slow but we have removed some bugs and with that now we
don't need to force the checkpoint. Also note that in current code
even with force checkpoint the bug is not completely avoided in all
the cases, check below comments from the code[1].
> 2) This will put a lot of load on the first checkpoint that will occur after creating the database statement. I will experiment around this to see if this has any side effects.
But now a checkpoint can happen at its own need and there is no need
to force a checkpoint like it was before patch.
So the major goal of this patch is 1) Correctly WAL log the create
database which is hack in the current system, 2) Avoid force
checkpoints, 3) We copy page by page so it will support TDE because if
the source and destination database has different encryption then we
can reencrypt the page before copying to destination database, which
is not possible in current system as we are copying directory 4) Now
the new database pages will get the latest LSN which is the correct
things earlier new database pages were getting copied directly with
old LSN only.OK. Understood, thanks.!--With Regards,Ashutosh Sharma.
On Mon, Dec 6, 2021 at 9:17 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Here are few more review comments: Thanks for reviewing it. > 1) It seems that we are not freeing the memory allocated for buf.data in CreateDirAndVersionFile(). Yeah this was a problem in v6 but I have fixed in v7, can you check that. > > + */ > +static void > +CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo) > +{ > > 2) Do we need to pass dbpath here? I mean why not reconstruct it from dbid and tsid. Yeah we can do that but I thought computing dbpath has some cost and since the caller already has it why not to pass it. > > 3) Not sure if this point has already been discussed, Will we be able to recover the data when wal_level is set to minimalbecause the following condition would be false with this wal level. > > + use_wal = XLogIsNeeded() && > + (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork); > Since we are creating new relfilenode this is fine, refer "Skipping WAL for New RelFileNode" in src/backend/access/transam/README -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Thank you, Dilip for the quick response. I am okay with the changes done in the v7 patch.
One last point - If we try to clone a huge database, as expected CREATE DATABASE emits a lot of WALs, causing a lot of intermediate checkpoints which seems to be affecting the performance slightly.
One last point - If we try to clone a huge database, as expected CREATE DATABASE emits a lot of WALs, causing a lot of intermediate checkpoints which seems to be affecting the performance slightly.
--
With Regards,
Ashutosh Sharma.
On Mon, Dec 6, 2021 at 9:59 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Mon, Dec 6, 2021 at 9:17 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Here are few more review comments:
Thanks for reviewing it.
> 1) It seems that we are not freeing the memory allocated for buf.data in CreateDirAndVersionFile().
Yeah this was a problem in v6 but I have fixed in v7, can you check that.
>
> + */
> +static void
> +CreateDirAndVersionFile(char *dbpath, Oid dbid, Oid tsid, bool isRedo)
> +{
>
> 2) Do we need to pass dbpath here? I mean why not reconstruct it from dbid and tsid.
Yeah we can do that but I thought computing dbpath has some cost and
since the caller already has it why not to pass it.
>
> 3) Not sure if this point has already been discussed, Will we be able to recover the data when wal_level is set to minimal because the following condition would be false with this wal level.
>
> + use_wal = XLogIsNeeded() &&
> + (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork);
>
Since we are creating new relfilenode this is fine, refer "Skipping
WAL for New RelFileNode" in src/backend/access/transam/README
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Mon, Dec 6, 2021 at 9:23 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > One last point - If we try to clone a huge database, as expected CREATE DATABASE emits a lot of WALs, causing a lot ofintermediate checkpoints which seems to be affecting the performance slightly. Yes, I think this needs to be characterized better. If you have a big shared buffers setting and a lot of those buffers are dirty and the template database is small, all of which is fairly normal, then this new approach should be much quicker. On the other hand, what if the situation is reversed? Perhaps you have a small shared buffers and not much of it is dirty and the template database is gigantic. Then maybe this new approach will be slower. But right now I think we don't know where the crossover point is, and I think we should try to figure that out. So for example, imagine tests with 1GB of shard_buffers, 8GB, and 64GB. And template databases with sizes of whatever the default is, 1GB, 10GB, 100GB. Repeatedly make 75% of the pages dirty and then create a new database from one of the templates. And then just measure the performance. Maybe for large databases this approach is just really the pits -- and if your max_wal_size is too small, it definitely will be. But, I don't know, maybe with reasonable settings it's not that bad. Writing everything to disk twice - once to WAL and once to the target directory - has to be more expensive than doing it once. But on the other hand, it's all sequential I/O and the data pages don't need to be fsync'd, so perhaps the overhead is relatively mild. I don't know. -- Robert Haas EDB: http://www.enterprisedb.com
Thanks Robert for sharing your thoughts.
On Mon, Dec 6, 2021 at 11:16 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Dec 6, 2021 at 9:23 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> One last point - If we try to clone a huge database, as expected CREATE DATABASE emits a lot of WALs, causing a lot of intermediate checkpoints which seems to be affecting the performance slightly.
Yes, I think this needs to be characterized better. If you have a big
shared buffers setting and a lot of those buffers are dirty and the
template database is small, all of which is fairly normal, then this
new approach should be much quicker. On the other hand, what if the
situation is reversed? Perhaps you have a small shared buffers and not
much of it is dirty and the template database is gigantic. Then maybe
this new approach will be slower. But right now I think we don't know
where the crossover point is, and I think we should try to figure that
out.
Yes I think so too.
So for example, imagine tests with 1GB of shard_buffers, 8GB, and
64GB. And template databases with sizes of whatever the default is,
1GB, 10GB, 100GB. Repeatedly make 75% of the pages dirty and then
create a new database from one of the templates. And then just measure
the performance. Maybe for large databases this approach is just
really the pits -- and if your max_wal_size is too small, it
definitely will be. But, I don't know, maybe with reasonable settings
it's not that bad. Writing everything to disk twice - once to WAL and
once to the target directory - has to be more expensive than doing it
once. But on the other hand, it's all sequential I/O and the data
pages don't need to be fsync'd, so perhaps the overhead is relatively
mild. I don't know.
So far, I haven't found much performance overhead with a few gb of data in the template database. It's just a bit with the default settings, perhaps setting a higher value of max_wal_size would reduce this overhead.
--
With Regards,
Ashutosh Sharma.
On Mon, Dec 6, 2021 at 7:53 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Thank you, Dilip for the quick response. I am okay with the changes done in the v7 patch. > > One last point - If we try to clone a huge database, as expected CREATE DATABASE emits a lot of WALs, causing a lot ofintermediate checkpoints which seems to be affecting the performance slightly. Yeah, that is a valid point because instead of just one WAL for createdb we will generate WAL for each page in the database, so I agree that if the max_wal_size is not enough for those WALs then we might have to pay the cost of multiple checkpoints. However, if we compare it with the current mechanism then now it is a forced checkpoint and there is no way to avoid it whereas with the new approach user can set enough max_wal_size and they can avoid it. So in other words now the checkpoint is driven by the amount of resource which is true for any other operation e.g. ALTER TABLE SET TABLESPACE so now it is in more sync with the rest of the system, but without the patch, it was a special purpose forced checkpoint only for the createdb. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hello Dilip,
While testing the v7 patches, I am observing a crash with the below test case.
Test case:
create tablespace tab location '<dir_path>/test_dir';
create tablespace tab1 location '<dir_path>/test_dir1';
create database test tablespace tab;
\c test
create table t( a int PRIMARY KEY,b text);
CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
insert into t values (generate_series(1,2000000), large_val());
alter table t set tablespace tab1 ;
\c postgres
create database test1 template test;
alter database test set tablespace pg_default;
alter database test set tablespace tab;
\c test1
alter table t set tablespace tab;
create tablespace tab location '<dir_path>/test_dir';
create tablespace tab1 location '<dir_path>/test_dir1';
create database test tablespace tab;
\c test
create table t( a int PRIMARY KEY,b text);
CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
insert into t values (generate_series(1,2000000), large_val());
alter table t set tablespace tab1 ;
\c postgres
create database test1 template test;
alter database test set tablespace pg_default;
alter database test set tablespace tab;
\c test1
alter table t set tablespace tab;
Logfile says:
2021-12-08 23:31:58.855 +04 [134252] PANIC: could not fsync file "base/16386/4152": No such file or directory
2021-12-08 23:31:59.398 +04 [134251] LOG: checkpointer process (PID 134252) was terminated by signal 6: Aborted
2021-12-08 23:31:59.398 +04 [134251] LOG: checkpointer process (PID 134252) was terminated by signal 6: Aborted
Thanks.
--
--
Regards,
Neha Sharma
On Tue, Dec 7, 2021 at 12:24 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Mon, Dec 6, 2021 at 7:53 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Thank you, Dilip for the quick response. I am okay with the changes done in the v7 patch.
>
> One last point - If we try to clone a huge database, as expected CREATE DATABASE emits a lot of WALs, causing a lot of intermediate checkpoints which seems to be affecting the performance slightly.
Yeah, that is a valid point because instead of just one WAL for
createdb we will generate WAL for each page in the database, so I
agree that if the max_wal_size is not enough for those WALs then we
might have to pay the cost of multiple checkpoints. However, if we
compare it with the current mechanism then now it is a forced
checkpoint and there is no way to avoid it whereas with the new
approach user can set enough max_wal_size and they can avoid it. So
in other words now the checkpoint is driven by the amount of resource
which is true for any other operation e.g. ALTER TABLE SET TABLESPACE
so now it is in more sync with the rest of the system, but without the
patch, it was a special purpose forced checkpoint only for the
createdb.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Thu, Dec 9, 2021 at 6:57 AM Neha Sharma <neha.sharma@enterprisedb.com> wrote: > > While testing the v7 patches, I am observing a crash with the below test case. > > Test case: > create tablespace tab location '<dir_path>/test_dir'; > create tablespace tab1 location '<dir_path>/test_dir1'; > create database test tablespace tab; > \c test > create table t( a int PRIMARY KEY,b text); > CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select array_agg(md5(g::text))::text from generate_series(1,256) g'; > insert into t values (generate_series(1,2000000), large_val()); > alter table t set tablespace tab1 ; > \c postgres > create database test1 template test; > alter database test set tablespace pg_default; > alter database test set tablespace tab; > \c test1 > alter table t set tablespace tab; > > Logfile says: > 2021-12-08 23:31:58.855 +04 [134252] PANIC: could not fsync file "base/16386/4152": No such file or directory > 2021-12-08 23:31:59.398 +04 [134251] LOG: checkpointer process (PID 134252) was terminated by signal 6: Aborted > I tried to reproduce the issue using your test scenario, but I needed to reduce the amount of inserted data (so reduced 2000000 to 20000) due to disk space. I then consistently get an error like the following: postgres=# alter database test set tablespace pg_default; ERROR: could not create file "pg_tblspc/16385/PG_15_202111301/16386/36395": File exists (this only happens when the patch is used) Regards, Greg Nancarrow Fujitsu Australia
On Thu, Dec 9, 2021 at 4:26 AM Greg Nancarrow <gregn4422@gmail.com> wrote:
On Thu, Dec 9, 2021 at 6:57 AM Neha Sharma <neha.sharma@enterprisedb.com> wrote:
>
> While testing the v7 patches, I am observing a crash with the below test case.
>
> Test case:
> create tablespace tab location '<dir_path>/test_dir';
> create tablespace tab1 location '<dir_path>/test_dir1';
> create database test tablespace tab;
> \c test
> create table t( a int PRIMARY KEY,b text);
> CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
> insert into t values (generate_series(1,2000000), large_val());
> alter table t set tablespace tab1 ;
> \c postgres
> create database test1 template test;
> alter database test set tablespace pg_default;
> alter database test set tablespace tab;
> \c test1
> alter table t set tablespace tab;
>
> Logfile says:
> 2021-12-08 23:31:58.855 +04 [134252] PANIC: could not fsync file "base/16386/4152": No such file or directory
> 2021-12-08 23:31:59.398 +04 [134251] LOG: checkpointer process (PID 134252) was terminated by signal 6: Aborted
>
I tried to reproduce the issue using your test scenario, but I needed
to reduce the amount of inserted data (so reduced 2000000 to 20000)
due to disk space.
I then consistently get an error like the following:
postgres=# alter database test set tablespace pg_default;
ERROR: could not create file
"pg_tblspc/16385/PG_15_202111301/16386/36395": File exists
(this only happens when the patch is used)
Yes, I was also getting this, and moving further we get a crash when we alter the table of database test1.
Below is the output of the test at my end.
CREATE TABLESPACE
postgres=# create tablespace tab location '/home/edb/PGsources/postgresql/inst/bin/rep_test';
CREATE TABLESPACE
postgres=# create database test tablespace tab;
CREATE DATABASE
postgres=# \c test
You are now connected to database "test" as user "edb".
test=# create table t( a int PRIMARY KEY,b text);
CREATE TABLE
test=# CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
CREATE FUNCTION
test=# insert into t values (generate_series(1,2000000), large_val());
INSERT 0 2000000
test=# alter table t set tablespace tab1 ;
ALTER TABLE
test=# \c postgres
You are now connected to database "postgres" as user "edb".
postgres=# create database test1 template test;
CREATE DATABASE
postgres=# alter database test set tablespace pg_default;
ERROR: could not create file "pg_tblspc/16384/PG_15_202111301/16386/2016395": File exists
postgres=# alter database test set tablespace tab;
ALTER DATABASE
postgres=# \c test1
You are now connected to database "test1" as user "edb".
test1=# alter table t set tablespace tab;
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!?>
Regards,
Greg Nancarrow
Fujitsu Australia
Hi,
ashu@postgres=# alter database test set tablespace pg_default;
ERROR: 58P02: could not create file "pg_tblspc/16385/PG_15_202111301/16386/16390": File exists
I have taken the above from Neha's test-case.
--
Attached patch fixes this. I am passing a new boolean flag named *movedb* to CopyDatabase() so that it could skip the creation of tables existing in non-default tablespace when doing alter database. Alternatively, we can also rename the boolean flag movedb to createdb and pass its value accordingly from movedb() or createdb(). Either way looks fine to me. Kindly check the attached patch for the changes.
Dilip, Could you please check the attached patch and let me know if it looks fine or not?
Neha, can you please re-run the test-cases with the attached patch.
Thanks,
--
With Regards,
Ashutosh Sharma.
On Thu, Dec 9, 2021 at 8:43 AM Neha Sharma <neha.sharma@enterprisedb.com> wrote:
On Thu, Dec 9, 2021 at 4:26 AM Greg Nancarrow <gregn4422@gmail.com> wrote:On Thu, Dec 9, 2021 at 6:57 AM Neha Sharma <neha.sharma@enterprisedb.com> wrote:
>
> While testing the v7 patches, I am observing a crash with the below test case.
>
> Test case:
> create tablespace tab location '<dir_path>/test_dir';
> create tablespace tab1 location '<dir_path>/test_dir1';
> create database test tablespace tab;
> \c test
> create table t( a int PRIMARY KEY,b text);
> CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
> insert into t values (generate_series(1,2000000), large_val());
> alter table t set tablespace tab1 ;
> \c postgres
> create database test1 template test;
> alter database test set tablespace pg_default;
> alter database test set tablespace tab;
> \c test1
> alter table t set tablespace tab;
>
> Logfile says:
> 2021-12-08 23:31:58.855 +04 [134252] PANIC: could not fsync file "base/16386/4152": No such file or directory
> 2021-12-08 23:31:59.398 +04 [134251] LOG: checkpointer process (PID 134252) was terminated by signal 6: Aborted
>
I tried to reproduce the issue using your test scenario, but I needed
to reduce the amount of inserted data (so reduced 2000000 to 20000)
due to disk space.
I then consistently get an error like the following:
postgres=# alter database test set tablespace pg_default;
ERROR: could not create file
"pg_tblspc/16385/PG_15_202111301/16386/36395": File exists
(this only happens when the patch is used)Yes, I was also getting this, and moving further we get a crash when we alter the table of database test1.Below is the output of the test at my end.postgres=# create tablespace tab1 location '/home/edb/PGsources/postgresql/inst/bin/rep_test1';
CREATE TABLESPACE
postgres=# create tablespace tab location '/home/edb/PGsources/postgresql/inst/bin/rep_test';
CREATE TABLESPACE
postgres=# create database test tablespace tab;
CREATE DATABASE
postgres=# \c test
You are now connected to database "test" as user "edb".
test=# create table t( a int PRIMARY KEY,b text);
CREATE TABLE
test=# CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
CREATE FUNCTION
test=# insert into t values (generate_series(1,2000000), large_val());
INSERT 0 2000000
test=# alter table t set tablespace tab1 ;
ALTER TABLE
test=# \c postgres
You are now connected to database "postgres" as user "edb".
postgres=# create database test1 template test;
CREATE DATABASE
postgres=# alter database test set tablespace pg_default;
ERROR: could not create file "pg_tblspc/16384/PG_15_202111301/16386/2016395": File exists
postgres=# alter database test set tablespace tab;
ALTER DATABASE
postgres=# \c test1
You are now connected to database "test1" as user "edb".
test1=# alter table t set tablespace tab;
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.!?>
Regards,
Greg Nancarrow
Fujitsu Australia
Attachment
On Thu, Dec 9, 2021 at 12:42 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Hi, > > The issue here is that we are trying to create a table that exists inside a non-default tablespace when doing ALTER DATABASE.I think this should be skipped otherwise we will come across the error like shown below: > > ashu@postgres=# alter database test set tablespace pg_default; > ERROR: 58P02: could not create file "pg_tblspc/16385/PG_15_202111301/16386/16390": File exists > > I have taken the above from Neha's test-case. > > -- > > Attached patch fixes this. I am passing a new boolean flag named *movedb* to CopyDatabase() so that it could skip the creationof tables existing in non-default tablespace when doing alter database. Alternatively, we can also rename the booleanflag movedb to createdb and pass its value accordingly from movedb() or createdb(). Either way looks fine to me. Kindlycheck the attached patch for the changes. > > Dilip, Could you please check the attached patch and let me know if it looks fine or not? > > Neha, can you please re-run the test-cases with the attached patch. Thanks Ahustosh, yeah I have observed the same, earlier we were directly copying the whole directory so this was not an issue, now if some tables of the database are already in the destination tablespace then we should skip them while copying. I will review your patch and merge into the main patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Dec 9, 2021 at 11:12 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
Hi,The issue here is that we are trying to create a table that exists inside a non-default tablespace when doing ALTER DATABASE. I think this should be skipped otherwise we will come across the error like shown below:
ashu@postgres=# alter database test set tablespace pg_default;
ERROR: 58P02: could not create file "pg_tblspc/16385/PG_15_202111301/16386/16390": File exists
Thanks Ashutosh for the patch, the mentioned issue has been resolved with the patch.
But I am still able to reproduce the crash consistently on top of this patch + v7 patches,just the test case has been modified.
create tablespace tab1 location '<dir_path>/test1';
create tablespace tab location '<dir_path>/test';
create database test tablespace tab;
\c test
create table t( a int PRIMARY KEY,b text);
CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
insert into t values (generate_series(1,100000), large_val());
alter table t set tablespace tab1 ;
\c postgres
create database test1 template test;
\c test1
alter table t set tablespace tab;
\c postgres
alter database test1 set tablespace tab1;
--Cancel the below command after few seconds
alter database test1 set tablespace pg_default;
\c test1
alter table t set tablespace tab1;
alter table t set tablespace tab1;
Logfile Snippet:
2021-12-09 17:49:18.110 +04 [18151] PANIC: could not fsync file "base/116398/116400": No such file or directory
2021-12-09 17:49:19.105 +04 [18150] LOG: checkpointer process (PID 18151) was terminated by signal 6: Aborted
2021-12-09 17:49:19.105 +04 [18150] LOG: terminating any other active server processes
2021-12-09 17:49:19.105 +04 [18150] LOG: checkpointer process (PID 18151) was terminated by signal 6: Aborted
2021-12-09 17:49:19.105 +04 [18150] LOG: terminating any other active server processes
On Thu, Dec 9, 2021 at 7:23 PM Neha Sharma <neha.sharma@enterprisedb.com> wrote: > > On Thu, Dec 9, 2021 at 11:12 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > \c postgres > alter database test1 set tablespace tab1; > > --Cancel the below command after few seconds > alter database test1 set tablespace pg_default; > > \c test1 > alter table t set tablespace tab1; > > > Logfile Snippet: > 2021-12-09 17:49:18.110 +04 [18151] PANIC: could not fsync file "base/116398/116400": No such file or directory > 2021-12-09 17:49:19.105 +04 [18150] LOG: checkpointer process (PID 18151) was terminated by signal 6: Aborted > 2021-12-09 17:49:19.105 +04 [18150] LOG: terminating any other active server processes Yeah, it seems like the fsync requests produced while copying database objects to the new tablespace are not unregistered. This seems like a different issue than previously raised. I will work on this next week, thanks for testing. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Dec 9, 2021 at 7:23 PM Neha Sharma <neha.sharma@enterprisedb.com> wrote:
On Thu, Dec 9, 2021 at 11:12 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:Hi,The issue here is that we are trying to create a table that exists inside a non-default tablespace when doing ALTER DATABASE. I think this should be skipped otherwise we will come across the error like shown below:
ashu@postgres=# alter database test set tablespace pg_default;
ERROR: 58P02: could not create file "pg_tblspc/16385/PG_15_202111301/16386/16390": File existsThanks Ashutosh for the patch, the mentioned issue has been resolved with the patch.But I am still able to reproduce the crash consistently on top of this patch + v7 patches,just the test case has been modified.create tablespace tab1 location '<dir_path>/test1';create tablespace tab location '<dir_path>/test';create database test tablespace tab;\c testcreate table t( a int PRIMARY KEY,b text);CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';insert into t values (generate_series(1,100000), large_val());alter table t set tablespace tab1 ;\c postgrescreate database test1 template test;\c test1alter table t set tablespace tab;\c postgresalter database test1 set tablespace tab1;--Cancel the below command after few secondsalter database test1 set tablespace pg_default;\c test1
alter table t set tablespace tab1;Logfile Snippet:2021-12-09 17:49:18.110 +04 [18151] PANIC: could not fsync file "base/116398/116400": No such file or directory
2021-12-09 17:49:19.105 +04 [18150] LOG: checkpointer process (PID 18151) was terminated by signal 6: Aborted
2021-12-09 17:49:19.105 +04 [18150] LOG: terminating any other active server processes
Probably doing the same inside the cleanup function movedb_failure_callback() should fix the problem.
--
With Regards,
Ashutosh Sharma.
On Fri, Dec 10, 2021 at 7:39 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >> >> Logfile Snippet: >> 2021-12-09 17:49:18.110 +04 [18151] PANIC: could not fsync file "base/116398/116400": No such file or directory >> 2021-12-09 17:49:19.105 +04 [18150] LOG: checkpointer process (PID 18151) was terminated by signal 6: Aborted >> 2021-12-09 17:49:19.105 +04 [18150] LOG: terminating any other active server processes > > > This is different from the issue you raised earlier. As Dilip said, we need to unregister sync requests for files thatgot successfully copied to the target database, but the overall alter database statement failed. We are doing this whenthe database is created successfully, but not when it fails. > Probably doing the same inside the cleanup function movedb_failure_callback() should fix the problem. Correct, I have done this cleanup, apart from this we have dropped the fsyc request in create database failure case as well and also need to drop buffer in error case of creatdb as well as movedb. I have also fixed the other issue for which you gave the patch (a bit differently) basically, in case of movedb the source and destination dboid are same so we don't need an additional parameter and also readjusted the conditions to avoid nested if. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- v8-0001-Refactor-relmap-load-and-relmap-write-functions.patch
- v8-0002-Extend-relmap-interfaces.patch
- v8-0005-New-interface-to-lock-relation-id.patch
- v8-0004-Extend-bufmgr-interfaces.patch
- v8-0003-Refactor-index_copy_data.patch
- v8-0007-WAL-logged-CREATE-DATABASE.patch
- v8-0006-Extend-ForgetDatabaseSyncRequests-interface.patch
+ /*
+ * If the relation is from the default tablespace then we need to
+ * create it in the destinations db's default tablespace. Otherwise,
+ * we need to create in the same tablespace as it is in the source
+ * database.
+ */
This comment looks a bit confusing to me especially because when we say destination db's default tablespace people may think of pg_default tablespace (at least I think so). Basically what you are trying to say here - "If the relation exists in the same tablespace as the src database, then in the destination db also it should be the same or something like that.. " So, why not put it that way instead of referring to it as the default tablespace. It's just my view. If you disagree you can ignore it.
--
+ else if (src_dboid == dst_dboid)
+ continue;
+ else
+ dstrnode.spcNode = srcrnode.spcNode;;
There is an extra semicolon here.
+ * If the relation is from the default tablespace then we need to
+ * create it in the destinations db's default tablespace. Otherwise,
+ * we need to create in the same tablespace as it is in the source
+ * database.
+ */
This comment looks a bit confusing to me especially because when we say destination db's default tablespace people may think of pg_default tablespace (at least I think so). Basically what you are trying to say here - "If the relation exists in the same tablespace as the src database, then in the destination db also it should be the same or something like that.. " So, why not put it that way instead of referring to it as the default tablespace. It's just my view. If you disagree you can ignore it.
--
+ else if (src_dboid == dst_dboid)
+ continue;
+ else
+ dstrnode.spcNode = srcrnode.spcNode;;
There is an extra semicolon here.
--
With Regards,
Ashutosh Sharma.
On Sun, Dec 12, 2021 at 1:39 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Dec 10, 2021 at 7:39 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>>
>> Logfile Snippet:
>> 2021-12-09 17:49:18.110 +04 [18151] PANIC: could not fsync file "base/116398/116400": No such file or directory
>> 2021-12-09 17:49:19.105 +04 [18150] LOG: checkpointer process (PID 18151) was terminated by signal 6: Aborted
>> 2021-12-09 17:49:19.105 +04 [18150] LOG: terminating any other active server processes
>
>
> This is different from the issue you raised earlier. As Dilip said, we need to unregister sync requests for files that got successfully copied to the target database, but the overall alter database statement failed. We are doing this when the database is created successfully, but not when it fails.
> Probably doing the same inside the cleanup function movedb_failure_callback() should fix the problem.
Correct, I have done this cleanup, apart from this we have dropped the
fsyc request in create database failure case as well and also need to
drop buffer in error case of creatdb as well as movedb. I have also
fixed the other issue for which you gave the patch (a bit differently)
basically, in case of movedb the source and destination dboid are same
so we don't need an additional parameter and also readjusted the
conditions to avoid nested if.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Mon, Dec 13, 2021 at 8:34 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > + /* > + * If the relation is from the default tablespace then we need to > + * create it in the destinations db's default tablespace. Otherwise, > + * we need to create in the same tablespace as it is in the source > + * database. > + */ > > This comment looks a bit confusing to me especially because when we say destination db's default tablespace people maythink of pg_default tablespace (at least I think so). Basically what you are trying to say here - "If the relation existsin the same tablespace as the src database, then in the destination db also it should be the same or something likethat.. " So, why not put it that way instead of referring to it as the default tablespace. It's just my view. If youdisagree you can ignore it. > > -- > > + else if (src_dboid == dst_dboid) > + continue; > + else > + dstrnode.spcNode = srcrnode.spcNode;; > > There is an extra semicolon here. Noted. I will fix them in the next version. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Dec 2, 2021 at 07:19:50PM +0530, Dilip Kumar wrote: From the patch: > Currently, CREATE DATABASE forces a checkpoint, then copies all the files, > then forces another checkpoint. The comments in the createdb() function > explain the reasons for this. The attached patch fixes this problem by making > create database completely WAL logged so that we can avoid the checkpoints. > > This can also be useful for supporting the TDE. For example, if we need different > encryption for the source and the target database then we can not re-encrypt the > page data if we copy the whole directory. But with this patch, we are copying > page by page so we have an opportunity to re-encrypt the page before copying that > to the target database. Uh, why is this true? Why can't we just copy the heap/index files 8k at a time and reencrypt them during the file copy, rather than using shared buffers? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Thu, Dec 16, 2021 at 12:15 AM Bruce Momjian <bruce@momjian.us> wrote: > > On Thu, Dec 2, 2021 at 07:19:50PM +0530, Dilip Kumar wrote: > From the patch: > > > Currently, CREATE DATABASE forces a checkpoint, then copies all the files, > > then forces another checkpoint. The comments in the createdb() function > > explain the reasons for this. The attached patch fixes this problem by making > > create database completely WAL logged so that we can avoid the checkpoints. > > > > This can also be useful for supporting the TDE. For example, if we need different > > encryption for the source and the target database then we can not re-encrypt the > > page data if we copy the whole directory. But with this patch, we are copying > > page by page so we have an opportunity to re-encrypt the page before copying that > > to the target database. > > Uh, why is this true? Why can't we just copy the heap/index files 8k at > a time and reencrypt them during the file copy, rather than using shared > buffers? Hi Bruce, Yeah, you are right that if we copy in 8k block then we can re-encrypt the page, but in the current system, we are not copying block by block. So the main effort for this patch is not only for TDE but to get rid of the checkpoint we are forced to do before and after create database. So my point is that in this patch since we are copying page by page we get an opportunity to re-encrypt the page. I agree that if the re-encryption would have been the main goal of this patch then true we can copy files in 8k blocks and re-encrypt those blocks, that time even if we have to access some page data for re-encryption (like nonce) then also we can do it, but that is not the main objective. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hi,
While testing the v8 patches in a hot-standby setup, it was observed the master is crashing with the below error;
2021-12-16 19:32:47.757 +04 [101483] PANIC: could not fsync file "pg_tblspc/16385/PG_15_202112111/16386/16391": No such file or directory
2021-12-16 19:32:48.917 +04 [101482] LOG: checkpointer process (PID 101483) was terminated by signal 6: Aborted
2021-12-16 19:32:48.917 +04 [101482] LOG: checkpointer process (PID 101483) was terminated by signal 6: Aborted
Parameters configured at master:
wal_level = hot_standby
max_wal_senders = 3
hot_standby = on
max_standby_streaming_delay= -1
wal_consistency_checking='all'
max_wal_size= 10GB
checkpoint_timeout= 1d
log_min_messages=debug1
max_wal_senders = 3
hot_standby = on
max_standby_streaming_delay= -1
wal_consistency_checking='all'
max_wal_size= 10GB
checkpoint_timeout= 1d
log_min_messages=debug1
Test Case:
create tablespace tab1 location '/home/edb/PGsources/postgresql/inst/bin/test1';
create tablespace tab location '/home/edb/PGsources/postgresql/inst/bin/test';
create database test tablespace tab;
\c test
create table t( a int PRIMARY KEY,b text);
CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
insert into t values (generate_series(1,100000), large_val());
alter table t set tablespace tab1 ;
\c postgres
create database test1 template test;
\c test1
alter table t set tablespace tab;
\c postgres
alter database test1 set tablespace tab1;
--cancel the below command
alter database test1 set tablespace pg_default; --press ctrl+c
\c test1
alter table t set tablespace tab1;
create tablespace tab1 location '/home/edb/PGsources/postgresql/inst/bin/test1';
create tablespace tab location '/home/edb/PGsources/postgresql/inst/bin/test';
create database test tablespace tab;
\c test
create table t( a int PRIMARY KEY,b text);
CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
insert into t values (generate_series(1,100000), large_val());
alter table t set tablespace tab1 ;
\c postgres
create database test1 template test;
\c test1
alter table t set tablespace tab;
\c postgres
alter database test1 set tablespace tab1;
--cancel the below command
alter database test1 set tablespace pg_default; --press ctrl+c
\c test1
alter table t set tablespace tab1;
Log file attached for reference.
Thanks.
--
--
Regards,
Neha Sharma
On Thu, Dec 16, 2021 at 4:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Thu, Dec 16, 2021 at 12:15 AM Bruce Momjian <bruce@momjian.us> wrote:
>
> On Thu, Dec 2, 2021 at 07:19:50PM +0530, Dilip Kumar wrote:
> From the patch:
>
> > Currently, CREATE DATABASE forces a checkpoint, then copies all the files,
> > then forces another checkpoint. The comments in the createdb() function
> > explain the reasons for this. The attached patch fixes this problem by making
> > create database completely WAL logged so that we can avoid the checkpoints.
> >
> > This can also be useful for supporting the TDE. For example, if we need different
> > encryption for the source and the target database then we can not re-encrypt the
> > page data if we copy the whole directory. But with this patch, we are copying
> > page by page so we have an opportunity to re-encrypt the page before copying that
> > to the target database.
>
> Uh, why is this true? Why can't we just copy the heap/index files 8k at
> a time and reencrypt them during the file copy, rather than using shared
> buffers?
Hi Bruce,
Yeah, you are right that if we copy in 8k block then we can re-encrypt
the page, but in the current system, we are not copying block by
block. So the main effort for this patch is not only for TDE but to
get rid of the checkpoint we are forced to do before and after create
database. So my point is that in this patch since we are copying page
by page we get an opportunity to re-encrypt the page. I agree that if
the re-encryption would have been the main goal of this patch then
true we can copy files in 8k blocks and re-encrypt those blocks, that
time even if we have to access some page data for re-encryption (like
nonce) then also we can do it, but that is not the main objective.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachment
I am getting the below error when running the same test-case that Neha shared in her previous email.
ERROR: 55000: some relations of database "test1" are already in tablespace "tab1"
HINT: You must move them back to the database's default tablespace before using this command.
LOCATION: movedb, dbcommands.c:1555
test-case:
HINT: You must move them back to the database's default tablespace before using this command.
LOCATION: movedb, dbcommands.c:1555
test-case:
========
create tablespace tab1 location '/home/ashu/test1';create tablespace tab location '/home/ashu/test';
create database test tablespace tab;
\c test
create table t(a int primary key, b text);
create or replace function large_val() returns text language sql as 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
insert into t values (generate_series(1,100000), large_val());
alter table t set tablespace tab1 ;
\c postgres
create database test1 template test;
\c test1
alter table t set tablespace tab;
\c postgres
alter database test1 set tablespace tab1; -- this fails with the given error.
Observations:
===========
Please note that before running above alter database statement, the table 't' is moved to tablespace 'tab' from 'tab1' so not sure why ReadDir() is returning true when searching for table 't' in tablespace 'tab1'. It should have returned NULL here:
while ((xlde = ReadDir(dstdir, dst_dbpath)) != NULL)
{
if (strcmp(xlde->d_name, ".") == 0 ||
strcmp(xlde->d_name, "..") == 0)
continue;
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("some relations of database \"%s\" are already in tablespace \"%s\"",
dbname, tblspcname),
errhint("You must move them back to the database's default tablespace before using this command.")));
}
{
if (strcmp(xlde->d_name, ".") == 0 ||
strcmp(xlde->d_name, "..") == 0)
continue;
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("some relations of database \"%s\" are already in tablespace \"%s\"",
dbname, tblspcname),
errhint("You must move them back to the database's default tablespace before using this command.")));
}
Also, if I run the checkpoint explicitly before executing the above alter database statement, this error doesn't appear which means it only happens with the new changes because earlier we were doing the force checkpoint at the end of createdb statement.
--
With Regards,
Ashutosh Sharma.
On Thu, Dec 16, 2021 at 9:26 PM Neha Sharma <neha.sharma@enterprisedb.com> wrote:
Hi,While testing the v8 patches in a hot-standby setup, it was observed the master is crashing with the below error;2021-12-16 19:32:47.757 +04 [101483] PANIC: could not fsync file "pg_tblspc/16385/PG_15_202112111/16386/16391": No such file or directory
2021-12-16 19:32:48.917 +04 [101482] LOG: checkpointer process (PID 101483) was terminated by signal 6: AbortedParameters configured at master:wal_level = hot_standby
max_wal_senders = 3
hot_standby = on
max_standby_streaming_delay= -1
wal_consistency_checking='all'
max_wal_size= 10GB
checkpoint_timeout= 1d
log_min_messages=debug1Test Case:
create tablespace tab1 location '/home/edb/PGsources/postgresql/inst/bin/test1';
create tablespace tab location '/home/edb/PGsources/postgresql/inst/bin/test';
create database test tablespace tab;
\c test
create table t( a int PRIMARY KEY,b text);
CREATE OR REPLACE FUNCTION large_val() RETURNS TEXT LANGUAGE SQL AS 'select array_agg(md5(g::text))::text from generate_series(1, 256) g';
insert into t values (generate_series(1,100000), large_val());
alter table t set tablespace tab1 ;
\c postgres
create database test1 template test;
\c test1
alter table t set tablespace tab;
\c postgres
alter database test1 set tablespace tab1;
--cancel the below command
alter database test1 set tablespace pg_default; --press ctrl+c
\c test1
alter table t set tablespace tab1;Log file attached for reference.Thanks.
--Regards,Neha SharmaOn Thu, Dec 16, 2021 at 4:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:On Thu, Dec 16, 2021 at 12:15 AM Bruce Momjian <bruce@momjian.us> wrote:
>
> On Thu, Dec 2, 2021 at 07:19:50PM +0530, Dilip Kumar wrote:
> From the patch:
>
> > Currently, CREATE DATABASE forces a checkpoint, then copies all the files,
> > then forces another checkpoint. The comments in the createdb() function
> > explain the reasons for this. The attached patch fixes this problem by making
> > create database completely WAL logged so that we can avoid the checkpoints.
> >
> > This can also be useful for supporting the TDE. For example, if we need different
> > encryption for the source and the target database then we can not re-encrypt the
> > page data if we copy the whole directory. But with this patch, we are copying
> > page by page so we have an opportunity to re-encrypt the page before copying that
> > to the target database.
>
> Uh, why is this true? Why can't we just copy the heap/index files 8k at
> a time and reencrypt them during the file copy, rather than using shared
> buffers?
Hi Bruce,
Yeah, you are right that if we copy in 8k block then we can re-encrypt
the page, but in the current system, we are not copying block by
block. So the main effort for this patch is not only for TDE but to
get rid of the checkpoint we are forced to do before and after create
database. So my point is that in this patch since we are copying page
by page we get an opportunity to re-encrypt the page. I agree that if
the re-encryption would have been the main goal of this patch then
true we can copy files in 8k blocks and re-encrypt those blocks, that
time even if we have to access some page data for re-encryption (like
nonce) then also we can do it, but that is not the main objective.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Mon, Dec 6, 2021 at 12:45 PM Robert Haas <robertmhaas@gmail.com> wrote: > So for example, imagine tests with 1GB of shard_buffers, 8GB, and > 64GB. And template databases with sizes of whatever the default is, > 1GB, 10GB, 100GB. Repeatedly make 75% of the pages dirty and then > create a new database from one of the templates. And then just measure > the performance. Maybe for large databases this approach is just > really the pits -- and if your max_wal_size is too small, it > definitely will be. But, I don't know, maybe with reasonable settings > it's not that bad. Writing everything to disk twice - once to WAL and > once to the target directory - has to be more expensive than doing it > once. But on the other hand, it's all sequential I/O and the data > pages don't need to be fsync'd, so perhaps the overhead is relatively > mild. I don't know. I have been tied up with other things for a bit now and have not had time to look at this thread; sorry about that. I have a little more time available now so I thought I would take a look at this again and see where things stand. Sadly, it doesn't appear to me that anyone has done any performance testing of this patch, along the lines suggested above or otherwise, and I think it's a crucial question for the patch. My reading of this thread is that nobody really likes the idea of maintaining two methods for performing CREATE DATABASE, but nobody wants to hose people who are using it to clone large databases, either. To some extent those things are inexorably in conflict. If we postulate that the 10TB template database is on a local RAID array with 40 spindles, while pg_wal is on an iSCSI volume that we access via a 128kB ISDN link, then the new system is going to be infinitely worse. But real situations aren't likely to be that bad, and it would be useful in my opinion to have an idea how bad they actually are. I'm somewhat inclined to propose that we keep the existing method around along with the new method. Even though nobody really likes that, we don't necessarily have to maintain both methods forever. If, say, we use the new method by default in all cases, but add an option to get the old method back if you need it, we could leave it that way for a few years and then propose removing the old method (and the switch to activate it) and see if anyone complains. That way, if the new method turns out to suck in certain cases, users have a way out. However, I still think doing some performance testing would be a really good idea. It's not a great plan to make decisions about this kind of thing in an information vacuum. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Dec 22, 2021 at 9:32 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > I couldn't find the mdpostchkpt() function. Are you talking about > SyncPostCheckpoint() ? Anyway, as you have rightly said, we need to > unlink all the files available inside the dst_tablespaceoid/dst_dboid/ > directory by scanning the pendingUnlinks list. And finally we don't > want the next checkpoint to unlink this file again and PANIC so for > that we have to update the entry for this unlinked rel file in the > hash table i.e. cancel the sync request for it. Until commit 3eb77eba5a51780d5cf52cd66a9844cd4d26feb0 in April 2019, there was an mdpostckpt function, which is probably what was meant here. -- Robert Haas EDB: http://www.enterprisedb.com
On Sun, Dec 12, 2021 at 3:09 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Correct, I have done this cleanup, apart from this we have dropped the > fsyc request in create database failure case as well and also need to > drop buffer in error case of creatdb as well as movedb. I have also > fixed the other issue for which you gave the patch (a bit differently) > basically, in case of movedb the source and destination dboid are same > so we don't need an additional parameter and also readjusted the > conditions to avoid nested if. Amazingly to me given how much time has passed, these patches still apply, although I think there are a few outstanding issues that you promised to fix in the next version and haven't yet addressed. In 0007, I think you will need to work a bit harder. I don't think that you can just add a second argument to ForgetDatabaseSyncRequests() that makes it do something other than what the name of the function suggests but without renaming the function or updating any comments. Elsewhere we have things like TablespaceCreateDbspace and ResetUnloggedRelationsInDbspaceDir so perhaps we ought to just add a new function with a name inspired by those precedents alongside the existing one, rather than doing it this way. In 0008, this is a bit confusing: + PageInit(dstPage, BufferGetPageSize(dstBuf), 0); + memcpy(dstPage, srcPage, BLCKSZ); After a minute, I figured out that the point here was to force log_newpage() to actually set the LSN, but how about a comment? I kind of wonder whether GetDatabaseRelationList should be broken into two functions so that don't have quite such deep nesting. And I wonder if maybe the return value of GetActiveSnapshot() should be cached in a local variable. On the whole I think there aren't huge code-level issues here, even if things need to be tweaked here and there and bugs fixed. The real key is arriving at a set of design trade-offs that doesn't make anyone too upset. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Feb 8, 2022 at 10:39 PM Robert Haas <robertmhaas@gmail.com> wrote: > > I have been tied up with other things for a bit now and have not had > time to look at this thread; sorry about that. I have a little more > time available now so I thought I would take a look at this again and > see where things stand. Thanks for looking into this. > > Sadly, it doesn't appear to me that anyone has done any performance > testing of this patch, along the lines suggested above or otherwise, > and I think it's a crucial question for the patch. Yeah, actually some performance testing started as shared by Ahustosh [1] and soon after that we got side tracked by another issue[2] which we thought had to be fixed before we proceed with this feature. My reading of this > thread is that nobody really likes the idea of maintaining two methods > for performing CREATE DATABASE, but nobody wants to hose people who > are using it to clone large databases, either. To some extent those > things are inexorably in conflict. If we postulate that the 10TB > template database is on a local RAID array with 40 spindles, while > pg_wal is on an iSCSI volume that we access via a 128kB ISDN link, > then the new system is going to be infinitely worse. But real > situations aren't likely to be that bad, and it would be useful in my > opinion to have an idea how bad they actually are. Yeah that makes sense, I will work on performance testing in this line and also on previous ideas you suggested. > I'm somewhat inclined to propose that we keep the existing method > around along with the new method. Even though nobody really likes > that, we don't necessarily have to maintain both methods forever. If, > say, we use the new method by default in all cases, but add an option > to get the old method back if you need it, we could leave it that way > for a few years and then propose removing the old method (and the > switch to activate it) and see if anyone complains. That way, if the > new method turns out to suck in certain cases, users have a way out. > However, I still think doing some performance testing would be a > really good idea. It's not a great plan to make decisions about this > kind of thing in an information vacuum. Yeah that makes sense to me. Now, one bigger question is can we proceed with this patch without fixing [2], IMHO, if we are deciding to keep the old method also intact then one option could be that for now only change CREATE DATABASE to support both old and new way of creating database and for time being leave the ALTER DATABASE SET TABLESPACE alone and let it work only with the old method? And another option is that we first fix the issue related to the tombstone file and then come back to this? IMHO, the first option could be better in a way that we have already made better progress in this patch and this is in better shape than the other patch we are trying to make for removing the tombstone files. [1]https://www.postgresql.org/message-id/CAE9k0Pkg20tHq8oiJ%2BxXa9%3Daf3QZCSYTw99aBaPthA1UMKhnTg%40mail.gmail.com [2] https://www.postgresql.org/message-id/CA%2BTgmobM5FN5x0u3tSpoNvk_TZPFCdbcHxsXCoY1ytn1dXROvg%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Feb 8, 2022 at 12:09:08PM -0500, Robert Haas wrote: > Sadly, it doesn't appear to me that anyone has done any performance > testing of this patch, along the lines suggested above or otherwise, > and I think it's a crucial question for the patch. My reading of this > thread is that nobody really likes the idea of maintaining two methods > for performing CREATE DATABASE, but nobody wants to hose people who > are using it to clone large databases, either. To some extent those > things are inexorably in conflict. If we postulate that the 10TB > template database is on a local RAID array with 40 spindles, while > pg_wal is on an iSCSI volume that we access via a 128kB ISDN link, > then the new system is going to be infinitely worse. But real > situations aren't likely to be that bad, and it would be useful in my > opinion to have an idea how bad they actually are. Honestly, I never understood why the checkpoint during CREATE DATABASE was as problem --- we checkpoint by default every five minutes anyway, so why is an additional two a problem --- it just means the next checkpoint will do less work. It is hard to see how avoiding checkpoints to add WAL writes, fscyncs, and replication traffic could be a win. I see the patch justification outlined here: https://www.postgresql.org/message-id/CAFiTN-sP6yLVTfjR42mEfvFwJ-SZ2iEtG1t0j=QX09X=BM+KWQ@mail.gmail.com TDE is mentioned as a value for this patch, but I don't see why it is needed --- TDE can easily decrypt/encrypt the pages while they are copied. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On 6/16/21 03:52, Dilip Kumar wrote: > On Tue, Jun 15, 2021 at 7:01 PM Andrew Dunstan <andrew@dunslane.net> wrote: >> Rather than use size, I'd be inclined to say use this if the source >> database is marked as a template, and use the copydir approach for >> anything that isn't. > Yeah, that is possible, on the other thought wouldn't it be good to > provide control to the user by providing two different commands, e.g. > COPY DATABASE for the existing method (copydir) and CREATE DATABASE > for the new method (fully wal logged)? > This proposal seems to have gotten lost. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Wed, Feb 9, 2022 at 7:49 PM Bruce Momjian <bruce@momjian.us> wrote: > > Honestly, I never understood why the checkpoint during CREATE DATABASE > was as problem --- we checkpoint by default every five minutes anyway, > so why is an additional two a problem --- it just means the next > checkpoint will do less work. It is hard to see how avoiding > checkpoints to add WAL writes, fscyncs, and replication traffic could be > a win. But don't you think that the current way of WAL logging the CREATE DATABASE is a bit hacky? I mean we are just logically WAL logging the source and destination directory paths without actually WAL logging what content we want to copy. IMHO this is against the basic principle of WAL and that's the reason we are forcefully checkpointing to avoid replaying that WAL during crash recovery. Even after this some of the code comments say that we have limitations during PITR[1] and we want to avoid it sometime in the future. [1] * In PITR replay, the first of these isn't an issue, and the second * is only a risk if the CREATE DATABASE and subsequent template * database change both occur while a base backup is being taken. * There doesn't seem to be much we can do about that except document * it as a limitation. * * Perhaps if we ever implement CREATE DATABASE in a less cheesy way, * we can avoid this. */ RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT); -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Feb 9, 2022 at 9:25 PM Andrew Dunstan <andrew@dunslane.net> wrote: > > > On 6/16/21 03:52, Dilip Kumar wrote: > > On Tue, Jun 15, 2021 at 7:01 PM Andrew Dunstan <andrew@dunslane.net> wrote: > >> Rather than use size, I'd be inclined to say use this if the source > >> database is marked as a template, and use the copydir approach for > >> anything that isn't. > > Yeah, that is possible, on the other thought wouldn't it be good to > > provide control to the user by providing two different commands, e.g. > > COPY DATABASE for the existing method (copydir) and CREATE DATABASE > > for the new method (fully wal logged)? > > > > > This proposal seems to have gotten lost. Yeah, I am planning to work on this part so that we can support both methods. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Feb 9, 2022 at 9:19 AM Bruce Momjian <bruce@momjian.us> wrote: > Honestly, I never understood why the checkpoint during CREATE DATABASE > was as problem --- we checkpoint by default every five minutes anyway, > so why is an additional two a problem --- it just means the next > checkpoint will do less work. It is hard to see how avoiding > checkpoints to add WAL writes, fscyncs, and replication traffic could be > a win. Try running pgbench with the --progress option and enough concurrent jobs to keep a moderately large system busy and watching what happens to the tps each time a checkpoint occurs. It's extremely dramatic, or at least it was the last time I ran such tests. I think that performance will sometimes drop by a factor of five or more when the checkpoint hits, and take multiple minutes to recover. I think your statement that doing an extra checkpoint "just means the next checkpoint will do less work" is kind of misleading. That's certainly true in some situations. But when the same pages are being dirtied over and over again, an extra checkpoint often means that the system will do MUCH MORE work, because every checkpoint triggers a new set of full-page writes over the actively-updated portion of the database. I think that very few people run systems with heavy write workloads with checkpoint_timeout=5m, precisely because of this issue. Almost every system I see has had that raised to at least 10m and sometimes 30m or more. It can make a massive difference. > I see the patch justification outlined here: > > https://www.postgresql.org/message-id/CAFiTN-sP6yLVTfjR42mEfvFwJ-SZ2iEtG1t0j=QX09X=BM+KWQ@mail.gmail.com > > TDE is mentioned as a value for this patch, but I don't see why it is > needed --- TDE can easily decrypt/encrypt the pages while they are > copied. That's true, but depending on what other design decisions we make, WAL-logging it might be a problem. Right now, when someone creates a new database, we log a single record that basically says "go copy the directory'". That's very different than what we normally do, which is to log changes to individual pages, or where required, small groups of pages (e.g. a single WAL record is written for an UPDATE even though it may touch two pages). The fact that in this case we only log a single WAL record for an operation that could touch an unbounded amount of data is why this needs special handling around checkpoints. It also introduces a certain amount of fragility into the system, because if for some reason the source directory on the standby doesn't exactly match the source directory on the primary, the new databases won't match either. Any errors that creep into the process can be propagated around to other places by a system like this. However, ordinarily that doesn't happen, which is why we've been able to use this system successfully for so many years. The other reason we've been able to use this successfully is that we're confident that we can perform exactly the same operation on the standby as we do on the primary knowing only the relevant directory names. If we say "copy this directory to there" we believe we'll be able to do that exactly the same way on the standby. Is that still true with TDE? Well, it depends. If the encryption can be performed knowing only the key and the identity of the block (database OID, tablespace OID, relfilenode, fork, block number) then it's true. But if the encryption needs to, for example, generate a random nonce for each block, then it's false. If you want the standby to be an exact copy of the master in a system where new blocks get random nonces, then you need to replicate the copy block-by-block, not as one gigantic operation, so that you can log the nonce you picked for each block. On the other hand, maybe you DON'T want the standby to be an exact copy of the master. If, for example, you imagine a system where the master and standby aren't even using the same key, then this is a lot less relevant. I can't predict whether PostgreSQL will get TDE in the future, and if it does, I can't predict what form it will take. Therefore any strong statement about whether this will benefit TDE or not seems to me to be pretty questionable - we don't know that it will be useful, and we don't know that it won't. But, like Dilip, I think the way we're WAL-logging CREATE DATABASE right now is a hack, and I *know* it can cause massive performance drops on busy systems. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Feb 9, 2022 at 10:59 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Wed, Feb 9, 2022 at 9:25 PM Andrew Dunstan <andrew@dunslane.net> wrote: > > On 6/16/21 03:52, Dilip Kumar wrote: > > > On Tue, Jun 15, 2021 at 7:01 PM Andrew Dunstan <andrew@dunslane.net> wrote: > > >> Rather than use size, I'd be inclined to say use this if the source > > >> database is marked as a template, and use the copydir approach for > > >> anything that isn't. > > > Yeah, that is possible, on the other thought wouldn't it be good to > > > provide control to the user by providing two different commands, e.g. > > > COPY DATABASE for the existing method (copydir) and CREATE DATABASE > > > for the new method (fully wal logged)? > > > > This proposal seems to have gotten lost. > > Yeah, I am planning to work on this part so that we can support both methods. But can we pick a different syntax? In my view this should be an option to CREATE DATABASE rather than a whole new command. -- Robert Haas EDB: http://www.enterprisedb.com
On 2/9/22 10:58, Dilip Kumar wrote: > On Wed, Feb 9, 2022 at 9:25 PM Andrew Dunstan <andrew@dunslane.net> wrote: >> >> On 6/16/21 03:52, Dilip Kumar wrote: >>> On Tue, Jun 15, 2021 at 7:01 PM Andrew Dunstan <andrew@dunslane.net> wrote: >>>> Rather than use size, I'd be inclined to say use this if the source >>>> database is marked as a template, and use the copydir approach for >>>> anything that isn't. >>> Yeah, that is possible, on the other thought wouldn't it be good to >>> provide control to the user by providing two different commands, e.g. >>> COPY DATABASE for the existing method (copydir) and CREATE DATABASE >>> for the new method (fully wal logged)? >>> >> >> This proposal seems to have gotten lost. > Yeah, I am planning to work on this part so that we can support both methods. > OK, many thanks. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Tue, Feb 8, 2022 at 11:47 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Now, one bigger question is can we proceed with this patch without > fixing [2], IMHO, if we are deciding to keep the old method also > intact then one option could be that for now only change CREATE > DATABASE to support both old and new way of creating database and for > time being leave the ALTER DATABASE SET TABLESPACE alone and let it > work only with the old method? And another option is that we first > fix the issue related to the tombstone file and then come back to > this? > > IMHO, the first option could be better in a way that we have already > made better progress in this patch and this is in better shape than > the other patch we are trying to make for removing the tombstone > files. Yeah, it's getting quite close to the end of this release cycle. I'm not sure whether we can get anything committed here at all in the time we have remaining, but I agree with you that this patch seems like a better prospect than that one. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Feb 9, 2022 at 11:00:06AM -0500, Robert Haas wrote: > Try running pgbench with the --progress option and enough concurrent > jobs to keep a moderately large system busy and watching what happens > to the tps each time a checkpoint occurs. It's extremely dramatic, or > at least it was the last time I ran such tests. I think that > performance will sometimes drop by a factor of five or more when the > checkpoint hits, and take multiple minutes to recover. > > I think your statement that doing an extra checkpoint "just means the > next checkpoint will do less work" is kind of misleading. That's > certainly true in some situations. But when the same pages are being > dirtied over and over again, an extra checkpoint often means that the > system will do MUCH MORE work, because every checkpoint triggers a new > set of full-page writes over the actively-updated portion of the > database. > > I think that very few people run systems with heavy write workloads > with checkpoint_timeout=5m, precisely because of this issue. Almost > every system I see has had that raised to at least 10m and sometimes > 30m or more. It can make a massive difference. Well, I think the worst case is that the checkpoint happens exactly between two checkpoints, so you are checkpointing twice as often, but if it happens just before or after a checkpoint, I assume the effect would be minimal. So, it seems we are weighing having a checkpoint happen in the middle of a checkpoint interval vs writing more WAL. If the WAL traffic, without CREATE DATABASE, is high, and the template database is small, writing more WAL and skipping the checkpoint will be win, but if the WAL traffic is small and the template database is big, the extra WAL will be a loss. Is this accurate? > I can't predict whether PostgreSQL will get TDE in the future, and if > it does, I can't predict what form it will take. Therefore any strong > statement about whether this will benefit TDE or not seems to me to be > pretty questionable - we don't know that it will be useful, and we Agreed. We would want to have a different heap/index key on the standby so we can rotate the heap/index key. > don't know that it won't. But, like Dilip, I think the way we're > WAL-logging CREATE DATABASE right now is a hack, and I *know* it can Yes, it is a hack, but it seems to be a clever one that we might have chosen if it had not been part of the original system. > cause massive performance drops on busy systems. See above. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Wed, Feb 9, 2022 at 1:34 PM Bruce Momjian <bruce@momjian.us> wrote: > Well, I think the worst case is that the checkpoint happens exactly > between two checkpoints, so you are checkpointing twice as often, but if > it happens just before or after a checkpoint, I assume the effect would > be minimal. I agree for the most part. I think that if checkpoints happen every 8 minutes normally and the extra checkpoint happens 2 minutes after the previous checkpoint, the impact may be almost as bad as if it had happened right in the middle. If it happens 5 seconds after the previous checkpoint, it should be low impact. > So, it seems we are weighing having a checkpoint happen in the middle of > a checkpoint interval vs writing more WAL. If the WAL traffic, without > CREATE DATABASE, is high, and the template database is small, writing > more WAL and skipping the checkpoint will be win, but if the WAL traffic > is small and the template database is big, the extra WAL will be a loss. > Is this accurate? I think that's basically correct. I would expect that the worry about big template database is mostly about template databases that are REALLY big. I think if your template database is 10GB you probably shouldn't be worried about this feature. 10GB of extra WAL isn't nothing, but if you've got reasonably capable hardware, it's not overloaded, and max_wal_size is big enough, it's probably not going to have a huge impact. Also, most of the impact will probably be on the CREATE DATABASE command itself, and other things running on the system at the same time will be impacted to a lesser degree. I think it's even possible that you will be happier with this feature than without, because you may like the idea that CREATE DATABASE itself is slow more than you like the idea of it making everything else on the system slow. On the other hand, if your template database is 1TB, the extra WAL is probably going to be a fairly big problem. Basically I think for most people this should be neutral or a win. For people with really large template databases, it's a loss. Hence the discussion about having a way for people who prefer the current behavior to keep it. > Agreed. We would want to have a different heap/index key on the standby > so we can rotate the heap/index key. I don't like that design, and I don't think that's what we should do, but I understand that you feel differently. IMHO, this thread is not the place to hash that out. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Feb 09, 2022 at 02:30:08PM -0500, Robert Haas wrote: > On Wed, Feb 9, 2022 at 1:34 PM Bruce Momjian <bruce@momjian.us> wrote: > > Well, I think the worst case is that the checkpoint happens exactly > > between two checkpoints, so you are checkpointing twice as often, but if > > it happens just before or after a checkpoint, I assume the effect would > > be minimal. > > I agree for the most part. I think that if checkpoints happen every 8 > minutes normally and the extra checkpoint happens 2 minutes after the > previous checkpoint, the impact may be almost as bad as if it had > happened right in the middle. If it happens 5 seconds after the > previous checkpoint, it should be low impact. But the extra checkpoints will be immediate, while on a properly configured system it should be spread checkpoint. That will add some more overhead. > > So, it seems we are weighing having a checkpoint happen in the middle of > > a checkpoint interval vs writing more WAL. If the WAL traffic, without > > CREATE DATABASE, is high, and the template database is small, writing > > more WAL and skipping the checkpoint will be win, but if the WAL traffic > > is small and the template database is big, the extra WAL will be a loss. > > Is this accurate? > > I think that's basically correct. I would expect that the worry about > big template database is mostly about template databases that are > REALLY big. I think if your template database is 10GB you probably > shouldn't be worried about this feature. 10GB of extra WAL isn't > nothing, but if you've got reasonably capable hardware, it's not > overloaded, and max_wal_size is big enough, it's probably not going to > have a huge impact. Also, most of the impact will probably be on the > CREATE DATABASE command itself, and other things running on the system > at the same time will be impacted to a lesser degree. I think it's > even possible that you will be happier with this feature than without, > because you may like the idea that CREATE DATABASE itself is slow more > than you like the idea of it making everything else on the system > slow. On the other hand, if your template database is 1TB, the extra > WAL is probably going to be a fairly big problem. > > Basically I think for most people this should be neutral or a win. For > people with really large template databases, it's a loss. Hence the > discussion about having a way for people who prefer the current > behavior to keep it. Those extra WALs will also impact backups and replication. You could have fancy hardware, a read-mostly workload and the need to replicate over a slow WAN, and in that case the 10GB could be much more problematic.
On Wed, Feb 9, 2022 at 9:31 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Feb 9, 2022 at 10:59 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Feb 9, 2022 at 9:25 PM Andrew Dunstan <andrew@dunslane.net> wrote: > > > On 6/16/21 03:52, Dilip Kumar wrote: > > > > On Tue, Jun 15, 2021 at 7:01 PM Andrew Dunstan <andrew@dunslane.net> wrote: > > > >> Rather than use size, I'd be inclined to say use this if the source > > > >> database is marked as a template, and use the copydir approach for > > > >> anything that isn't. > > > > Yeah, that is possible, on the other thought wouldn't it be good to > > > > provide control to the user by providing two different commands, e.g. > > > > COPY DATABASE for the existing method (copydir) and CREATE DATABASE > > > > for the new method (fully wal logged)? > > > > > > This proposal seems to have gotten lost. > > > > Yeah, I am planning to work on this part so that we can support both methods. > > But can we pick a different syntax? In my view this should be an > option to CREATE DATABASE rather than a whole new command. Maybe we can provide something like CREATE DATABASE..WITH WAL_LOG=true/false ? OR CREATE DATABASE..WITH WAL_LOG_DATA_PAGE=true/false ? OR CREATE DATABASE..WITH CHECKPOINT=true/false ? OR And then we can explain in documentation about these options? I think default should be new method? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Feb 10, 2022 at 2:52 AM Julien Rouhaud <rjuju123@gmail.com> wrote: > Those extra WALs will also impact backups and replication. You could have > fancy hardware, a read-mostly workload and the need to replicate over a slow > WAN, and in that case the 10GB could be much more problematic. True, I guess, but how bad does your WAN have to be for that to be an issue? On a 1 gigabit/second link, that's a little over 2 minutes of transfer time. That's not nothing, but it's not extreme, either, especially because there's no sense in querying an empty database. You're going to have to put some stuff in that database before you can do anything meaningful with it, and that's going to have to be replicated with or without this feature. I am not saying it couldn't be a problem, and that's why I'm endorsing making the behavior optional. But I think that it's a niche scenario. You need a bigger-than-normal template database, a slow WAN link, AND you need the amount of data loaded into the databases you create from the template to be small enough to make the cost of logging the template pages material. If you create a 10GB database from a template and then load 200GB of data into it, the WAL-logging overhead of creating the template is only 5%. I won't really be surprised if we hear that someone has a 10GB template database and likes to make a ton of copies and only change 500 rows in each one while replicating the whole thing over a slow WAN. That can definitely happen, and I'm sure whoever is doing that has reasons for it which they consider good and sufficient. However, I don't think there are likely to be a ton of people doing stuff like that - just a few. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2022-02-10 10:32:42 -0500, Robert Haas wrote: > I won't really be surprised if we hear that someone has a 10GB > template database and likes to make a ton of copies and only change > 500 rows in each one while replicating the whole thing over a slow > WAN. That can definitely happen, and I'm sure whoever is doing that > has reasons for it which they consider good and sufficient. However, I > don't think there are likely to be a ton of people doing stuff like > that - just a few. Yea. I would be a bit more concerned if we made creating template databases very cheap, e.g. by using file copy-on-write functionality like we have for pg_upgrade. But right now it's a fairly hefty operation anyway. Greetings, Andres Freund
On 2/10/22 07:32, Dilip Kumar wrote: > On Wed, Feb 9, 2022 at 9:31 PM Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Feb 9, 2022 at 10:59 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: >>> On Wed, Feb 9, 2022 at 9:25 PM Andrew Dunstan <andrew@dunslane.net> wrote: >>>> On 6/16/21 03:52, Dilip Kumar wrote: >>>>> On Tue, Jun 15, 2021 at 7:01 PM Andrew Dunstan <andrew@dunslane.net> wrote: >>>>>> Rather than use size, I'd be inclined to say use this if the source >>>>>> database is marked as a template, and use the copydir approach for >>>>>> anything that isn't. >>>>> Yeah, that is possible, on the other thought wouldn't it be good to >>>>> provide control to the user by providing two different commands, e.g. >>>>> COPY DATABASE for the existing method (copydir) and CREATE DATABASE >>>>> for the new method (fully wal logged)? >>>> This proposal seems to have gotten lost. >>> Yeah, I am planning to work on this part so that we can support both methods. >> But can we pick a different syntax? In my view this should be an >> option to CREATE DATABASE rather than a whole new command. > Maybe we can provide something like > > CREATE DATABASE..WITH WAL_LOG=true/false ? OR > CREATE DATABASE..WITH WAL_LOG_DATA_PAGE=true/false ? OR > CREATE DATABASE..WITH CHECKPOINT=true/false ? OR > > And then we can explain in documentation about these options? I think > default should be new method? > > The last one at least has the advantage that it doesn't invent yet another keyword. I can live with the new method being the default. I'm sure it would be highlighted in the release notes too. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Fri, Feb 11, 2022 at 12:11 PM Andrew Dunstan <andrew@dunslane.net> wrote: > The last one at least has the advantage that it doesn't invent yet > another keyword. We don't need a new keyword for this as long as it lexes as one token, because createdb_opt_name accepts IDENT. So I think we should focus on trying to come up with something that is as clear as we know how to make it. What I find difficult about doing that is that this is all a bunch of technical details that users may have difficulty understanding. If we say WAL_LOG or WAL_LOG_DATA, a reasonably but not incredibly well-informed user will assume that skipping WAL is not really an option. If we say CHECKPOINT, a reasonably but not incredibly well-informed user will presume they don't want one (I think). CHECKPOINT also seems like it's naming the switch by the unwanted side effect, which doesn't seem too flattering to the existing method. How about something like LOG_AS_CLONE? That makes it clear, I hope, that we're logging it a different way, but that method of logging it is different in each case. You'd still have to read the documentation to find out what it really means, but at least it seems like it points you more in the right direction. To me, anyway. > I can live with the new method being the default. I'm sure it would be > highlighted in the release notes too. That would make sense. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Feb 11, 2022 at 12:35:50PM -0500, Robert Haas wrote: > How about something like LOG_AS_CLONE? That makes it clear, I hope, > that we're logging it a different way, but that method of logging it > is different in each case. You'd still have to read the documentation > to find out what it really means, but at least it seems like it points > you more in the right direction. To me, anyway. I think CLONE would be confusing since we don't use that term often, maybe LOG_DB_COPY or LOG_FILE_COPY? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Fri, Feb 11, 2022 at 12:50 PM Bruce Momjian <bruce@momjian.us> wrote: > On Fri, Feb 11, 2022 at 12:35:50PM -0500, Robert Haas wrote: > > How about something like LOG_AS_CLONE? That makes it clear, I hope, > > that we're logging it a different way, but that method of logging it > > is different in each case. You'd still have to read the documentation > > to find out what it really means, but at least it seems like it points > > you more in the right direction. To me, anyway. > > I think CLONE would be confusing since we don't use that term often, > maybe LOG_DB_COPY or LOG_FILE_COPY? Yeah, maybe. But it's not clear to me with that kind of naming whether TRUE or FALSE would be the existing behavior? One version logs a single record for the whole database, and the other logs a record per database block. Neither version logs per file. LOG_COPIED_BLOCKS, maybe? -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Feb 11, 2022 at 01:18:58PM -0500, Robert Haas wrote: > On Fri, Feb 11, 2022 at 12:50 PM Bruce Momjian <bruce@momjian.us> wrote: > > On Fri, Feb 11, 2022 at 12:35:50PM -0500, Robert Haas wrote: > > > How about something like LOG_AS_CLONE? That makes it clear, I hope, > > > that we're logging it a different way, but that method of logging it > > > is different in each case. You'd still have to read the documentation > > > to find out what it really means, but at least it seems like it points > > > you more in the right direction. To me, anyway. > > > > I think CLONE would be confusing since we don't use that term often, > > maybe LOG_DB_COPY or LOG_FILE_COPY? > > Yeah, maybe. But it's not clear to me with that kind of naming whether > TRUE or FALSE would be the existing behavior? One version logs a > single record for the whole database, and the other logs a record per > database block. Neither version logs per file. LOG_COPIED_BLOCKS, > maybe? Yes, I like BLOCKS more than FILE. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On 2/11/22 13:32, Bruce Momjian wrote: > On Fri, Feb 11, 2022 at 01:18:58PM -0500, Robert Haas wrote: >> On Fri, Feb 11, 2022 at 12:50 PM Bruce Momjian <bruce@momjian.us> wrote: >>> On Fri, Feb 11, 2022 at 12:35:50PM -0500, Robert Haas wrote: >>>> How about something like LOG_AS_CLONE? That makes it clear, I hope, >>>> that we're logging it a different way, but that method of logging it >>>> is different in each case. You'd still have to read the documentation >>>> to find out what it really means, but at least it seems like it points >>>> you more in the right direction. To me, anyway. >>> I think CLONE would be confusing since we don't use that term often, >>> maybe LOG_DB_COPY or LOG_FILE_COPY? >> Yeah, maybe. But it's not clear to me with that kind of naming whether >> TRUE or FALSE would be the existing behavior? One version logs a >> single record for the whole database, and the other logs a record per >> database block. Neither version logs per file. LOG_COPIED_BLOCKS, >> maybe? > Yes, I like BLOCKS more than FILE. I'm not really sure any single parameter name is going to capture the subtlety involved here. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Fri, Feb 11, 2022 at 3:40 PM Andrew Dunstan <andrew@dunslane.net> wrote: > I'm not really sure any single parameter name is going to capture the > subtlety involved here. I mean to some extent that's inevitable, but it's not a reason not to do the best we can. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Feb 11, 2022 at 1:32 PM Bruce Momjian <bruce@momjian.us> wrote: > > Yeah, maybe. But it's not clear to me with that kind of naming whether > > TRUE or FALSE would be the existing behavior? One version logs a > > single record for the whole database, and the other logs a record per > > database block. Neither version logs per file. LOG_COPIED_BLOCKS, > > maybe? > > Yes, I like BLOCKS more than FILE. Cool. -- Robert Haas EDB: http://www.enterprisedb.com
On 2022-Feb-11, Robert Haas wrote: > What I find difficult about doing that is that this is all a bunch of > technical details that users may have difficulty understanding. If we > say WAL_LOG or WAL_LOG_DATA, a reasonably but not incredibly > well-informed user will assume that skipping WAL is not really an > option. If we say CHECKPOINT, a reasonably but not incredibly > well-informed user will presume they don't want one (I think). > CHECKPOINT also seems like it's naming the switch by the unwanted side > effect, which doesn't seem too flattering to the existing method. It seems you're thinking deciding what to do based on an option that gets a boolean argument. But what about making the argument be an enum? For example CREATE DATABASE ... WITH (STRATEGY = LOG); -- default if option is omitted CREATE DATABASE ... WITH (STRATEGY = CHECKPOINT); So the user has to think about it in terms of some strategy to choose, rather than enabling or disabling some flag with nontrivial implications. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "[PostgreSQL] is a great group; in my opinion it is THE best open source development communities in existence anywhere." (Lamar Owen)
On 2/11/22 15:47, Robert Haas wrote: > On Fri, Feb 11, 2022 at 3:40 PM Andrew Dunstan <andrew@dunslane.net> wrote: >> I'm not really sure any single parameter name is going to capture the >> subtlety involved here. > I mean to some extent that's inevitable, but it's not a reason not to > do the best we can. True. I do think we should be wary of any name starting with "LOG", though. Long experience tells us that's something that confuses users when it refers to the WAL. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Fri, Feb 11, 2022 at 4:08 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > It seems you're thinking deciding what to do based on an option that > gets a boolean argument. But what about making the argument be an enum? > For example > > CREATE DATABASE ... WITH (STRATEGY = LOG); -- default if option is omitted > CREATE DATABASE ... WITH (STRATEGY = CHECKPOINT); > > So the user has to think about it in terms of some strategy to choose, > rather than enabling or disabling some flag with nontrivial > implications. I don't like those particular strategy names very much, but in general I think that could be a way to go, too. I somewhat hope we never end up with THREE strategies for creating a new database, but now that I think about it, we might. Somebody might want to use a fancy FS primitive that clones a directory at the FS level, or something. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2022-02-11 16:19:12 -0500, Robert Haas wrote: > I somewhat hope we never end up with THREE strategies for creating a new > database, but now that I think about it, we might. Somebody might want to > use a fancy FS primitive that clones a directory at the FS level, or > something. I think that'd be a great, and pretty easy to implement, feature. But it seems like it'd be mostly orthogonal to the "WAL log data" vs "checkpoint data" question? On the primary / single node system using "WAL log data" with "COW file copy" would work well. I bet using COW file copies would speed up our own regression tests noticeably - on slower systems we spend a fair bit of time and space creating template0 and postgres, with the bulk of the data never changing. Template databases are also fairly commonly used by application developers to avoid the cost of rerunning all the setup DDL & initial data loading for different tests. Making that measurably cheaper would be a significant win. Greetings, Andres Freund
On Sat, Feb 12, 2022 at 2:38 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2022-Feb-11, Robert Haas wrote: > > > What I find difficult about doing that is that this is all a bunch of > > technical details that users may have difficulty understanding. If we > > say WAL_LOG or WAL_LOG_DATA, a reasonably but not incredibly > > well-informed user will assume that skipping WAL is not really an > > option. If we say CHECKPOINT, a reasonably but not incredibly > > well-informed user will presume they don't want one (I think). > > CHECKPOINT also seems like it's naming the switch by the unwanted side > > effect, which doesn't seem too flattering to the existing method. > > It seems you're thinking deciding what to do based on an option that > gets a boolean argument. But what about making the argument be an enum? > For example > > CREATE DATABASE ... WITH (STRATEGY = LOG); -- default if option is omitted > CREATE DATABASE ... WITH (STRATEGY = CHECKPOINT); > > So the user has to think about it in terms of some strategy to choose, > rather than enabling or disabling some flag with nontrivial > implications. Yeah I think being explicit about giving the strategy to the user looks like a better option. Now they can choose whether they want it to create using WAL log or using CHECKPOINT. Otherwise, if we give a flag then we will have to give an explanation that if they choose not to WAL log then we will have to do a checkpoint internally. So I think giving LOG vs CHECKPOINT as an explicit option looks better to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Sun, Feb 13, 2022 at 10:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I have done performance testing with different template DB sizes and different amounts of dirty shared buffers and I think as expected the bigger the dirty shared buffer the checkpoint approach becomes costly and OTOH the larger the template DB size the WAL log approach takes more time. I think it is very common to have larger shared buffers and of course, if somebody has configured such a large shared buffer then a good % of it will be dirty most of the time. So IMHO in the future, the WAL log approach is going to be more usable in general. However, this is just my opinion, and others may have completely different thoughts and anyhow we are keeping options for both the approaches so no worry. Next, I am planning to do some more tests, where we are having pgbench running and concurrently we do CREATEDB maybe every 1 minute and see what is the CREATEDB time as well as what is the impact on pgbench performance. Because currently I have only measured CREATEDB time but we must be knowing the impact of createdb on the other system as well. Test setup: max_wal_size=64GB checkpoint_timeout=15min - CREATE base TABLE of size of Shared Buffers - CREATE template database and table in it of varying sizes (as per test) - CHECKPOINT (write out dirty buffers) - UPDATE 70% of tuple in base table (dirty 70% of shared buffers) - CREATE database using template db. (Actual test target) test1: 1 GB shared buffers, template DB size = 6MB, dirty shared buffer=70% Head: 2341.665 ms Patch: 85.229 ms test2: 1 GB shared buffers, template DB size = 1GB, dirty shared buffer=70% Head: 4044 ms Patch: 8376 ms test3: 8 GB shared buffers, template DB size = 1GB, dirty shared buffer=70% Head: 21398 ms Patch: 9834 ms test4: 8 GB shared buffers, template DB size = 10GB, dirty shared buffer=95% Head: 38574 ms Patch: 77160 ms test4: 32 GB shared buffers, template DB size = 10GB, dirty shared buffer=70% Head: 47656 ms Patch: 79767 ms test5: 64 GB shared buffers, template DB size = 1GB, dirty shared buffer=70% Head: 59151 ms Patch: 8742 ms test6: 64 GB shared buffers, template DB size = 50GB, dirty shared buffer=50% Head: 171614 ms Patch: 406040 ms -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Sun, Feb 13, 2022 at 1:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrot> > test4: > 32 GB shared buffers, template DB size = 10GB, dirty shared buffer=70% > Head: 47656 ms > Patch: 79767 ms This seems like the most surprising result of the bunch. Here, the template DB is both small enough to fit in shared_buffers and small enough not to trigger a checkpoint all by itself, and yet the patch loses. Did you checkpoint between one test and the next, or might this test have been done after a bunch of WAL had already been written since the last checkpoint so that the 10GB pushed it over the edge? BTW, you have test4 twice in your list of results. -- Robert Haas EDB: http://www.enterprisedb.com
On Sun, Feb 13, 2022 at 9:56 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Sun, Feb 13, 2022 at 1:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrot> > > test4: > > 32 GB shared buffers, template DB size = 10GB, dirty shared buffer=70% > > Head: 47656 ms > > Patch: 79767 ms > > This seems like the most surprising result of the bunch. Here, the > template DB is both small enough to fit in shared_buffers and small > enough not to trigger a checkpoint all by itself, and yet the patch > loses. Well this is not really surprising to me because what I have noticed is that with the new approach the createdb time is completely dependent upon the template db size. So if the source db size is 10GB it is taking around 80sec and the shared buffers size does not have a major impact. Maybe a very small shared buffer can have more impact so I will test that as well. > > Did you checkpoint between one test and the next, or might this test > have been done after a bunch of WAL had already been written since the > last checkpoint so that the 10GB pushed it over the edge? Not really, I am running each test with a new initdb so that could not be an issue. > BTW, you have test4 twice in your list of results. My bad, those are different tests. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Sun, Feb 13, 2022 at 12:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sun, Feb 13, 2022 at 10:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Next, I am planning to do some more tests, where we are having pgbench > running and concurrently we do CREATEDB maybe every 1 minute and see > what is the CREATEDB time as well as what is the impact on pgbench > performance. Because currently I have only measured CREATEDB time but > we must be knowing the impact of createdb on the other system as well. I have done tests with the pgbench as well. So basically I did not notice any significant difference in the TPS, I was expecting there should be some difference due to the checkpoint on the head so maybe I need to test with more backend maybe. And createdb time there is a huge difference. I think this is because template1 db is very small so patch is getting completed in no time whereas head is taking huge time because of high dirty shared buffers (due to concurrent pgbench). config: echo "logging_collector=on" >> data/postgresql.conf echo "port = 5432" >> data/postgresql.conf echo "max_wal_size=64GB" >> data/postgresql.conf echo "checkpoint_timeout=15min" >> data/postgresql.conf echo "shared_buffers=32GB" >> data/postgresql.conf Test: ./pgbench -i -s 1000 postgres ./pgbench -c 32 -j 32 -T 1200 -M prepared postgres >> result.txt -- Concurrently run below script every 1 mins CREATE DATABASE mydb log_copied_blocks=true/false; Results: - Pgbench TPS: Did not observe any difference head vs patch - Create db time(very small template): head: 21000 ms to 42000 ms (at different time) patch: 80 ms -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Mon, Feb 14, 2022 at 10:31 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sun, Feb 13, 2022 at 9:56 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Sun, Feb 13, 2022 at 1:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrot> > > > test4: > > > 32 GB shared buffers, template DB size = 10GB, dirty shared buffer=70% > > > Head: 47656 ms > > > Patch: 79767 ms > > > > This seems like the most surprising result of the bunch. Here, the > > template DB is both small enough to fit in shared_buffers and small > > enough not to trigger a checkpoint all by itself, and yet the patch > > loses. > > Well this is not really surprising to me because what I have noticed > is that with the new approach the createdb time is completely > dependent upon the template db size. So if the source db size is 10GB > it is taking around 80sec and the shared buffers size does not have a > major impact. Maybe a very small shared buffer can have more impact > so I will test that as well. I have done some more experiments just to understand where we are spending most of the time. First I have tried with synchronous commit and fsync off and the creation time dropped from 80s to 70s then I just removed the log_newpage then time further dropped to 50s. I have also tried with different shared buffer sizes and observed that reducing or increasing the shared buffer size does not have much impact on the created db with the new approach. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Mon, Feb 14, 2022 at 12:01 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Well this is not really surprising to me because what I have noticed > is that with the new approach the createdb time is completely > dependent upon the template db size. So if the source db size is 10GB > it is taking around 80sec and the shared buffers size does not have a > major impact. Maybe a very small shared buffer can have more impact > so I will test that as well. OK. Well, then this approach is somewhat worse than I expected for moderately large template databases. But it seems very good for small template databases, especially when there is other work in progress on the system. -- Robert Haas EDB: http://www.enterprisedb.com
Hi Dilip, On Sun, Feb 13, 2022 at 12:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sun, Feb 13, 2022 at 10:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > I have done performance testing with different template DB sizes and > different amounts of dirty shared buffers and I think as expected the > bigger the dirty shared buffer the checkpoint approach becomes costly > and OTOH the larger the template DB size the WAL log approach takes > more time. > > I think it is very common to have larger shared buffers and of course, > if somebody has configured such a large shared buffer then a good % of > it will be dirty most of the time. So IMHO in the future, the WAL log > approach is going to be more usable in general. However, this is just > my opinion, and others may have completely different thoughts and > anyhow we are keeping options for both the approaches so no worry. > > Next, I am planning to do some more tests, where we are having pgbench > running and concurrently we do CREATEDB maybe every 1 minute and see > what is the CREATEDB time as well as what is the impact on pgbench > performance. Because currently I have only measured CREATEDB time but > we must be knowing the impact of createdb on the other system as well. > > Test setup: > max_wal_size=64GB > checkpoint_timeout=15min > - CREATE base TABLE of size of Shared Buffers > - CREATE template database and table in it of varying sizes (as per test) > - CHECKPOINT (write out dirty buffers) > - UPDATE 70% of tuple in base table (dirty 70% of shared buffers) > - CREATE database using template db. (Actual test target) > > test1: > 1 GB shared buffers, template DB size = 6MB, dirty shared buffer=70% > Head: 2341.665 ms > Patch: 85.229 ms > > test2: > 1 GB shared buffers, template DB size = 1GB, dirty shared buffer=70% > Head: 4044 ms > Patch: 8376 ms > > test3: > 8 GB shared buffers, template DB size = 1GB, dirty shared buffer=70% > Head: 21398 ms > Patch: 9834 ms > > test4: > 8 GB shared buffers, template DB size = 10GB, dirty shared buffer=95% > Head: 38574 ms > Patch: 77160 ms > > test4: > 32 GB shared buffers, template DB size = 10GB, dirty shared buffer=70% > Head: 47656 ms > Patch: 79767 ms > Is it possible to see the WAL size generated by these two statements: UPDATE 70% of the tuple in the base table (dirty 70% of the shared buffers) && CREATE database using template DB (Actual test target). Just wanted to know if it can exceed the max_wal_size of 64GB. Also, is it possible to try with minimal wal_level? Sorry for asking you this, I could try it myself but I don't have any high level system to try it. -- With Regards, Ashutosh Sharma.
On Mon, Feb 14, 2022 at 9:17 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > Is it possible to see the WAL size generated by these two statements: > UPDATE 70% of the tuple in the base table (dirty 70% of the shared > buffers) && CREATE database using template DB (Actual test target). > Just wanted to know if it can exceed the max_wal_size of 64GB. I think we already know the wal size generated by creating a db with an old and new approach. With the old approach it is just one WAL log and with the new approach it is going to log every single block of the database. Yeah the updating 70% of the database could have some impact but for verification purposes I tested without the update and still the create db with WAL log is taking almost the same time. But anyway when I test next time I will verify again that no force checkpoint is triggered. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Sun, Feb 13, 2022 at 10:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sat, Feb 12, 2022 at 2:38 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > It seems you're thinking deciding what to do based on an option that > > gets a boolean argument. But what about making the argument be an enum? > > For example > > > > CREATE DATABASE ... WITH (STRATEGY = LOG); -- default if option is omitted > > CREATE DATABASE ... WITH (STRATEGY = CHECKPOINT); > > > > So the user has to think about it in terms of some strategy to choose, > > rather than enabling or disabling some flag with nontrivial > > implications. > > > Yeah I think being explicit about giving the strategy to the user > looks like a better option. Now they can choose whether they want it > to create using WAL log or using CHECKPOINT. Otherwise, if we give a > flag then we will have to give an explanation that if they choose not > to WAL log then we will have to do a checkpoint internally. So I > think giving LOG vs CHECKPOINT as an explicit option looks better to > me. So do we have consensus to use (STRATEGY = LOG/CHECKPOINT or do we think that keeping it bool i.e. Is LOG_COPIED_BLOCKS a better option? Once we have consensus on this I will make this change and documentation as well along with the other changes suggested by Robert. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Mon, Feb 14, 2022 at 11:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > So do we have consensus to use (STRATEGY = LOG/CHECKPOINT or do we > think that keeping it bool i.e. Is LOG_COPIED_BLOCKS a better option? > Once we have consensus on this I will make this change and > documentation as well along with the other changes suggested by > Robert. I think we have consensus on STRATEGY. I'm not sure if we have consensus on what the option values should be. If we had an option to use fs-based cloning, that would also need to issue a checkpoint, which makes me think that CHECKPOINT is not the best name. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Feb 14, 2022 at 12:27:10PM -0500, Robert Haas wrote: > On Mon, Feb 14, 2022 at 11:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > So do we have consensus to use (STRATEGY = LOG/CHECKPOINT or do we > > think that keeping it bool i.e. Is LOG_COPIED_BLOCKS a better option? > > Once we have consensus on this I will make this change and > > documentation as well along with the other changes suggested by > > Robert. > > I think we have consensus on STRATEGY. I'm not sure if we have > consensus on what the option values should be. If we had an option to > use fs-based cloning, that would also need to issue a checkpoint, > which makes me think that CHECKPOINT is not the best name. I think if we want LOG, it has tob e WAL_LOG instead of just LOG. Was there discussion that the user _has_ to specify and option instead of using a default? That doesn't seem good. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Mon, Feb 14, 2022 at 1:58 PM Bruce Momjian <bruce@momjian.us> wrote: > > I think we have consensus on STRATEGY. I'm not sure if we have > > consensus on what the option values should be. If we had an option to > > use fs-based cloning, that would also need to issue a checkpoint, > > which makes me think that CHECKPOINT is not the best name. > > I think if we want LOG, it has tob e WAL_LOG instead of just LOG. Was > there discussion that the user _has_ to specify and option instead of > using a default? That doesn't seem good. I agree. I think we can set a default, which can be either whatever we think will be best on average, or maybe it can be conditional based on the database size or something. -- Robert Haas EDB: http://www.enterprisedb.com
Andrew made a good case above for avoiding LOG:
>I do think we should be wary of any name starting with "LOG", though.
>Long experience tells us that's something that confuses users when it
refers to the WAL.
On Tue, Feb 15, 2022 at 2:01 AM Maciek Sakrejda <m.sakrejda@gmail.com> wrote: > Here is the updated version of the patch, the changes are 1) Fixed review comments given by Robert and one open comment from Ashutosh. 2) Preserved the old create db method. 3) As agreed upthread for now we are using the new strategy only for createdb not for movedb so I have removed the changes in ForgetDatabaseSyncRequests() and DropDatabaseBuffers(). 3) Provided a database creation strategy option as of now I have kept it as below. CREATE DATABASE ... WITH (STRATEGY = WAL_LOG); -- default if option is omitted CREATE DATABASE ... WITH (STRATEGY = FILE_COPY); I have updated the document but I was not sure how much internal information to be exposed to the user so I will work on that based on feedback from others. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Feb 15, 2022 at 6:49 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Here is the updated version of the patch, the changes are 1) Fixed > review comments given by Robert and one open comment from Ashutosh. > 2) Preserved the old create db method. 3) As agreed upthread for now > we are using the new strategy only for createdb not for movedb so I > have removed the changes in ForgetDatabaseSyncRequests() and > DropDatabaseBuffers(). 3) Provided a database creation strategy > option as of now I have kept it as below. > > CREATE DATABASE ... WITH (STRATEGY = WAL_LOG); -- default if > option is omitted > CREATE DATABASE ... WITH (STRATEGY = FILE_COPY); All right. I think there have been two design-level objections to this patch, and this resolves one of them. The other one is trickier, because AFAICT it's basically an opinion question: is accessing pg_class in the template database from some backend that is connected to another database too ugly to be acceptable? Several people have expressed concerns about that, but it's not clear to me whether they are essentially saying "that is not what I would do if I were doing this project" or more like "if you commit something that does it that way I will be enraged and demand an immediate revert and the removal of your commit bit." If it's the former, I think it's possible to clean up various details of these patches to make them look nicer than they do at present and get something committed for PostgreSQL 15. But if it is the latter then there's really no point to that kind of cleanup work and we should probably just give up now. So, Andres, Heikki, and anybody else with a strong opinion, can you clarify how vigorously you hate this design, or don't? My own opinion is that this is actually rather elegant. It just makes sense to me that the right way to figure out what relations are in a database is to get that list from the database rather than from the filesystem. Nobody would accept the idea of making \d work by listing out the directory contents rather than by walking pg_class, and so the only reason we ought to accept that in the case of cloning a database is if the code is too ugly any other way. But is it really? It's true that this patch set does some refactoring of interfaces in order to make that work, and there's a few things about how it does that that I think could be improved, but on the whole, it's seems like a remarkably small amount of code to do something that we've long considered absolutely taboo. Now, it's nowhere close to being something that could be used to allow fully general cross-database access, and there are severe problems with the idea of allowing any such thing. In particular, there are various places that test for connections to a database, and aren't going to be expected processes not connected to the database to be touching it. My belief is that a heavyweight lock on the database is a suitable surrogate, because we actually take such a lock when connecting to a database, and that forms part of the regular interlock. Taking such locks routinely for short periods would be expensive and might create other problems, but doing it for a maintenance operation seems OK. Also, if we wanted to actually support full cross-database access, locking wouldn't be the only problem by far. We'd have to deal with things like the relcache and the catcache, which would be hard, and might increase the cost of very common things that we need to be cheap. But none of that is implicated in this patch, which only generalizes code paths that are not so commonly taken as to pose a problem, and manages to reuse quite a bit of code rather than introducing entirely new code to do the same things. . It does introduce some new code here and there, though: there isn't zero duplication. The biggest chunk of that FWICS is in 0006, in GetDatabaseRelationList and GetRelListFromPage. I just can't get excited about that. It's literally about two screens worth of code. We're not talking about duplicating src/backend/access/heapam or something like that. I do think it would be a good idea to split it up just a bit more: I think the code inside GetRelListFromPage that is guarded by HeapTupleSatisfiesVisibility() could be moved into a separate subroutine, and I think that would likely look a big nicer. But fundamentally I just don't see a huge issue here. That is not to say there isn't a huge issue here: just that I don't see it. Comments? -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2022-02-17 14:27:09 -0500, Robert Haas wrote: > The other one is trickier, because AFAICT it's basically an opinion > question: is accessing pg_class in the template database from some backend > that is connected to another database too ugly to be acceptable? Several > people have expressed concerns about that, but it's not clear to me whether > they are essentially saying "that is not what I would do if I were doing > this project" or more like "if you commit something that does it that way I > will be enraged and demand an immediate revert and the removal of your > commit bit." If it's the former, I think it's possible to clean up various > details of these patches to make them look nicer than they do at present and > get something committed for PostgreSQL 15. Could you or Dilip outline how it now works, and what exactly makes it safe etc (e.g. around locking, invalidation processing, snapshots, xid horizons)? I just scrolled through the patchset without finding such an explanation, so it's a bit hard to judge. > But if it is the latter then there's really no point to that kind of cleanup > work and we should probably just give up now. This thread is long. Could you summarize what lead you to consider other approaches (e.g. looking in the filesystem for relfilenodes) as not feasible / too ugly / ...? Greetings, Andres Freund
On Thu, Feb 17, 2022 at 4:13 PM Andres Freund <andres@anarazel.de> wrote: > Could you or Dilip outline how it now works, and what exactly makes it safe > etc (e.g. around locking, invalidation processing, snapshots, xid horizons)? > > I just scrolled through the patchset without finding such an explanation, so > it's a bit hard to judge. That's a good question and it's making me think about a few things I hadn't considered before. Dilip can add more here, but my impression is that most problems are prevented by CREATE DATABASE, with or without this patch, starts by acquiring a ShareLock on the database, preventing new connections, and then making sure there are no existing connections. That means nothing in the target database can be changing, which I think makes a lot of the stuff on your list a non-issue. Any problems that remain have to be the result of something that CREATE DATABASE does having a bad interaction either with something that is completed beforehand or something that begins afterward. There just can't be overlap, and I think that rules out most problems. Now you pointed out earlier one problem that it doesn't fix: unlike the current method, this method involves reading buffers from the template database into shared_buffers, and those buffers, once read, stick around even after the operation finishes. That's not an intrinsic problem, though, because a connection to the database could do the same thing. However, again as you pointed out, it is a problem, though, if we do it with less locking than a real database connection would have done. It seems to me that if there are other problems here, they have to be basically of the same sort: they have to leave the system in some state which is otherwise impossible. Do you see some other kind of hazard, or more examples of how that could happen? I think the leftover buffers in shared_buffers have to be basically the only thing, because apart from that, how is this any different than a file copy? The only other kind of hazard I can think of is: could it be unsafe to try to interpret the contents of a database to which no one else is connected at present due to any of the issues you mention? But what's the hazard exactly? It can't be a problem if we've failed to process sinval messages for the target database, because we're not using any caches anyway. We can't. It can't be unsafe to test visibility of XIDs for that database, because in an alternate universe some backend could have connected to that database and seen the same XIDs. One thing we COULD be doing wrong is using the wrong snapshot to test the visibility of XIDs. The patch uses GetActiveSnapshot(), and I'm thinking that is probably wrong. Shouldn't it be GetLatestSnapshot()? And do we need to worry about snapshots being database-specific? Maybe so. > > But if it is the latter then there's really no point to that kind of cleanup > > work and we should probably just give up now. > > This thread is long. Could you summarize what lead you to consider other > approaches (e.g. looking in the filesystem for relfilenodes) as not feasible / > too ugly / ...? I don't think it's infeasible to look at the filesystem for files and just copy whatever files we find. It's a plausible alternate design. I just don't like it as well. I think that relying on the filesystem contents to tell us what's going on is kind of hacky. The only technical issue I see there is that the WAL logging might require more kludgery, since that mechanism is kind of intertwined with shared_buffers. You'd have to get the right block references into the WAL record, and you have to make sure that checkpoints don't move the redo pointer at an inopportune moment. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2022-02-17 18:00:19 -0500, Robert Haas wrote: > Now you pointed out earlier one problem that it doesn't fix: unlike > the current method, this method involves reading buffers from the > template database into shared_buffers, and those buffers, once read, > stick around even after the operation finishes. Yea, I don't see a problem with that. A concurrent DROP DATABASE or such would be problematic, but the locking prevents that. > The only other kind of hazard I can think of is: could it be unsafe to > try to interpret the contents of a database to which no one else is > connected at present due to any of the issues you mention? But what's > the hazard exactly? I don't quite know. But I don't think it's impossible to run into trouble in this area. E.g. xid horizons are computed in a database specific way. If the routine reading pg_class did hot pruning, you could end up removing data that's actually needed for a logical slot in the other database because the backend local horizon state was computed for the "local" database? Could there be problems because other backends wouldn't see the backend accessing the other database as being connected to that database (PGPROC->databaseId)? Or what if somebody optimized snapshots to disregard readonly transactions in other databases? > It can't be a problem if we've failed to process sinval messages for the > target database, because we're not using any caches anyway. Could you end up with an outdated relmap entry? Probably not. > We can't. It can't be unsafe to test visibility of XIDs for that database, > because in an alternate universe some backend could have connected to that > database and seen the same XIDs. That's a weak argument, because in that alternative universe a PGPROC entry with the PGPROC->databaseId = template_databases_oid would exist. Greetings, Andres Freund
On Fri, Feb 18, 2022 at 4:30 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > This thread is long. Could you summarize what lead you to consider other > > approaches (e.g. looking in the filesystem for relfilenodes) as not feasible / > > too ugly / ...? > > I don't think it's infeasible to look at the filesystem for files and > just copy whatever files we find. It's a plausible alternate design. I > just don't like it as well. I think that relying on the filesystem > contents to tell us what's going on is kind of hacky. The only > technical issue I see there is that the WAL logging might require more > kludgery, since that mechanism is kind of intertwined with > shared_buffers. You'd have to get the right block references into the > WAL record, and you have to make sure that checkpoints don't move the > redo pointer at an inopportune moment. Actually based on the previous discussion, I also tried to write the POC with the file system scanning approach to identify the relation to be copied seet patch 0007 in this thread [1]. And later we identified one issue [2], i.e. while scanning directly the disk file we will only know the relfilenode but we can not identify the relation oid that means we can not lock the relation. Now, I am not saying that there is no way to work around that issue but that was also one of the reasons for not pursuing that approach. [1] https://www.postgresql.org/message-id/CAFiTN-v1KYsVAhq_fOWFa27LZiw9uK4n4cz5XmQJxJpsVcfq1w%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAFiTN-v%3DU58by_BeiZruNhykxk1q9XUxF%2BqLzD2LZAsEn2EBkg%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, Feb 18, 2022 at 5:09 AM Andres Freund <andres@anarazel.de> wrote: Thanks a lot Andres for taking time to read the thread and patch. > > The only other kind of hazard I can think of is: could it be unsafe to > > try to interpret the contents of a database to which no one else is > > connected at present due to any of the issues you mention? But what's > > the hazard exactly? > > I don't quite know. But I don't think it's impossible to run into trouble in > this area. E.g. xid horizons are computed in a database specific way. If the > routine reading pg_class did hot pruning, you could end up removing data > that's actually needed for a logical slot in the other database because the > backend local horizon state was computed for the "local" database? I agree that while computing the xid horizon (ComputeXidHorizons()), we only consider the backend which are connected to the same database to which we are connected. But we don't need to worry here because we know the fact that there could be absolutely no backend connected to the database we are trying to copy so we don't need to worry about pruning the tuples which are visible to other backends. Now if we are worried about the replication slot then for that we also consider the xmin horizon from the replication slots so I don't think that we have any problem here as well. And we also consider the walsender as well for computing the xid horizon. > Could there be problems because other backends wouldn't see the backend > accessing the other database as being connected to that database > (PGPROC->databaseId)? You mean that other backend will not consider this backend (which is copying database) as connected to source database? Yeah that is correct but what is the problem in that, other backends can not connect to the source database so what problem can they create to the backend which is copying the database. > Or what if somebody optimized snapshots to disregard readonly transactions in > other databases? Can you elaborate on this point? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Feb 17, 2022 at 6:39 PM Andres Freund <andres@anarazel.de> wrote: > > The only other kind of hazard I can think of is: could it be unsafe to > > try to interpret the contents of a database to which no one else is > > connected at present due to any of the issues you mention? But what's > > the hazard exactly? > > I don't quite know. But I don't think it's impossible to run into trouble in > this area. E.g. xid horizons are computed in a database specific way. If the > routine reading pg_class did hot pruning, you could end up removing data > that's actually needed for a logical slot in the other database because the > backend local horizon state was computed for the "local" database? Yeah, but it doesn't -- and shouldn't. There's no HeapScanDesc here, so we can't accidentally wander into heap_page_prune_opt(). We should document the things we're thinking about here in the comments to prevent future mistakes, but I think for the moment we are OK. > Could there be problems because other backends wouldn't see the backend > accessing the other database as being connected to that database > (PGPROC->databaseId)? I think that if there's any hazard here, it must be related to snapshots, which brings us to your next point: > Or what if somebody optimized snapshots to disregard readonly transactions in > other databases? So there are two related questions here. One is whether the snapshot that we're using to access the template database can be unsafe, and the other is whether the read-only access that we're performing inside the template database could mess up somebody else's snapshot. Let's deal with the second point first: nobody else knows that we're reading from the template database, and nobody else is reading from the template database except possibly for someone who is doing exactly what we're doing. Therefore, I think this hazard can be ruled out. On the first point, a key point in my opinion is that there can be no in-flight transactions in the template database, because nobody is connected to it, and prepared transactions in a template database are verboten. It therefore can't matter if we include too few XIDs in our snapshot, or if our xmin is too new. The reverse case can matter, though: if the xmin of our snapshot were too old, or if we had extra XIDs in our snapshot, then we might think that a transaction is still in progress when it isn't. Therefore, I think the patch is wrong to use GetActiveSnapshot() and must use GetLatestSnapshot() *after* it's finished making sure that nobody is using the template database. I don't think there's a hazard beyond that, though. Let's consider the two ways in which things could go wrong: 1. Extra XIDs in the snapshot. Any current or future optimization of snapshots would presumably be trying to make them smaller by removing XIDs from the snapshot, not making them bigger by adding XIDs to the snapshot. I guess in theory you can imagine an optimization that tests for the presence of XIDs by some method other than scanning through an array, and which feels free to add XIDs to the snapshot if they "can't matter," but I think it's up to the author of that hypothetical future patch to make sure they don't break anything in so doing -- especially because it's entirely possible for our session to see XIDs used by a backend in some other database, because they could show up in shared catalogs. I think that's why, as far as I can tell, we only use the database ID when setting pruning thresholds, and not for snapshots. 2. xmin of snapshot too new. There are no in-progress transactions in the template database, so how can this even happen? If we set the xmin "in the future," then we could get confused about what's visible due to wraparound, but that seems crazy. I don't see how there can be a problem here. > > It can't be a problem if we've failed to process sinval messages for the > > target database, because we're not using any caches anyway. > > Could you end up with an outdated relmap entry? Probably not. Again, we're not relying on caching -- we read the file. > > We can't. It can't be unsafe to test visibility of XIDs for that database, > > because in an alternate universe some backend could have connected to that > > database and seen the same XIDs. > > That's a weak argument, because in that alternative universe a PGPROC entry > with the PGPROC->databaseId = template_databases_oid would exist. So what? As I also argue above, I don't think that affects snapshot generation, and if it did it wouldn't matter anyway, because it could only remove in-progress transactions from the snapshot, and there aren't any in the template database anyhow. Another way of looking at this is: we could just as well use SnapshotSelf or (if it still existed) SnapshotNow to test visibility. In a world where there are no transactions in flight, it's the same thing. In fact, maybe we should do it that way, just to make it clearer what's happening. I think these are really good questions you are raising, so I'm not trying to be dismissive. But after some thought I'm not yet seeing any problems (other than the use of GetActiveSnapshot). -- Robert Haas EDB: http://www.enterprisedb.com
I'm not sure about the current status, but found it while playing around with the latest changes a bit, so thought of sharing it here. + <varlistentry> + <term><replaceable class="parameter">strategy</replaceable></term> + <listitem> + <para> + This is used for copying the database directory. Currently, we have + two strategies the <literal>WAL_LOG_BLOCK</literal> and the Isn't it wal_log instead of wal_log_block? I think when users input wrong strategy with createdb command, we should provide a hint message showing allowed values for strategy types along with an error message. This will be helpful for the users. On Tue, Feb 15, 2022 at 5:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Feb 15, 2022 at 2:01 AM Maciek Sakrejda <m.sakrejda@gmail.com> wrote: > > > Here is the updated version of the patch, the changes are 1) Fixed > review comments given by Robert and one open comment from Ashutosh. > 2) Preserved the old create db method. 3) As agreed upthread for now > we are using the new strategy only for createdb not for movedb so I > have removed the changes in ForgetDatabaseSyncRequests() and > DropDatabaseBuffers(). 3) Provided a database creation strategy > option as of now I have kept it as below. > > CREATE DATABASE ... WITH (STRATEGY = WAL_LOG); -- default if > option is omitted > CREATE DATABASE ... WITH (STRATEGY = FILE_COPY); > > I have updated the document but I was not sure how much internal > information to be exposed to the user so I will work on that based on > feedback from others. > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com
On Tue, Feb 22, 2022 at 8:27 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
I'm not sure about the current status, but found it while playing
around with the latest changes a bit, so thought of sharing it here.
+ <varlistentry>
+ <term><replaceable class="parameter">strategy</replaceable></term>
+ <listitem>
+ <para>
+ This is used for copying the database directory. Currently, we have
+ two strategies the <literal>WAL_LOG_BLOCK</literal> and the
Isn't it wal_log instead of wal_log_block?
I think when users input wrong strategy with createdb command, we
should provide a hint message showing allowed values for strategy
types along with an error message. This will be helpful for the users.
I will fix these two comments while posting the next version.
On Tue, Mar 1, 2022 at 5:15 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Tue, Feb 22, 2022 at 8:27 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:I'm not sure about the current status, but found it while playing
around with the latest changes a bit, so thought of sharing it here.
+ <varlistentry>
+ <term><replaceable class="parameter">strategy</replaceable></term>
+ <listitem>
+ <para>
+ This is used for copying the database directory. Currently, we have
+ two strategies the <literal>WAL_LOG_BLOCK</literal> and the
Isn't it wal_log instead of wal_log_block?
I think when users input wrong strategy with createdb command, we
should provide a hint message showing allowed values for strategy
types along with an error message. This will be helpful for the users.I will fix these two comments while posting the next version.
The new version of the patch fixes these 2 comments pointed by Ashutosh and also splits the GetRelListFromPage() function as suggested by Robert and uses the latest snapshot for scanning the pg_class instead of active snapshot as pointed out by Robert. I haven't yet added the test case to create a database using this new strategy option. So if we are okay with these two options FILE_COPY and WAL_LOG then I will add test cases for the same.
Attachment
On Thu, Mar 3, 2022 at 11:22 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > The new version of the patch fixes these 2 comments pointed by Ashutosh and also splits the GetRelListFromPage() functionas suggested by Robert and uses the latest snapshot for scanning the pg_class instead of active snapshot as pointedout by Robert. I haven't yet added the test case to create a database using this new strategy option. So if we areokay with these two options FILE_COPY and WAL_LOG then I will add test cases for the same. Reviewing 0001, the boundaries of the critical section move slightly, but only over a memcpy, which can't fail, so that seems fine. But this comment looks ominous: * Note: we're cheating a little bit here by assuming that mapped files * are either in pg_global or the database's default tablespace. It's not clear to me how the code that follows relies on this assumption, but the overall patch set would make that not true any more, so there's some kind of an issue to think about there. It's a little asymmetric that load_relmap_file() gets a subroutine read_relmap_file() while write_relmap_file() gets a subroutine write_relmap_file_internal(). Perhaps we could call the functions {load,write}_named_relmap_file() or something of that sort. Reviewing 0002, your comment updates in relmap_redo() are not complete. Note that there's an unmodified comment that says "Write out the new map and send sinval" just above where you modify the code to only conditionally send sinval. I'm somewhat uncomfortable with the shape of this logic, too. It looks weird to be sometimes calling write_relmap_file and sometimes write_relmap_file_internal. You'd expect functions with those names to be called at different abstraction levels, rather than at parallel call sites. The renaming I proposed would help with this but it's not just a cosmetic issue: the logic to construct mapfilename is in this function in one case, but in the called function in the other case. I can't help but think that the write_relmap_file()/write_relmap_file_internal() split isn't entirely the right thing. I think part of the confusion here is that, pre-patch, write_relmap_file() gets called during both recovery and normal running, and the updates to shared_map or local_map are actually nonsense during recovery, because the local map at least is local to whatever our database is, and we don't have a database connection if we're the startup process. After your patch, we're still going through write_relmap_file in recovery in some cases, but really those map updates don't seem like things that should be happening at all. And on the other hand it's not clear to me why the CRC stuff isn't needed in all cases, but that's only going to happen when we go through the non-internal version of the function. You've probably spent more time looking at this code than I have, but I'm wondering if the division should be like this: we have one function that does the actual update, and another function that does the update plus sets global variables. Recovery always uses the first one, and outside of recovery we use the first one for the create-database case and the second one otherwise. Thoughts? Regarding 0003, my initial thought was to like the fact that you'd avoided duplicating code by using a function parameter, but as I look at it a bit more, it's not clear to me that it's enough code that we really care about not duplicating it. I would not expect to find a function called RelationCopyAllFork() in tablecmds.c. I'd expect to find it in storage.c, I think. And I think I'd be surprised to find out that it doesn't actually know anything about copying; it's basically just a loop over the forks to which you can supply your own copy-function. And the fact that it's got an argument of type copy_relation_storage and the argument name is copy_storage and the value is sometimes RelationCopyStorageis a terminological muddle, too. So I feel like perhaps this needs more thought. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Mar 9, 2022 at 3:12 AM Robert Haas <robertmhaas@gmail.com> wrote: > Thanks for reviewing and valuable feedback. > Reviewing 0001, the boundaries of the critical section move slightly, > but only over a memcpy, which can't fail, so that seems fine. But this > comment looks ominous: > > * Note: we're cheating a little bit here by assuming that mapped files > * are either in pg_global or the database's default tablespace. > > It's not clear to me how the code that follows relies on this > assumption, but the overall patch set would make that not true any > more, so there's some kind of an issue to think about there. I think the comments are w.r.t choosing the file path, because here we assume either it is in the global tablespace or default tablespace of the database. Here also the comment is partially true because we also assume that it will be in the default tablespace of the database (because we do not need to worry about the shared relations). But I think this comments can move to the caller function where we are creating the file path. if (shared) { snprintf(mapfilename, sizeof(mapfilename), "global/%s", RELMAPPER_FILENAME); } else { snprintf(mapfilename, sizeof(mapfilename), "%s/%s", dbpath, RELMAPPER_FILENAME); } > It's a little asymmetric that load_relmap_file() gets a subroutine > read_relmap_file() while write_relmap_file() gets a subroutine > write_relmap_file_internal(). Perhaps we could call the functions > {load,write}_named_relmap_file() or something of that sort. Yeah this should be changed. > Reviewing 0002, your comment updates in relmap_redo() are not > complete. Note that there's an unmodified comment that says "Write out > the new map and send sinval" just above where you modify the code to > only conditionally send sinval. I'm somewhat uncomfortable with the > shape of this logic, too. It looks weird to be sometimes calling > write_relmap_file and sometimes write_relmap_file_internal. You'd > expect functions with those names to be called at different > abstraction levels, rather than at parallel call sites. The renaming I > proposed would help with this but it's not just a cosmetic issue: the > logic to construct mapfilename is in this function in one case, but in > the called function in the other case. I can't help but think that the > write_relmap_file()/write_relmap_file_internal() split isn't entirely > the right thing. > > I think part of the confusion here is that, pre-patch, > write_relmap_file() gets called during both recovery and normal > running, and the updates to shared_map or local_map are actually > nonsense during recovery, because the local map at least is local to > whatever our database is, and we don't have a database connection if > we're the startup process. Yeah you are correct about the local map, but I am not sure whether we can rely on not updating the shared map in the startup process. Because how can we guarantee that now or in future the startup process can never look into the map? I agree that it is not connected to the database so it doesn't make sense to look into the local map but how we are going to ensure the shared map. Said that I think there are only 3 function which must be looking at these maps directly RelationMapOidToFilenode(), RelationMapFilenodeToOid() and RelationMapUpdateMap() and these functions are called from a very few places and I don't think these should be called during recovery. So probably we can put a elog saying they should never be called during recovery? After your patch, we're still going through > write_relmap_file in recovery in some cases, but really those map > updates don't seem like things that should be happening at all. And on > the other hand it's not clear to me why the CRC stuff isn't needed in > all cases, but that's only going to happen when we go through the > non-internal version of the function. You've probably spent more time > looking at this code than I have, but I'm wondering if the division > should be like this: we have one function that does the actual update, > and another function that does the update plus sets global variables. > Recovery always uses the first one, and outside of recovery we use the > first one for the create-database case and the second one otherwise. > Thoughts? Right, infact now also if you see the logic, the write_relmap_file_internal() is taking care of the actual update and the write_relmap_file() is doing update + setting the global variables. So yeah we can rename as you suggested in 0001 and here also we can change the logic as you suggested that the recovery and createdb will only call the first function which is just doing the update. > Regarding 0003, my initial thought was to like the fact that you'd > avoided duplicating code by using a function parameter, but as I look > at it a bit more, it's not clear to me that it's enough code that we > really care about not duplicating it. I would not expect to find a > function called RelationCopyAllFork() in tablecmds.c. Okay, actually I see this logic of copying the fork at a few different places like index_copy_data() in tablecmds.c. and then in heapam_relation_copy_data() in heapam_handler.c. So I was not sure what could be right place for this function so I kept it in the same file (tablecmds.c) because I splitted it from the function in this file. I'd expect to > find it in storage.c, I think. And I think I'd be surprised to find > out that it doesn't actually know anything about copying; it's > basically just a loop over the forks to which you can supply your own > copy-function. Yeah but it eventually expects a function pointer to copy storage so we can not completely deny that it knows nothing about the copy? And the fact that it's got an argument of type > copy_relation_storage and the argument name is copy_storage and the > value is sometimes RelationCopyStorageis a terminological muddle, too. > So I feel like perhaps this needs more thought. One option is that we can duplicate this loop in dbcommand.c as well like we are having it already in tablecmds.c and heapam_handler.c? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Mar 9, 2022 at 6:07 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Yeah you are correct about the local map, but I am not sure whether we > can rely on not updating the shared map in the startup process. > Because how can we guarantee that now or in future the startup process > can never look into the map? I agree that it is not connected to the > database so it doesn't make sense to look into the local map but how > we are going to ensure the shared map. Said that I think there are > only 3 function which must be looking at these maps directly > RelationMapOidToFilenode(), RelationMapFilenodeToOid() and > RelationMapUpdateMap() and these functions are called from a very few > places and I don't think these should be called during recovery. So > probably we can put a elog saying they should never be called during > recovery? Yeah, that seems reasonable. > Right, infact now also if you see the logic, the > write_relmap_file_internal() is taking care of the actual update and > the write_relmap_file() is doing update + setting the global > variables. So yeah we can rename as you suggested in 0001 and here > also we can change the logic as you suggested that the recovery and > createdb will only call the first function which is just doing the > update. But I think we want the path construction to be managed by the function rather than the caller, too. > I'd expect to > > find it in storage.c, I think. And I think I'd be surprised to find > > out that it doesn't actually know anything about copying; it's > > basically just a loop over the forks to which you can supply your own > > copy-function. > > Yeah but it eventually expects a function pointer to copy storage so > we can not completely deny that it knows nothing about the copy? Sure, I guess. It's just not obvious why the argument would actually need to be a function that copies storage, or why there's more than one way to copy storage. I'd rather keep all the code paths unified, if we can, and set behavior via flags or something, maybe. I'm not sure whether that's realistic, though. > And the fact that it's got an argument of type > > copy_relation_storage and the argument name is copy_storage and the > > value is sometimes RelationCopyStorageis a terminological muddle, too. > > So I feel like perhaps this needs more thought. > > One option is that we can duplicate this loop in dbcommand.c as well > like we are having it already in tablecmds.c and heapam_handler.c? Yeah, I think this is also worth considering. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Mar 9, 2022 at 6:44 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > Right, infact now also if you see the logic, the > > write_relmap_file_internal() is taking care of the actual update and > > the write_relmap_file() is doing update + setting the global > > variables. So yeah we can rename as you suggested in 0001 and here > > also we can change the logic as you suggested that the recovery and > > createdb will only call the first function which is just doing the > > update. > > But I think we want the path construction to be managed by the > function rather than the caller, too. I have completely changed the logic for this refactoring. Basically, write_relmap_file(), is already having parameters to control whether to write wal, send inval and we are already passing the dbpath. Instead of making a new function I just pass one additional parameter to this function itself about whether we are creating a new map or not and I think with that changes are very less and this looks cleaner to me. Similarly for load_relmap_file() also I just had to pass the dbpath and memory for destination map. Please have a look and let me know your thoughts. > Sure, I guess. It's just not obvious why the argument would actually > need to be a function that copies storage, or why there's more than > one way to copy storage. I'd rather keep all the code paths unified, > if we can, and set behavior via flags or something, maybe. I'm not > sure whether that's realistic, though. I try considering that, I think it doesn't look good to make it flag based, One of the main problem I noticed is that now for copying either we need to call RelationCopyStorageis() or RelationCopyStorageUsingBuffer() based on the input flag. But if we move the main copy function to the storage.c then the storage.c will have depedency on bufmgr functions because I don't think we can keep RelationCopyStorageUsingBuffer() inside storage.c. So for now, I have duplicated the loop which is already there in index_copy_data() and heapam_relation_copy_data() and kept that in bufmgr.c and also moved RelationCopyStorageUsingBuffer() into the bufmgr.c. I think bufmgr.c is already having function which are dealing with smgr things so I feel this is the right place for the function. Other changes: 1. 0001 and 0002 are merged because now we are not really refactoring these function and just passing the additioanl arguments to it make sense to combine the changes. 2. Same with 0003, that now we are not refactoring existing functions but providing new interfaces so merged it with the 0004 (which was 0006 previously) I think we should also write the test cases for create database strategy. But I do not see any test case for create database for testing the existing options. So I am wondering whether we should add the test case only for the new option we are providing or we should create a separate path which tests the new option as well as the existing options. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Here are some review comments on the latest patch (v11-0004-WAL-logged-CREATE-DATABASE.patch). I actually did the review of the v10 patch but that applies for this latest version as well. + /* Now errors are fatal ... */ + START_CRIT_SECTION(); Did you mean PANIC instead of FATAL? == + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("invalid create strategy %s", strategy), + errhint("Valid strategies are \"wal_log\", and \"file_copy\"."))); + } Should this be - "invalid createdb strategy" instead of "invalid create strategy"? == + /* + * In case of ALTER DATABASE SET TABLESPACE we don't need to do + * anything for the object which are not in the source db's default + * tablespace. The source and destination dboid will be same in + * case of ALTER DATABASE SET TABLESPACE. + */ + else if (src_dboid == dst_dboid) + continue; + else + dstrnode.spcNode = srcrnode.spcNode; Is this change still required? Do we support the WAL_COPY strategy for ALTER DATABASE? == + /* Open the source and the destination relation at smgr level. */ + src_smgr = smgropen(srcrnode, InvalidBackendId); + dst_smgr = smgropen(dstrnode, InvalidBackendId); + + /* Copy relation storage from source to the destination. */ + CreateAndCopyRelationData(src_smgr, dst_smgr, relinfo->relpersistence); Do we need to do smgropen for destination relfilenode here? Aren't we already doing that inside RelationCreateStorage? == + use_wal = XLogIsNeeded() && + (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork); + + /* Get number of blocks in the source relation. */ + nblocks = smgrnblocks(src, forkNum); What if number of blocks in a source relation is ZERO? Should we check for that and return immediately. We have already done smgrcreate. == + /* We don't need to copy the shared objects to the target. */ + if (classForm->reltablespace == GLOBALTABLESPACE_OID) + return NULL; + + /* + * If the object doesn't have the storage then nothing to be + * done for that object so just ignore it. + */ + if (!RELKIND_HAS_STORAGE(classForm->relkind)) + return NULL; We can probably club together above two if-checks. == + <varlistentry> + <term><replaceable class="parameter">strategy</replaceable></term> + <listitem> + <para> + This is used for copying the database directory. Currently, we have + two strategies the <literal>WAL_LOG</literal> and the + <literal>FILE_COPY</literal>. If <literal>WAL_LOG</literal> strategy + is used then the database will be copied block by block and it will + also WAL log each copied block. Otherwise, if <literal>FILE_COPY I think we need to mention the default strategy in the documentation page. -- With Regards, Ashutosh Sharma. On Thu, Mar 10, 2022 at 4:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Mar 9, 2022 at 6:44 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > > Right, infact now also if you see the logic, the > > > write_relmap_file_internal() is taking care of the actual update and > > > the write_relmap_file() is doing update + setting the global > > > variables. So yeah we can rename as you suggested in 0001 and here > > > also we can change the logic as you suggested that the recovery and > > > createdb will only call the first function which is just doing the > > > update. > > > > But I think we want the path construction to be managed by the > > function rather than the caller, too. > > I have completely changed the logic for this refactoring. Basically, > write_relmap_file(), is already having parameters to control whether > to write wal, send inval and we are already passing the dbpath. > Instead of making a new function I just pass one additional parameter > to this function itself about whether we are creating a new map or not > and I think with that changes are very less and this looks cleaner to > me. Similarly for load_relmap_file() also I just had to pass the > dbpath and memory for destination map. Please have a look and let me > know your thoughts. > > > Sure, I guess. It's just not obvious why the argument would actually > > need to be a function that copies storage, or why there's more than > > one way to copy storage. I'd rather keep all the code paths unified, > > if we can, and set behavior via flags or something, maybe. I'm not > > sure whether that's realistic, though. > > I try considering that, I think it doesn't look good to make it flag > based, One of the main problem I noticed is that now for copying > either we need to call RelationCopyStorageis() or > RelationCopyStorageUsingBuffer() based on the input flag. But if we > move the main copy function to the storage.c then the storage.c will > have depedency on bufmgr functions because I don't think we can keep > RelationCopyStorageUsingBuffer() inside storage.c. So for now, I have > duplicated the loop which is already there in index_copy_data() and > heapam_relation_copy_data() and kept that in bufmgr.c and also moved > RelationCopyStorageUsingBuffer() into the bufmgr.c. I think bufmgr.c > is already having function which are dealing with smgr things so I > feel this is the right place for the function. > > Other changes: > 1. 0001 and 0002 are merged because now we are not really refactoring > these function and just passing the additioanl arguments to it make > sense to combine the changes. > 2. Same with 0003, that now we are not refactoring existing functions > but providing new interfaces so merged it with the 0004 (which was > 0006 previously) > > I think we should also write the test cases for create database > strategy. But I do not see any test case for create database for > testing the existing options. So I am wondering whether we should add > the test case only for the new option we are providing or we should > create a separate path which tests the new option as well as the > existing options. > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com
On Thu, Mar 10, 2022 at 7:22 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Here are some review comments on the latest patch > (v11-0004-WAL-logged-CREATE-DATABASE.patch). I actually did the review > of the v10 patch but that applies for this latest version as well. > > + /* Now errors are fatal ... */ > + START_CRIT_SECTION(); > > Did you mean PANIC instead of FATAL? I think here fatal didn't really mean the error level FATAL, that means critical and I have seen it is used in other places also. But I really don't think we need this comments to removed to avoid any confusion. > == > > + > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("invalid create > strategy %s", strategy), > + errhint("Valid strategies are > \"wal_log\", and \"file_copy\"."))); > + } > > > Should this be - "invalid createdb strategy" instead of "invalid > create strategy"? Changed > == > > + /* > + * In case of ALTER DATABASE SET TABLESPACE we don't need to do > + * anything for the object which are not in the source > db's default > + * tablespace. The source and destination dboid will be same in > + * case of ALTER DATABASE SET TABLESPACE. > + */ > + else if (src_dboid == dst_dboid) > + continue; > + else > + dstrnode.spcNode = srcrnode.spcNode; > > > Is this change still required? Do we support the WAL_COPY strategy for > ALTER DATABASE? Yeah now it is unreachable code so removed. > == > > + /* Open the source and the destination relation at > smgr level. */ > + src_smgr = smgropen(srcrnode, InvalidBackendId); > + dst_smgr = smgropen(dstrnode, InvalidBackendId); > + > + /* Copy relation storage from source to the destination. */ > + CreateAndCopyRelationData(src_smgr, dst_smgr, > relinfo->relpersistence); > > Do we need to do smgropen for destination relfilenode here? Aren't we > already doing that inside RelationCreateStorage? Yeah I have changed the complete logic and removed the smgr_open for both source and destination and moved inside CreateAndCopyRelationData, please check the updated code. > == > > + use_wal = XLogIsNeeded() && > + (relpersistence == RELPERSISTENCE_PERMANENT || > copying_initfork); > + > + /* Get number of blocks in the source relation. */ > + nblocks = smgrnblocks(src, forkNum); > > What if number of blocks in a source relation is ZERO? Should we check > for that and return immediately. We have already done smgrcreate. Yeah make sense to optimize, with that we will not have to get the buffer strategy so done. > == > > + /* We don't need to copy the shared objects to the target. */ > + if (classForm->reltablespace == GLOBALTABLESPACE_OID) > + return NULL; > + > + /* > + * If the object doesn't have the storage then nothing to be > + * done for that object so just ignore it. > + */ > + if (!RELKIND_HAS_STORAGE(classForm->relkind)) > + return NULL; > > We can probably club together above two if-checks. Done > == > > + <varlistentry> > + <term><replaceable class="parameter">strategy</replaceable></term> > + <listitem> > + <para> > + This is used for copying the database directory. Currently, we have > + two strategies the <literal>WAL_LOG</literal> and the > + <literal>FILE_COPY</literal>. If <literal>WAL_LOG</literal> strategy > + is used then the database will be copied block by block and it will > + also WAL log each copied block. Otherwise, if <literal>FILE_COPY > > I think we need to mention the default strategy in the documentation page. Done -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Thu, Mar 10, 2022 at 6:02 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I have completely changed the logic for this refactoring. Basically, > write_relmap_file(), is already having parameters to control whether > to write wal, send inval and we are already passing the dbpath. > Instead of making a new function I just pass one additional parameter > to this function itself about whether we are creating a new map or not > and I think with that changes are very less and this looks cleaner to > me. Similarly for load_relmap_file() also I just had to pass the > dbpath and memory for destination map. Please have a look and let me > know your thoughts. It's not terrible, but how about something like the attached instead? I think this has the effect of reducing the number of cases that the low-level code needs to know about from 2 to 1, instead of making it go up from 2 to 3. > I think we should also write the test cases for create database > strategy. But I do not see any test case for create database for > testing the existing options. So I am wondering whether we should add > the test case only for the new option we are providing or we should > create a separate path which tests the new option as well as the > existing options. FWIW, src/bin/scripts/t/020_createdb.pl does a little bit of testing of this kind. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
Thanks Dilip for working on the review comments. I'll take a look at the new version of patch and let you know my comments, if any. -- With Regards, Ashutosh Sharma. On Thu, Mar 10, 2022 at 8:38 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Mar 10, 2022 at 7:22 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > Here are some review comments on the latest patch > > (v11-0004-WAL-logged-CREATE-DATABASE.patch). I actually did the review > > of the v10 patch but that applies for this latest version as well. > > > > + /* Now errors are fatal ... */ > > + START_CRIT_SECTION(); > > > > Did you mean PANIC instead of FATAL? > > I think here fatal didn't really mean the error level FATAL, that > means critical and I have seen it is used in other places also. But I > really don't think we need this comments to removed to avoid any > confusion. > > > == > > > > + > > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > + errmsg("invalid create > > strategy %s", strategy), > > + errhint("Valid strategies are > > \"wal_log\", and \"file_copy\"."))); > > + } > > > > > > Should this be - "invalid createdb strategy" instead of "invalid > > create strategy"? > > Changed > > > == > > > > + /* > > + * In case of ALTER DATABASE SET TABLESPACE we don't need to do > > + * anything for the object which are not in the source > > db's default > > + * tablespace. The source and destination dboid will be same in > > + * case of ALTER DATABASE SET TABLESPACE. > > + */ > > + else if (src_dboid == dst_dboid) > > + continue; > > + else > > + dstrnode.spcNode = srcrnode.spcNode; > > > > > > Is this change still required? Do we support the WAL_COPY strategy for > > ALTER DATABASE? > > Yeah now it is unreachable code so removed. > > > == > > > > + /* Open the source and the destination relation at > > smgr level. */ > > + src_smgr = smgropen(srcrnode, InvalidBackendId); > > + dst_smgr = smgropen(dstrnode, InvalidBackendId); > > + > > + /* Copy relation storage from source to the destination. */ > > + CreateAndCopyRelationData(src_smgr, dst_smgr, > > relinfo->relpersistence); > > > > Do we need to do smgropen for destination relfilenode here? Aren't we > > already doing that inside RelationCreateStorage? > > Yeah I have changed the complete logic and removed the smgr_open for > both source and destination and moved inside > CreateAndCopyRelationData, please check the updated code. > > > == > > > > + use_wal = XLogIsNeeded() && > > + (relpersistence == RELPERSISTENCE_PERMANENT || > > copying_initfork); > > + > > + /* Get number of blocks in the source relation. */ > > + nblocks = smgrnblocks(src, forkNum); > > > > What if number of blocks in a source relation is ZERO? Should we check > > for that and return immediately. We have already done smgrcreate. > > Yeah make sense to optimize, with that we will not have to get the > buffer strategy so done. > > > == > > > > + /* We don't need to copy the shared objects to the target. */ > > + if (classForm->reltablespace == GLOBALTABLESPACE_OID) > > + return NULL; > > + > > + /* > > + * If the object doesn't have the storage then nothing to be > > + * done for that object so just ignore it. > > + */ > > + if (!RELKIND_HAS_STORAGE(classForm->relkind)) > > + return NULL; > > > > We can probably club together above two if-checks. > > Done > > > == > > > > + <varlistentry> > > + <term><replaceable class="parameter">strategy</replaceable></term> > > + <listitem> > > + <para> > > + This is used for copying the database directory. Currently, we have > > + two strategies the <literal>WAL_LOG</literal> and the > > + <literal>FILE_COPY</literal>. If <literal>WAL_LOG</literal> strategy > > + is used then the database will be copied block by block and it will > > + also WAL log each copied block. Otherwise, if <literal>FILE_COPY > > > > I think we need to mention the default strategy in the documentation page. > > Done > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com
On Thu, Mar 10, 2022 at 10:18 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Mar 10, 2022 at 6:02 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have completely changed the logic for this refactoring. Basically, > > write_relmap_file(), is already having parameters to control whether > > to write wal, send inval and we are already passing the dbpath. > > Instead of making a new function I just pass one additional parameter > > to this function itself about whether we are creating a new map or not > > and I think with that changes are very less and this looks cleaner to > > me. Similarly for load_relmap_file() also I just had to pass the > > dbpath and memory for destination map. Please have a look and let me > > know your thoughts. > > It's not terrible, but how about something like the attached instead? > I think this has the effect of reducing the number of cases that the > low-level code needs to know about from 2 to 1, instead of making it > go up from 2 to 3. > Looks better, but why do you want to pass elevel to the load_relmap_file()? Are we calling this function from somewhere other than read_relmap_file()? If not, do we have any plans to call this function directly bypassing read_relmap_file for any upcoming patch? -- With Regards, Ashutosh Sharma.
On Thu, Mar 10, 2022 at 10:18 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Mar 10, 2022 at 6:02 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have completely changed the logic for this refactoring. Basically, > > write_relmap_file(), is already having parameters to control whether > > to write wal, send inval and we are already passing the dbpath. > > Instead of making a new function I just pass one additional parameter > > to this function itself about whether we are creating a new map or not > > and I think with that changes are very less and this looks cleaner to > > me. Similarly for load_relmap_file() also I just had to pass the > > dbpath and memory for destination map. Please have a look and let me > > know your thoughts. > > It's not terrible, but how about something like the attached instead? > I think this has the effect of reducing the number of cases that the > low-level code needs to know about from 2 to 1, instead of making it > go up from 2 to 3. Yeah this looks cleaner, I will rebase the remaining patch. > > I think we should also write the test cases for create database > > strategy. But I do not see any test case for create database for > > testing the existing options. So I am wondering whether we should add > > the test case only for the new option we are providing or we should > > create a separate path which tests the new option as well as the > > existing options. > > FWIW, src/bin/scripts/t/020_createdb.pl does a little bit of testing > of this kind. Okay, I think we need to support the strategy in createdb bin as well. I will do that. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, Mar 11, 2022 at 11:52 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Mar 10, 2022 at 10:18 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Thu, Mar 10, 2022 at 6:02 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > I have completely changed the logic for this refactoring. Basically, > > > write_relmap_file(), is already having parameters to control whether > > > to write wal, send inval and we are already passing the dbpath. > > > Instead of making a new function I just pass one additional parameter > > > to this function itself about whether we are creating a new map or not > > > and I think with that changes are very less and this looks cleaner to > > > me. Similarly for load_relmap_file() also I just had to pass the > > > dbpath and memory for destination map. Please have a look and let me > > > know your thoughts. > > > > It's not terrible, but how about something like the attached instead? > > I think this has the effect of reducing the number of cases that the > > low-level code needs to know about from 2 to 1, instead of making it > > go up from 2 to 3. > > Yeah this looks cleaner, I will rebase the remaining patch. Here is the updated version of the patch set. Changes, 1) it take Robert's patch as first refactoring patch 2) Rebase other new relmapper apis on top of that in 0002 3) Some code refactoring in main patch 0005 and also one problem fix, earlier in wal log method I have removed ForceSyncCommit(), but IMHO that is equally valid whether we use file_copy or wal_log because in both cases we are creating the disk files. 4) Support strategy in createdb tool and add test case as part of 0006. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
You may also need to add documentation to app-createdb.sgml. Currently you have just added to create_database.sgml. Also, I had a quick look at the new changes done in v13-0005-WAL-logged-CREATE-DATABASE.patch and they seemed fine to me although I haven't put much emphasis on the comments and other cosmetic stuff. -- With Regards, Ashutosh Sharma. On Fri, Mar 11, 2022 at 3:51 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Mar 11, 2022 at 11:52 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Thu, Mar 10, 2022 at 10:18 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > > > On Thu, Mar 10, 2022 at 6:02 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > I have completely changed the logic for this refactoring. Basically, > > > > write_relmap_file(), is already having parameters to control whether > > > > to write wal, send inval and we are already passing the dbpath. > > > > Instead of making a new function I just pass one additional parameter > > > > to this function itself about whether we are creating a new map or not > > > > and I think with that changes are very less and this looks cleaner to > > > > me. Similarly for load_relmap_file() also I just had to pass the > > > > dbpath and memory for destination map. Please have a look and let me > > > > know your thoughts. > > > > > > It's not terrible, but how about something like the attached instead? > > > I think this has the effect of reducing the number of cases that the > > > low-level code needs to know about from 2 to 1, instead of making it > > > go up from 2 to 3. > > > > Yeah this looks cleaner, I will rebase the remaining patch. > > Here is the updated version of the patch set. > > Changes, 1) it take Robert's patch as first refactoring patch 2) > Rebase other new relmapper apis on top of that in 0002 3) Some code > refactoring in main patch 0005 and also one problem fix, earlier in > wal log method I have removed ForceSyncCommit(), but IMHO that is > equally valid whether we use file_copy or wal_log because in both > cases we are creating the disk files. 4) Support strategy in createdb > tool and add test case as part of 0006. > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com
On Fri, Mar 11, 2022 at 12:15 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > Looks better, but why do you want to pass elevel to the > load_relmap_file()? Are we calling this function from somewhere other > than read_relmap_file()? If not, do we have any plans to call this > function directly bypassing read_relmap_file for any upcoming patch? If it fails during CREATE DATABASE, it should be ERROR, not FATAL. In that case, we only need to stop trying to create a database; we don't need to terminate the session. On the other hand if we can't read our own database's relmap files, that's an unrecoverable error, because we will not be able to run any queries at all, so FATAL is appropriate. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Mar 11, 2022 at 8:21 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Mar 11, 2022 at 12:15 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Looks better, but why do you want to pass elevel to the > > load_relmap_file()? Are we calling this function from somewhere other > > than read_relmap_file()? If not, do we have any plans to call this > > function directly bypassing read_relmap_file for any upcoming patch? > > If it fails during CREATE DATABASE, it should be ERROR, not FATAL. In > that case, we only need to stop trying to create a database; we don't > need to terminate the session. On the other hand if we can't read our > own database's relmap files, that's an unrecoverable error, because we > will not be able to run any queries at all, so FATAL is appropriate. > OK. I can see it being used in the v13 patch. In the previous patches it was hard-coded with FATAL. Also, we simply error out when doing file copy as I can see in the copy_file function. So yes FATAL is not the right option to use when creating a database. Thanks. -- With Regards, Ashutosh Sharma.
On Fri, Mar 11, 2022 at 5:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Changes, 1) it take Robert's patch as first refactoring patch 2) > Rebase other new relmapper apis on top of that in 0002 3) Some code > refactoring in main patch 0005 and also one problem fix, earlier in > wal log method I have removed ForceSyncCommit(), but IMHO that is > equally valid whether we use file_copy or wal_log because in both > cases we are creating the disk files. 4) Support strategy in createdb > tool and add test case as part of 0006. I don't think you've adequately considered temporary relations here. It seems to be that ReadBufferWithoutRelcache() could not be safe on a temprel, because we'd need a BackendId to access the underlying storage. So I think that ReadBufferWithoutRelcache can only accept unlogged or permanent, and maybe the argument ought to be a Boolean instead of a relpersistence value. I thought that this problem might be only cosmetic, but I checked the code that actually does the copy, and there's no filter there on relpersistence either. And I think there should be. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Mar 11, 2022 at 1:10 PM Robert Haas <robertmhaas@gmail.com> wrote: > I don't think you've adequately considered temporary relations here. > It seems to be that ReadBufferWithoutRelcache() could not be safe on a > temprel, because we'd need a BackendId to access the underlying > storage. So I think that ReadBufferWithoutRelcache can only accept > unlogged or permanent, and maybe the argument ought to be a Boolean > instead of a relpersistence value. I thought that this problem might > be only cosmetic, but I checked the code that actually does the copy, > and there's no filter there on relpersistence either. And I think > there should be. I hit "send" too quickly there: rhaas=# create database fudge; CREATE DATABASE rhaas=# \c fudge You are now connected to database "fudge" as user "rhaas". fudge=# create temp table q (); CREATE TABLE fudge=# ^Z [2]+ Stopped psql [rhaas Downloads]$ pg_ctl stop -mi waiting for server to shut down.... done server stopped [rhaas Downloads]$ %% psql \c You are now connected to database "fudge" as user "rhaas". fudge=# select * from pg_class where relpersistence='t'; oid | relname | relnamespace | reltype | reloftype | relowner | relam | relfilenode | reltablespace | relpages | reltuples | relallvisible | reltoastrelid | relhasindex | relisshared | relpersistence | relkind | relnatts | relchecks | relhasrules | relhastriggers | relhassubclass | relrowsecurity | relforcerowsecurity | relispopulated | relreplident | relispartition | relrewrite | relfrozenxid | relminmxid | relacl | reloptions | relpartbound -------+---------+--------------+---------+-----------+----------+-------+-------------+---------------+----------+-----------+---------------+---------------+-------------+-------------+----------------+---------+----------+-----------+-------------+----------------+----------------+----------------+---------------------+----------------+--------------+----------------+------------+--------------+------------+--------+------------+-------------- 16388 | q | 16386 | 16390 | 0 | 10 | 2 | 16388 | 0 | 0 | -1 | 0 | 0 | f | f | t | r | 0 | 0 | f | f | f | f | f | t | d | f | 0 | 721 | 1 | | | (1 row) fudge=# \c rhaas You are now connected to database "rhaas" as user "rhaas". rhaas=# alter database fudge is_template true; ALTER DATABASE rhaas=# create database cookies template fudge; CREATE DATABASE rhaas=# \c cookies You are now connected to database "cookies" as user "rhaas". cookies=# select count(*) from pg_class where relpersistence='t'; count ------- 1 (1 row) You have to be quick, because autovacuum will drop the orphaned temp table when it notices it, but it is possible. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Mar 11, 2022 at 5:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Changes, 1) it take Robert's patch as first refactoring patch 2) > Rebase other new relmapper apis on top of that in 0002 3) Some code > refactoring in main patch 0005 and also one problem fix, earlier in > wal log method I have removed ForceSyncCommit(), but IMHO that is > equally valid whether we use file_copy or wal_log because in both > cases we are creating the disk files. 4) Support strategy in createdb > tool and add test case as part of 0006. I think there's something wrong with what this patch is doing with the XLOG records. It adds XLOG_DBASE_CREATEDIR, but the only new XLogInsert() calls in the patch are passing XLOG_DBASE_CREATE, and no existing references are adjusted. Similarly with xl_dbase_create_rec and xl_dbase_createdir_rec. Why would we introduce a new record type and not use it? Let's not call the functions for the different strategies CopyDatabase() and CopyDatabaseWithWal() but rather something that matches up to the strategy names e.g. create_database_using_wal_log() and create_database_using_file_copy(). There's something a little funny about the names wal_log and file_copy ... they're not quite parallel gramatically. But it's probably OK. The changes to createdb_failure_params make me a little nervous. I think we'd be in real trouble if we failed before completing both DropDatabaseBuffers() and ForgetDatabaseSyncRequests(). However, it looks to me like those are both intended to be no-fail operations, so I don't see an actual hazard here. But, hmm, what about on the recovery side? Suppose that we start copying the database block by block and then either (a) the standby is promoted before the copy is finished or (b) the copy fails. Now the standby has data in shared_buffers for a database that does not exist. If that's not bad, then why does createdb_failure_params need to DropDatabaseBuffers()? But I bet it is bad. I wonder if we should be using RelationCopyStorage() rather than this new function RelationCopyStorageUsingBuffer(). That would avoid having the buffers in shared_buffers, dodging the problem. But then we have a problem with checkpoint interlocking: we could begin replay from a checkpoint and find that the pages that were supposed to get copied prior to the checkpoint were actually not copied, because the checkpoint record could be written after we've logged a page being copied and before we actually write the page. Or, we could crash after writing a whole lot of pages and a checkpoint record, but before RelationCopyStorage() fsyncs the destination fork. It doesn't seem advisable to hold off checkpoints for the time it takes to copy an entire relation fork, so the solution is apparently to keep the data in shared buffers after all. But that brings us right back to square one. Have you thought through this whole problem carefully? It seems like a total mess to me at the moment, but maybe I'm missing something. There seems to be no reason to specify specific values for the members of enum CreateDBStrategy. I think the naming of some of the new functions might need work, in particular GetRelInfoFromTuple, GetRelListFromPage, and GetDatabaseRelationList. The names seem kind of generic for what they're doing. Maybe ScanSourceDatabasePgClass, ScanSourceDatabasePgClassPage, ScanSourceDatabasePgClassTuple? -- Robert Haas EDB: http://www.enterprisedb.com
On Sat, Mar 12, 2022 at 1:55 AM Robert Haas <robertmhaas@gmail.com> wrote: > Responding to this specific issue.. > The changes to createdb_failure_params make me a little nervous. I > think we'd be in real trouble if we failed before completing both > DropDatabaseBuffers() and ForgetDatabaseSyncRequests(). However, it > looks to me like those are both intended to be no-fail operations, so > I don't see an actual hazard here. I might be missing something but even without that I do not see a real problem here. The reason we are dropping the database buffers and pending sync request is because right after this we are removing the underlying files and if we just remove the files without dropping the buffer from the buffer cache then the checkpointer will fail while trying to flush the buffer. But, hmm, what about on the > recovery side? Suppose that we start copying the database block by > block and then either (a) the standby is promoted before the copy is > finished or (b) the copy fails. I think the above logic will be valid in case of standby as well because we are not really deleting the underlying files. Now the standby has data in > shared_buffers for a database that does not exist. If that's not bad, > then why does createdb_failure_params need to DropDatabaseBuffers()? > But I bet it is bad. I wonder if we should be using > RelationCopyStorage() rather than this new function > RelationCopyStorageUsingBuffer(). I am not sure how RelationCopyStorage() will help in the standby side, because then also we will log the same WAL (XLOG_FPI) for each page and standby side we will use buffer to apply this FPI so if you think that there is a problem then it will be same with RelationCopyStorage() at least on the standby side. In fact while we are rewriting the relation during vacuum full that time also we are calling log_newpage() under RelationCopyStorage() and during standby if it gets promoted we will be having some buffers in the buffer pool with the new relfilenode. So I think our case is also the same. So here my stand is that we need to drop database buffers and remove pending sync requests because we are deleting underlying files and if we do not do that in some extreme cases then there is no need to drop the buffers or remove the pending sync request and the worst consequences would be waste of disk space. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, Mar 11, 2022 at 11:51 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Mar 11, 2022 at 1:10 PM Robert Haas <robertmhaas@gmail.com> wrote: > > I don't think you've adequately considered temporary relations here. > > It seems to be that ReadBufferWithoutRelcache() could not be safe on a > > temprel, because we'd need a BackendId to access the underlying > > storage. So I think that ReadBufferWithoutRelcache can only accept > > unlogged or permanent, and maybe the argument ought to be a Boolean > > instead of a relpersistence value. I thought that this problem might > > be only cosmetic, but I checked the code that actually does the copy, > > and there's no filter there on relpersistence either. And I think > > there should be. Yeah right for now, this api can only support unlogged or permanent. I will fix this > I hit "send" too quickly there: > > rhaas=# create database fudge; > CREATE DATABASE > rhaas=# \c fudge > You are now connected to database "fudge" as user "rhaas". > fudge=# create temp table q (); > CREATE TABLE > fudge=# ^Z > [2]+ Stopped psql > [rhaas Downloads]$ pg_ctl stop -mi > waiting for server to shut down.... done > server stopped > [rhaas Downloads]$ %% > psql > \c > You are now connected to database "fudge" as user "rhaas". > fudge=# select * from pg_class where relpersistence='t'; > oid | relname | relnamespace | reltype | reloftype | relowner | > relam | relfilenode | reltablespace | relpages | reltuples | > relallvisible | reltoastrelid | relhasindex | relisshared | > relpersistence | relkind | relnatts | relchecks | relhasrules | > relhastriggers | relhassubclass | relrowsecurity | relforcerowsecurity > | relispopulated | relreplident | relispartition | relrewrite | > relfrozenxid | relminmxid | relacl | reloptions | relpartbound > -------+---------+--------------+---------+-----------+----------+-------+-------------+---------------+----------+-----------+---------------+---------------+-------------+-------------+----------------+---------+----------+-----------+-------------+----------------+----------------+----------------+---------------------+----------------+--------------+----------------+------------+--------------+------------+--------+------------+-------------- > 16388 | q | 16386 | 16390 | 0 | 10 | > 2 | 16388 | 0 | 0 | -1 | 0 > | 0 | f | f | t | r > | 0 | 0 | f | f | f > | f | f | t | d > | f | 0 | 721 | 1 | | > | > (1 row) > > fudge=# \c rhaas > You are now connected to database "rhaas" as user "rhaas". > rhaas=# alter database fudge is_template true; > ALTER DATABASE > rhaas=# create database cookies template fudge; > CREATE DATABASE > rhaas=# \c cookies > You are now connected to database "cookies" as user "rhaas". > cookies=# select count(*) from pg_class where relpersistence='t'; > count > ------- > 1 > (1 row) I think this is not a right example to show the problem, no? Because you are showing the pg_class entry and the pg_class is not a temp relation so even if we avoid copying the temp relation pg_class will be copied right? so you will still see this uncleaned temp relation entry. I could reproduce exactly the same issue without my patch as well. So I agree we need to avoid copying temp relations but with that the above behavior will not change. Am I missing something? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Sat, Mar 12, 2022 at 11:06 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > In fact while we are rewriting the relation during vacuum full that > time also we are calling log_newpage() under RelationCopyStorage() and > during standby if it gets promoted we will be having some buffers in > the buffer pool with the new relfilenode. So I think our case is also > the same. > > So here my stand is that we need to drop database buffers and remove > pending sync requests because we are deleting underlying files and if > we do not do that in some extreme cases then there is no need to drop > the buffers or remove the pending sync request and the worst > consequences would be waste of disk space. So other than this open point I have fixed other comments given by you which includes, - Avoid copying temp relfilenode - Rename of functions CopyDatabase* -> CreateDatabaseUsing* - GetDatabaseRelationList and friends to ScanSourceDatabasePgClass* - Removed unused structure and macro because we are using the same WAL for copying the database using the old method or creating the directory and version files for the new method. Do you think we should introduce a new WAL for that instead of using the same? Other than that, I have fixed some mistakes in comments and supported tab completion for the new options. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- v14-0002-Extend-relmap-interfaces.patch
- v14-0001-Refactor-relmap-load-and-relmap-write-functions.patch
- v14-0004-New-interface-to-lock-relation-id.patch
- v14-0003-Allow-ReadBufferWithoutRelcache-to-support-unlog.patch
- v14-0005-WAL-logged-CREATE-DATABASE.patch
- v14-0006-Support-create-database-strategy-in-createdb-too.patch
On Sat, Mar 12, 2022 at 12:36 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > So here my stand is that we need to drop database buffers and remove > pending sync requests because we are deleting underlying files and if > we do not do that in some extreme cases then there is no need to drop > the buffers or remove the pending sync request and the worst > consequences would be waste of disk space. Hmm, I guess you're right. On Mon, Mar 14, 2022 at 7:51 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > - Removed unused structure and macro because we are using the same WAL > for copying the database using the old method or creating the > directory and version files for the new method. Do you think we > should introduce a new WAL for that instead of using the same? I think it would make sense to have two different WAL records e.g. XLOG_DBASE_CREATE_WAL_LOG and XLOG_DBASE_CREATE_FILE_COPY. Then it's easy to see how this could be generalized to other strategies in the future. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Mar 14, 2022 at 7:51 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Other than that, I have fixed some mistakes in comments and supported > tab completion for the new options. I was looking at 0001 and 0002 again and realized that I swapped the names load_relmap_file() and read_relmap_file() from what I should have done. Here's a revised version. With this, read_relmap_file() and write_relmap_file() become functions that just read and write the file without touching any global variables, and load_relmap_file() is the function that reads data from the file and puts it into a global variable, which seems more sensible than the way I had it before. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Mon, Mar 14, 2022 at 12:04 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Mar 14, 2022 at 7:51 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Other than that, I have fixed some mistakes in comments and supported > > tab completion for the new options. > > I was looking at 0001 and 0002 again and realized that I swapped the > names load_relmap_file() and read_relmap_file() from what I should > have done. Here's a revised version. With this, read_relmap_file() and > write_relmap_file() become functions that just read and write the file > without touching any global variables, and load_relmap_file() is the > function that reads data from the file and puts it into a global > variable, which seems more sensible than the way I had it before. Regarding 0003 and 0005, I'm not a fan of 'bool isunlogged'. I think 'bool permanent' would be better (note BM_PERMANENT). This would involve reversing true and false. Regarding 0004, I can't really see a reason for this function to take a LockRelId as a parameter rather than two separate OIDs. I also can't entirely see why it should be called LockRelationId. Maybe LockRelationInDatabaseById(Oid dbid, Oid relid, LOCKMODE lockmode)? Note that neither caller actually has a LockRelId available; both have to construct one. Regarding 0005: + CREATEDB_WAL_LOG = 0, + CREATEDB_FILE_COPY = 1 I still think you don't need = 0 and = 1 here. I'll probably go through and do a pass over the comments once you post the next version of this. There seems to be work needed in a bunch of places, but it probably makes more sense for me to go through and adjust the things that seem to need it rather than listing a bunch of changes for you to make. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Mar 14, 2022 at 10:04 PM Robert Haas <robertmhaas@gmail.com> wrote: > > Regarding 0004, I can't really see a reason for this function to take > a LockRelId as a parameter rather than two separate OIDs. I also can't > entirely see why it should be called LockRelationId. Maybe > LockRelationInDatabaseById(Oid dbid, Oid relid, LOCKMODE lockmode)? > Note that neither caller actually has a LockRelId available; both have > to construct one. Actually we already have an existing function UnlockRelationId(LockRelId *relid, LOCKMODE lockmode) so it makes more sense to have a parallel lock function. Do you still think we should change? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Mon, Mar 14, 2022 at 12:44 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Mon, Mar 14, 2022 at 10:04 PM Robert Haas <robertmhaas@gmail.com> wrote: > > Regarding 0004, I can't really see a reason for this function to take > > a LockRelId as a parameter rather than two separate OIDs. I also can't > > entirely see why it should be called LockRelationId. Maybe > > LockRelationInDatabaseById(Oid dbid, Oid relid, LOCKMODE lockmode)? > > Note that neither caller actually has a LockRelId available; both have > > to construct one. > > Actually we already have an existing function > UnlockRelationId(LockRelId *relid, LOCKMODE lockmode) so it makes more > sense to have a parallel lock function. Do you still think we should > change? Oh! OK, well, then what you did makes sense, for consistency. Didn't realize that. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Mar 14, 2022 at 10:04 PM Robert Haas <robertmhaas@gmail.com> wrote: > I think it would make sense to have two different WAL records e.g. > XLOG_DBASE_CREATE_WAL_LOG and XLOG_DBASE_CREATE_FILE_COPY. Then it's > easy to see how this could be generalized to other strategies in the > future. Done that way. In dbase_desc(), for XLOG_DBASE_CREATE_FILE_COPY I have kept the older description i.e. "copy dir" and for XLOG_DBASE_CREATE_WAL_LOG it is "create dir", because logically the first one is actually copying and the second one is just creating the directory. Do you think we should be using "copy dir file_copy" and "copy dir wal_log" in the description as well? > On Mon, Mar 14, 2022 at 12:04 PM Robert Haas <robertmhaas@gmail.com> wrote: > > I was looking at 0001 and 0002 again and realized that I swapped the > > names load_relmap_file() and read_relmap_file() from what I should > > have done. Here's a revised version. With this, read_relmap_file() and > > write_relmap_file() become functions that just read and write the file > > without touching any global variables, and load_relmap_file() is the > > function that reads data from the file and puts it into a global > > variable, which seems more sensible than the way I had it before. Okay, I have included this patch and rebased other patches on top of that. > Regarding 0003 and 0005, I'm not a fan of 'bool isunlogged'. I think > 'bool permanent' would be better (note BM_PERMANENT). This would > involve reversing true and false. Okay changed. > Regarding 0005: > > + CREATEDB_WAL_LOG = 0, > + CREATEDB_FILE_COPY = 1 > > I still think you don't need = 0 and = 1 here. Done > I'll probably go through and do a pass over the comments once you post > the next version of this. There seems to be work needed in a bunch of > places, but it probably makes more sense for me to go through and > adjust the things that seem to need it rather than listing a bunch of > changes for you to make. Sure, thanks. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- v15-0001-Refactor-relmap-load-and-relmap-write-functions.patch
- v15-0003-Allow-ReadBufferWithoutRelcache-to-support-unlog.patch
- v15-0005-WAL-logged-CREATE-DATABASE.patch
- v15-0004-New-interface-to-lock-relation-id.patch
- v15-0002-Extend-relmap-interfaces.patch
- v15-0006-Support-create-database-strategy-in-createdb-too.patch
Few comments on the latest patch: - /* We need to construct the pathname for this database */ - dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid); + if (xlrec->dbid != InvalidOid) + dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid); + else + dbpath = pstrdup("global"); Do we really need this change? Is GetDatabasePath() alone not capable of handling it? == +static CreateDBRelInfo *ScanSourceDatabasePgClassTuple(HeapTupleData *tuple, + Oid tbid, Oid dbid, + char *srcpath); +static List *ScanSourceDatabasePgClassPage(Page page, Buffer buf, Oid tbid, + Oid dbid, char *srcpath, + List *rnodelist, Snapshot snapshot); +static List *ScanSourceDatabasePgClass(Oid srctbid, Oid srcdbid, char *srcpath); I think we can shorten these function names to probably ScanSourceDBPgClassRel(), ScanSourceDBPgClassTuple() and likewise? -- With Regards, Ashutosh Sharma. On Tue, Mar 15, 2022 at 3:24 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Mar 14, 2022 at 10:04 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > I think it would make sense to have two different WAL records e.g. > > XLOG_DBASE_CREATE_WAL_LOG and XLOG_DBASE_CREATE_FILE_COPY. Then it's > > easy to see how this could be generalized to other strategies in the > > future. > > Done that way. In dbase_desc(), for XLOG_DBASE_CREATE_FILE_COPY I > have kept the older description i.e. "copy dir" and for > XLOG_DBASE_CREATE_WAL_LOG it is "create dir", because logically the > first one is actually copying and the second one is just creating the > directory. Do you think we should be using "copy dir file_copy" and > "copy dir wal_log" in the description as well? > > > On Mon, Mar 14, 2022 at 12:04 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > I was looking at 0001 and 0002 again and realized that I swapped the > > > names load_relmap_file() and read_relmap_file() from what I should > > > have done. Here's a revised version. With this, read_relmap_file() and > > > write_relmap_file() become functions that just read and write the file > > > without touching any global variables, and load_relmap_file() is the > > > function that reads data from the file and puts it into a global > > > variable, which seems more sensible than the way I had it before. > > Okay, I have included this patch and rebased other patches on top of that. > > > Regarding 0003 and 0005, I'm not a fan of 'bool isunlogged'. I think > > 'bool permanent' would be better (note BM_PERMANENT). This would > > involve reversing true and false. > > Okay changed. > > > Regarding 0005: > > > > + CREATEDB_WAL_LOG = 0, > > + CREATEDB_FILE_COPY = 1 > > > > I still think you don't need = 0 and = 1 here. > > Done > > > I'll probably go through and do a pass over the comments once you post > > the next version of this. There seems to be work needed in a bunch of > > places, but it probably makes more sense for me to go through and > > adjust the things that seem to need it rather than listing a bunch of > > changes for you to make. > > Sure, thanks. > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com
On Tue, Mar 15, 2022 at 12:30 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > Few comments on the latest patch: > > - /* We need to construct the pathname for this database */ > - dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid); > + if (xlrec->dbid != InvalidOid) > + dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid); > + else > + dbpath = pstrdup("global"); > > Do we really need this change? Is GetDatabasePath() alone not capable > of handling it? Well, I mean, that function has a special case for GLOBALTABLESPACE_OID, but GLOBALTABLESPACE_OID is 1664, and InvalidOid is 0. > I think we can shorten these function names to probably > ScanSourceDBPgClassRel(), ScanSourceDBPgClassTuple() and likewise? We could, but I don't think it's an improvement. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Mar 15, 2022 at 10:17 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Mar 15, 2022 at 12:30 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > Few comments on the latest patch: > > > > - /* We need to construct the pathname for this database */ > > - dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid); > > + if (xlrec->dbid != InvalidOid) > > + dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid); > > + else > > + dbpath = pstrdup("global"); > > > > Do we really need this change? Is GetDatabasePath() alone not capable > > of handling it? > > Well, I mean, that function has a special case for > GLOBALTABLESPACE_OID, but GLOBALTABLESPACE_OID is 1664, and InvalidOid > is 0. > Wouldn't this be true only in case of a shared map file (when dbOid is Invalid and tblspcOid is globaltablespace_oid) or am I missing something? -- With Regards, Ashutosh Sharma.
On Tue, Mar 15, 2022 at 1:26 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > On Tue, Mar 15, 2022 at 12:30 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > Few comments on the latest patch: > > > > > > - /* We need to construct the pathname for this database */ > > > - dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid); > > > + if (xlrec->dbid != InvalidOid) > > > + dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid); > > > + else > > > + dbpath = pstrdup("global"); > > > > > > Do we really need this change? Is GetDatabasePath() alone not capable > > > of handling it? > > > > Well, I mean, that function has a special case for > > GLOBALTABLESPACE_OID, but GLOBALTABLESPACE_OID is 1664, and InvalidOid > > is 0. > > > > Wouldn't this be true only in case of a shared map file (when dbOid is > Invalid and tblspcOid is globaltablespace_oid) or am I missing > something? *facepalm* Good catch, sorry that I'm slow on the uptake today. v3 attached. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Tue, Mar 15, 2022 at 11:09 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Mar 15, 2022 at 1:26 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > On Tue, Mar 15, 2022 at 12:30 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > > > Few comments on the latest patch: > > > > > > > > - /* We need to construct the pathname for this database */ > > > > - dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid); > > > > + if (xlrec->dbid != InvalidOid) > > > > + dbpath = GetDatabasePath(xlrec->dbid, xlrec->tsid); > > > > + else > > > > + dbpath = pstrdup("global"); > > > > > > > > Do we really need this change? Is GetDatabasePath() alone not capable > > > > of handling it? > > > > > > Well, I mean, that function has a special case for > > > GLOBALTABLESPACE_OID, but GLOBALTABLESPACE_OID is 1664, and InvalidOid > > > is 0. > > > > > > > Wouldn't this be true only in case of a shared map file (when dbOid is > > Invalid and tblspcOid is globaltablespace_oid) or am I missing > > something? > > *facepalm* > > Good catch, sorry that I'm slow on the uptake today. > > v3 attached. Thanks Ashutosh and Robert. Other pacthes cleanly applied on this patch still generated a new version so that we can find all patches together. There are no other changes. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
- v16-0002-Extend-relmap-interfaces.patch
- v16-0001-Refactor-relmap-load-and-relmap-write-functions.patch
- v16-0003-Allow-ReadBufferWithoutRelcache-to-support-unlog.patch
- v16-0004-New-interface-to-lock-relation-id.patch
- v16-0005-WAL-logged-CREATE-DATABASE.patch
- v16-0006-Support-create-database-strategy-in-createdb-too.patch
On Wed, Mar 16, 2022 at 12:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Thanks Ashutosh and Robert. Other pacthes cleanly applied on this > patch still generated a new version so that we can find all patches > together. There are no other changes. I committed my v3 of my refactoring patch, here 0001. I'm working over the comments in the rest of the patch series and will post an updated version when I get done. I think I will likely merge all the remaining patches together just to make it simpler to manage; we can split things out again if we need to do that. One question that occurred to me when looking this over is whether, or why, it's safe against concurrent smgr invalidations. It seems to me that every loop in the new CREATE DATABASE code needs to CHECK_FOR_INTERRUPTS() -- some do already -- and when they do that, I think we might receive an invalidation message that causes us to smgrclose() some or all of the things where we previously did smgropen(). I don't quite see why that can't cause problems here. I tried running the src/bin/scripts regression tests with debug_discard_caches=1 and none of the tests failed, so there may very well be a reason why this is actually totally fine, but I don't know what it is. On the other hand, it may be that things went horribly wrong and the tests are just smart enough to catch it, or maybe there's a problematic scenario which those tests just don't hit. I don't know. Thoughts? -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Mar 18, 2022 at 1:44 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Mar 16, 2022 at 12:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Thanks Ashutosh and Robert. Other pacthes cleanly applied on this > > patch still generated a new version so that we can find all patches > > together. There are no other changes. > > I committed my v3 of my refactoring patch, here 0001. > > I'm working over the comments in the rest of the patch series and will > post an updated version when I get done. I think I will likely merge > all the remaining patches together just to make it simpler to manage; > we can split things out again if we need to do that. Thanks for the effort. > One question that occurred to me when looking this over is whether, or > why, it's safe against concurrent smgr invalidations. We are only accessing the smgr of the source database and the destination database. And there is no one else that can be connected to the source db and the destination db is not visible to anyone. So do we really need to worry about the concurrent smgr invalidation? What am I missing? It seems to me > that every loop in the new CREATE DATABASE code needs to > CHECK_FOR_INTERRUPTS() -- some do already -- and when they do that, Yes, the pg_class reading code is missing this check so we need to put it. But copying code like CreateDatabaseUsingWalLog() have it inside the deepest loop in RelationCopyStorageUsingBuffer() and similarly CreateDatabaseUsingFileCopy() have it in copydir(). Maybe we should put it in all loop so that we do not skip checking due to some condition. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, Mar 18, 2022 at 12:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > One question that occurred to me when looking this over is whether, or > > why, it's safe against concurrent smgr invalidations. > > We are only accessing the smgr of the source database and the > destination database. And there is no one else that can be connected > to the source db and the destination db is not visible to anyone. So > do we really need to worry about the concurrent smgr invalidation? > What am I missing? A sinval reset can occur at any moment due to an overflow of the queue. That acts as a universal reset of everything. So you can't reason on the basis of what somebody might be sending. -- Robert Haas EDB: http://www.enterprisedb.com
On Sun, Mar 20, 2022 at 12:03 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Mar 18, 2022 at 12:39 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > One question that occurred to me when looking this over is whether, or > > > why, it's safe against concurrent smgr invalidations. > > > > We are only accessing the smgr of the source database and the > > destination database. And there is no one else that can be connected > > to the source db and the destination db is not visible to anyone. So > > do we really need to worry about the concurrent smgr invalidation? > > What am I missing? > > A sinval reset can occur at any moment due to an overflow of the > queue. That acts as a universal reset of everything. So you can't > reason on the basis of what somebody might be sending. I thought that way because IIUC, when we are locking the database tuple we are ensuring that we are calling ReceiveSharedInvalidMessages() right? And IIUC ReceiveSharedInvalidMessages(), is designed such a way that it will consume all the outstanding messages and that's the reason it loops multiple times if it identifies that the queue is full. And if my assumption here is correct then I think it is also correct that now we only need to worry about anyone generating new invalidations and that is not possible in this case. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Sun, Mar 20, 2022 at 1:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I thought that way because IIUC, when we are locking the database > tuple we are ensuring that we are calling > ReceiveSharedInvalidMessages() right? And IIUC > ReceiveSharedInvalidMessages(), is designed such a way that it will > consume all the outstanding messages and that's the reason it loops > multiple times if it identifies that the queue is full. And if my > assumption here is correct then I think it is also correct that now we > only need to worry about anyone generating new invalidations and that > is not possible in this case. Well, I don't see how that chain of logic addresses my concern about sinval reset. Mind you, I'm not sure there's an actual problem here, because I tried testing the patch with debug_discard_caches=1 and nothing failed. But I still don't understand WHY nothing failed. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Mar 21, 2022 at 7:07 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Sun, Mar 20, 2022 at 1:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I thought that way because IIUC, when we are locking the database > > tuple we are ensuring that we are calling > > ReceiveSharedInvalidMessages() right? And IIUC > > ReceiveSharedInvalidMessages(), is designed such a way that it will > > consume all the outstanding messages and that's the reason it loops > > multiple times if it identifies that the queue is full. And if my > > assumption here is correct then I think it is also correct that now we > > only need to worry about anyone generating new invalidations and that > > is not possible in this case. > > Well, I don't see how that chain of logic addresses my concern about > sinval reset. > > Mind you, I'm not sure there's an actual problem here, because I tried > testing the patch with debug_discard_caches=1 and nothing failed. But > I still don't understand WHY nothing failed. Okay, I see what you are saying. Yeah this looks like a problem to me as well. I will try to reproduce this issue. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Mon, Mar 21, 2022 at 8:29 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Mar 21, 2022 at 7:07 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Sun, Mar 20, 2022 at 1:34 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > I thought that way because IIUC, when we are locking the database > > > tuple we are ensuring that we are calling > > > ReceiveSharedInvalidMessages() right? And IIUC > > > ReceiveSharedInvalidMessages(), is designed such a way that it will > > > consume all the outstanding messages and that's the reason it loops > > > multiple times if it identifies that the queue is full. And if my > > > assumption here is correct then I think it is also correct that now we > > > only need to worry about anyone generating new invalidations and that > > > is not possible in this case. > > > > Well, I don't see how that chain of logic addresses my concern about > > sinval reset. > > > > Mind you, I'm not sure there's an actual problem here, because I tried > > testing the patch with debug_discard_caches=1 and nothing failed. But > > I still don't understand WHY nothing failed. > > Okay, I see what you are saying. Yeah this looks like a problem to me > as well. I will try to reproduce this issue. I tried to debug the case but I realized that somehow CHECK_FOR_INTERRUPTS() is not calling the AcceptInvalidationMessages() and I could not find the same while looking into the code as well. While debugging I noticed that AcceptInvalidationMessages() is called multiple times but that is only through LockRelationId() but while locking the relation we had already closed the previous smgr because at a time we keep only one smgr open. And that's the reason it is not hitting the issue which we think it could. Is there any condition under which it will call AcceptInvalidationMessages() through CHECK_FOR_INTERRUPTS() ? because I could not see while debugging as well as in code. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Mon, Mar 21, 2022 at 11:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I tried to debug the case but I realized that somehow > CHECK_FOR_INTERRUPTS() is not calling the > AcceptInvalidationMessages() and I could not find the same while > looking into the code as well. While debugging I noticed that > AcceptInvalidationMessages() is called multiple times but that is only > through LockRelationId() but while locking the relation we had already > closed the previous smgr because at a time we keep only one smgr open. > And that's the reason it is not hitting the issue which we think it > could. Is there any condition under which it will call > AcceptInvalidationMessages() through CHECK_FOR_INTERRUPTS() ? because > I could not see while debugging as well as in code. Yeah, I think the reason you can't find it is that it's not there. I was confused in what I wrote earlier. I think we only process sinval catchups when we're idle, not at every CHECK_FOR_INTERRUPTS(). And I think the reason for that is precisely that it would be hard to write correct code otherwise, since invalidations might then get processed in a lot more places. So ... I guess all we really need to do here is avoid assuming that the results of smgropen() are valid across any code that might acquire relation locks. Which I think the code already does. But on a related note, why doesn't CreateDatabaseUsingWalLog() acquire locks on both the source and destination relations? It looks like you're only taking locks for the source template database ... but I thought the intention here was to make sure that we didn't pull pages into shared_buffers without holding a lock on the relation and/or the database? I suppose the point is that while the template database might be concurrently dropped, nobody can be doing anything concurrently to the target database because nobody knows that it exists yet. Still, I think that this would be the only case where we let pages into shared_buffers without a relation or database lock, though maybe I'm confused about this point, too. If not, perhaps we should consider locking the target database OID and each relation OID as we are copying it? I guess I'm imagining that there might be more code pathways in the future that want to ensure that there are no remaining buffers for some particular database or relation OID. It seems natural to want to be able to take some lock that prevents buffers from being added, and then go and get rid of all the ones that are there already. But I admit I can't quite think of a concrete case where we'd want to do something like this where the patch as coded would be a problem. I'm just thinking perhaps taking locks is fairly harmless and might avoid some hypothetical problem later. Thoughts? -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Mar 21, 2022 at 11:53 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Mar 21, 2022 at 11:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I tried to debug the case but I realized that somehow > > CHECK_FOR_INTERRUPTS() is not calling the > > AcceptInvalidationMessages() and I could not find the same while > > looking into the code as well. While debugging I noticed that > > AcceptInvalidationMessages() is called multiple times but that is only > > through LockRelationId() but while locking the relation we had already > > closed the previous smgr because at a time we keep only one smgr open. > > And that's the reason it is not hitting the issue which we think it > > could. Is there any condition under which it will call > > AcceptInvalidationMessages() through CHECK_FOR_INTERRUPTS() ? because > > I could not see while debugging as well as in code. > > Yeah, I think the reason you can't find it is that it's not there. I > was confused in what I wrote earlier. I think we only process sinval > catchups when we're idle, not at every CHECK_FOR_INTERRUPTS(). And I > think the reason for that is precisely that it would be hard to write > correct code otherwise, since invalidations might then get processed > in a lot more places. So ... I guess all we really need to do here is > avoid assuming that the results of smgropen() are valid across any > code that might acquire relation locks. Which I think the code already > does. > > But on a related note, why doesn't CreateDatabaseUsingWalLog() acquire > locks on both the source and destination relations? It looks like > you're only taking locks for the source template database ... but I > thought the intention here was to make sure that we didn't pull pages > into shared_buffers without holding a lock on the relation and/or the > database? I suppose the point is that while the template database > might be concurrently dropped, nobody can be doing anything > concurrently to the target database because nobody knows that it > exists yet. Still, I think that this would be the only case where we > let pages into shared_buffers without a relation or database lock, > though maybe I'm confused about this point, too. If not, perhaps we > should consider locking the target database OID and each relation OID > as we are copying it? > > I guess I'm imagining that there might be more code pathways in the > future that want to ensure that there are no remaining buffers for > some particular database or relation OID. It seems natural to want to > be able to take some lock that prevents buffers from being added, and > then go and get rid of all the ones that are there already. But I > admit I can't quite think of a concrete case where we'd want to do > something like this where the patch as coded would be a problem. I'm > just thinking perhaps taking locks is fairly harmless and might avoid > some hypothetical problem later. > > Thoughts? I think this make sense. I haven't changed the original patch as you told you were improving on some comments, so in order to avoid conflict I have created this add on patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Mar 22, 2022 at 10:28 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > I think this make sense. I haven't changed the original patch as you > told you were improving on some comments, so in order to avoid > conflict I have created this add on patch. > In my previous patch mistakenly I used src_dboid instead of dest_dboid. Fixed in this version. For destination db I have used lock mode as AccessSharedLock. Logically if we see access wise we don't want anyone else to be accessing that db but that is anyway protected because it is not visible to anyone else. So I think AccessSharedLock should be correct here because we are just taking this lock because we are accessing pages in shared buffers from this database's relations. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Mar 22, 2022 at 5:00 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > In my previous patch mistakenly I used src_dboid instead of > dest_dboid. Fixed in this version. For destination db I have used > lock mode as AccessSharedLock. Logically if we see access wise we > don't want anyone else to be accessing that db but that is anyway > protected because it is not visible to anyone else. So I think > AccessSharedLock should be correct here because we are just taking > this lock because we are accessing pages in shared buffers from this > database's relations. Here's my worked-over version of your previous patch. I haven't tried to incorporate your incremental patch that you just posted. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Tue, Mar 22, 2022 at 11:23 AM Robert Haas <robertmhaas@gmail.com> wrote: > Here's my worked-over version of your previous patch. I haven't tried > to incorporate your incremental patch that you just posted. Also, please have a look at the XXX comments that I added in a few places where I think you need to make further changes. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2022-03-22 11:23:16 -0400, Robert Haas wrote: > From 116bcdb6174a750b7ef7ae05ef6f39cebaf9bcf5 Mon Sep 17 00:00:00 2001 > From: Robert Haas <rhaas@postgresql.org> > Date: Tue, 22 Mar 2022 11:22:26 -0400 > Subject: [PATCH v1] Add new block-by-block strategy for CREATE DATABASE. I might have missed it because I just skimmed the patch. But I still think it should contain a comment detailing why accessing catalogs from another database is safe in this instance, and perhaps a comment or three in places that could break it (e.g. snapshot computation, horizon stuff). Greetings, Andres Freund
On Tue, Mar 22, 2022 at 11:42 AM Andres Freund <andres@anarazel.de> wrote: > I might have missed it because I just skimmed the patch. But I still think it > should contain a comment detailing why accessing catalogs from another > database is safe in this instance, and perhaps a comment or three in places > that could break it (e.g. snapshot computation, horizon stuff). Please see the function header comment for ScanSourceDatabasePgClass. I don't quite see how changes in those places would break this, but if you want to be more specific perhaps I will see the light? -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Mar 21, 2022 at 2:23 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Mar 21, 2022 at 11:21 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I tried to debug the case but I realized that somehow > > CHECK_FOR_INTERRUPTS() is not calling the > > AcceptInvalidationMessages() and I could not find the same while > > looking into the code as well. While debugging I noticed that > > AcceptInvalidationMessages() is called multiple times but that is only > > through LockRelationId() but while locking the relation we had already > > closed the previous smgr because at a time we keep only one smgr open. > > And that's the reason it is not hitting the issue which we think it > > could. Is there any condition under which it will call > > AcceptInvalidationMessages() through CHECK_FOR_INTERRUPTS() ? because > > I could not see while debugging as well as in code. > > Yeah, I think the reason you can't find it is that it's not there. I > was confused in what I wrote earlier. I think we only process sinval > catchups when we're idle, not at every CHECK_FOR_INTERRUPTS(). And I > think the reason for that is precisely that it would be hard to write > correct code otherwise, since invalidations might then get processed > in a lot more places. So ... I guess all we really need to do here is > avoid assuming that the results of smgropen() are valid across any > code that might acquire relation locks. Which I think the code already > does. So I talked to Andres and Thomas about this and they told me that I was right to worry about this problem. Over on the thread about "wrong fds used for refilenodes after pg_upgrade relfilenode changes Reply-To:" there is a plan to make use ProcSignalBarrier to make smgr objects disappear, and ProcSignalBarrier can be processed at any CHECK_FOR_INTERRUPTS(), so then we'd have a problem here. Commit f10f0ae420ee62400876ab34dca2c09c20dcd030 established a policy that you should always re-fetch the smgr object instead of reusing one you've already got, and even before that it was known to be unsafe to keep them around for any period of time, because anything that opened a relation, including a syscache lookup, could potentially accept invalidations. So most of our code is already hardened against the possibility of smgr objects disappearing. I have a feeling there may be some that isn't, but it would be good if this patch didn't introduce more such code at the same time that patch is trying to introduce more ways to get rid of smgr objects. It was suggested to me that what this patch ought to be doing is calling CreateFakeRelcacheEntry() and then using RelationGetSmgr(fakerel) every time we need the SmgrRelation, without ever keeping it around for any amount of code. That way, if the smgr relation gets closed out from under us at a CHECK_FOR_INTERRUPTS(), we'll just recreate it at the next RelationGetSmgr() call. Andres also noted that he thinks the patch performs redundant cleanup, because of the fact that it uses RelationCreateStorage. That will arrange to remove files on abort, but createdb() also has its own mechanism for that. It doesn't seem like a thing to do twice in two different ways. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Mar 22, 2022 at 8:53 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Mar 22, 2022 at 5:00 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > In my previous patch mistakenly I used src_dboid instead of > > dest_dboid. Fixed in this version. For destination db I have used > > lock mode as AccessSharedLock. Logically if we see access wise we > > don't want anyone else to be accessing that db but that is anyway > > protected because it is not visible to anyone else. So I think > > AccessSharedLock should be correct here because we are just taking > > this lock because we are accessing pages in shared buffers from this > > database's relations. > > Here's my worked-over version of your previous patch. I haven't tried > to incorporate your incremental patch that you just posted. Thanks for working on the comments. Please find the updated version which include below changes - Worked on the XXX comments added by you. - Added database level lock for the target database as well. - Used fake relcache and removed direct access to the smgr, I think it was not really necessary in ScanSourceDatabasePgClass() because we are using it for a very short period of time but still I have changed it, let me know if you think that it is unneccessary to create the fake relcache here. - Removed extra space in createdb.c and fixed test case. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Wed, Mar 23, 2022 at 2:14 AM Robert Haas <robertmhaas@gmail.com> wrote: > > So I talked to Andres and Thomas about this and they told me that I > was right to worry about this problem. Over on the thread about "wrong > fds used for refilenodes after pg_upgrade relfilenode changes > Reply-To:" there is a plan to make use ProcSignalBarrier to make smgr > objects disappear, and ProcSignalBarrier can be processed at any > CHECK_FOR_INTERRUPTS(), so then we'd have a problem here. Commit > f10f0ae420ee62400876ab34dca2c09c20dcd030 established a policy that you > should always re-fetch the smgr object instead of reusing one you've > already got, and even before that it was known to be unsafe to keep > them around for any period of time, because anything that opened a > relation, including a syscache lookup, could potentially accept > invalidations. So most of our code is already hardened against the > possibility of smgr objects disappearing. I have a feeling there may > be some that isn't, but it would be good if this patch didn't > introduce more such code at the same time that patch is trying to > introduce more ways to get rid of smgr objects. It was suggested to me > that what this patch ought to be doing is calling > CreateFakeRelcacheEntry() and then using RelationGetSmgr(fakerel) > every time we need the SmgrRelation, without ever keeping it around > for any amount of code. That way, if the smgr relation gets closed out > from under us at a CHECK_FOR_INTERRUPTS(), we'll just recreate it at > the next RelationGetSmgr() call. Okay, I have changed this in my latest version of the patch. > Andres also noted that he thinks the patch performs redundant cleanup, > because of the fact that it uses RelationCreateStorage. That will > arrange to remove files on abort, but createdb() also has its own > mechanism for that. It doesn't seem like a thing to do twice in two > different ways. Okay this is an interesting point. So one option is that in case of failure while using the wal log strategy we do not remove the database directory, because an abort transaction will take care of removing the relation file. But then in failure case we will leave the orphaned database directory with version file and the relmap file. Another option is to do the redundant cleanup as we are doing now. Any other options? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Mar 23, 2022 at 4:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Okay this is an interesting point. So one option is that in case of > failure while using the wal log strategy we do not remove the database > directory, because an abort transaction will take care of removing the > relation file. But then in failure case we will leave the orphaned > database directory with version file and the relmap file. Another > option is to do the redundant cleanup as we are doing now. Any other > options? I think our overriding goal should be to get everything using one mechanism. It doesn't look straightforward to get everything to go through the PendingRelDelete mechanism, because as you say, it can't handle non-relation files or directories. However, what if we opt out of that mechanism? We could do that either by not using RelationCreateStorage() in the first place and directly calling smgrcreate(), or by using RelationPreserveStorage() afterwards to yank the file back out of the list. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Mar 23, 2022 at 5:54 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Mar 23, 2022 at 4:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Okay this is an interesting point. So one option is that in case of > > failure while using the wal log strategy we do not remove the database > > directory, because an abort transaction will take care of removing the > > relation file. But then in failure case we will leave the orphaned > > database directory with version file and the relmap file. Another > > option is to do the redundant cleanup as we are doing now. Any other > > options? > > I think our overriding goal should be to get everything using one > mechanism. It doesn't look straightforward to get everything to go > through the PendingRelDelete mechanism, because as you say, it can't > handle non-relation files or directories. However, what if we opt out > of that mechanism? We could do that either by not using > RelationCreateStorage() in the first place and directly calling > smgrcreate(), or by using RelationPreserveStorage() afterwards to yank > the file back out of the list. I think directly using smgrcreate() is a better idea instead of first registering and then unregistering it. I have made that change in the attached patch. After this change now we can merge creating the MAIN_FORKNUM also in the loop below where we are creating other fork[1] with one extra condition but I think current code is in more sync with the other code where we are doing the similar things so I have not merged it in the loop. Please let me know if you think otherwise. [1] + /* + * Create and copy all forks of the relation. We are not using + * RelationCreateStorage() as it is registering the cleanup for the + * underlying relation storage on the transaction abort. But during create + * database failure, we have a separate cleanup mechanism for the whole + * database directory. Therefore, we don't need to register cleanup for + * each individual relation storage. + */ + smgrcreate(RelationGetSmgr(dst_rel), MAIN_FORKNUM, false); + if (permanent) + log_smgrcreate(&dst_rnode, MAIN_FORKNUM); + + /* copy main fork. */ + RelationCopyStorageUsingBuffer(src_rel, dst_rel, MAIN_FORKNUM, permanent); + + /* copy those extra forks that exist */ + for (ForkNumber forkNum = MAIN_FORKNUM + 1; + forkNum <= MAX_FORKNUM; forkNum++) + { + if (smgrexists(RelationGetSmgr(src_rel), forkNum)) + { + smgrcreate(RelationGetSmgr(dst_rel), forkNum, false); + -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Wed, Mar 23, 2022 at 9:19 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > I think directly using smgrcreate() is a better idea instead of first > registering and then unregistering it. I have made that change in > the attached patch. After this change now we can merge creating the > MAIN_FORKNUM also in the loop below where we are creating other > fork[1] with one extra condition but I think current code is in more > sync with the other code where we are doing the similar things so I > have not merged it in the loop. Please let me know if you think > otherwise. Generally I think our practice is that we do the main fork unconditionally (because it should always be there) and the others only if they exist. I suggest that you make this consistent with that, but you could do it like if (forkNum != MAIN_FORKNUM && !smgrexists(...)) continue if that seems nicer. Do you think that this version handles pending syncs correctly? I think perhaps that is overlooked. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Mar 23, 2022 at 7:03 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Mar 23, 2022 at 9:19 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I think directly using smgrcreate() is a better idea instead of first > > registering and then unregistering it. I have made that change in > > the attached patch. After this change now we can merge creating the > > MAIN_FORKNUM also in the loop below where we are creating other > > fork[1] with one extra condition but I think current code is in more > > sync with the other code where we are doing the similar things so I > > have not merged it in the loop. Please let me know if you think > > otherwise. > > Generally I think our practice is that we do the main fork > unconditionally (because it should always be there) and the others > only if they exist. I suggest that you make this consistent with that, > but you could do it like if (forkNum != MAIN_FORKNUM && > !smgrexists(...)) continue if that seems nicer. Maybe we can do that. > Do you think that this version handles pending syncs correctly? I > think perhaps that is overlooked. Yeah I missed that. So options are either we go to the other approach and call RelationPreserveStorage() after RelationCreateStorage(), or we expose the AddPendingSync() function from the storage layer and then conditionally use it. I think if we are planning to expose this api then we better rename it to RelationAddPendingSync(). Honestly, I do not have any specific preference here. I can try both the approaches and send both if you or anyone else do not have any preference here? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hi, On 2022-03-23 18:49:11 +0530, Dilip Kumar wrote: > I think directly using smgrcreate() is a better idea instead of first > registering and then unregistering it. I have made that change in > the attached patch. After this change now we can merge creating the > MAIN_FORKNUM also in the loop below where we are creating other > fork[1] with one extra condition but I think current code is in more > sync with the other code where we are doing the similar things so I > have not merged it in the loop. Please let me know if you think > otherwise. FWIW, this fails tests: https://cirrus-ci.com/build/4929662173315072 https://cirrus-ci.com/task/6651773434724352?logs=test_bin#L121 https://cirrus-ci.com/task/6088823481303040?logs=test_world#L2377 Greetings, Andres Freund
On Wed, Mar 23, 2022 at 9:13 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2022-03-23 18:49:11 +0530, Dilip Kumar wrote: > > I think directly using smgrcreate() is a better idea instead of first > > registering and then unregistering it. I have made that change in > > the attached patch. After this change now we can merge creating the > > MAIN_FORKNUM also in the loop below where we are creating other > > fork[1] with one extra condition but I think current code is in more > > sync with the other code where we are doing the similar things so I > > have not merged it in the loop. Please let me know if you think > > otherwise. > > FWIW, this fails tests: https://cirrus-ci.com/build/4929662173315072 > https://cirrus-ci.com/task/6651773434724352?logs=test_bin#L121 > https://cirrus-ci.com/task/6088823481303040?logs=test_world#L2377 Strange to see that these changes are making a failure in the file_copy strategy[1] because we made changes only related to the wal_log strategy. However I will look into this. Thanks. [1] Failed test 'createdb -T foobar2 foobar5 -S file_copy exit code 0' -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Mar 23, 2022 at 9:05 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Mar 23, 2022 at 7:03 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Wed, Mar 23, 2022 at 9:19 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > I think directly using smgrcreate() is a better idea instead of first > > > registering and then unregistering it. I have made that change in > > > the attached patch. After this change now we can merge creating the > > > MAIN_FORKNUM also in the loop below where we are creating other > > > fork[1] with one extra condition but I think current code is in more > > > sync with the other code where we are doing the similar things so I > > > have not merged it in the loop. Please let me know if you think > > > otherwise. > > > > Generally I think our practice is that we do the main fork > > unconditionally (because it should always be there) and the others > > only if they exist. I suggest that you make this consistent with that, > > but you could do it like if (forkNum != MAIN_FORKNUM && > > !smgrexists(...)) continue if that seems nicer. > > Maybe we can do that. > > > Do you think that this version handles pending syncs correctly? I > > think perhaps that is overlooked. > > Yeah I missed that. So options are either we go to the other approach > and call RelationPreserveStorage() after > RelationCreateStorage(), Here is the patch with this approach, I am not sending both patches with different approaches in the same mail otherwise cfbot might generate conflict while applying the patch I think, so I will send it in a seperate mail. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Wed, Mar 23, 2022 at 9:25 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Mar 23, 2022 at 9:13 PM Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > On 2022-03-23 18:49:11 +0530, Dilip Kumar wrote: > > > I think directly using smgrcreate() is a better idea instead of first > > > registering and then unregistering it. I have made that change in > > > the attached patch. After this change now we can merge creating the > > > MAIN_FORKNUM also in the loop below where we are creating other > > > fork[1] with one extra condition but I think current code is in more > > > sync with the other code where we are doing the similar things so I > > > have not merged it in the loop. Please let me know if you think > > > otherwise. > > > > FWIW, this fails tests: https://cirrus-ci.com/build/4929662173315072 > > https://cirrus-ci.com/task/6651773434724352?logs=test_bin#L121 > > https://cirrus-ci.com/task/6088823481303040?logs=test_world#L2377 > > Strange to see that these changes are making a failure in the > file_copy strategy[1] because we made changes only related to the > wal_log strategy. However I will look into this. Thanks. > [1] > Failed test 'createdb -T foobar2 foobar5 -S file_copy exit code 0' I could not see any reason for it to fail, and I could not reproduce it either. Is it possible to access the server log for this cfbot failure? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hi, On 2022-03-23 22:29:40 +0530, Dilip Kumar wrote: > I could not see any reason for it to fail, and I could not reproduce > it either. Is it possible to access the server log for this cfbot > failure? Yes, near the top, below the cpu / memory graphs, there's a file navigator. Should have all files ending with *.log or starting with regress_log_*. Greetings, Andres Freund
On Wed, Mar 23, 2022 at 10:37 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2022-03-23 22:29:40 +0530, Dilip Kumar wrote: > > I could not see any reason for it to fail, and I could not reproduce > > it either. Is it possible to access the server log for this cfbot > > failure? > > Yes, near the top, below the cpu / memory graphs, there's a file > navigator. Should have all files ending with *.log or starting with > regress_log_*. Okay, I think I have found the reasoning for this failure, basically, if we see the below logs then the second statement is failing with foobar5 already exists and that is because some of the above test case is conditionally generating the same name. So the fix is to use a different name. 2022-03-23 13:53:54.554 UTC [32647][client backend] [020_createdb.pl][3/12:0] LOG: statement: CREATE DATABASE foobar5 TEMPLATE template0 LOCALE_PROVIDER icu ICU_LOCALE 'en'; ...... 2022-03-23 13:53:55.374 UTC [32717][client backend] [020_createdb.pl][3/46:0] LOG: statement: CREATE DATABASE foobar5 STRATEGY file_copy TEMPLATE foobar2; 2022-03-23 13:53:55.390 UTC [32717][client backend] [020_createdb.pl][3/46:0] ERROR: database "foobar5" already exists -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Mar 23, 2022 at 10:50 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Mar 23, 2022 at 10:37 PM Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > On 2022-03-23 22:29:40 +0530, Dilip Kumar wrote: > > > I could not see any reason for it to fail, and I could not reproduce > > > it either. Is it possible to access the server log for this cfbot > > > failure? > > > > Yes, near the top, below the cpu / memory graphs, there's a file > > navigator. Should have all files ending with *.log or starting with > > regress_log_*. > > Okay, I think I have found the reasoning for this failure, basically, > if we see the below logs then the second statement is failing with > foobar5 already exists and that is because some of the above test case > is conditionally generating the same name. So the fix is to use a > different name. In the latest version I have fixed this issue by using a non conflicting name, because when it was compiled with-icu the foobar5 was already used and we were seeing failure. Apart from this I have fixed the duplicate cleanup problem by passing an extra parameter to RelationCreateStorage, which decides whether to register for on-abort delete or not and added the comments for the same. IMHO this looks the most cleaner way to do it, please check the patch and let me know your thoughts. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Thu, Mar 24, 2022 at 1:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > In the latest version I have fixed this issue by using a non > conflicting name, because when it was compiled with-icu the foobar5 > was already used and we were seeing failure. Apart from this I have > fixed the duplicate cleanup problem by passing an extra parameter to > RelationCreateStorage, which decides whether to register for on-abort > delete or not and added the comments for the same. IMHO this looks > the most cleaner way to do it, please check the patch and let me know > your thoughts. I think that might be an OK way to do it. I think if we were starting from scratch we'd probably want to come up with some better system, but that's true of a lot of things. I went over your version and changed some comments. I also added documentation for the new wait event. Here's a new version. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Thu, Mar 24, 2022 at 9:29 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Mar 24, 2022 at 1:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > In the latest version I have fixed this issue by using a non > > conflicting name, because when it was compiled with-icu the foobar5 > > was already used and we were seeing failure. Apart from this I have > > fixed the duplicate cleanup problem by passing an extra parameter to > > RelationCreateStorage, which decides whether to register for on-abort > > delete or not and added the comments for the same. IMHO this looks > > the most cleaner way to do it, please check the patch and let me know > > your thoughts. > > I think that might be an OK way to do it. I think if we were starting > from scratch we'd probably want to come up with some better system, > but that's true of a lot of things. Right. > I went over your version and changed some comments. I also added > documentation for the new wait event. Here's a new version. > Thanks, I have gone through your changes in comments and docs and those LGTM. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Mar 24, 2022 at 12:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Thanks, I have gone through your changes in comments and docs and those LGTM. It looks like this patch will need to be updated for Alvaro's commit 49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. The newly added test 029_replay_tsp_drops.pl fails with this patch applied. The standby log shows: 2022-03-25 10:00:10.022 EDT [38209] LOG: entering standby mode 2022-03-25 10:00:10.024 EDT [38209] LOG: redo starts at 0/3000028 2022-03-25 10:00:10.062 EDT [38209] FATAL: could not create directory "pg_tblspc/16385/PG_15_202203241/16390": No such file or directory 2022-03-25 10:00:10.062 EDT [38209] CONTEXT: WAL redo at 0/43EBD88 for Database/CREATE_WAL_LOG: create dir 16385/16390 On a quick look, I'm guessing that XLOG_DBASE_CREATE_WAL_LOG will need to mirror some of the logic that was added to the replay code for the existing strategy, but I haven't figured out the details. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Mar 25, 2022 at 7:41 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Mar 24, 2022 at 12:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Thanks, I have gone through your changes in comments and docs and those LGTM. > > It looks like this patch will need to be updated for Alvaro's commit > 49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. The newly added test > 029_replay_tsp_drops.pl fails with this patch applied. The standby log > shows: > > 2022-03-25 10:00:10.022 EDT [38209] LOG: entering standby mode > 2022-03-25 10:00:10.024 EDT [38209] LOG: redo starts at 0/3000028 > 2022-03-25 10:00:10.062 EDT [38209] FATAL: could not create directory > "pg_tblspc/16385/PG_15_202203241/16390": No such file or directory > 2022-03-25 10:00:10.062 EDT [38209] CONTEXT: WAL redo at 0/43EBD88 > for Database/CREATE_WAL_LOG: create dir 16385/16390 > > On a quick look, I'm guessing that XLOG_DBASE_CREATE_WAL_LOG will need > to mirror some of the logic that was added to the replay code for the > existing strategy, but I haven't figured out the details. > Yeah, I think I got it, for XLOG_DBASE_CREATE_WAL_LOG now we will have to handle the missing parent directory case, like Alvaro handled for the XLOG_DBASE_CREATE(_FILE_COPY) case. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, Mar 25, 2022 at 8:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > On a quick look, I'm guessing that XLOG_DBASE_CREATE_WAL_LOG will need > > to mirror some of the logic that was added to the replay code for the > > existing strategy, but I haven't figured out the details. > > > > Yeah, I think I got it, for XLOG_DBASE_CREATE_WAL_LOG now we will have > to handle the missing parent directory case, like Alvaro handled for > the XLOG_DBASE_CREATE(_FILE_COPY) case. I have updated the patch so now we skip the XLOG_DBASE_CREATE_WAL_LOG as well if the tablespace directory is missing. But with our new wal_log method there will be other follow up wal logs like, XLOG_RELMAP_UPDATE, XLOG_SMGR_CREATE and XLOG_FPI. I have put the similar logic for relmap_update WAL replay as well, but we don't need this for smgr_create or fpi. Because the mdcreate() is taking care of creating missing directory in TablespaceCreateDbspace() and fpi only logged after we create the new smgr at least in case of create database. Now, is it possible to get the FPI without smgr_create wal in other cases? If it is then that problem is orthogonal to this path, but anyway I could not find any such scenario. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Sat, Mar 26, 2022 at 5:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Mar 25, 2022 at 8:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > On a quick look, I'm guessing that XLOG_DBASE_CREATE_WAL_LOG will need > > > to mirror some of the logic that was added to the replay code for the > > > existing strategy, but I haven't figured out the details. > > > > > > > Yeah, I think I got it, for XLOG_DBASE_CREATE_WAL_LOG now we will have > > to handle the missing parent directory case, like Alvaro handled for > > the XLOG_DBASE_CREATE(_FILE_COPY) case. > > I have updated the patch so now we skip the XLOG_DBASE_CREATE_WAL_LOG > as well if the tablespace directory is missing. But with our new > wal_log method there will be other follow up wal logs like, > XLOG_RELMAP_UPDATE, XLOG_SMGR_CREATE and XLOG_FPI. > > I have put the similar logic for relmap_update WAL replay as well, There was some mistake in the last patch, basically, for relmap update also I have checked the missing tablespace directory but I should have checked the missing database directory so I have fixed that. > Now, is it possible to get the FPI without smgr_create wal in other > cases? If it is then that problem is orthogonal to this path, but > anyway I could not find any such scenario. I have digged further into it, tried manually removing the directory before XLOG_FPI, but I noticed that during FPI also XLogReadBufferExtended() take cares of creating the missing files using smgrcreate() and that intern take care of missing directory creation so I don't think we have any problem here. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, Mar 28, 2022 at 2:18 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have put the similar logic for relmap_update WAL replay as well, > > There was some mistake in the last patch, basically, for relmap update > also I have checked the missing tablespace directory but I should have > checked the missing database directory so I have fixed that. > > > Now, is it possible to get the FPI without smgr_create wal in other > > cases? If it is then that problem is orthogonal to this path, but > > anyway I could not find any such scenario. > > I have digged further into it, tried manually removing the directory > before XLOG_FPI, but I noticed that during FPI also > XLogReadBufferExtended() take cares of creating the missing files > using smgrcreate() and that intern take care of missing directory > creation so I don't think we have any problem here. I don't understand whether XLOG_RELMAP_UPDATE should be just doing smgrcreate() as we would for most WAL records or whether it should be adopting the new system introduced by 49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. I wrote about this concern over here: http://postgr.es/m/CA+TgmoYcUPL+WOJL2ZzhH=zmrhj0iOQ=iCFM0SuYqBbqZEamEg@mail.gmail.com But apart from that question your adaptations here look reasonable to me. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Mar 29, 2022 at 12:38 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Mar 28, 2022 at 2:18 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > I have put the similar logic for relmap_update WAL replay as well, > > > > There was some mistake in the last patch, basically, for relmap update > > also I have checked the missing tablespace directory but I should have > > checked the missing database directory so I have fixed that. > > > > > Now, is it possible to get the FPI without smgr_create wal in other > > > cases? If it is then that problem is orthogonal to this path, but > > > anyway I could not find any such scenario. > > > > I have digged further into it, tried manually removing the directory > > before XLOG_FPI, but I noticed that during FPI also > > XLogReadBufferExtended() take cares of creating the missing files > > using smgrcreate() and that intern take care of missing directory > > creation so I don't think we have any problem here. > > I don't understand whether XLOG_RELMAP_UPDATE should be just doing > smgrcreate() XLOG_RELMAP_UPDATE is for the complete database so for which relnode it will create smgr? I think you probably meant TablespaceCreateDbspace()? as we would for most WAL records or whether it should be > adopting the new system introduced by > 49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. I wrote about this concern > over here: okay, thanks. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Mon, Mar 28, 2022 at 3:08 PM Robert Haas <robertmhaas@gmail.com> wrote: > smgrcreate() as we would for most WAL records or whether it should be > adopting the new system introduced by > 49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. I wrote about this concern > over here: > > http://postgr.es/m/CA+TgmoYcUPL+WOJL2ZzhH=zmrhj0iOQ=iCFM0SuYqBbqZEamEg@mail.gmail.com > > But apart from that question your adaptations here look reasonable to me. That commit having been reverted, I committed v6 instead. Let's see what breaks... -- Robert Haas EDB: http://www.enterprisedb.com
On 2022-03-29 11:55:05 -0400, Robert Haas wrote: > That commit having been reverted, I committed v6 instead. Let's see > what breaks... It fails in CI (for the mirror of the postgres repo on github): https://cirrus-ci.com/task/6279465603956736?logs=test_bin#L121 tap test log: https://api.cirrus-ci.com/v1/artifact/task/6279465603956736/log/src/bin/scripts/tmp_check/log/regress_log_020_createdb postmaster log: https://api.cirrus-ci.com/v1/artifact/task/6279465603956736/log/src/bin/scripts/tmp_check/log/020_createdb_main.log recent versions failed similarly on cfbot: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/37/3192 https://cirrus-ci.com/task/5217140407009280?logs=test_bin#L121 # Running: createdb -T foobar2 foobar6 -S wal_log createdb: error: too many command-line arguments (first is "wal_log") Try "createdb --help" for more information. not ok 31 - createdb -T foobar2 foobar6 -S wal_log exit code 0 # Failed test 'createdb -T foobar2 foobar6 -S wal_log exit code 0' # at t/020_createdb.pl line 117. not ok 32 - create database with WAL_LOG strategy: SQL found in server log # Failed test 'create database with WAL_LOG strategy: SQL found in server log' # at t/020_createdb.pl line 117. # '' # doesn't match '(?^:statement: CREATE DATABASE foobar6 STRATEGY wal_log TEMPLATE foobar2)' # Running: createdb -T foobar2 foobar7 -S file_copy createdb: error: too many command-line arguments (first is "file_copy") Try "createdb --help" for more information. not ok 33 - createdb -T foobar2 foobar7 -S file_copy exit code 0 # Failed test 'createdb -T foobar2 foobar7 -S file_copy exit code 0' # at t/020_createdb.pl line 122. not ok 34 - create database with FILE_COPY strategy: SQL found in server log # Failed test 'create database with FILE_COPY strategy: SQL found in server log' # at t/020_createdb.pl line 122. # '' # doesn't match '(?^:statement: CREATE DATABASE foobar7 STRATEGY file_copy TEMPLATE foobar2)' Looks like there's some problem with commandline parsing? Greetings, Andres Freund
On Tue, Mar 29, 2022 at 1:35 PM Andres Freund <andres@anarazel.de> wrote: > # Running: createdb -T foobar2 foobar6 -S wal_log > createdb: error: too many command-line arguments (first is "wal_log") > Try "createdb --help" for more information. > not ok 31 - createdb -T foobar2 foobar6 -S wal_log exit code 0 > > Looks like there's some problem with commandline parsing? Apparently getopt_long() is fussier on Windows. I have committed a fix. -- Robert Haas EDB: http://www.enterprisedb.com
Andres Freund <andres@anarazel.de> writes: > Looks like there's some problem with commandline parsing? That test script is expecting glibc-like laxness of switch parsing. Put the switches before the non-switch arguments. regards, tom lane
On Tue, Mar 29, 2022 at 1:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@anarazel.de> writes: > > Looks like there's some problem with commandline parsing? > > That test script is expecting glibc-like laxness of switch > parsing. Put the switches before the non-switch arguments. I just did that. :-) -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Mar 29, 2022 at 1:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> That test script is expecting glibc-like laxness of switch >> parsing. Put the switches before the non-switch arguments. > I just did that. :-) Yup, you pushed while I was typing. FWIW, I don't think it's "Windows" enforcing this, it's our own src/port/getopt[_long].c. If there were a well-defined spec for what glibc does with such cases, it might be interesting to try to make our version bug-compatible with theirs. But AFAIK it's some random algorithm that they probably feel at liberty to change. regards, tom lane
On Tue, Mar 29, 2022 at 2:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Tue, Mar 29, 2022 at 1:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> That test script is expecting glibc-like laxness of switch > >> parsing. Put the switches before the non-switch arguments. > > > I just did that. :-) > > Yup, you pushed while I was typing. > > FWIW, I don't think it's "Windows" enforcing this, it's our own > src/port/getopt[_long].c. If there were a well-defined spec > for what glibc does with such cases, it might be interesting to > try to make our version bug-compatible with theirs. But AFAIK > it's some random algorithm that they probably feel at liberty > to change. I guess that characterization surprises me. The man page for getopt_long() says this, and has for a long time at least on systems I've used: ENVIRONMENT POSIXLY_CORRECT If set, option processing stops when the first non- option is found and a leading `-' or `+' in the optstring is ignored. And also this: BUGS The argv argument is not really const as its elements may be permuted (unless POSIXLY_CORRECT is set). Doesn't that make it pretty clear what the GNU version is doing? -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Mar 29, 2022 at 2:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> it's some random algorithm that they probably feel at liberty >> to change. > I guess that characterization surprises me. The man page for > getopt_long() says this, and has for a long time at least on systems > I've used: Yeah, they say they follow the POSIX spec when you set POSIXLY_CORRECT. What they don't spell out in any detail is what they do when you don't. We know that it involves rearranging the argv[] array behind the application's back, but not what the rules are for doing that. In particular, they must have some undocumented and probably not very safe method for deciding which arguments are neither switches nor switch arguments. (Actually, if I recall previous discussions properly, another stumbling block to doing anything here is that we'd also have to change all our documentation to explain it. Fixing the command line synopses would be a mess already, and explaining the rules would be worse.) regards, tom lane
On Tue, Mar 29, 2022 at 2:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Tue, Mar 29, 2022 at 2:17 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> it's some random algorithm that they probably feel at liberty > >> to change. > > > I guess that characterization surprises me. The man page for > > getopt_long() says this, and has for a long time at least on systems > > I've used: > > Yeah, they say they follow the POSIX spec when you set POSIXLY_CORRECT. > What they don't spell out in any detail is what they do when you don't. > We know that it involves rearranging the argv[] array behind the > application's back, but not what the rules are for doing that. In > particular, they must have some undocumented and probably not very safe > method for deciding which arguments are neither switches nor switch > arguments. I mean, I think of an option as something that starts with '-'. The documentation contains a caveat that says: "The special argument ‘--’ forces in all cases the end of option scanning." So I think I would expect it just looks for arguments starting with '-' that do not follow an argument that is exactly "--". <looks around for the source code> https://github.com/gcc-mirror/gcc/blob/master/libiberty/getopt.c If an element of ARGV starts with '-', and is not exactly "-" or "--", then it is an option element. The characters of this element (aside from the initial '-') are option characters. If `getopt' is called repeatedly, it returns successively each of the option characters from each of the option elements. OK - so I was off slightly. Either "-" or "--" terminates the options list. Apart from that anything starting with "-" is an option. I think you're overestimating the level of mystery that's present here, as well as the likelihood that the rules could ever be changed. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2022-03-29 11:55:05 -0400, Robert Haas wrote: > I committed v6 instead. Just noticed that it makes initdb a bit slower / the cluster a bit bigger, because now there's WAL traffic from creating the databases. There's an optimization (albeit insufficient) to reduce WAL traffic in bootstrap mode, but not for single user mode when the CREATE DATABASEs happen. In an optimized build, with wal-segsize 1 (the most extreme case) using FILE_COPY vs WAL_LOG: perf stat ~/build/postgres/dev-optimize/install/bin/initdb /tmp/initdb/ --wal-segsize=1 WAL_LOG: 487.58 msec task-clock # 0.848 CPUs utilized 2,874 context-switches # 5.894 K/sec 0 cpu-migrations # 0.000 /sec 10,209 page-faults # 20.938 K/sec 1,550,483,095 cycles # 3.180 GHz 2,537,618,094 instructions # 1.64 insn per cycle 492,780,121 branches # 1.011 G/sec 7,384,884 branch-misses # 1.50% of all branches 0.575213800 seconds time elapsed 0.349812000 seconds user 0.133225000 seconds sys FILE_COPY: 476.54 msec task-clock # 0.854 CPUs utilized 3,005 context-switches # 6.306 K/sec 0 cpu-migrations # 0.000 /sec 10,050 page-faults # 21.090 K/sec 1,516,058,200 cycles # 3.181 GHz 2,504,126,907 instructions # 1.65 insn per cycle 488,042,856 branches # 1.024 G/sec 7,327,364 branch-misses # 1.50% of all branches 0.557934976 seconds time elapsed 0.360473000 seconds user 0.112109000 seconds sys the numbers are similar if repeated. du -s /tmp/initdb/ WAL_LOG: 35112 FILE_COPY: 29288 So it seems we should specify a strategy in initdb? It kind of makes sense - we're not going to read anything from those database. And because of the ringbuffer of 256kB, we'll not even reduce IO meaningfully. - Andres
On Wed, Mar 30, 2022 at 6:47 AM Andres Freund <andres@anarazel.de> wrote: > > > du -s /tmp/initdb/ > WAL_LOG: 35112 > FILE_COPY: 29288 > > So it seems we should specify a strategy in initdb? It kind of makes sense - > we're not going to read anything from those database. And because of the > ringbuffer of 256kB, we'll not even reduce IO meaningfully. I think this makes sense, so you mean with initdb we will always use file_copy or we want to give a command line option for initdb ? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, Mar 29, 2022 at 9:25 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Mar 28, 2022 at 3:08 PM Robert Haas <robertmhaas@gmail.com> wrote: > > smgrcreate() as we would for most WAL records or whether it should be > > adopting the new system introduced by > > 49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. I wrote about this concern > > over here: > > > > http://postgr.es/m/CA+TgmoYcUPL+WOJL2ZzhH=zmrhj0iOQ=iCFM0SuYqBbqZEamEg@mail.gmail.com > > > > But apart from that question your adaptations here look reasonable to me. > > That commit having been reverted, I committed v6 instead. Let's see > what breaks... > There was a duplicate error check for the invalid createdb strategy option in the test case, although it would not create any issue but it is duplicate so I have fixed it in the attached patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Hi, On 2022-03-30 09:28:58 +0530, Dilip Kumar wrote: > On Wed, Mar 30, 2022 at 6:47 AM Andres Freund <andres@anarazel.de> wrote: > > > > > > du -s /tmp/initdb/ > > WAL_LOG: 35112 > > FILE_COPY: 29288 > > > > So it seems we should specify a strategy in initdb? It kind of makes sense - > > we're not going to read anything from those database. And because of the > > ringbuffer of 256kB, we'll not even reduce IO meaningfully. > > I think this makes sense, so you mean with initdb we will always use > file_copy or we want to give a command line option for initdb ? Don't see a need for a commandline option / a situation where using WAL_LOG would be preferrable for initdb. Greetings, Andres Freund
Hi, On 2022-03-29 11:55:05 -0400, Robert Haas wrote: > I committed v6 instead. I was just discussing the WAL prefetching patch with Thomas. A question in that discussion made me look at the coverage of REDO for CREATE DATABASE: https://coverage.postgresql.org/src/backend/commands/dbcommands.c.gcov.html Seems there's currently nothing hitting the REDO for XLOG_DBASE_CREATE_FILE_COPY (currently line 3019). I think it'd be good to keep coverage for that. How about adding a CREATE DATABASE ... STRATEGY file_copy to 001_stream_rep.pl? Might be worth adding a test for ALTER DATABASE ... SET TABLESPACE at the same time, this patch did affect that path in some minor ways. And, somewhat shockingly, we don't have a single test for it. - Andres
On Thu, Mar 31, 2022 at 5:07 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2022-03-29 11:55:05 -0400, Robert Haas wrote: > > I committed v6 instead. > > I was just discussing the WAL prefetching patch with Thomas. A question in > that discussion made me look at the coverage of REDO for CREATE DATABASE: > https://coverage.postgresql.org/src/backend/commands/dbcommands.c.gcov.html > > Seems there's currently nothing hitting the REDO for > XLOG_DBASE_CREATE_FILE_COPY (currently line 3019). I think it'd be good to > keep coverage for that. How about adding a > CREATE DATABASE ... STRATEGY file_copy > to 001_stream_rep.pl? > > > Might be worth adding a test for ALTER DATABASE ... SET TABLESPACE at the same > time, this patch did affect that path in some minor ways. And, somewhat > shockingly, we don't have a single test for it. I will add tests for both of these cases and send the patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Mar 31, 2022 at 9:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Mar 31, 2022 at 5:07 AM Andres Freund <andres@anarazel.de> wrote: > > > > Hi, > > > > On 2022-03-29 11:55:05 -0400, Robert Haas wrote: > > > I committed v6 instead. > > > > I was just discussing the WAL prefetching patch with Thomas. A question in > > that discussion made me look at the coverage of REDO for CREATE DATABASE: > > https://coverage.postgresql.org/src/backend/commands/dbcommands.c.gcov.html > > > > Seems there's currently nothing hitting the REDO for > > XLOG_DBASE_CREATE_FILE_COPY (currently line 3019). I think it'd be good to > > keep coverage for that. How about adding a > > CREATE DATABASE ... STRATEGY file_copy > > to 001_stream_rep.pl? > > > > > > Might be worth adding a test for ALTER DATABASE ... SET TABLESPACE at the same > > time, this patch did affect that path in some minor ways. And, somewhat > > shockingly, we don't have a single test for it. > > I will add tests for both of these cases and send the patch. 0001 is changing the strategy to file copy during initdb and 0002 patch adds the test cases for both these cases. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Thu, Mar 31, 2022 at 3:52 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > 0001 is changing the strategy to file copy during initdb and 0002 > patch adds the test cases for both these cases. IMHO, 0001 looks fine, except for needing some adjustments to the wording. I'm less sure about 0002. It's testing the stuff Andres mentioned, but I'm not sure how good the tests are. Andres, thoughts? Do you want me to polish and commit 0001? -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2022-03-31 13:22:24 +0530, Dilip Kumar wrote: > 0001 is changing the strategy to file copy during initdb and 0002 > patch adds the test cases for both these cases. Thanks! > From 4a997e2a95074a520777cd2b369f9c728b360969 Mon Sep 17 00:00:00 2001 > From: Dilip Kumar <dilipkumar@localhost.localdomain> > Date: Thu, 31 Mar 2022 10:43:16 +0530 > Subject: [PATCH 1/2] Use file_copy strategy during initdb > > Because skipping the checkpoint during initdb will not result > in significant savings, so there is no point in using wal_log > as that will simply increase the cluster size by generating > extra wal. > --- > src/bin/initdb/initdb.c | 14 +++++++++++--- > 1 file changed, 11 insertions(+), 3 deletions(-) > > diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c > index 5e36943..1256082 100644 > --- a/src/bin/initdb/initdb.c > +++ b/src/bin/initdb/initdb.c > @@ -1856,6 +1856,11 @@ make_template0(FILE *cmdfd) > * it would fail. To avoid that, assign a fixed OID to template0 rather > * than letting the server choose one. > * > + * Using file_copy strategy is preferable over wal_log here because > + * skipping the checkpoint during initdb will not result in significant > + * savings, so there is no point in using wal_log as that will simply > + * increase the cluster size by generating extra wal. It's not just the increase in size, it's also the increase in time due to WAL logging. > * (Note that, while the user could have dropped and recreated these > * objects in the old cluster, the problem scenario only exists if the OID > * that is in use in the old cluster is also used in the new cluster - and > @@ -1863,7 +1868,7 @@ make_template0(FILE *cmdfd) > */ > static const char *const template0_setup[] = { > "CREATE DATABASE template0 IS_TEMPLATE = true ALLOW_CONNECTIONS = false OID = " > - CppAsString2(Template0ObjectId) ";\n\n", > + CppAsString2(Template0ObjectId) " STRATEGY = file_copy;\n\n", I'd perhaps break this into a separate line, but... > From d0759bcfc4fed674e938e4a03159f5953ca9718d Mon Sep 17 00:00:00 2001 > From: Dilip Kumar <dilipkumar@localhost.localdomain> > Date: Thu, 31 Mar 2022 12:07:19 +0530 > Subject: [PATCH 2/2] Create database test coverage > > Test create database strategy wal replay and alter database > set tablespace. > --- > src/test/modules/test_misc/t/002_tablespace.pl | 12 ++++++++++++ > src/test/recovery/t/001_stream_rep.pl | 24 ++++++++++++++++++++++++ > 2 files changed, 36 insertions(+) > > diff --git a/src/test/modules/test_misc/t/002_tablespace.pl b/src/test/modules/test_misc/t/002_tablespace.pl > index 04e5439..f3bbddc 100644 > --- a/src/test/modules/test_misc/t/002_tablespace.pl > +++ b/src/test/modules/test_misc/t/002_tablespace.pl > @@ -83,7 +83,19 @@ $result = $node->psql('postgres', > "ALTER TABLE t SET tablespace regress_ts1"); > ok($result == 0, 'move table in-place->abs'); > > +# Test ALTER DATABASE SET TABLESPACE > +$result = $node->psql('postgres', > + "CREATE DATABASE testdb TABLESPACE regress_ts1"); > +ok($result == 0, 'create database in tablespace 1'); > +$result = $node->psql('testdb', > + "CREATE TABLE t ()"); > +ok($result == 0, 'create table in testdb database'); > +$result = $node->psql('postgres', > + "ALTER DATABASE testdb SET TABLESPACE regress_ts2"); > +ok($result == 0, 'move database to tablespace 2'); This just tests the command doesn't fail, but not whether it actually did something useful. Seem we should at least insert a row or two into the the table, and verify they can be accessed? > +# Create database with different strategies and check its presence in standby > +$node_primary->safe_psql('postgres', > + "CREATE DATABASE testdb1 STRATEGY = FILE_COPY; "); > +$node_primary->safe_psql('testdb1', > + "CREATE TABLE tab_int AS SELECT generate_series(1,10) AS a"); > +$node_primary->safe_psql('postgres', > + "CREATE DATABASE testdb2 STRATEGY = WAL_LOG; "); > +$node_primary->safe_psql('testdb2', > + "CREATE TABLE tab_int AS SELECT generate_series(1,10) AS a"); > + > +# Wait for standbys to catch up > +$primary_lsn = $node_primary->lsn('write'); > +$node_primary->wait_for_catchup($node_standby_1, 'replay', $primary_lsn); > + > +$result = > + $node_standby_1->safe_psql('testdb1', "SELECT count(*) FROM tab_int"); > +print "standby 1: $result\n"; > +is($result, qq(10), 'check streamed content on standby 1'); > + > +$result = > + $node_standby_1->safe_psql('testdb2', "SELECT count(*) FROM tab_int"); > +print "standby 1: $result\n"; > +is($result, qq(10), 'check streamed content on standby 1'); > + > # Check that only READ-only queries can run on standbys > is($node_standby_1->psql('postgres', 'INSERT INTO tab_int VALUES (1)'), > 3, 'read-only queries on standby 1'); I'd probably add a function for creating database / table and then testing it, with a strategy parameter. That way we can afterwards add more tests verifying that everything worked. Greetings, Andres Freund
Hi, On 2022-03-31 10:05:10 -0400, Robert Haas wrote: > On Thu, Mar 31, 2022 at 3:52 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > 0001 is changing the strategy to file copy during initdb and 0002 > > patch adds the test cases for both these cases. > > IMHO, 0001 looks fine, except for needing some adjustments to the wording. Agreed. > I'm less sure about 0002. It's testing the stuff Andres mentioned, but > I'm not sure how good the tests are. I came to a similar conclusion. It's still better than nothing, but it's just a small bit of additional effort to do some basic testing that e.g. the move actually worked... > Andres, thoughts? Do you want me to polish and commit 0001? Yes please! FWIW, once the freeze is done I'm planning to set up scripting to see which parts of the code we whacked around don't have test coverage... Greetings, Andres Freund
On Thu, Mar 31, 2022 at 12:25 PM Andres Freund <andres@anarazel.de> wrote: > > Andres, thoughts? Do you want me to polish and commit 0001? > > Yes please! Here is a polished version. Comments? > FWIW, once the freeze is done I'm planning to set up scripting to see which > parts of the code we whacked around don't have test coverage... Sounds terrifying. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On 2022-03-31 14:31:43 -0400, Robert Haas wrote: > On Thu, Mar 31, 2022 at 12:25 PM Andres Freund <andres@anarazel.de> wrote: > > > Andres, thoughts? Do you want me to polish and commit 0001? > > > > Yes please! > > Here is a polished version. Comments? LGTM.
On Thu, Mar 31, 2022 at 2:44 PM Andres Freund <andres@anarazel.de> wrote: > On 2022-03-31 14:31:43 -0400, Robert Haas wrote: > > On Thu, Mar 31, 2022 at 12:25 PM Andres Freund <andres@anarazel.de> wrote: > > > > Andres, thoughts? Do you want me to polish and commit 0001? > > > > > > Yes please! > > > > Here is a polished version. Comments? > > LGTM. Committed. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Mar 31, 2022 at 9:52 PM Andres Freund <andres@anarazel.de> wrote: > > + "ALTER DATABASE testdb SET TABLESPACE regress_ts2"); > > +ok($result == 0, 'move database to tablespace 2'); > > This just tests the command doesn't fail, but not whether it actually did > something useful. Seem we should at least insert a row or two into the the > table, and verify they can be accessed? Now, added some tuples and verified them. > > # Check that only READ-only queries can run on standbys > > is($node_standby_1->psql('postgres', 'INSERT INTO tab_int VALUES (1)'), > > 3, 'read-only queries on standby 1'); > > I'd probably add a function for creating database / table and then testing it, > with a strategy parameter. That way we can afterwards add more tests verifying > that everything worked. I have created a function to create a database and table and verify the content in it. Another option is we can just keep the database and table creation inside the function and the verification part outside it so that if some future test case wants to create some extra content and verify it then they can do so. But with the current tests in mind the way I got it in the attached patch has less duplicate code so I preferred it this way. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Hi, On 2022-03-29 11:55:05 -0400, Robert Haas wrote: > I committed v6 instead. Coverity complains that this patch added GetDatabasePath() calls without freeing its return value. Normally that'd be easy to dismiss, due to memory contexts, but there's no granular resets in CreateDatabaseUsingFileCopy(). And obviously there can be a lot of relations in one database - we shouldn't hold onto the same path over and over again. The case in recovery is worse, because there we don't have a memory context to reset afaics. Oddly enough, it sure looks like we have an existing version of this bug in the file-copy path? Greetings, Andres Freund
On Sun, Apr 3, 2022 at 9:52 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2022-03-29 11:55:05 -0400, Robert Haas wrote: > > I committed v6 instead. > > Coverity complains that this patch added GetDatabasePath() calls without > freeing its return value. Normally that'd be easy to dismiss, due to memory > contexts, but there's no granular resets in CreateDatabaseUsingFileCopy(). And > obviously there can be a lot of relations in one database - we shouldn't hold > onto the same path over and over again. > The case in recovery is worse, because there we don't have a memory context to > reset afaics. Oddly enough, it sure looks like we have an existing version of > this bug in the file-copy path? Yeah, I see that the createdb() and dbase_redo() had this existing problem and with this patch we have created a few more such occurrences. The attached patch fixes it. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Mar 29, 2022 at 11:55:05AM -0400, Robert Haas wrote: > On Mon, Mar 28, 2022 at 3:08 PM Robert Haas <robertmhaas@gmail.com> wrote: > > smgrcreate() as we would for most WAL records or whether it should be > > adopting the new system introduced by > > 49d9cfc68bf4e0d32a948fe72d5a0ef7f464944e. I wrote about this concern > > over here: > > > > http://postgr.es/m/CA+TgmoYcUPL+WOJL2ZzhH=zmrhj0iOQ=iCFM0SuYqBbqZEamEg@mail.gmail.com > > > > But apart from that question your adaptations here look reasonable to me. > > That commit having been reverted, I committed v6 instead. Let's see > what breaks... There's a crash 2022-07-31 01:22:51.437 CDT client backend[13362] [unknown] PANIC: could not open critical system index 2662 (gdb) bt #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 #1 0x00007efe27999801 in __GI_abort () at abort.c:79 #2 0x00005583891941dc in errfinish (filename=<optimized out>, filename@entry=0x558389420437 "relcache.c", lineno=lineno@entry=4328, funcname=funcname@entry=0x558389421680 <__func__.33178> "load_critical_index") at elog.c:675 #3 0x00005583891713ef in load_critical_index (indexoid=indexoid@entry=2662, heapoid=heapoid@entry=1259) at relcache.c:4328 #4 0x0000558389172667 in RelationCacheInitializePhase3 () at relcache.c:4103 #5 0x00005583891b93a4 in InitPostgres (in_dbname=in_dbname@entry=0x55838a50d468 "a", dboid=dboid@entry=0, username=username@entry=0x55838a50d448"pryzbyj", useroid=useroid@entry=0, load_session_libraries=<optimized out>, override_allow_connections=override_allow_connections@entry=false, out_dbname=0x0)at postinit.c:1087 #6 0x0000558388daa7bb in PostgresMain (dbname=0x55838a50d468 "a", username=username@entry=0x55838a50d448 "pryzbyj") at postgres.c:4081 #7 0x0000558388b9f423 in BackendRun (port=port@entry=0x55838a505dd0) at postmaster.c:4490 #8 0x0000558388ba6e07 in BackendStartup (port=port@entry=0x55838a505dd0) at postmaster.c:4218 #9 0x0000558388ba747f in ServerLoop () at postmaster.c:1808 #10 0x0000558388ba8f93 in PostmasterMain (argc=7, argv=<optimized out>) at postmaster.c:1480 #11 0x0000558388840e1f in main (argc=7, argv=0x55838a4dc000) at main.c:197 while :; do psql -qh /tmp postgres -c "DROP DATABASE a" -c "CREATE DATABASE a TEMPLATE postgres STRATEGY wal_log"; done # Run this for a few loops and then ^C or hold down ^C until it stops, # and then connect to postgres and try to connect to 'a': postgres=# \c a 2022-07-31 01:22:51.437 CDT client backend[13362] [unknown] PANIC: could not open critical system index 2662 Unfortunately, that isn't very consistent, and you have have to run it a bunch of times... I don't know if it's an issue of any significance that CREATE DATABASE / ^C leaves behind a broken database, but it is an issue that the cluster crashes. While struggling to reproduce that problem, I also hit this warning, which may or may not be the same. I added an abort() after WARNING in aset.c to get a backtrace. WARNING: problem in alloc set PortalContext: bogus aset link in block 0x55a63f2f9d60, chunk 0x55a63f2fb138 Program terminated with signal SIGABRT, Aborted. #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 51 ../sysdeps/unix/sysv/linux/raise.c: No existe el archivo o el directorio. (gdb) bt #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 #1 0x00007f81144f1801 in __GI_abort () at abort.c:79 #2 0x000055a63c834c5d in AllocSetCheck (context=context@entry=0x55a63f26fea0) at aset.c:1491 #3 0x000055a63c835b09 in AllocSetDelete (context=0x55a63f26fea0) at aset.c:638 #4 0x000055a63c854322 in MemoryContextDelete (context=0x55a63f26fea0) at mcxt.c:252 #5 0x000055a63c8591d5 in PortalDrop (portal=portal@entry=0x55a63f2bb7a0, isTopCommit=isTopCommit@entry=false) at portalmem.c:596 #6 0x000055a63c3e4a7b in exec_simple_query (query_string=query_string@entry=0x55a63f24db90 "CREATE DATABASE a TEMPLATE postgresSTRATEGY wal_log ;") at postgres.c:1253 #7 0x000055a63c3e7fc1 in PostgresMain (dbname=<optimized out>, username=username@entry=0x55a63f279448 "pryzbyj") at postgres.c:4505 #8 0x000055a63c1dc423 in BackendRun (port=port@entry=0x55a63f271dd0) at postmaster.c:4490 #9 0x000055a63c1e3e07 in BackendStartup (port=port@entry=0x55a63f271dd0) at postmaster.c:4218 #10 0x000055a63c1e447f in ServerLoop () at postmaster.c:1808 #11 0x000055a63c1e5f93 in PostmasterMain (argc=7, argv=<optimized out>) at postmaster.c:1480 #12 0x000055a63be7de1f in main (argc=7, argv=0x55a63f248000) at main.c:197 I reproduced that by running this a couple dozen times in an interactive psql. It doesn't seem to affect STRATEGY=file_copy. SET statement_timeout=0; DROP DATABASE a; SET statement_timeout='60ms'; CREATE DATABASE a TEMPLATE postgres STRATEGY wal_log; \c a \c postgres Also, if I understand correctly, this patch seems to assume that nobody is connected to the source database. But what's actually enforced is just that nobody *else* is connected. Is it any issue that the current DB can be used as a source? Anyway, both of the above problems are reproducible using a different database. |postgres=# CREATE DATABASE new TEMPLATE postgres STRATEGY wal_log; |CREATE DATABASE -- Justin
On Tue, Aug 2, 2022 at 1:50 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > Unfortunately, that isn't very consistent, and you have have to run it a bunch > of times... I was eventually able to reproduce this in part by using the interactive psql method you describe. It didn't crash, but it did spit out a bunch of funny error messages: postgres=# SET statement_timeout=0; DROP DATABASE a; SET statement_timeout='60ms'; CREATE DATABASE a TEMPLATE postgres STRATEGY wal_log ; \c a \c postgres SET ERROR: database "a" does not exist SET ERROR: canceling statement due to statement timeout WARNING: problem in alloc set PortalContext: req size > alloc size for chunk 0x7f99508911f0 in block 0x7f9950890800 WARNING: problem in alloc set PortalContext: bad size 0 for chunk 0x7f99508911f0 in block 0x7f9950890800 WARNING: problem in alloc set PortalContext: bad single-chunk 0x7f9950891208 in block 0x7f9950890800 WARNING: problem in alloc set PortalContext: found inconsistent memory block 0x7f9950890800 WARNING: problem in alloc set PortalContext: req size > alloc size for chunk 0x7f99508911f0 in block 0x7f9950890800 WARNING: problem in alloc set PortalContext: bad size 0 for chunk 0x7f99508911f0 in block 0x7f9950890800 WARNING: problem in alloc set PortalContext: bad single-chunk 0x7f9950891208 in block 0x7f9950890800 WARNING: problem in alloc set PortalContext: found inconsistent memory block 0x7f9950890800 connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL: database "a" does not exist Previous connection kept postgres=# select * from pg_database; oid | datname | datdba | encoding | datlocprovider | datistemplate | datallowconn | datconnlimit | datfrozenxid | datminmxid | dattablespace | datcollate | datctype | daticulocale | datcollversion | datacl -----+-----------+--------+----------+----------------+---------------+--------------+--------------+--------------+------------+---------------+-------------+-------------+--------------+----------------+---------------------------- 5 | postgres | 10 | 6 | c | f | t | -1 | 718 | 1 | 1663 | en_US.UTF-8 | en_US.UTF-8 | | | 1 | template1 | 10 | 6 | c | t | t | -1 | 718 | 1 | 1663 | en_US.UTF-8 | en_US.UTF-8 | | | {=c/rhaas,rhaas=CTc/rhaas} 4 | template0 | 10 | 6 | c | t | f | -1 | 718 | 1 | 1663 | en_US.UTF-8 | en_US.UTF-8 | | | {=c/rhaas,rhaas=CTc/rhaas} (3 rows) I then set backtrace_functions='AllocSetCheck' and reproduced it again, which led to stack traces like this: 2022-08-02 16:50:32.490 EDT [98814] WARNING: problem in alloc set PortalContext: bad single-chunk 0x7f9950886608 in block 0x7f9950885c00 2022-08-02 16:50:32.490 EDT [98814] BACKTRACE: 2 postgres 0x000000010cd37ef5 AllocSetCheck + 549 3 postgres 0x000000010cd37730 AllocSetReset + 48 4 postgres 0x000000010cd3f6f1 MemoryContextResetOnly + 81 5 postgres 0x000000010cd378b9 AllocSetDelete + 73 6 postgres 0x000000010cd41e09 PortalDrop + 425 7 postgres 0x000000010cd427bb AtCleanup_Portals + 203 8 postgres 0x000000010c86476d CleanupTransaction + 29 9 postgres 0x000000010c865d4f AbortCurrentTransaction + 63 10 postgres 0x000000010cba1395 PostgresMain + 885 11 postgres 0x000000010caf5472 PostmasterMain + 7586 12 postgres 0x000000010ca31e3d main + 2205 13 libdyld.dylib 0x00007fff699afcc9 start + 1 14 ??? 0x0000000000000001 0x0 + 1 I recompiled with -O0 and hacked the code that emits the BACKTRACE: bit to go into an infinite loop if it's hit, which enabled me to hook up a debugger at the point of the failure. The debugger says: (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP frame #0: 0x000000010e98a157 postgres`send_message_to_server_log(edata=0x000000010ec0f658) at elog.c:2916:4 frame #1: 0x000000010e9866d6 postgres`EmitErrorReport at elog.c:1537:3 frame #2: 0x000000010e986016 postgres`errfinish(filename="aset.c", lineno=1470, funcname="AllocSetCheck") at elog.c:592:2 frame #3: 0x000000010e9c8465 postgres`AllocSetCheck(context=0x00007ff77c80d200) at aset.c:1469:5 frame #4: 0x000000010e9c7c05 postgres`AllocSetDelete(context=0x00007ff77c80d200) at aset.c:638:2 frame #5: 0x000000010e9d368b postgres`MemoryContextDelete(context=0x00007ff77c80d200) at mcxt.c:252:2 * frame #6: 0x000000010e9d705b postgres`PortalDrop(portal=0x00007ff77e028920, isTopCommit=false) at portalmem.c:596:2 frame #7: 0x000000010e9d7e0e postgres`AtCleanup_Portals at portalmem.c:907:3 frame #8: 0x000000010e22030d postgres`CleanupTransaction at xact.c:2890:2 frame #9: 0x000000010e2219da postgres`AbortCurrentTransaction at xact.c:3328:4 frame #10: 0x000000010e763237 postgres`PostgresMain(dbname="postgres", username="rhaas") at postgres.c:4232:3 frame #11: 0x000000010e6625aa postgres`BackendRun(port=0x00007ff77c1042c0) at postmaster.c:4490:2 frame #12: 0x000000010e661b18 postgres`BackendStartup(port=0x00007ff77c1042c0) at postmaster.c:4218:3 frame #13: 0x000000010e66088a postgres`ServerLoop at postmaster.c:1808:7 frame #14: 0x000000010e65def2 postgres`PostmasterMain(argc=1, argv=0x00007ff77ae05cf0) at postmaster.c:1480:11 frame #15: 0x000000010e50521f postgres`main(argc=1, argv=0x00007ff77ae05cf0) at main.c:197:3 frame #16: 0x00007fff699afcc9 libdyld.dylib`start + 1 (lldb) fr sel 6 frame #6: 0x000000010e9d705b postgres`PortalDrop(portal=0x00007ff77e028920, isTopCommit=false) at portalmem.c:596:2 593 MemoryContextDelete(portal->holdContext); 594 595 /* release subsidiary storage */ -> 596 MemoryContextDelete(portal->portalContext); 597 598 /* release portal struct (it's in TopPortalContext) */ 599 pfree(portal); (lldb) fr sel 3 frame #3: 0x000000010e9c8465 postgres`AllocSetCheck(context=0x00007ff77c80d200) at aset.c:1469:5 1466 * Check chunk size 1467 */ 1468 if (dsize > chsize) -> 1469 elog(WARNING, "problem in alloc set %s: req size > alloc size for chunk %p in block %p", 1470 name, chunk, block); 1471 if (chsize < (1 << ALLOC_MINBITS)) 1472 elog(WARNING, "problem in alloc set %s: bad size %zu for chunk %p in block %p", (lldb) p dsize (Size) $3 = 20 (lldb) p chsize (Size) $4 = 0 It seems like CreateDatabaseUsingWalLog() must be doing something that corrupts PortalContext, but at the moment I'm not sure what that thing could be. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > WARNING: problem in alloc set PortalContext: req size > alloc size > for chunk 0x7f99508911f0 in block 0x7f9950890800 > WARNING: problem in alloc set PortalContext: bad size 0 for chunk > 0x7f99508911f0 in block 0x7f9950890800 > WARNING: problem in alloc set PortalContext: bad single-chunk > 0x7f9950891208 in block 0x7f9950890800 > WARNING: problem in alloc set PortalContext: found inconsistent > memory block 0x7f9950890800 > WARNING: problem in alloc set PortalContext: req size > alloc size > for chunk 0x7f99508911f0 in block 0x7f9950890800 > WARNING: problem in alloc set PortalContext: bad size 0 for chunk > 0x7f99508911f0 in block 0x7f9950890800 > WARNING: problem in alloc set PortalContext: bad single-chunk > 0x7f9950891208 in block 0x7f9950890800 > WARNING: problem in alloc set PortalContext: found inconsistent > memory block 0x7f9950890800 This looks like nothing so much as the fallout from something scribbling past the end of an allocated palloc chunk, or perhaps writing on already-freed space. Perhaps running the test case under valgrind would help to finger the culprit. regards, tom lane
On Tue, Aug 02, 2022 at 05:46:34PM -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > WARNING: problem in alloc set PortalContext: req size > alloc size for chunk 0x7f99508911f0 in block 0x7f9950890800 > > This looks like nothing so much as the fallout from something scribbling > past the end of an allocated palloc chunk, or perhaps writing on > already-freed space. Perhaps running the test case under valgrind > would help to finger the culprit. Yeah but my test case is so poor that it's a chore ... (Sorry for that, but it took me 2 days to be able to reproduce the problem so I sent it sooner rather than looking for a better way ... ) I got this interesting looking thing. ==11628== Invalid write of size 8 ==11628== at 0x1D12B3A: smgrsetowner (smgr.c:213) ==11628== by 0x1C7C224: RelationGetSmgr (rel.h:572) ==11628== by 0x1C7C224: RelationCopyStorageUsingBuffer (bufmgr.c:3725) ==11628== by 0x1C7C7A6: CreateAndCopyRelationData (bufmgr.c:3817) ==11628== by 0x14A4518: CreateDatabaseUsingWalLog (dbcommands.c:221) ==11628== by 0x14AB009: createdb (dbcommands.c:1393) ==11628== by 0x1D2B9AF: standard_ProcessUtility (utility.c:776) ==11628== by 0x1D2C46A: ProcessUtility (utility.c:530) ==11628== by 0x1D265F5: PortalRunUtility (pquery.c:1158) ==11628== by 0x1D27089: PortalRunMulti (pquery.c:1315) ==11628== by 0x1D27A7C: PortalRun (pquery.c:791) ==11628== by 0x1D1E33D: exec_simple_query (postgres.c:1243) ==11628== by 0x1D218BC: PostgresMain (postgres.c:4505) ==11628== Address 0x1025bc18 is 2,712 bytes inside a block of size 8,192 free'd ==11628== at 0x4033A3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==11628== by 0x217D7C2: AllocSetReset (aset.c:608) ==11628== by 0x219B57A: MemoryContextResetOnly (mcxt.c:181) ==11628== by 0x217DBD5: AllocSetDelete (aset.c:654) ==11628== by 0x219C1EC: MemoryContextDelete (mcxt.c:252) ==11628== by 0x21A109F: PortalDrop (portalmem.c:596) ==11628== by 0x21A269C: AtCleanup_Portals (portalmem.c:907) ==11628== by 0x11FEAB1: CleanupTransaction (xact.c:2890) ==11628== by 0x120A74C: AbortCurrentTransaction (xact.c:3328) ==11628== by 0x1D2158C: PostgresMain (postgres.c:4232) ==11628== by 0x1B15DB5: BackendRun (postmaster.c:4490) ==11628== by 0x1B1D799: BackendStartup (postmaster.c:4218) ==11628== Block was alloc'd at ==11628== at 0x40327F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==11628== by 0x217F0DC: AllocSetAlloc (aset.c:920) ==11628== by 0x219E4D2: palloc (mcxt.c:1082) ==11628== by 0x14A14BE: ScanSourceDatabasePgClassTuple (dbcommands.c:444) ==11628== by 0x14A1CD8: ScanSourceDatabasePgClassPage (dbcommands.c:384) ==11628== by 0x14A20BF: ScanSourceDatabasePgClass (dbcommands.c:322) ==11628== by 0x14A4348: CreateDatabaseUsingWalLog (dbcommands.c:177) ==11628== by 0x14AB009: createdb (dbcommands.c:1393) ==11628== by 0x1D2B9AF: standard_ProcessUtility (utility.c:776) ==11628== by 0x1D2C46A: ProcessUtility (utility.c:530) ==11628== by 0x1D265F5: PortalRunUtility (pquery.c:1158) ==11628== by 0x1D27089: PortalRunMulti (pquery.c:1315) -- Justin
On 2022-08-02 17:04:16 -0500, Justin Pryzby wrote: > I got this interesting looking thing. > > ==11628== Invalid write of size 8 > ==11628== at 0x1D12B3A: smgrsetowner (smgr.c:213) > ==11628== by 0x1C7C224: RelationGetSmgr (rel.h:572) > ==11628== by 0x1C7C224: RelationCopyStorageUsingBuffer (bufmgr.c:3725) > ==11628== by 0x1C7C7A6: CreateAndCopyRelationData (bufmgr.c:3817) > ==11628== by 0x14A4518: CreateDatabaseUsingWalLog (dbcommands.c:221) > ==11628== by 0x14AB009: createdb (dbcommands.c:1393) > ==11628== by 0x1D2B9AF: standard_ProcessUtility (utility.c:776) > ==11628== by 0x1D2C46A: ProcessUtility (utility.c:530) > ==11628== by 0x1D265F5: PortalRunUtility (pquery.c:1158) > ==11628== by 0x1D27089: PortalRunMulti (pquery.c:1315) > ==11628== by 0x1D27A7C: PortalRun (pquery.c:791) > ==11628== by 0x1D1E33D: exec_simple_query (postgres.c:1243) > ==11628== by 0x1D218BC: PostgresMain (postgres.c:4505) > ==11628== Address 0x1025bc18 is 2,712 bytes inside a block of size 8,192 free'd > ==11628== at 0x4033A3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) > ==11628== by 0x217D7C2: AllocSetReset (aset.c:608) > ==11628== by 0x219B57A: MemoryContextResetOnly (mcxt.c:181) > ==11628== by 0x217DBD5: AllocSetDelete (aset.c:654) > ==11628== by 0x219C1EC: MemoryContextDelete (mcxt.c:252) > ==11628== by 0x21A109F: PortalDrop (portalmem.c:596) > ==11628== by 0x21A269C: AtCleanup_Portals (portalmem.c:907) > ==11628== by 0x11FEAB1: CleanupTransaction (xact.c:2890) > ==11628== by 0x120A74C: AbortCurrentTransaction (xact.c:3328) > ==11628== by 0x1D2158C: PostgresMain (postgres.c:4232) > ==11628== by 0x1B15DB5: BackendRun (postmaster.c:4490) > ==11628== by 0x1B1D799: BackendStartup (postmaster.c:4218) > ==11628== Block was alloc'd at > ==11628== at 0x40327F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) > ==11628== by 0x217F0DC: AllocSetAlloc (aset.c:920) > ==11628== by 0x219E4D2: palloc (mcxt.c:1082) > ==11628== by 0x14A14BE: ScanSourceDatabasePgClassTuple (dbcommands.c:444) > ==11628== by 0x14A1CD8: ScanSourceDatabasePgClassPage (dbcommands.c:384) > ==11628== by 0x14A20BF: ScanSourceDatabasePgClass (dbcommands.c:322) > ==11628== by 0x14A4348: CreateDatabaseUsingWalLog (dbcommands.c:177) > ==11628== by 0x14AB009: createdb (dbcommands.c:1393) > ==11628== by 0x1D2B9AF: standard_ProcessUtility (utility.c:776) > ==11628== by 0x1D2C46A: ProcessUtility (utility.c:530) > ==11628== by 0x1D265F5: PortalRunUtility (pquery.c:1158) > ==11628== by 0x1D27089: PortalRunMulti (pquery.c:1315) Ick. That looks like somehow we end up with smgr entries still pointing to fake relcache entries, created in a prior attempt at create database. Looks like you'd need error trapping to call FreeFakeRelcacheEntry() (or just smgrclearowner()) in case of error. Or perhaps we can instead prevent the fake relcache entry being set as the owner in the first place? Why do we even need fake relcache entries here? Looks like all that they're used for is a bunch of RelationGetSmgr() calls? Can't we instead just pass the rnode to smgropen()? Given that we're doing that once for every buffer in the body of RelationCopyStorageUsingBuffer(), doing it in a bunch of other less-frequent places can't be a problem. can't Greetings, Andres Freund
On Wed, Aug 3, 2022 at 3:53 AM Andres Freund <andres@anarazel.de> wrote: > > On 2022-08-02 17:04:16 -0500, Justin Pryzby wrote: > > I got this interesting looking thing. > > > > ==11628== Invalid write of size 8 > > ==11628== at 0x1D12B3A: smgrsetowner (smgr.c:213) > > ==11628== by 0x1C7C224: RelationGetSmgr (rel.h:572) > > ==11628== by 0x1C7C224: RelationCopyStorageUsingBuffer (bufmgr.c:3725) > > ==11628== by 0x1C7C7A6: CreateAndCopyRelationData (bufmgr.c:3817) > > ==11628== by 0x14A4518: CreateDatabaseUsingWalLog (dbcommands.c:221) > > ==11628== by 0x14AB009: createdb (dbcommands.c:1393) > > ==11628== by 0x1D2B9AF: standard_ProcessUtility (utility.c:776) > > ==11628== by 0x1D2C46A: ProcessUtility (utility.c:530) > > ==11628== by 0x1D265F5: PortalRunUtility (pquery.c:1158) > > ==11628== by 0x1D27089: PortalRunMulti (pquery.c:1315) > > ==11628== by 0x1D27A7C: PortalRun (pquery.c:791) > > ==11628== by 0x1D1E33D: exec_simple_query (postgres.c:1243) > > ==11628== by 0x1D218BC: PostgresMain (postgres.c:4505) > > ==11628== Address 0x1025bc18 is 2,712 bytes inside a block of size 8,192 free'd > > ==11628== at 0x4033A3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) > > ==11628== by 0x217D7C2: AllocSetReset (aset.c:608) > > ==11628== by 0x219B57A: MemoryContextResetOnly (mcxt.c:181) > > ==11628== by 0x217DBD5: AllocSetDelete (aset.c:654) > > ==11628== by 0x219C1EC: MemoryContextDelete (mcxt.c:252) > > ==11628== by 0x21A109F: PortalDrop (portalmem.c:596) > > ==11628== by 0x21A269C: AtCleanup_Portals (portalmem.c:907) > > ==11628== by 0x11FEAB1: CleanupTransaction (xact.c:2890) > > ==11628== by 0x120A74C: AbortCurrentTransaction (xact.c:3328) > > ==11628== by 0x1D2158C: PostgresMain (postgres.c:4232) > > ==11628== by 0x1B15DB5: BackendRun (postmaster.c:4490) > > ==11628== by 0x1B1D799: BackendStartup (postmaster.c:4218) > > ==11628== Block was alloc'd at > > ==11628== at 0x40327F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) > > ==11628== by 0x217F0DC: AllocSetAlloc (aset.c:920) > > ==11628== by 0x219E4D2: palloc (mcxt.c:1082) > > ==11628== by 0x14A14BE: ScanSourceDatabasePgClassTuple (dbcommands.c:444) > > ==11628== by 0x14A1CD8: ScanSourceDatabasePgClassPage (dbcommands.c:384) > > ==11628== by 0x14A20BF: ScanSourceDatabasePgClass (dbcommands.c:322) > > ==11628== by 0x14A4348: CreateDatabaseUsingWalLog (dbcommands.c:177) > > ==11628== by 0x14AB009: createdb (dbcommands.c:1393) > > ==11628== by 0x1D2B9AF: standard_ProcessUtility (utility.c:776) > > ==11628== by 0x1D2C46A: ProcessUtility (utility.c:530) > > ==11628== by 0x1D265F5: PortalRunUtility (pquery.c:1158) > > ==11628== by 0x1D27089: PortalRunMulti (pquery.c:1315) > > Ick. That looks like somehow we end up with smgr entries still pointing to > fake relcache entries, created in a prior attempt at create database. The surprising thing is how the smgr entry survived the transaction abort, I mean AtEOXact_SMgr should have closed the smgr and should have removed from the smgr cache. > Looks like you'd need error trapping to call FreeFakeRelcacheEntry() (or just > smgrclearowner()) in case of error. > > Or perhaps we can instead prevent the fake relcache entry being set as the > owner in the first place? > > Why do we even need fake relcache entries here? Looks like all that they're > used for is a bunch of RelationGetSmgr() calls? Can't we instead just pass the > rnode to smgropen()? Given that we're doing that once for every buffer in the > body of RelationCopyStorageUsingBuffer(), doing it in a bunch of other > less-frequent places can't be a problem. > can't I think in some of the previous versions of the patch we were using smgropen() but changed it so that we do not reuse the smgr after it gets removed during interrupt processing, see discussion here[1] [1] https://www.postgresql.org/message-id/CA%2BTgmoYKovODW2Y7rQmmRFaKu445p9uAahjpgfbY8eyeL07BXA%40mail.gmail.com From the Valgrind report, it is clear that we are getting the smgr entry whose smgr->smgr_owner is pointing into the fake relcache entry. So I am investigating further how it is possible for the smgr created during a previous create database attempt to survive beyond abort transaction. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Aug 3, 2022 at 11:28 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Aug 3, 2022 at 3:53 AM Andres Freund <andres@anarazel.de> wrote: > > > > On 2022-08-02 17:04:16 -0500, Justin Pryzby wrote: > > > I got this interesting looking thing. > > > > > > ==11628== Invalid write of size 8 > > > ==11628== at 0x1D12B3A: smgrsetowner (smgr.c:213) > > > ==11628== by 0x1C7C224: RelationGetSmgr (rel.h:572) > > > ==11628== by 0x1C7C224: RelationCopyStorageUsingBuffer (bufmgr.c:3725) > > > ==11628== by 0x1C7C7A6: CreateAndCopyRelationData (bufmgr.c:3817) > > > ==11628== by 0x14A4518: CreateDatabaseUsingWalLog (dbcommands.c:221) > > > ==11628== by 0x14AB009: createdb (dbcommands.c:1393) > > > ==11628== by 0x1D2B9AF: standard_ProcessUtility (utility.c:776) > > > ==11628== by 0x1D2C46A: ProcessUtility (utility.c:530) > > > ==11628== by 0x1D265F5: PortalRunUtility (pquery.c:1158) > > > ==11628== by 0x1D27089: PortalRunMulti (pquery.c:1315) > > > ==11628== by 0x1D27A7C: PortalRun (pquery.c:791) > > > ==11628== by 0x1D1E33D: exec_simple_query (postgres.c:1243) > > > ==11628== by 0x1D218BC: PostgresMain (postgres.c:4505) > > > ==11628== Address 0x1025bc18 is 2,712 bytes inside a block of size 8,192 free'd > > > ==11628== at 0x4033A3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) > > > ==11628== by 0x217D7C2: AllocSetReset (aset.c:608) > > > ==11628== by 0x219B57A: MemoryContextResetOnly (mcxt.c:181) > > > ==11628== by 0x217DBD5: AllocSetDelete (aset.c:654) > > > ==11628== by 0x219C1EC: MemoryContextDelete (mcxt.c:252) > > > ==11628== by 0x21A109F: PortalDrop (portalmem.c:596) > > > ==11628== by 0x21A269C: AtCleanup_Portals (portalmem.c:907) > > > ==11628== by 0x11FEAB1: CleanupTransaction (xact.c:2890) > > > ==11628== by 0x120A74C: AbortCurrentTransaction (xact.c:3328) > > > ==11628== by 0x1D2158C: PostgresMain (postgres.c:4232) > > > ==11628== by 0x1B15DB5: BackendRun (postmaster.c:4490) > > > ==11628== by 0x1B1D799: BackendStartup (postmaster.c:4218) > > > ==11628== Block was alloc'd at > > > ==11628== at 0x40327F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) > > > ==11628== by 0x217F0DC: AllocSetAlloc (aset.c:920) > > > ==11628== by 0x219E4D2: palloc (mcxt.c:1082) > > > ==11628== by 0x14A14BE: ScanSourceDatabasePgClassTuple (dbcommands.c:444) > > > ==11628== by 0x14A1CD8: ScanSourceDatabasePgClassPage (dbcommands.c:384) > > > ==11628== by 0x14A20BF: ScanSourceDatabasePgClass (dbcommands.c:322) > > > ==11628== by 0x14A4348: CreateDatabaseUsingWalLog (dbcommands.c:177) > > > ==11628== by 0x14AB009: createdb (dbcommands.c:1393) > > > ==11628== by 0x1D2B9AF: standard_ProcessUtility (utility.c:776) > > > ==11628== by 0x1D2C46A: ProcessUtility (utility.c:530) > > > ==11628== by 0x1D265F5: PortalRunUtility (pquery.c:1158) > > > ==11628== by 0x1D27089: PortalRunMulti (pquery.c:1315) > > > > Ick. That looks like somehow we end up with smgr entries still pointing to > > fake relcache entries, created in a prior attempt at create database. > > The surprising thing is how the smgr entry survived the transaction > abort, I mean AtEOXact_SMgr should have closed the smgr and should > have removed from the > smgr cache. > Okay, so AtEOXact_SMgr will only get rid of unowned smgr and ours are owned by a fake relcache and whose lifetime is just portal memory context which will go away at the transaction end. So as Andres suggested options could be that a) we catch the error and do FreeFakeRelcacheEntry. b) directly use smgropen instead of RelationGetSmgr because actually, we do not want the owner to be set for these smgrs. I think option b) looks better to me, I will prepare a patch and test whether the error goes away with that or not. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Aug 3, 2022 at 12:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Okay, so AtEOXact_SMgr will only get rid of unowned smgr and ours are > owned by a fake relcache and whose lifetime is just portal memory > context which will go away at the transaction end. So as Andres > suggested options could be that a) we catch the error and do > FreeFakeRelcacheEntry. b) directly use smgropen instead of > RelationGetSmgr because actually, we do not want the owner to be set > for these smgrs. > > I think option b) looks better to me, I will prepare a patch and test > whether the error goes away with that or not. > Here is the patch which directly uses smgropen instead of using fake relcache entry. We don't preserve the smgr pointer and whenever required we again call the smgropen. With this patch it resolved the problem for me at least what I was able to reproduce. I was able to reproduce the WARNING messages that Robert got as well as the valgrind error that Justin got and with this patch both are resolved. @Justin can you help in verifying the original issue? Another alternative could be that continue using fake relcache entry but instead of RelationGetSmgr() create some new function which doesn't set the owner in the smgr. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Wed, Aug 3, 2022 at 1:41 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Aug 3, 2022 at 12:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > Okay, so AtEOXact_SMgr will only get rid of unowned smgr and ours are > > owned by a fake relcache and whose lifetime is just portal memory > > context which will go away at the transaction end. So as Andres > > suggested options could be that a) we catch the error and do > > FreeFakeRelcacheEntry. b) directly use smgropen instead of > > RelationGetSmgr because actually, we do not want the owner to be set > > for these smgrs. > > > > I think option b) looks better to me, I will prepare a patch and test > > whether the error goes away with that or not. > > > > Here is the patch which directly uses smgropen instead of using fake > relcache entry. We don't preserve the smgr pointer and whenever > required we again call the smgropen. > > With this patch it resolved the problem for me at least what I was > able to reproduce. I was able to reproduce the WARNING messages that > Robert got as well as the valgrind error that Justin got and with this > patch both are resolved. Another version of the patch which closes the smgr at the end using smgrcloserellocator() and I have also added a commit message. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Wed, Aug 3, 2022 at 7:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Another version of the patch which closes the smgr at the end using > smgrcloserellocator() and I have also added a commit message. Hmm, but didn't we decide against doing it that way intentionally? The comment you're deleting says "If we didn't do this and used the smgr layer directly, we would have to worry about invalidations." -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, 3 Aug 2022 at 9:22 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Aug 3, 2022 at 7:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Another version of the patch which closes the smgr at the end using
> smgrcloserellocator() and I have also added a commit message.
Hmm, but didn't we decide against doing it that way intentionally? The
comment you're deleting says "If we didn't do this and used the smgr
layer directly, we would have to worry about invalidations."
I think we only need to worry if we keep the smgr reference around and try to reuse it. But in this patch I am not keeping the reference to the smgr.
—
Dilip
On Wed, Aug 03, 2022 at 04:45:23PM +0530, Dilip Kumar wrote: > Another version of the patch which closes the smgr at the end using > smgrcloserellocator() and I have also added a commit message. Thanks for providing a patch. This seems to fix the second problem with accessing freed memory. But I reproduced the first problem with a handful of tries interrupting the while loop: 2022-08-03 10:39:50.129 CDT client backend[5530] [unknown] PANIC: could not open critical system index 2662 In the failure, when trying to connect to the new "a" DB, it does this: [pid 10700] openat(AT_FDCWD, "base/17003/pg_filenode.map", O_RDONLY) = 11 [pid 10700] read(11, "\27'Y\0\21\0\0\0\353\4\0\0\353\4\0\0\341\4\0\0\341\4\0\0\347\4\0\0\347\4\0\0\337\4\0\0\337\4\0\0\24\v\0\0\24\v\0\0\25\v\0\0\25\v\0\0K\20\0\0K\20\0\0L\20\0\0L\20\0\0\202\n\0\0\202\n\0\0\203\n\0\0\203\n\0\0\217\n\0\0\217\n\0\0\220\n\0\0\220\n\0\0b\n\0\0b\n\0\0c\n\0\0c\n\0\0f\n\0\0f\n\0\0g\n\0\0g\n\0\0\177\r\0\0\177\r\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\362\366\252\337", 524)= 524 [pid 10700] close(11) = 0 [pid 10700] openat(AT_FDCWD, "base/17003/pg_internal.init", O_RDONLY) = -1 ENOENT (No such file or directory) [pid 10700] openat(AT_FDCWD, "base/17003/1259", O_RDWR) = 11 [pid 10700] lseek(11, 0, SEEK_END) = 106496 [pid 10700] lseek(11, 0, SEEK_END) = 106496 And then reads nothing but zero bytes from FD 11 (rel 1259/pg_class) So far, I haven't succeeded in eliciting anything useful from valgrind. -- Justin
On Wed, Aug 3, 2022 at 9:32 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > On Wed, Aug 03, 2022 at 04:45:23PM +0530, Dilip Kumar wrote: > > Another version of the patch which closes the smgr at the end using > > smgrcloserellocator() and I have also added a commit message. > > Thanks for providing a patch. > This seems to fix the second problem with accessing freed memory. Thanks for the confirmation. > But I reproduced the first problem with a handful of tries interrupting the > while loop: > > 2022-08-03 10:39:50.129 CDT client backend[5530] [unknown] PANIC: could not open critical system index 2662 > > In the failure, when trying to connect to the new "a" DB, it does this: > > [pid 10700] openat(AT_FDCWD, "base/17003/pg_filenode.map", O_RDONLY) = 11 > [pid 10700] read(11, "\27'Y\0\21\0\0\0\353\4\0\0\353\4\0\0\341\4\0\0\341\4\0\0\347\4\0\0\347\4\0\0\337\4\0\0\337\4\0\0\24\v\0\0\24\v\0\0\25\v\0\0\25\v\0\0K\20\0\0K\20\0\0L\20\0\0L\20\0\0\202\n\0\0\202\n\0\0\203\n\0\0\203\n\0\0\217\n\0\0\217\n\0\0\220\n\0\0\220\n\0\0b\n\0\0b\n\0\0c\n\0\0c\n\0\0f\n\0\0f\n\0\0g\n\0\0g\n\0\0\177\r\0\0\177\r\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\362\366\252\337", 524)= 524 > [pid 10700] close(11) = 0 > [pid 10700] openat(AT_FDCWD, "base/17003/pg_internal.init", O_RDONLY) = -1 ENOENT (No such file or directory) > [pid 10700] openat(AT_FDCWD, "base/17003/1259", O_RDWR) = 11 > [pid 10700] lseek(11, 0, SEEK_END) = 106496 > [pid 10700] lseek(11, 0, SEEK_END) = 106496 > > And then reads nothing but zero bytes from FD 11 (rel 1259/pg_class) > > So far, I haven't succeeded in eliciting anything useful from valgrind. I tried multiple times but had no luck with reproducing this issue. Do you have some logs to know that just before ^C what was the last query executed and whether it got canceled or executed completely? Because theoretically, if the create database failed anywhere in between then it should at least clean the directory of that newly created database but seems there are some corrupted data in that directory so seems it is not symptoms of just the create database failure but some combination of multiple things. I will put more thought into this tomorrow. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Aug 03, 2022 at 11:02:00AM -0500, Justin Pryzby wrote: > But I reproduced the first problem with a handful of tries interrupting the > while loop: > > 2022-08-03 10:39:50.129 CDT client backend[5530] [unknown] PANIC: could not open critical system index 2662 ... > So far, I haven't succeeded in eliciting anything useful from valgrind. Now, I've reproduced the problem under valgrind, but it doesn't show anything useful: pryzbyj@pryzbyj:~$ while :; do psql -h /tmp template1 -c "DROP DATABASE a" -c "CREATE DATABASE a TEMPLATE postgres STRATEGYwal_log"; done ERROR: database "a" does not exist CREATE DATABASE ^CCancel request sent ERROR: canceling statement due to user request ERROR: database "a" already exists ^C pryzbyj@pryzbyj:~$ ^C pryzbyj@pryzbyj:~$ ^C pryzbyj@pryzbyj:~$ ^C pryzbyj@pryzbyj:~$ psql -h /tmp a -c '' 2022-08-03 11:57:39.178 CDT client backend[31321] [unknown] PANIC: could not open critical system index 2662 psql: error: falló la conexión al servidor en el socket «/tmp/.s.PGSQL.5432»: PANIC: could not open critical system index2662 On the server process, nothing interesting but the backtrace (the error was before this, while writing the relation file, but there's nothing suspicious). 2022-08-03 11:08:06.628 CDT client backend[2841] [unknown] PANIC: could not open critical system index 2662 ==2841== ==2841== Process terminating with default action of signal 6 (SIGABRT) ==2841== at 0x5FBBE97: raise (raise.c:51) ==2841== by 0x5FBD800: abort (abort.c:79) ==2841== by 0x2118DEF: errfinish (elog.c:675) ==2841== by 0x20F6002: load_critical_index (relcache.c:4328) ==2841== by 0x20F727A: RelationCacheInitializePhase3 (relcache.c:4103) ==2841== by 0x213DFA5: InitPostgres (postinit.c:1087) ==2841== by 0x1D20D72: PostgresMain (postgres.c:4081) ==2841== by 0x1B15CFD: BackendRun (postmaster.c:4490) ==2841== by 0x1B1D6E1: BackendStartup (postmaster.c:4218) ==2841== by 0x1B1DD59: ServerLoop (postmaster.c:1808) ==2841== by 0x1B1F86D: PostmasterMain (postmaster.c:1480) ==2841== by 0x17B7150: main (main.c:197) Below, I reproduced it without valgrind (and without LANG): pryzbyj@pryzbyj:~/src/postgres$ while :; do psql -qh /tmp template1 -c "DROP DATABASE a" -c "CREATE DATABASE a TEMPLATE postgresSTRATEGY wal_log"; done 2022-08-03 11:59:52.675 CDT checkpointer[1881] LOG: checkpoint starting: immediate force wait 2022-08-03 11:59:52.862 CDT checkpointer[1881] LOG: checkpoint complete: wrote 4 buffers (0.0%); 0 WAL file(s) added, 0removed, 0 recycled; write=0.045 s, sync=0.038 s, total=0.188 s; sync files=3, longest=0.019 s, average=0.013 s; distance=3kB, estimate=3 kB; lsn=0/24862508, redo lsn=0/248624D0 2022-08-03 11:59:53.213 CDT checkpointer[1881] LOG: checkpoint starting: immediate force wait 2022-08-03 11:59:53.409 CDT checkpointer[1881] LOG: checkpoint complete: wrote 4 buffers (0.0%); 0 WAL file(s) added, 0removed, 0 recycled; write=0.030 s, sync=0.054 s, total=0.196 s; sync files=4, longest=0.029 s, average=0.014 s; distance=4042kB, estimate=4042 kB; lsn=0/24C54D88, redo lsn=0/24C54D50 ^CCancel request sent 2022-08-03 11:59:53.750 CDT checkpointer[1881] LOG: checkpoint starting: immediate force wait 2022-08-03 11:59:53.930 CDT checkpointer[1881] LOG: checkpoint complete: wrote 4 buffers (0.0%); 0 WAL file(s) added, 0removed, 1 recycled; write=0.029 s, sync=0.027 s, total=0.181 s; sync files=4, longest=0.022 s, average=0.007 s; distance=4042kB, estimate=4042 kB; lsn=0/250476D0, redo lsn=0/25047698 2022-08-03 11:59:54.270 CDT checkpointer[1881] LOG: checkpoint starting: immediate force wait ^C2022-08-03 11:59:54.294 CDT client backend[1903] psql ERROR: canceling statement due to user request 2022-08-03 11:59:54.294 CDT client backend[1903] psql STATEMENT: DROP DATABASE a Cancel request sent ERROR: canceling statement due to user request 2022-08-03 11:59:54.296 CDT client backend[1903] psql ERROR: database "a" already exists 2022-08-03 11:59:54.296 CDT client backend[1903] psql STATEMENT: CREATE DATABASE a TEMPLATE postgres STRATEGY wal_log ERROR: database "a" already exists ^C pryzbyj@pryzbyj:~/src/postgres$ ^C pryzbyj@pryzbyj:~/src/postgres$ ^C pryzbyj@pryzbyj:~/src/postgres$ 2022-08-03 11:59:54.427 CDT checkpointer[1881] LOG: checkpoint complete: wrote 4 buffers(0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.024 s, sync=0.036 s, total=0.158 s; sync files=4, longest=0.027s, average=0.009 s; distance=4042 kB, estimate=4042 kB; lsn=0/2543A108, redo lsn=0/2543A0A8 ^C pryzbyj@pryzbyj:~/src/postgres$ ^C pryzbyj@pryzbyj:~/src/postgres$ ^C pryzbyj@pryzbyj:~/src/postgres$ psql -h /tmp a -c '' 2022-08-03 11:59:56.617 CDT clientbackend[1914] [unknown] PANIC: could not open critical system index 2662
Hi, On 2022-08-03 12:01:18 -0500, Justin Pryzby wrote: > Now, I've reproduced the problem under valgrind, but it doesn't show anything > useful Yea, that looks like an issue on a different level. > > pryzbyj@pryzbyj:~$ while :; do psql -h /tmp template1 -c "DROP DATABASE a" -c "CREATE DATABASE a TEMPLATE postgres STRATEGYwal_log"; done > ERROR: database "a" does not exist > CREATE DATABASE > ^CCancel request sent > ERROR: canceling statement due to user request > ERROR: database "a" already exists > ^C Hm. This looks more like an issue of DROP DATABASE not being interruptible. I suspect this isn't actually related to STRATEGY wal_log and could likely be reproduced in older versions too. It's pretty obvious that dropdb() isn't safe against being interrupted. We delete the data before we have committed the deletion of the pg_database entry. Seems like we should hold interrupts across the remove_dbtablespaces() until *after* we've committed the transaction? Greetings, Andres Freund
On Wed, Aug 03, 2022 at 11:26:43AM -0700, Andres Freund wrote: > Hm. This looks more like an issue of DROP DATABASE not being interruptible. I > suspect this isn't actually related to STRATEGY wal_log and could likely be > reproduced in older versions too. I couldn't reproduce it with file_copy, but my recipe isn't exactly reliable. That may just mean that it's easier to hit now. -- Justin
On Thu, Aug 4, 2022 at 12:18 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > On Wed, Aug 03, 2022 at 11:26:43AM -0700, Andres Freund wrote: > > Hm. This looks more like an issue of DROP DATABASE not being interruptible. I > > suspect this isn't actually related to STRATEGY wal_log and could likely be > > reproduced in older versions too. > > I couldn't reproduce it with file_copy, but my recipe isn't exactly reliable. > That may just mean that it's easier to hit now. I think this looks like a problem with drop db but IMHO you are seeing this behavior only when a database is created using WAL LOG because in this strategy we are using buffers to write the destination database pages and some of the dirty buffers and sync requests might still be pending. And now when we try to drop the database it drops all the dirty buffers and all pending sync requests and then before it actually removes the directory it gets interrupted and now you see the database directory on disk which is partially corrupted. See below sequence of drop database dropdb() { ... DropDatabaseBuffers(db_id); ... ForgetDatabaseSyncRequests(db_id); ... RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT); WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE)); -- Inside this it can process the cancel query and get interrupted remove_dbtablespaces(db_id); .. } I reproduced the same error by inducing error just before WaitForProcSignalBarrier. postgres[14968]=# CREATE DATABASE a STRATEGY WAL_LOG ; drop database a; CREATE DATABASE ERROR: XX000: test error LOCATION: dropdb, dbcommands.c:1684 postgres[14968]=# \c a connection to server on socket "/tmp/.s.PGSQL.5432" failed: PANIC: could not open critical system index 2662 Previous connection kept postgres[14968]=# -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Aug 4, 2022 at 9:41 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Aug 4, 2022 at 12:18 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > > On Wed, Aug 03, 2022 at 11:26:43AM -0700, Andres Freund wrote: > > > Hm. This looks more like an issue of DROP DATABASE not being interruptible. I > > > suspect this isn't actually related to STRATEGY wal_log and could likely be > > > reproduced in older versions too. > > > > I couldn't reproduce it with file_copy, but my recipe isn't exactly reliable. > > That may just mean that it's easier to hit now. > > I think this looks like a problem with drop db but IMHO you are seeing > this behavior only when a database is created using WAL LOG because in > this strategy we are using buffers to write the destination database > pages and some of the dirty buffers and sync requests might still be > pending. And now when we try to drop the database it drops all the > dirty buffers and all pending sync requests and then before it > actually removes the directory it gets interrupted and now you see the > database directory on disk which is partially corrupted. See below > sequence of drop database > > > dropdb() > { > ... > DropDatabaseBuffers(db_id); > ... > ForgetDatabaseSyncRequests(db_id); > ... > RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT); > > WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE)); > -- Inside this it can process the cancel query and get interrupted > remove_dbtablespaces(db_id); > .. > } > > I reproduced the same error by inducing error just before > WaitForProcSignalBarrier. > > postgres[14968]=# CREATE DATABASE a STRATEGY WAL_LOG ; drop database a; > CREATE DATABASE > ERROR: XX000: test error > LOCATION: dropdb, dbcommands.c:1684 > postgres[14968]=# \c a > connection to server on socket "/tmp/.s.PGSQL.5432" failed: PANIC: > could not open critical system index 2662 > Previous connection kept > postgres[14968]=# So basically, from this we can say it is completely a problem with drop databases, I mean I can produce any behavior by interrupting drop database 1. If we created some tables/inserted data and the drop database got cancelled, it might have a database directory and those objects are lost. 2. Or you can even drop the database directory and then get cancelled before deleting the pg_database entry then also you will end up with a corrupted database, doesn't matter whether you created it with WAL LOG or FILE COPY. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Wed, Aug 3, 2022 at 7:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Another version of the patch which closes the smgr at the end using > smgrcloserellocator() and I have also added a commit message. I have reviewed this patch and I don't see a problem with it. However, it would be nice if Andres or someone else who understands this area well (Tom? Thomas?) would also review it, because I also reviewed what's in the tree now and that turns out to be buggy, which leads me to conclude that I don't understand this area as well as would be desirable. I'm inclined to hold off on committing this until next week, not only for that reason, but also because there's a wrap planned on Monday, and committing anything now seems like it might have too much of a risk of upsetting that plan. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Aug 04, 2022 at 04:07:01PM -0400, Robert Haas wrote: > I'm inclined to hold off on committing this until next week, not only +1 I don't see any reason to hurry to fix problems that occur when DROP DATABASE is interrupted. Sorry to beat up your patches so much and for such crappy test cases^C -- Justin
Justin Pryzby <pryzby@telsasoft.com> writes: > On Thu, Aug 04, 2022 at 04:07:01PM -0400, Robert Haas wrote: >> I'm inclined to hold off on committing this until next week, not only > +1 +1 ... there are some other v15 open items that I don't think we'll see fixed for beta3, either. regards, tom lane
Hi, On 2022-08-03 16:45:23 +0530, Dilip Kumar wrote: > Another version of the patch which closes the smgr at the end using > smgrcloserellocator() and I have also added a commit message. What's the motivation behind the explicit close? > @@ -258,8 +258,8 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath) > Page page; > List *rlocatorlist = NIL; > LockRelId relid; > - Relation rel; > Snapshot snapshot; > + SMgrRelation smgr; > BufferAccessStrategy bstrategy; > > /* Get pg_class relfilenumber. */ > @@ -276,16 +276,9 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath) > rlocator.dbOid = dbid; > rlocator.relNumber = relfilenumber; > > - /* > - * We can't use a real relcache entry for a relation in some other > - * database, but since we're only going to access the fields related to > - * physical storage, a fake one is good enough. If we didn't do this and > - * used the smgr layer directly, we would have to worry about > - * invalidations. > - */ > - rel = CreateFakeRelcacheEntry(rlocator); > - nblocks = smgrnblocks(RelationGetSmgr(rel), MAIN_FORKNUM); > - FreeFakeRelcacheEntry(rel); > + smgr = smgropen(rlocator, InvalidBackendId); > + nblocks = smgrnblocks(smgr, MAIN_FORKNUM); > + smgrclose(smgr); Why are you opening and then closing again? Part of the motivation for the question is that a local SMgrRelation variable may lead to it being used further, opening up interrupt processing issues. > + rlocator.locator = src_rlocator; > + smgrcloserellocator(rlocator); > + > + rlocator.locator = dst_rlocator; > + smgrcloserellocator(rlocator); As mentioned above, it's not clear to me why this is now done... Otherwise looks good to me. Greetings, Andres Freund
Hi, On 2022-08-04 16:07:01 -0400, Robert Haas wrote: > On Wed, Aug 3, 2022 at 7:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Another version of the patch which closes the smgr at the end using > > smgrcloserellocator() and I have also added a commit message. > > I have reviewed this patch and I don't see a problem with it. However, > it would be nice if Andres or someone else who understands this area > well (Tom? Thomas?) would also review it, because I also reviewed > what's in the tree now and that turns out to be buggy, which leads me > to conclude that I don't understand this area as well as would be > desirable. I don't think this issue is something I'd have caught "originally" either. It's probably more of a "fake relcache entry" issue (or undocumented limitation) than a bug in the new code. I did a quick review upthread - some minor quibbles aside, I think it looks good. > I'm inclined to hold off on committing this until next week, not only > for that reason, but also because there's a wrap planned on Monday, > and committing anything now seems like it might have too much of a > risk of upsetting that plan. Makes sense. Unlikely to be a blocker for anybody. Greetings, Andres Freund
Hi, On 2022-08-04 16:38:35 +0530, Dilip Kumar wrote: > So basically, from this we can say it is completely a problem with > drop databases, I mean I can produce any behavior by interrupting drop > database > 1. If we created some tables/inserted data and the drop database got > cancelled, it might have a database directory and those objects are > lost. > 2. Or you can even drop the database directory and then get cancelled > before deleting the pg_database entry then also you will end up with a > corrupted database, doesn't matter whether you created it with WAL LOG > or FILE COPY. Yea. I think at the very least we need to start holding interrupts before the DropDatabaseBuffers() and do so until commit. That's probably best done by doing the transaction commit inside dropdb. Greetings, Andres Freund
Robert Haas <robertmhaas@gmail.com> writes: > I have reviewed this patch and I don't see a problem with it. However, > it would be nice if Andres or someone else who understands this area > well (Tom? Thomas?) would also review it, because I also reviewed > what's in the tree now and that turns out to be buggy, which leads me > to conclude that I don't understand this area as well as would be > desirable. FWIW, I approve of getting rid of the use of CreateFakeRelcacheEntry here, because I do not think that mechanism is meant to be used outside of WAL replay. However, this patch fails to remove it from CreateAndCopyRelationData, which seems likely to be just as much at risk. The "invalidation" comment bothered me for awhile, but I think it's fine: we know that no other backend can connect to the source DB because we have it locked, and we know that no other backend can connect to the destination DB because it doesn't exist yet according to the catalogs, so nothing could possibly occur to invalidate our idea of where the physical files are. It would be nice to document these assumptions, though, rather than merely remove all the relevant commentary. While I'm at it, I would like to strenuously object to the current framing of CreateAndCopyRelationData as a general-purpose copying mechanism. Because of the above assumptions, I think it's utterly unsafe to use anywhere except in CREATE DATABASE. The header comment fails to warn about that at all, and placing it in bufmgr.c rather than static in dbcommands.c is just an invitation to future misuse. Perhaps I'm overly sensitive to that because I just finished cleaning up somebody's misuse of non-general-purpose code (1aa8dad41), but as this stands I think it's positively dangerous. regards, tom lane
Andres Freund <andres@anarazel.de> writes: > Yea. I think at the very least we need to start holding interrupts before the > DropDatabaseBuffers() and do so until commit. That's probably best done by > doing the transaction commit inside dropdb. We've talked before about ignoring interrupts across commit, but I find the idea a bit scary. In any case, DROP DATABASE is far from the only place with a problem. regards, tom lane
On Thu, Aug 04, 2022 at 06:02:50PM -0400, Tom Lane wrote: > The "invalidation" comment bothered me for awhile, but I think it's > fine: we know that no other backend can connect to the source DB > because we have it locked, About that - is it any problem that the currently-connected db can be used as a template? It's no issue for 2-phase commit, because "create database" cannot run in an txn. -- Justin
Hi, On 2022-08-04 18:05:25 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Yea. I think at the very least we need to start holding interrupts before the > > DropDatabaseBuffers() and do so until commit. That's probably best done by > > doing the transaction commit inside dropdb. > > We've talked before about ignoring interrupts across commit, but > I find the idea a bit scary. I'm not actually suggesting to do so across commit, just until the commit. Maintaining that seems easiest if dropdb() does the commit internally. > In any case, DROP DATABASE is far from the only place with a problem. What other place has a database corrupting potential of this magnitude just because interrupts are accepted? We throw valid s_b contents away and then accept interrupts before committing - with predictable results. We also accept interrupts as part of deleting the db data dir (due to catalog access). Greetings, Andres Freund
I wrote: > While I'm at it, I would like to strenuously object to the current > framing of CreateAndCopyRelationData as a general-purpose copying > mechanism. And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer not completely broken? /* Read block from source relation. */ srcBuf = ReadBufferWithoutRelcache(src->rd_locator, forkNum, blkno, RBM_NORMAL, bstrategy_src, permanent); srcPage = BufferGetPage(srcBuf); if (PageIsNew(srcPage) || PageIsEmpty(srcPage)) { ReleaseBuffer(srcBuf); continue; } /* Use P_NEW to extend the destination relation. */ dstBuf = ReadBufferWithoutRelcache(dst->rd_locator, forkNum, P_NEW, RBM_NORMAL, bstrategy_dst, permanent); You can't skip pages just because they are empty. Well, maybe you could if you were doing something to ensure that you zero-fill the corresponding blocks on the destination side. But this isn't doing that. It's using P_NEW for dstBuf, which will have the effect of silently collapsing out such pages. Maybe in isolation a heap could withstand that, but its indexes won't be happy (and I guess t_ctid chain links won't either). I think you should just lose the if() stanza. There's no optimization to be had here that's worth any extra complication. (This seems worth fixing before beta3, as it looks like a rather nasty data corruption hazard.) regards, tom lane
I wrote: > And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer > not completely broken? [pile^2] Also, what is the rationale for locking the target buffer but not the source buffer? That seems pretty hard to justify from here, even granting the assumption that we don't expect any other processes to be interested in these buffers (which I don't grant, because checkpointer). regards, tom lane
Hi, On 2022-08-04 19:01:06 -0400, Tom Lane wrote: > And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer > not completely broken? > > /* Read block from source relation. */ > srcBuf = ReadBufferWithoutRelcache(src->rd_locator, forkNum, blkno, > RBM_NORMAL, bstrategy_src, > permanent); > srcPage = BufferGetPage(srcBuf); > if (PageIsNew(srcPage) || PageIsEmpty(srcPage)) > { > ReleaseBuffer(srcBuf); > continue; > } > > /* Use P_NEW to extend the destination relation. */ > dstBuf = ReadBufferWithoutRelcache(dst->rd_locator, forkNum, P_NEW, > RBM_NORMAL, bstrategy_dst, > permanent); > > You can't skip pages just because they are empty. Well, maybe you could > if you were doing something to ensure that you zero-fill the corresponding > blocks on the destination side. But this isn't doing that. It's using > P_NEW for dstBuf, which will have the effect of silently collapsing out > such pages. Maybe in isolation a heap could withstand that, but its > indexes won't be happy (and I guess t_ctid chain links won't either). > > I think you should just lose the if() stanza. There's no optimization to > be had here that's worth any extra complication. > > (This seems worth fixing before beta3, as it looks like a rather > nasty data corruption hazard.) Ugh, yes. And even with this fixed I think this should grow at least an assertion that the block numbers match, probably even an elog. Greetings, Andres
Andres Freund <andres@anarazel.de> writes: > On 2022-08-04 18:05:25 -0400, Tom Lane wrote: >> In any case, DROP DATABASE is far from the only place with a problem. > What other place has a database corrupting potential of this magnitude just > because interrupts are accepted? We throw valid s_b contents away and then > accept interrupts before committing - with predictable results. We also accept > interrupts as part of deleting the db data dir (due to catalog access). Those things would be better handled by moving the data-discarding steps to post-commit. Maybe that argues for having an internal commit halfway through DROP DATABASE: remove pg_database row, commit, start new transaction, clean up. regards, tom lane
Andres Freund <andres@anarazel.de> writes: > On 2022-08-04 19:01:06 -0400, Tom Lane wrote: >> (This seems worth fixing before beta3, as it looks like a rather >> nasty data corruption hazard.) > Ugh, yes. And even with this fixed I think this should grow at least an > assertion that the block numbers match, probably even an elog. Yeah, the assumption that P_NEW would automatically match the source block was making me itchy too. An explicit test-and-elog seems worth the cycles. regards, tom lane
Hi, On August 4, 2022 4:11:13 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >I wrote: >> And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer >> not completely broken? > >[pile^2] Also, what is the rationale for locking the target buffer >but not the source buffer? That seems pretty hard to justify from >here, even granting the assumption that we don't expect any other >processes to be interested in these buffers (which I don't grant, >because checkpointer). I'm not arguing it's good or should stay that way, but it's probably okayish that checkpointer / bgwriter have access, giventhat they will never modify buffers. They just take a lock to prevent concurrent modifications, which RelationCopyStorageUsingBufferhopefully doesn't do. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Hi, On August 4, 2022 4:20:16 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Yeah, the assumption that P_NEW would automatically match the source block >was making me itchy too. An explicit test-and-elog seems worth the >cycles. Is there a good reason to rely on P_NEW at all? Both from an efficiency and robustness POV it seems like it'd be better touse smgrextend to bulk extend just after smgrcreate() and then fill s_b u using RBM_ZERO (or whatever it is called). Thatbulk smgrextend would later be a good point to use fallocate so the FS can immediately size the file correctly, withouta lot of pointless writes of zeroes. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Andres Freund <andres@anarazel.de> writes: > On August 4, 2022 4:11:13 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> [pile^2] Also, what is the rationale for locking the target buffer >> but not the source buffer? That seems pretty hard to justify from >> here, even granting the assumption that we don't expect any other >> processes to be interested in these buffers (which I don't grant, >> because checkpointer). > I'm not arguing it's good or should stay that way, but it's probably okayish that checkpointer / bgwriter have access,given that they will never modify buffers. They just take a lock to prevent concurrent modifications, which RelationCopyStorageUsingBufferhopefully doesn't do. I'm not arguing that it's actively broken today --- but AFAIR, every other access to a shared buffer takes a buffer lock. It does not seem to me to be very future-proof for this code to decide it's exempt from that rule, without so much as a comment justifying it. Furthermore, what's the gain? We aren't expecting contention here, I think. If we were, then it probably *would* be actively broken. regards, tom lane
Andres Freund <andres@anarazel.de> writes: > Is there a good reason to rely on P_NEW at all? Both from an efficiency > and robustness POV it seems like it'd be better to use smgrextend to > bulk extend just after smgrcreate() and then fill s_b u using RBM_ZERO > (or whatever it is called). That bulk smgrextend would later be a good > point to use fallocate so the FS can immediately size the file > correctly, without a lot of pointless writes of zeroes. Hmm ... makes sense. We'd be using smgrextend to write just the last page of the file, relying on its API spec "Note that we assume writing a block beyond current EOF causes intervening file space to become filled with zeroes". I don't know that we're using that assumption aggressively today, but as long as it doesn't confuse the kernel it'd probably be fine. regards, tom lane
On Thu, Aug 4, 2022 at 6:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > I have reviewed this patch and I don't see a problem with it. However, > > it would be nice if Andres or someone else who understands this area > > well (Tom? Thomas?) would also review it, because I also reviewed > > what's in the tree now and that turns out to be buggy, which leads me > > to conclude that I don't understand this area as well as would be > > desirable. > > FWIW, I approve of getting rid of the use of CreateFakeRelcacheEntry > here, because I do not think that mechanism is meant to be used > outside of WAL replay. However, this patch fails to remove it from > CreateAndCopyRelationData, which seems likely to be just as much > at risk. It looks to me like it does? > The "invalidation" comment bothered me for awhile, but I think it's > fine: we know that no other backend can connect to the source DB > because we have it locked, and we know that no other backend can > connect to the destination DB because it doesn't exist yet according > to the catalogs, so nothing could possibly occur to invalidate our > idea of where the physical files are. It would be nice to document > these assumptions, though, rather than merely remove all the relevant > commentary. I don't think that's the point. We could always suffer a sinval reset or a PROCSIGNAL_BARRIER_SMGRRELEASE. But since the code avoids ever reusing the smgr, it should be OK. I think. > While I'm at it, I would like to strenuously object to the current > framing of CreateAndCopyRelationData as a general-purpose copying > mechanism. Because of the above assumptions, I think it's utterly > unsafe to use anywhere except in CREATE DATABASE. The header comment > fails to warn about that at all, and placing it in bufmgr.c rather > than static in dbcommands.c is just an invitation to future misuse. > Perhaps I'm overly sensitive to that because I just finished cleaning > up somebody's misuse of non-general-purpose code (1aa8dad41), but > as this stands I think it's positively dangerous. OK. No objection to you revising the comments however you feel is appropriate. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Aug 4, 2022 at 7:01 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer > not completely broken? Ouch. That's pretty bad. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Aug 4, 2022 at 7:11 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > [pile^2] Also, what is the rationale for locking the target buffer > but not the source buffer? That seems pretty hard to justify from > here, even granting the assumption that we don't expect any other > processes to be interested in these buffers (which I don't grant, > because checkpointer). Ooph. I agree. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Aug 5, 2022 at 4:31 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I wrote: > > While I'm at it, I would like to strenuously object to the current > > framing of CreateAndCopyRelationData as a general-purpose copying > > mechanism. > > And while I'm piling on, how is this bit in RelationCopyStorageUsingBuffer > not completely broken? > > /* Read block from source relation. */ > srcBuf = ReadBufferWithoutRelcache(src->rd_locator, forkNum, blkno, > RBM_NORMAL, bstrategy_src, > permanent); > srcPage = BufferGetPage(srcBuf); > if (PageIsNew(srcPage) || PageIsEmpty(srcPage)) > { > ReleaseBuffer(srcBuf); > continue; > } > > /* Use P_NEW to extend the destination relation. */ > dstBuf = ReadBufferWithoutRelcache(dst->rd_locator, forkNum, P_NEW, > RBM_NORMAL, bstrategy_dst, > permanent); > > You can't skip pages just because they are empty. Well, maybe you could > if you were doing something to ensure that you zero-fill the corresponding > blocks on the destination side. But this isn't doing that. It's using > P_NEW for dstBuf, which will have the effect of silently collapsing out > such pages. Maybe in isolation a heap could withstand that, but its > indexes won't be happy (and I guess t_ctid chain links won't either). > > I think you should just lose the if() stanza. There's no optimization to > be had here that's worth any extra complication. > > (This seems worth fixing before beta3, as it looks like a rather > nasty data corruption hazard.) Yeah this is broken. -- Dilip -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, Aug 5, 2022 at 5:36 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On August 4, 2022 4:20:16 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >Yeah, the assumption that P_NEW would automatically match the source block > >was making me itchy too. An explicit test-and-elog seems worth the > >cycles. > > Is there a good reason to rely on P_NEW at all? I think there were 2 arguments for which we used bufmgr instead of smgrextend for the destination database 1) (Comment from Andres) The big benefit would be that the *target* database does not have to be written out / shared buffer is immediately populated. [1] 2) (Comment from Robert) We wanted to avoid writing new code which bypasses the shared buffers. [1]https://www.postgresql.org/message-id/20210905202800.ji4fnfs3xzhvo7l6%40alap3.anarazel.de Both from an efficiency and robustness POV it seems like it'd be better to use smgrextend to bulk extend just after smgrcreate() and then fill s_b u using RBM_ZERO (or whatever it is called). That bulk smgrextend would later be a good point to use fallocate so the FS can immediately size the file correctly, without a lot of pointless writes of zeroes. Yeah okay, so you mean since we already know the nblocks in the source file so we can directly do smgrextend in bulk before the copy loop and then we can just copy block by block using bufmgr with proper blkno instead of P_NEW. Yeah I think this looks optimized to me and this will take care of the above 2 points I mentioned that we will still have the target database pages in the shared buffers and we are not bypassing the shared buffers also. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, Aug 5, 2022 at 2:59 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2022-08-03 16:45:23 +0530, Dilip Kumar wrote: > > Another version of the patch which closes the smgr at the end using > > smgrcloserellocator() and I have also added a commit message. > > What's the motivation behind the explicit close? > > > > @@ -258,8 +258,8 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath) > > Page page; > > List *rlocatorlist = NIL; > > LockRelId relid; > > - Relation rel; > > Snapshot snapshot; > > + SMgrRelation smgr; > > BufferAccessStrategy bstrategy; > > > > /* Get pg_class relfilenumber. */ > > @@ -276,16 +276,9 @@ ScanSourceDatabasePgClass(Oid tbid, Oid dbid, char *srcpath) > > rlocator.dbOid = dbid; > > rlocator.relNumber = relfilenumber; > > > > - /* > > - * We can't use a real relcache entry for a relation in some other > > - * database, but since we're only going to access the fields related to > > - * physical storage, a fake one is good enough. If we didn't do this and > > - * used the smgr layer directly, we would have to worry about > > - * invalidations. > > - */ > > - rel = CreateFakeRelcacheEntry(rlocator); > > - nblocks = smgrnblocks(RelationGetSmgr(rel), MAIN_FORKNUM); > > - FreeFakeRelcacheEntry(rel); > > + smgr = smgropen(rlocator, InvalidBackendId); > > + nblocks = smgrnblocks(smgr, MAIN_FORKNUM); > > + smgrclose(smgr); > > Why are you opening and then closing again? Part of the motivation for the > question is that a local SMgrRelation variable may lead to it being used > further, opening up interrupt processing issues. Yeah okay, I think there is no reason to close, in the previous version I had like below and I think that's a better idea. nblocks = smgrnblocks(smgropen(rlocator, InvalidBackendId), MAIN_FORKNUM) > > > + rlocator.locator = src_rlocator; > > + smgrcloserellocator(rlocator); > > + > > + rlocator.locator = dst_rlocator; > > + smgrcloserellocator(rlocator); > > As mentioned above, it's not clear to me why this is now done... > > Otherwise looks good to me. Yeah maybe it is not necessary to close as these unowned smgr will automatically get closed on the transaction end. Actually the previous person of the patch had both these comments fixed. The reason for explicitly closing it is that I have noticed that most of the places we explicitly close the smgr where we do smgropen e.g. index_copy_data(), heapam_relation_copy_data() OTOH some places we don't close it e.g. IssuePendingWritebacks(). So now I think that in our case better we do not close it because I do not like this specific code at the end to close the smgr. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, Aug 5, 2022 at 10:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Yeah maybe it is not necessary to close as these unowned smgr will > automatically get closed on the transaction end. Actually the > previous person of the patch had both these comments fixed. The > reason for explicitly closing it is that I have noticed that most of > the places we explicitly close the smgr where we do smgropen e.g. > index_copy_data(), heapam_relation_copy_data() OTOH some places we > don't close it e.g. IssuePendingWritebacks(). So now I think that in > our case better we do not close it because I do not like this specific > code at the end to close the smgr. PFA patches for different problems discussed in the thread 0001 - Fix the problem of skipping the empty block and buffer lock on source buffer 0002 - Remove fake relcache entry (same as 0001-BugfixInWalLogCreateDB.patch) 0003 - Optimization to avoid extending block by block -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
Hi, On 2022-08-04 19:14:08 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2022-08-04 18:05:25 -0400, Tom Lane wrote: > >> In any case, DROP DATABASE is far from the only place with a problem. > > > What other place has a database corrupting potential of this magnitude just > > because interrupts are accepted? We throw valid s_b contents away and then > > accept interrupts before committing - with predictable results. We also accept > > interrupts as part of deleting the db data dir (due to catalog access). > > Those things would be better handled by moving the data-discarding > steps to post-commit. Maybe that argues for having an internal > commit halfway through DROP DATABASE: remove pg_database row, > commit, start new transaction, clean up. That'd still require holding interrupts, I think. We shouldn't accept interrupts until the on-disk data is actually deleted. In theory I think we should have a pg_database column indicating whether the database is valid or not. For database creation, insert the pg_database row with valid=false, commit, then do the filesystem operation, then mark as valid, commit. For database drop, mark as invalid, commit, remove filesystem stuff, delete row, commit. With dropdb allowed against an invalid database, but obviously nothing else. But clearly this isn't a short term / backpatchable thing. Greetings, Andres Freund
Dilip Kumar <dilipbalaut@gmail.com> writes: > PFA patches for different problems discussed in the thread > 0001 - Fix the problem of skipping the empty block and buffer lock on > source buffer > 0002 - Remove fake relcache entry (same as 0001-BugfixInWalLogCreateDB.patch) > 0003 - Optimization to avoid extending block by block I pushed 0001, because it seems fairly critical to get that in before beta3. The others can stand more leisurely discussion. I note from https://coverage.postgresql.org/src/backend/storage/buffer/bufmgr.c.gcov.html that the block-skipping path is actually taken in our tests (this won't be visible there for very much longer of course). So we actually *are* making a corrupt copy, and we haven't noticed. This is perhaps not too surprising, because the only test case that I can find is in 020_createdb.pl: $node->issues_sql_like( [ 'createdb', '-T', 'foobar2', '-S', 'wal_log', 'foobar6' ], qr/statement: CREATE DATABASE foobar6 STRATEGY wal_log TEMPLATE foobar2/, 'create database with WAL_LOG strategy'); which is, um, not exactly a robust test of whether anything happened at all, let alone whether it was correct. I'm not real sure that this test would even notice if the CREATE reported failure. regards, tom lane
Dilip Kumar <dilipbalaut@gmail.com> writes: > On Fri, Aug 5, 2022 at 10:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: >> Yeah maybe it is not necessary to close as these unowned smgr will >> automatically get closed on the transaction end. I do not think this is a great idea for the per-relation smgrs created during RelationCopyStorageUsingBuffer. Yeah, they'll be mopped up at transaction end, but that doesn't mean that creating possibly tens of thousands of transient smgrs isn't going to cause performance issues. I think RelationCopyStorageUsingBuffer needs to open and then close the smgrs it uses, which means that ReadBufferWithoutRelcache is not the appropriate API for it to use, either; need to go down another level. regards, tom lane
On Sat, Aug 6, 2022 at 9:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Dilip Kumar <dilipbalaut@gmail.com> writes: > > On Fri, Aug 5, 2022 at 10:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > >> Yeah maybe it is not necessary to close as these unowned smgr will > >> automatically get closed on the transaction end. > > I do not think this is a great idea for the per-relation smgrs created > during RelationCopyStorageUsingBuffer. Yeah, they'll be mopped up at > transaction end, but that doesn't mean that creating possibly tens of > thousands of transient smgrs isn't going to cause performance issues. Okay, so for that we can simply call smgrcloserellocator(rlocator); before exiting the RelationCopyStorageUsingBuffer() right? > I think RelationCopyStorageUsingBuffer needs to open and then close > the smgrs it uses, which means that ReadBufferWithoutRelcache is not the > appropriate API for it to use, either; need to go down another level. Not sure how going down another level would help, the whole point is that we don't want to keep the reference of the smgr for a long time especially in the loop which is interruptible. So everytime we need smgr we can call smgropen and if it is already in the smgr cache then we will get it from there. So I think it makes sense that when we are exiting the function that time we can just call smgrcloserellocator() so that if it is opened it will be closed and otherwise it will do nothing. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hi, On 2022-08-07 09:24:40 +0530, Dilip Kumar wrote: > On Sat, Aug 6, 2022 at 9:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > Dilip Kumar <dilipbalaut@gmail.com> writes: > > > On Fri, Aug 5, 2022 at 10:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > >> Yeah maybe it is not necessary to close as these unowned smgr will > > >> automatically get closed on the transaction end. > > > > I do not think this is a great idea for the per-relation smgrs created > > during RelationCopyStorageUsingBuffer. Yeah, they'll be mopped up at > > transaction end, but that doesn't mean that creating possibly tens of > > thousands of transient smgrs isn't going to cause performance issues. I was assuming that the files would get reopened at the end of the transaction anyway, but it looks like that's not the case, unless wal_level=minimal. Hm. CreateAndCopyRelationData() calls RelationCreateStorage() with register_delete = false, which is ok because createdb_failure_callback will clean things up. But that's another thing that's not great for a routine with a general name... > Okay, so for that we can simply call smgrcloserellocator(rlocator); > before exiting the RelationCopyStorageUsingBuffer() right? Yea, I think so. > > I think RelationCopyStorageUsingBuffer needs to open and then close > > the smgrs it uses, which means that ReadBufferWithoutRelcache is not the > > appropriate API for it to use, either; need to go down another level. > > Not sure how going down another level would help, the whole point is > that we don't want to keep the reference of the smgr for a long time > especially in the loop which is interruptible. Yea, I'm not following either. Greetings, Andres Freund
On Sun, Aug 7, 2022 at 9:47 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2022-08-07 09:24:40 +0530, Dilip Kumar wrote: > > On Sat, Aug 6, 2022 at 9:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > > > Dilip Kumar <dilipbalaut@gmail.com> writes: > > > > On Fri, Aug 5, 2022 at 10:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > >> Yeah maybe it is not necessary to close as these unowned smgr will > > > >> automatically get closed on the transaction end. > > > > > > I do not think this is a great idea for the per-relation smgrs created > > > during RelationCopyStorageUsingBuffer. Yeah, they'll be mopped up at > > > transaction end, but that doesn't mean that creating possibly tens of > > > thousands of transient smgrs isn't going to cause performance issues. > > I was assuming that the files would get reopened at the end of the transaction > anyway, but it looks like that's not the case, unless wal_level=minimal. > > Hm. CreateAndCopyRelationData() calls RelationCreateStorage() with > register_delete = false, which is ok because createdb_failure_callback will > clean things up. But that's another thing that's not great for a routine with > a general name... > > > > Okay, so for that we can simply call smgrcloserellocator(rlocator); > > before exiting the RelationCopyStorageUsingBuffer() right? > > Yea, I think so. Done, along with that, I have also got the hunk of smgropen and smgrclose in ScanSourceDatabasePgClass() which I had in v1 patch[1]. Because here we do not want to reuse the smgr of the pg_class again so instead of closing at the end with smgrcloserellocator() we can just keep the smgr reference and close immediately after getting the number of blocks. Whereas in CreateAndCopyRelationData and RelationCopyStorageUsingBuffer() we are using the smgr of the source and dest relation multiple time so it make sense to not close it immediately and we can close while exiting the function with smgrcloserellocator(). [1] + smgr = smgropen(rlocator, InvalidBackendId); + nblocks = smgrnblocks(smgr, MAIN_FORKNUM); + smgrclose(smgr); -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Wed, Aug 10, 2022 at 1:01 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Done, along with that, I have also got the hunk of smgropen and > smgrclose in ScanSourceDatabasePgClass() which I had in v1 patch[1]. > Because here we do not want to reuse the smgr of the pg_class again so > instead of closing at the end with smgrcloserellocator() we can just > keep the smgr reference and close immediately after getting the number > of blocks. Whereas in CreateAndCopyRelationData and > RelationCopyStorageUsingBuffer() we are using the smgr of the source > and dest relation multiple time so it make sense to not close it > immediately and we can close while exiting the function with > smgrcloserellocator(). As far as I know, this 0001 addresses all outstanding comments and fixes the reported bug. Does anyone think otherwise? -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Aug 11, 2022 at 2:15 PM Robert Haas <robertmhaas@gmail.com> wrote: > As far as I know, this 0001 addresses all outstanding comments and > fixes the reported bug. > > Does anyone think otherwise? If they do, they're keeping quiet, so I committed this and back-patched it to v15. Regarding 0002 -- should it, perhaps, use PGAlignedBlock? Although 0002 is formally a performance optimization, I'm inclined to think that if we're going to commit it, it should also be back-patched into v15, because letting the code diverge when we're not even out of beta yet seems painful. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Aug 12, 2022 at 6:33 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Aug 11, 2022 at 2:15 PM Robert Haas <robertmhaas@gmail.com> wrote: > > As far as I know, this 0001 addresses all outstanding comments and > > fixes the reported bug. > > > > Does anyone think otherwise? > > If they do, they're keeping quiet, so I committed this and > back-patched it to v15. > > Regarding 0002 -- should it, perhaps, use PGAlignedBlock? Yes we can do that, although here we are not using this buffer directly as a "Page" so we do not have any real alignment issue but I do not see any problem in using PGAlignedBlock so change that. > Although 0002 is formally a performance optimization, I'm inclined to > think that if we're going to commit it, it should also be back-patched > into v15, because letting the code diverge when we're not even out of > beta yet seems painful. Yeah that makes sense to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Wed, Aug 17, 2022 at 12:02 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Although 0002 is formally a performance optimization, I'm inclined to > > think that if we're going to commit it, it should also be back-patched > > into v15, because letting the code diverge when we're not even out of > > beta yet seems painful. > > Yeah that makes sense to me. OK, done. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Aug 02, 2022 at 12:50:43PM -0500, Justin Pryzby wrote: > Also, if I understand correctly, this patch seems to assume that nobody is > connected to the source database. But what's actually enforced is just that > nobody *else* is connected. Is it any issue that the current DB can be used as > a source? Anyway, both of the above problems are reproducible using a > different database. > > |postgres=# CREATE DATABASE new TEMPLATE postgres STRATEGY wal_log; > |CREATE DATABASE On Thu, Aug 04, 2022 at 05:16:04PM -0500, Justin Pryzby wrote: > On Thu, Aug 04, 2022 at 06:02:50PM -0400, Tom Lane wrote: > > The "invalidation" comment bothered me for awhile, but I think it's > > fine: we know that no other backend can connect to the source DB > > because we have it locked, > > About that - is it any problem that the currently-connected db can be used as a > template? It's no issue for 2-phase commit, because "create database" cannot > run in an txn. Would anybody want to comment on this ? Is it okay that the *current* DB can be used as a template ?
On Fri, Sep 2, 2022 at 5:25 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > On Tue, Aug 02, 2022 at 12:50:43PM -0500, Justin Pryzby wrote: > > Also, if I understand correctly, this patch seems to assume that nobody is > > connected to the source database. But what's actually enforced is just that > > nobody *else* is connected. Is it any issue that the current DB can be used as > > a source? Anyway, both of the above problems are reproducible using a > > different database. > > > > |postgres=# CREATE DATABASE new TEMPLATE postgres STRATEGY wal_log; > > |CREATE DATABASE > > On Thu, Aug 04, 2022 at 05:16:04PM -0500, Justin Pryzby wrote: > > On Thu, Aug 04, 2022 at 06:02:50PM -0400, Tom Lane wrote: > > > The "invalidation" comment bothered me for awhile, but I think it's > > > fine: we know that no other backend can connect to the source DB > > > because we have it locked, > > > > About that - is it any problem that the currently-connected db can be used as a > > template? It's no issue for 2-phase commit, because "create database" cannot > > run in an txn. > > Would anybody want to comment on this ? > Is it okay that the *current* DB can be used as a template ? I don't think there should be any problem with that. The main problem could have been that since we are reading the pg_class tuple block by block there could be an issue if someone concurrently modifies the pg_class or there are some tuples that are inserted by the prepared transaction. But in this case, the same backend can not have an open prepared transaction while creating a database and that backend of course can not perform any parallel operation as well. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
RE: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints
From
"Yoshikazu Imai (Fujitsu)"
Date:
Hi. While investigating the codes in RelationCopyStorageUsingBuffer, I wonder that there is any reason to use RBM_NORMAL for acquiring destination buffer. I think we can use RBM_ZERO_AND_LOCK here instead of RBM_NORMAL. When we use RBM_NORMAL, a destination block is read by smgrread and it's content is verified with PageIsVerifiedExtended in ReadBuffer_common. A page verification normally succeeds because destination blocks are zero-filled by using smgrextend, but do we trust that it is surely zero-filled? On Fri, Aug 5, 2022 at 00:32 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote: > Hmm ... makes sense. We'd be using smgrextend to write just the last page > of the file, relying on its API spec "Note that we assume writing a block > beyond current EOF causes intervening file space to become filled with > zeroes". I don't know that we're using that assumption aggressively > today, but as long as it doesn't confuse the kernel it'd probably be fine. If there is a block which is not zero-filled, page verification would fail and eventually CREATE DATABASE would fail. (I also think that originally we don't need page verification for destination blocks here because those blocks are just overwritten by source buffer.) Also, from performance POV, I think it is good to use RBM_ZERO_AND_LOCK. In RBM_NORMAL, destination blocks are read from disk by smgrread each time, but in RBM_ZERO_AND_LOCK, we only set buffers zero-filled by MemSet and don't access the disk which makes performance better. If we expect the destination buffer is always zero-filled, we can use RBM_ZERO_AND_LOCK. Apart from above, there seems to be an old comment which is for the old codes when acquiring destination buffer by using P_NEW. "/* Use P_NEW to extend the destination relation. */" -- Yoshikazu Imai > -----Original Message----- > From: Dilip Kumar <dilipbalaut@gmail.com> > Sent: Friday, September 2, 2022 11:22 PM > To: Justin Pryzby <pryzby@telsasoft.com> > Cc: Robert Haas <robertmhaas@gmail.com>; Tom Lane <tgl@sss.pgh.pa.us>; Andres Freund <andres@anarazel.de>; > Ashutosh Sharma <ashu.coek88@gmail.com>; Maciek Sakrejda <m.sakrejda@gmail.com>; Bruce Momjian > <bruce@momjian.us>; Alvaro Herrera <alvherre@alvh.no-ip.org>; Andrew Dunstan <andrew@dunslane.net>; Heikki > Linnakangas <hlinnaka@iki.fi>; pgsql-hackers@lists.postgresql.org; Thomas Munro <thomas.munro@gmail.com> > Subject: Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints > > On Fri, Sep 2, 2022 at 5:25 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > > On Tue, Aug 02, 2022 at 12:50:43PM -0500, Justin Pryzby wrote: > > > Also, if I understand correctly, this patch seems to assume that > > > nobody is connected to the source database. But what's actually > > > enforced is just that nobody *else* is connected. Is it any issue > > > that the current DB can be used as a source? Anyway, both of the > > > above problems are reproducible using a different database. > > > > > > |postgres=# CREATE DATABASE new TEMPLATE postgres STRATEGY wal_log; > > > |CREATE DATABASE > > > > On Thu, Aug 04, 2022 at 05:16:04PM -0500, Justin Pryzby wrote: > > > On Thu, Aug 04, 2022 at 06:02:50PM -0400, Tom Lane wrote: > > > > The "invalidation" comment bothered me for awhile, but I think > > > > it's > > > > fine: we know that no other backend can connect to the source DB > > > > because we have it locked, > > > > > > About that - is it any problem that the currently-connected db can > > > be used as a template? It's no issue for 2-phase commit, because > > > "create database" cannot run in an txn. > > > > Would anybody want to comment on this ? > > Is it okay that the *current* DB can be used as a template ? > > I don't think there should be any problem with that. The main problem could have been that since we are reading the > pg_class tuple block by block there could be an issue if someone concurrently modifies the pg_class or there are some > tuples that are inserted by the prepared transaction. But in this case, the same backend can not have an open prepared > transaction while creating a database and that backend of course can not perform any parallel operation as well. > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com >