Thread: Speedup of relation deletes during recovery
Hi, When multiple relations are deleted at the same transaction, the files of those relations are deleted by one call to smgrdounlinkall(), which leads to scan whole shared_buffers only one time. OTOH, during recovery, smgrdounlink() (not smgrdounlinkall()) is called for each file to delete, which leads to scan shared_buffers multiple times. Obviously this can cause to increase the WAL replay time very much especially when shared_buffers is huge. To alleviate this situation, I'd like to propose to change the recovery so that it also calls smgrdounlinkall() only one time to delete multiple relation files. Patch attached. Thought? Regards, -- Fujii Masao
Attachment
Hello. At Fri, 30 Mar 2018 08:31:29 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com> > Hi, > > When multiple relations are deleted at the same transaction, > the files of those relations are deleted by one call to smgrdounlinkall(), > which leads to scan whole shared_buffers only one time. OTOH, > during recovery, smgrdounlink() (not smgrdounlinkall()) is called > for each file to delete, which leads to scan shared_buffers multiple times. > Obviously this can cause to increase the WAL replay time very much > especially when shared_buffers is huge. > > To alleviate this situation, I'd like to propose to change the recovery > so that it also calls smgrdounlinkall() only one time to delete multiple > relation files. Patch attached. Thought? It is obviously a left-over of 279628a0a7. This patch applies the same change with the patch and looks fine for me. Note that FinishPreparedTransaction has the same loop over smgrdounlink. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Mar 30, 2018 at 11:19:58AM +0900, Kyotaro HORIGUCHI wrote: > At Fri, 30 Mar 2018 08:31:29 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com> >> When multiple relations are deleted at the same transaction, >> the files of those relations are deleted by one call to smgrdounlinkall(), >> which leads to scan whole shared_buffers only one time. OTOH, >> during recovery, smgrdounlink() (not smgrdounlinkall()) is called >> for each file to delete, which leads to scan shared_buffers multiple times. >> Obviously this can cause to increase the WAL replay time very much >> especially when shared_buffers is huge. >> >> To alleviate this situation, I'd like to propose to change the recovery >> so that it also calls smgrdounlinkall() only one time to delete multiple >> relation files. Patch attached. Thought? > > It is obviously a left-over of 279628a0a7. This patch applies the > same change with the patch and looks fine for me. Note that > FinishPreparedTransaction has the same loop over smgrdounlink. Yeah, I was just going to say the same after looking at Fujii-san's patch. This would also cause smgrdounlink() to become unused in the core code. So this could just be... Removed? /me Preparing shelter against incoming wraith. That's not v11 material at this point in any case. -- Michael
Attachment
From: Fujii Masao [mailto:masao.fujii@gmail.com] > When multiple relations are deleted at the same transaction, the files of > those relations are deleted by one call to smgrdounlinkall(), which leads > to scan whole shared_buffers only one time. OTOH, during recovery, > smgrdounlink() (not smgrdounlinkall()) is called for each file to delete, > which leads to scan shared_buffers multiple times. > Obviously this can cause to increase the WAL replay time very much especially > when shared_buffers is huge. > > To alleviate this situation, I'd like to propose to change the recovery > so that it also calls smgrdounlinkall() only one time to delete multiple > relation files. Patch attached. Thought? Nice catch. As Horiguchi-san and Michael already commented, the patch looks good. As a related improvement, the following proposal is effective for shortening WAL replay time of DROP TABLE (and possiblyTRUNCATE as well.) How should we proceed with this? https://www.postgresql.org/message-id/A1CF58A8CBA14341B3F3AC6A468F18454545E4F3@g01jpexmbyt23 Furthermore, TRUNCATE has a similar and worse issue. While DROP TABLE scans the shared buffers once for each table, TRUNCATEdoes that for each fork, resulting in three scans per table. How about changing the following functions smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks) DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber forkNum, BlockNumber firstDelBlock) to smgrtruncate(SMgrRelation reln, ForkNumber *forknum, BlockNumber *nblocks, int nforks) DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum, BlockNumber *firstDelBlock, int nforks) to perform the scan only once per table? If there doesn't seem to be a problem, I think I'll submit the patch next month. I'd welcome if anyone could do that. Regards Takayuki Tsunakawa
Fujii Masao <masao.fujii@gmail.com> writes: > Hi, > > When multiple relations are deleted at the same transaction, > the files of those relations are deleted by one call to smgrdounlinkall(), > which leads to scan whole shared_buffers only one time. OTOH, > during recovery, smgrdounlink() (not smgrdounlinkall()) is called > for each file to delete, which leads to scan shared_buffers multiple times. > Obviously this can cause to increase the WAL replay time very much > especially when shared_buffers is huge. Wonder if this is the case for streaming standbys replaying truncates also? Just couple days ago I was running a pg_upgrade test scenario but did not reach the point of upgrade yet. We made snapshots of our large reporting system putting them into a master and 1-slave streaming replication configuration. There are hundreds of unlogged tables that we wish to trunc before the upgrade to save time during the rsyncing of standbys, since a standard rsync replicates all data which is timeconsumeing and useless sending to the standbys. Indeed the tables are already trunc'd on the test master since it too was a recovered system upon initial start but the truncates took place anyway since it's part of our framework. It ran in just a few seconds on master. The standby however was $slow replaying these truncates which we noticed because the upgrade wil not proceed until master and 1 or more standbys are confirmed all at same checkpoint. I straced the standby's startup process to find it unlinking lots of tables, getting -1 on the unlink syscall since the non-init fork files were already missing. I can't describe just how slow if was but took minutes and the lines being output by strace were *not* blowing up my display as happens generally when any busy process is straced. And, the test systems were config'd with $large shared buffers of 64G. Please advise > > To alleviate this situation, I'd like to propose to change the recovery > so that it also calls smgrdounlinkall() only one time to delete multiple > relation files. Patch attached. Thought? > > Regards, -- Jerry Sievers Postgres DBA/Development Consulting e: postgres.consulting@comcast.net p: 312.241.7800
Fujii Masao <masao.fujii@gmail.com> writes: > Hi, > > When multiple relations are deleted at the same transaction, > the files of those relations are deleted by one call to smgrdounlinkall(), > which leads to scan whole shared_buffers only one time. OTOH, > during recovery, smgrdounlink() (not smgrdounlinkall()) is called > for each file to delete, which leads to scan shared_buffers multiple times. > Obviously this can cause to increase the WAL replay time very much > especially when shared_buffers is huge. Forgot to mention version in my TLDR prev reply :-) version ------------------------------------------------------------------------------------------------------------------------------------------------ PostgreSQL 9.5.12 on x86_64-pc-linux-gnu (Ubuntu 9.5.12-1.pgdg16.04+1), compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9)5.4.0 20160609, 64-bit (1 row) > > To alleviate this situation, I'd like to propose to change the recovery > so that it also calls smgrdounlinkall() only one time to delete multiple > relation files. Patch attached. Thought? > > Regards, -- Jerry Sievers Postgres DBA/Development Consulting e: postgres.consulting@comcast.net p: 312.241.7800
From: Jerry Sievers [mailto:gsievers19@comcast.net] > Wonder if this is the case for streaming standbys replaying truncates > also? Yes, As I wrote in my previous mail, TRUNCATE is worse than DROP TABLE. Regards Takayuki Tsunakawa
On Fri, Mar 30, 2018 at 11:46 AM, Michael Paquier <michael@paquier.xyz> wrote: > On Fri, Mar 30, 2018 at 11:19:58AM +0900, Kyotaro HORIGUCHI wrote: >> At Fri, 30 Mar 2018 08:31:29 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in <CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com> >>> When multiple relations are deleted at the same transaction, >>> the files of those relations are deleted by one call to smgrdounlinkall(), >>> which leads to scan whole shared_buffers only one time. OTOH, >>> during recovery, smgrdounlink() (not smgrdounlinkall()) is called >>> for each file to delete, which leads to scan shared_buffers multiple times. >>> Obviously this can cause to increase the WAL replay time very much >>> especially when shared_buffers is huge. >>> >>> To alleviate this situation, I'd like to propose to change the recovery >>> so that it also calls smgrdounlinkall() only one time to delete multiple >>> relation files. Patch attached. Thought? >> >> It is obviously a left-over of 279628a0a7. This patch applies the >> same change with the patch and looks fine for me. Note that >> FinishPreparedTransaction has the same loop over smgrdounlink. Thanks for the review! I also changed FinishPreparedTransaction() so that it uses smgrdounlinkall(). Patch attached. > Yeah, I was just going to say the same after looking at Fujii-san's > patch. This would also cause smgrdounlink() to become unused in the > core code. So this could just be... Removed? Yes, I think. And, I found that smgrdounlinkfork() is also dead code. Per the discussion [1], this unused function was left intentionally. But it's still dead code since 2012, so I'd like to remove it. Patch attached. [1] https://www.postgresql.org/message-id/1471.1339106082@sss.pgh.pa.us Regards, -- Fujii Masao
Attachment
On Fri, Mar 30, 2018 at 12:18 PM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > From: Fujii Masao [mailto:masao.fujii@gmail.com] >> When multiple relations are deleted at the same transaction, the files of >> those relations are deleted by one call to smgrdounlinkall(), which leads >> to scan whole shared_buffers only one time. OTOH, during recovery, >> smgrdounlink() (not smgrdounlinkall()) is called for each file to delete, >> which leads to scan shared_buffers multiple times. >> Obviously this can cause to increase the WAL replay time very much especially >> when shared_buffers is huge. >> >> To alleviate this situation, I'd like to propose to change the recovery >> so that it also calls smgrdounlinkall() only one time to delete multiple >> relation files. Patch attached. Thought? > > Nice catch. As Horiguchi-san and Michael already commented, the patch looks good. > > As a related improvement, the following proposal is effective for shortening WAL replay time of DROP TABLE (and possiblyTRUNCATE as well.) How should we proceed with this? > > https://www.postgresql.org/message-id/A1CF58A8CBA14341B3F3AC6A468F18454545E4F3@g01jpexmbyt23 > > > Furthermore, TRUNCATE has a similar and worse issue. While DROP TABLE scans the shared buffers once for each table, TRUNCATEdoes that for each fork, resulting in three scans per table. How about changing the following functions > > smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks) > DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber forkNum, > BlockNumber firstDelBlock) > > to > > smgrtruncate(SMgrRelation reln, ForkNumber *forknum, BlockNumber *nblocks, int nforks) > DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum, > BlockNumber *firstDelBlock, int nforks) > > to perform the scan only once per table? If there doesn't seem to be a problem, I think I'll submit the patch next month. I'd welcome if anyone could do that. Yeah, it's worth working on this problem. To decrease the number of scans of shared_buffers, you would need to change the order of truncations of files and WAL logging. In RelationTruncate(), currently WAL is logged after FSM and VM are truncated. IOW, with the patch, FSM and VM would need to be truncated after WAL logging. You would need to check whether this reordering is valid. Regards, -- Fujii Masao
On Wed, Apr 18, 2018 at 12:46:58AM +0900, Fujii Masao wrote: > Yes, I think. And, I found that smgrdounlinkfork() is also dead code. > Per the discussion [1], this unused function was left intentionally. > But it's still dead code since 2012, so I'd like to remove it. Patch attached. Indeed, it's close to six years and the status is the same. So let's drop it. I have been surrounding the area to see if any modules actually use those, particularly on github, but I could not find callers. The patch looks logically fine to me. In your first message, you mentioned that the replay time increased a lot. Do you have numbers to share with some large settings of shared_buffers? It would be better to wait for v12 branch to open before pushing anything, as the focus is on stabililizing things on v11. -- Michael
Attachment
On Wed, Apr 18, 2018 at 10:44 AM, Michael Paquier <michael@paquier.xyz> wrote: > On Wed, Apr 18, 2018 at 12:46:58AM +0900, Fujii Masao wrote: >> Yes, I think. And, I found that smgrdounlinkfork() is also dead code. >> Per the discussion [1], this unused function was left intentionally. >> But it's still dead code since 2012, so I'd like to remove it. Patch attached. > > Indeed, it's close to six years and the status is the same. So let's > drop it. I have been surrounding the area to see if any modules > actually use those, particularly on github, but I could not find > callers. > > The patch looks logically fine to me. In your first message, you > mentioned that the replay time increased a lot. Do you have numbers to > share with some large settings of shared_buffers? No. But after my colleague truncated more than one hundred tables on the server with shared_buffers = 300GB, the recovery could not finish even after 10 minutes since the startup of the recovery. So I had to shutdown the server immediately, set shared_buffers to very small temporarily and start the server to cause the recovery to finish soon. > It would be better to wait for v12 branch to open before pushing > anything, as the focus is on stabililizing things on v11. Yes, so I added this to next CommitFest. Regards, -- Fujii Masao
On Thu, Apr 19, 2018 at 01:52:26AM +0900, Fujii Masao wrote: > No. But after my colleague truncated more than one hundred tables on > the server with shared_buffers = 300GB, the recovery could not finish > even after 10 minutes since the startup of the recovery. So I had to > shutdown the server immediately, set shared_buffers to very small > temporarily and start the server to cause the recovery to finish soon. Hm, OK. Here are simple functions to create and drop many relations in a single transaction: create or replace function create_tables(numtabs int) returns void as $$ declare query_string text; begin for i in 1..numtabs loop query_string := 'create table tab_' || i::text || ' (a int);'; execute query_string; end loop; end; $$ language plpgsql; create or replace function drop_tables(numtabs int) returns void as $$ declare query_string text; begin for i in 1..numtabs loop query_string := 'drop table tab_' || i::text; execute query_string; end loop; end; $$ language plpgsql; With a server running 8GB of shared buffers (you likely need to increase max_locks_per_transaction) and 10k relations created and dropped, it took 50 seconds to finish redo of roughly 4 segments: 2018-04-19 11:17:15 JST [7472]: [14-1] db=,user=,app=,client= LOG: redo starts at 0/15DF2E8 2018-04-19 11:18:05 JST [7472]: [15-1] db=,user=,app=,client= LOG: invalid record length at 0/4E46160: wanted 24, got 0 2018-04-19 11:18:05 JST [7472]: [16-1] db=,user=,app=,client= LOG: redo done at 0/4A7C6E Then with your patch it took... Barely 1 second. 2018-04-19 11:20:33 JST [11690]: [14-1] db=,user=,app=,client= LOG: redo starts at 0/15DF2E8 2018-04-19 11:20:34 JST [11690]: [15-1] db=,user=,app=,client= LOG: invalid record length at 0/4E299B0: wanted 24, got 0 2018-04-19 11:20:34 JST [11690]: [16-1] db=,user=,app=,client= LOG: redo done at 0/4E29978 Looking at profiles with perf I can also see that 95% of the time is spent in DropRelFileNodesAllBuffers which is where the startup process spends most of its CPU. So HEAD is booh. And your patch is excellent in this context. -- Michael
Attachment
From: Fujii Masao [mailto:masao.fujii@gmail.com] > Yeah, it's worth working on this problem. To decrease the number of scans > of > shared_buffers, you would need to change the order of truncations of files > and > WAL logging. In RelationTruncate(), currently WAL is logged after FSM and > VM > are truncated. IOW, with the patch, FSM and VM would need to be truncated > after > WAL logging. You would need to check whether this reordering is valid. Sure, thank you for advice. Takayuki Tsunakawa
Hi, We just had a customer hit this issue. I kind of wonder whether this shouldn't be backpatched: Currently the execution on the primary is O(NBuffers * log(ndrels)) whereas it's O(NBuffers * ndrels) on the standby - with a lot higher constants to boot. That means it's very easy to get into situations where the standy starts to lag behind very significantly. > --- a/src/backend/access/transam/twophase.c > +++ b/src/backend/access/transam/twophase.c > @@ -1445,6 +1445,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit) > int ndelrels; > SharedInvalidationMessage *invalmsgs; > int i; > + SMgrRelation *srels = NULL; > > /* > * Validate the GID, and lock the GXACT to ensure that two backends do not > @@ -1534,13 +1535,16 @@ FinishPreparedTransaction(const char *gid, bool isCommit) > delrels = abortrels; > ndelrels = hdr->nabortrels; > } > + > + srels = palloc(sizeof(SMgrRelation) * ndelrels); > for (i = 0; i < ndelrels; i++) > - { > - SMgrRelation srel = smgropen(delrels[i], InvalidBackendId); > + srels[i] = smgropen(delrels[i], InvalidBackendId); > > - smgrdounlink(srel, false); > - smgrclose(srel); > - } > + smgrdounlinkall(srels, ndelrels, false); > + > + for (i = 0; i < ndelrels; i++) > + smgrclose(srels[i]); > + pfree(srels); This code is now duplicated three times - shouldn't we just add a function that encapsulates dropping relations in a commit/abort record? Greetings, Andres Freund
> We just had a customer hit this issue. I kind of wonder whether this > shouldn't be backpatched: Currently the execution on the primary is > O(NBuffers * log(ndrels)) whereas it's O(NBuffers * ndrels) on the > standby - with a lot higher constants to boot. That means it's very > easy to get into situations where the standy starts to lag behind very significantly. +1, we faced with that too -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Hi, On 2018-06-15 10:45:04 -0700, Andres Freund wrote: > > + > > + srels = palloc(sizeof(SMgrRelation) * ndelrels); > > for (i = 0; i < ndelrels; i++) > > - { > > - SMgrRelation srel = smgropen(delrels[i], InvalidBackendId); > > + srels[i] = smgropen(delrels[i], InvalidBackendId); > > > > - smgrdounlink(srel, false); > > - smgrclose(srel); > > - } > > + smgrdounlinkall(srels, ndelrels, false); > > + > > + for (i = 0; i < ndelrels; i++) > > + smgrclose(srels[i]); > > + pfree(srels); Hm. This will probably cause another complexity issue. If we just smgropen() the relation will be unowned. Which means smgrclose() will call remove_from_unowned_list(), which is O(open-relations). Which means this change afaict creates a new O(ndrels^2) behaviour? It seems like the easiest fix for that would be to have a local SMgrRelationData in that loop, that we assign the relations to? That's a bit hacky, but afaict should work? Greetings, Andres Freund
On Sat, Jun 16, 2018 at 3:28 AM, Andres Freund <andres@anarazel.de> wrote: > Hi, > > On 2018-06-15 10:45:04 -0700, Andres Freund wrote: >> > + >> > + srels = palloc(sizeof(SMgrRelation) * ndelrels); >> > for (i = 0; i < ndelrels; i++) >> > - { >> > - SMgrRelation srel = smgropen(delrels[i], InvalidBackendId); >> > + srels[i] = smgropen(delrels[i], InvalidBackendId); >> > >> > - smgrdounlink(srel, false); >> > - smgrclose(srel); >> > - } >> > + smgrdounlinkall(srels, ndelrels, false); >> > + >> > + for (i = 0; i < ndelrels; i++) >> > + smgrclose(srels[i]); >> > + pfree(srels); > > Hm. This will probably cause another complexity issue. If we just > smgropen() the relation will be unowned. Which means smgrclose() will > call remove_from_unowned_list(), which is O(open-relations). Which means > this change afaict creates a new O(ndrels^2) behaviour? > > It seems like the easiest fix for that would be to have a local > SMgrRelationData in that loop, that we assign the relations to? That's > a bit hacky, but afaict should work? The easier (but tricky) fix for that is to call smgrclose() for each SMgrRelation in the reverse order. That is, we should do for (i = ndelrels - 1; i >= 0; i--) smgrclose(srels[i]); instead of for (i = 0; i < ndelrels; i++) smgrclose(srels[i]); Since add_to_unowned_list() always adds the entry into the head of the list, each SMgrRelation entry is added into the "unowned" list in descending order of its identifier, i.e., SMgrRelation entry with larger identifier should be placed ahead of one with smaller identifier. So if we calls remove_from_unowed_list() in descending order of SMgrRelation entry's identifier, the performance of remove_from_unowned_list()'s search should O(1). IOW, we can immediately find the SMgrRelation entry to remove, at the head of the list. Regards, -- Fujii Masao
On Sat, Jun 16, 2018 at 2:54 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: >> We just had a customer hit this issue. I kind of wonder whether this >> shouldn't be backpatched: Currently the execution on the primary is >> O(NBuffers * log(ndrels)) whereas it's O(NBuffers * ndrels) on the >> standby - with a lot higher constants to boot. That means it's very >> easy to get into situations where the standy starts to lag behind very >> significantly. > > +1, we faced with that too +1 to back-patch. As Horiguchi-san pointed out, this is basically the fix for oversight of commit 279628a0a7, not new feature. Regards, -- Fujii Masao
On 2018-06-19 03:06:54 +0900, Fujii Masao wrote: > On Sat, Jun 16, 2018 at 3:28 AM, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > > > On 2018-06-15 10:45:04 -0700, Andres Freund wrote: > >> > + > >> > + srels = palloc(sizeof(SMgrRelation) * ndelrels); > >> > for (i = 0; i < ndelrels; i++) > >> > - { > >> > - SMgrRelation srel = smgropen(delrels[i], InvalidBackendId); > >> > + srels[i] = smgropen(delrels[i], InvalidBackendId); > >> > > >> > - smgrdounlink(srel, false); > >> > - smgrclose(srel); > >> > - } > >> > + smgrdounlinkall(srels, ndelrels, false); > >> > + > >> > + for (i = 0; i < ndelrels; i++) > >> > + smgrclose(srels[i]); > >> > + pfree(srels); > > > > Hm. This will probably cause another complexity issue. If we just > > smgropen() the relation will be unowned. Which means smgrclose() will > > call remove_from_unowned_list(), which is O(open-relations). Which means > > this change afaict creates a new O(ndrels^2) behaviour? > > > > It seems like the easiest fix for that would be to have a local > > SMgrRelationData in that loop, that we assign the relations to? That's > > a bit hacky, but afaict should work? > > The easier (but tricky) fix for that is to call smgrclose() for > each SMgrRelation in the reverse order. That is, we should do > > for (i = ndelrels - 1; i >= 0; i--) > smgrclose(srels[i]); > > instead of > > for (i = 0; i < ndelrels; i++) > smgrclose(srels[i]); We could do that - but add_to_unowned_list() is actually a bottleneck in other places during recovery too. We pretty much never (outside of dropping relations / databases) close opened relations during recovery - which is obviously problematic since nearly all of the entries are unowned. I've come to the conclusion that we should have a global variable that just disables adding anything to the global lists. Greetings, Andres Freund
On Tue, Jun 19, 2018 at 6:13 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Sat, Jun 16, 2018 at 2:54 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: >>> We just had a customer hit this issue. I kind of wonder whether this >>> shouldn't be backpatched: Currently the execution on the primary is >>> O(NBuffers * log(ndrels)) whereas it's O(NBuffers * ndrels) on the >>> standby - with a lot higher constants to boot. That means it's very >>> easy to get into situations where the standy starts to lag behind very >>> significantly. >> >> +1, we faced with that too > > +1 to back-patch. As Horiguchi-san pointed out, this is basically > the fix for oversight of commit 279628a0a7, not new feature. +1 -- Thomas Munro http://www.enterprisedb.com
On 2018-06-18 11:13:47 -0700, Andres Freund wrote: > On 2018-06-19 03:06:54 +0900, Fujii Masao wrote: > > On Sat, Jun 16, 2018 at 3:28 AM, Andres Freund <andres@anarazel.de> wrote: > > > Hi, > > > > > > On 2018-06-15 10:45:04 -0700, Andres Freund wrote: > > >> > + > > >> > + srels = palloc(sizeof(SMgrRelation) * ndelrels); > > >> > for (i = 0; i < ndelrels; i++) > > >> > - { > > >> > - SMgrRelation srel = smgropen(delrels[i], InvalidBackendId); > > >> > + srels[i] = smgropen(delrels[i], InvalidBackendId); > > >> > > > >> > - smgrdounlink(srel, false); > > >> > - smgrclose(srel); > > >> > - } > > >> > + smgrdounlinkall(srels, ndelrels, false); > > >> > + > > >> > + for (i = 0; i < ndelrels; i++) > > >> > + smgrclose(srels[i]); > > >> > + pfree(srels); > > > > > > Hm. This will probably cause another complexity issue. If we just > > > smgropen() the relation will be unowned. Which means smgrclose() will > > > call remove_from_unowned_list(), which is O(open-relations). Which means > > > this change afaict creates a new O(ndrels^2) behaviour? > > > > > > It seems like the easiest fix for that would be to have a local > > > SMgrRelationData in that loop, that we assign the relations to? That's > > > a bit hacky, but afaict should work? > > > > The easier (but tricky) fix for that is to call smgrclose() for > > each SMgrRelation in the reverse order. That is, we should do > > > > for (i = ndelrels - 1; i >= 0; i--) > > smgrclose(srels[i]); > > > > instead of > > > > for (i = 0; i < ndelrels; i++) > > smgrclose(srels[i]); > > We could do that - but add_to_unowned_list() is actually a bottleneck in > other places during recovery too. We pretty much never (outside of > dropping relations / databases) close opened relations during recovery - > which is obviously problematic since nearly all of the entries are > unowned. I've come to the conclusion that we should have a global > variable that just disables adding anything to the global lists. On second thought: I think we should your approach in the back branches, and something like I'm suggesting in master once open. Greetings, Andres Freund
On Wed, Jun 20, 2018 at 08:43:11PM -0700, Andres Freund wrote: > On 2018-06-18 11:13:47 -0700, Andres Freund wrote: >> We could do that - but add_to_unowned_list() is actually a bottleneck in >> other places during recovery too. We pretty much never (outside of >> dropping relations / databases) close opened relations during recovery - >> which is obviously problematic since nearly all of the entries are >> unowned. I've come to the conclusion that we should have a global >> variable that just disables adding anything to the global lists. > > On second thought: I think we should your approach in the back branches, > and something like I'm suggesting in master once open. +1. Let's also make sure that the removal of smgrdounlink and smgrdounlinkfork happens only on master. -- Michael
Attachment
On 2018-06-21 14:40:58 +0900, Michael Paquier wrote: > On Wed, Jun 20, 2018 at 08:43:11PM -0700, Andres Freund wrote: > > On 2018-06-18 11:13:47 -0700, Andres Freund wrote: > >> We could do that - but add_to_unowned_list() is actually a bottleneck in > >> other places during recovery too. We pretty much never (outside of > >> dropping relations / databases) close opened relations during recovery - > >> which is obviously problematic since nearly all of the entries are > >> unowned. I've come to the conclusion that we should have a global > >> variable that just disables adding anything to the global lists. > > > > On second thought: I think we should your approach in the back branches, > > and something like I'm suggesting in master once open. > > +1. Let's also make sure that the removal of smgrdounlink and > smgrdounlinkfork happens only on master. FWIW, I'd just skip removing them. They seem useful functionality and don't hurt. I don't see the point in mucking around with them. Greetings, Andres Freund
On Fri, Jun 22, 2018 at 5:41 AM, Andres Freund <andres@anarazel.de> wrote: > On 2018-06-21 14:40:58 +0900, Michael Paquier wrote: >> On Wed, Jun 20, 2018 at 08:43:11PM -0700, Andres Freund wrote: >> > On 2018-06-18 11:13:47 -0700, Andres Freund wrote: >> >> We could do that - but add_to_unowned_list() is actually a bottleneck in >> >> other places during recovery too. We pretty much never (outside of >> >> dropping relations / databases) close opened relations during recovery - >> >> which is obviously problematic since nearly all of the entries are >> >> unowned. I've come to the conclusion that we should have a global >> >> variable that just disables adding anything to the global lists. >> > >> > On second thought: I think we should your approach in the back branches, >> > and something like I'm suggesting in master once open. I think we should take the hint in the comments and make it O(1) anyway. See attached draft patch. + SMgrRelation *srels = NULL; There seems to be no reason to initialise the variable to NULL in the three places where you do that. It is always assigned a value before use. There is probably a way to share a bit more code among those three places, but that sort of refactoring should be for later. -- Thomas Munro http://www.enterprisedb.com
Attachment
On Wed, Jun 27, 2018 at 12:16 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > I think we should take the hint in the comments and make it O(1) > anyway. See attached draft patch. Alternatively, here is a shorter and sweeter dlist version (I did the open-coded one thinking of theoretical back-patchability). -- Thomas Munro http://www.enterprisedb.com
Attachment
On Wed, Jun 27, 2018 at 1:13 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, Jun 27, 2018 at 12:16 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> I think we should take the hint in the comments and make it O(1) >> anyway. See attached draft patch. > > Alternatively, here is a shorter and sweeter dlist version (I did the > open-coded one thinking of theoretical back-patchability). ... though, on second thoughts, the dlist version steam-rolls over the possibility that it might not be in the list (mentioned in the comments, though it's not immediately clear how that would happen). On further reflection, on the basis that it's the most conservative change, +1 for Fujii-san's close-in-reverse-order idea. We should reconsider that data structure for 12; there doesn't seems to be a good reason to carry all those comments warning about performance when the O(1) version is shorter than the comments. -- Thomas Munro http://www.enterprisedb.com
On 2018-06-27 13:44:03 +1200, Thomas Munro wrote: > On further reflection, on the basis that it's the most conservative > change, +1 for Fujii-san's close-in-reverse-order idea. We should > reconsider that data structure for 12; there doesn't seems to be a > good reason to carry all those comments warning about performance when > the O(1) version is shorter than the comments. Agreed on this. I like the dlist version more than the earlier one, so let's fix it up, for v12+. But regardless I'd argue that we consider disabling that infrastructure while in recovery - it's just unnecessary. Greetings, Andres Freund
On Wed, Jun 27, 2018 at 1:46 PM, Andres Freund <andres@anarazel.de> wrote: > On 2018-06-27 13:44:03 +1200, Thomas Munro wrote: >> On further reflection, on the basis that it's the most conservative >> change, +1 for Fujii-san's close-in-reverse-order idea. We should >> reconsider that data structure for 12; there doesn't seems to be a >> good reason to carry all those comments warning about performance when >> the O(1) version is shorter than the comments. > > Agreed on this. I like the dlist version more than the earlier one, so > let's fix it up, for v12+. But regardless I'd argue that we consider > disabling that infrastructure while in recovery - it's just unnecessary. I tested this patch using the functions that Michael provided + an extra variant truncate_tables(), using the -v2 patch + the adjustment to close in reverse order. I set shared_buffers to 4GB and max_locks_per_transaction to 30000. I tested three CREATE/DROP code paths, and also two different TRUNCATE paths: SELECT create_tables(10000); SELECT drop_tables(10000); BEGIN; SELECT create_tables(10000); ROLLBACK; SELECT create_tables(10000); BEGIN; SELECT drop_tables(10000); PREPARE TRANSACTION 'x'; COMMIT PREPARED 'x'; SELECT create_tables(10000); SELECT truncate_tables(10000); In all these cases my startup process would eat 100% CPU for ~20 seconds, and with the patch this dropped to ~1-2 second (I didn't measure precisely, I just watched htop) with the patch. I also tried back-patching to 9.3 and the results were the same. Based on this and positive feedback from other reviewers, I will mark this "Ready for Committer" (meaning v2 + close-in-reverse-order). I did have one cosmetic remark a few messages up (redundant variable initialisation). By the way, as noted by others already, there is still a way to reach a horrible case with or without the patch: BEGIN; SELECT create_tables(10000); SELECT truncate_tables(10000); That's a different code path that eats a lot of CPU on the *primary*, because: /* * Normally, we need a transaction-safe truncation here. However, if * the table was either created in the current (sub)transaction or has * a new relfilenode in the current (sub)transaction, then we can just * truncate it in-place, because a rollback would cause the whole * table or the current physical file to be thrown away anyway. */ if (rel->rd_createSubid == mySubid || rel->rd_newRelfilenodeSubid == mySubid) { /* Immediate, non-rollbackable truncation is OK */ heap_truncate_one_rel(rel); } else Without range-scannable buffer mapping (Andres's radix tree thing), that bet doesn't work out too well when you do it more than once. Hmm... we could just... not do that? (Has anyone ever looked into a lazier approach to dropping buffers?) -- Thomas Munro http://www.enterprisedb.com
Hi, On 2018-06-27 15:56:58 +1200, Thomas Munro wrote: > That's a different code path that eats a lot of CPU on the *primary*, because: While that's obviously not great, I think it's far less dangerous than the standby having worse complexity than the primary. There's an obvious backpressure if commands on the primary take a while, but there's normally none for commands with high complexity on the standby... > /* > * Normally, we need a transaction-safe truncation > here. However, if > * the table was either created in the current > (sub)transaction or has > * a new relfilenode in the current (sub)transaction, > then we can just > * truncate it in-place, because a rollback would > cause the whole > * table or the current physical file to be thrown away anyway. > */ > if (rel->rd_createSubid == mySubid || > rel->rd_newRelfilenodeSubid == mySubid) > { > /* Immediate, non-rollbackable truncation is OK */ > heap_truncate_one_rel(rel); > } > else > > Without range-scannable buffer mapping (Andres's radix tree thing), > that bet doesn't work out too well when you do it more than once. > Hmm... we could just... not do that? That'd probably hurt noticably too... > (Has anyone ever looked into a lazier approach to dropping buffers?) What precisely are you thinking of? We kinda now do something lazy-ish at EOXact... Greetings, Andres Freund
On Wed, Jun 27, 2018 at 4:17 PM, Andres Freund <andres@anarazel.de> wrote: > On 2018-06-27 15:56:58 +1200, Thomas Munro wrote: >> Without range-scannable buffer mapping (Andres's radix tree thing), >> that bet doesn't work out too well when you do it more than once. >> Hmm... we could just... not do that? > > That'd probably hurt noticably too... Allow the optimisation only once per transaction? >> (Has anyone ever looked into a lazier approach to dropping buffers?) > > What precisely are you thinking of? We kinda now do something lazy-ish > at EOXact... I mean at the buffer level. If we did nothing at all, the problem would be dirty buffers that you'd eventually try to write out to non-existant files. What if... you just remembered recently dropped relations, and then whenever writing dirty buffers (either because of stealing or checkpointing) you could check if they belong to recently dropped relations and just mark them clean? To garbage collect the recently dropped list, you could somehow make use of the knowledge that checkpoints and full clock hand cycles must drop all such buffers. I'm not proposing anything, just musing and wondering if anyone has looked into this sort of thing... -- Thomas Munro http://www.enterprisedb.com
On Wed, Jun 27, 2018 at 10:44 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, Jun 27, 2018 at 1:13 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Wed, Jun 27, 2018 at 12:16 PM, Thomas Munro >> <thomas.munro@enterprisedb.com> wrote: >>> I think we should take the hint in the comments and make it O(1) >>> anyway. See attached draft patch. >> >> Alternatively, here is a shorter and sweeter dlist version (I did the >> open-coded one thinking of theoretical back-patchability). > > ... though, on second thoughts, the dlist version steam-rolls over the > possibility that it might not be in the list (mentioned in the > comments, though it's not immediately clear how that would happen). > > On further reflection, on the basis that it's the most conservative > change, +1 for Fujii-san's close-in-reverse-order idea. We should > reconsider that data structure for 12 +1 Attached is v3 patch which I implemented close-in-reverse-order idea on v2. Regards, -- Fujii Masao
Attachment
On 2018-06-28 03:21:51 +0900, Fujii Masao wrote: > On Wed, Jun 27, 2018 at 10:44 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: > > On Wed, Jun 27, 2018 at 1:13 PM, Thomas Munro > > <thomas.munro@enterprisedb.com> wrote: > >> On Wed, Jun 27, 2018 at 12:16 PM, Thomas Munro > >> <thomas.munro@enterprisedb.com> wrote: > >>> I think we should take the hint in the comments and make it O(1) > >>> anyway. See attached draft patch. > >> > >> Alternatively, here is a shorter and sweeter dlist version (I did the > >> open-coded one thinking of theoretical back-patchability). > > > > ... though, on second thoughts, the dlist version steam-rolls over the > > possibility that it might not be in the list (mentioned in the > > comments, though it's not immediately clear how that would happen). > > > > On further reflection, on the basis that it's the most conservative > > change, +1 for Fujii-san's close-in-reverse-order idea. We should > > reconsider that data structure for 12 > > +1 > > Attached is v3 patch which I implemented close-in-reverse-order idea > on v2. > > Regards, > > -- > Fujii Masao > --- a/src/backend/access/transam/twophase.c > +++ b/src/backend/access/transam/twophase.c > @@ -1457,6 +1457,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit) > int ndelrels; > SharedInvalidationMessage *invalmsgs; > int i; > + SMgrRelation *srels; > > /* > * Validate the GID, and lock the GXACT to ensure that two backends do not > @@ -1549,13 +1550,22 @@ FinishPreparedTransaction(const char *gid, bool isCommit) > delrels = abortrels; > ndelrels = hdr->nabortrels; > } > + > + srels = palloc(sizeof(SMgrRelation) * ndelrels); > for (i = 0; i < ndelrels; i++) > - { > - SMgrRelation srel = smgropen(delrels[i], InvalidBackendId); > + srels[i] = smgropen(delrels[i], InvalidBackendId); > > - smgrdounlink(srel, false); > - smgrclose(srel); > - } > + smgrdounlinkall(srels, ndelrels, false); > + > + /* > + * Call smgrclose() in reverse order as when smgropen() is called. > + * This trick enables remove_from_unowned_list() in smgrclose() > + * to search the SMgrRelation from the unowned list, > + * in O(1) performance. > + */ > + for (i = ndelrels - 1; i >= 0; i--) > + smgrclose(srels[i]); > + pfree(srels); > > /* > * Handle cache invalidation messages. > --- a/src/backend/access/transam/xact.c > +++ b/src/backend/access/transam/xact.c > @@ -5618,6 +5618,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed, > /* Make sure files supposed to be dropped are dropped */ > if (parsed->nrels > 0) > { > + SMgrRelation *srels; > + > /* > * First update minimum recovery point to cover this WAL record. Once > * a relation is deleted, there's no going back. The buffer manager > @@ -5635,6 +5637,7 @@ xact_redo_commit(xl_xact_parsed_commit *parsed, > */ > XLogFlush(lsn); > > + srels = palloc(sizeof(SMgrRelation) * parsed->nrels); > for (i = 0; i < parsed->nrels; i++) > { > SMgrRelation srel = smgropen(parsed->xnodes[i], InvalidBackendId); > @@ -5642,9 +5645,20 @@ xact_redo_commit(xl_xact_parsed_commit *parsed, > > for (fork = 0; fork <= MAX_FORKNUM; fork++) > XLogDropRelation(parsed->xnodes[i], fork); > - smgrdounlink(srel, true); > - smgrclose(srel); > + srels[i] = srel; > } > + > + smgrdounlinkall(srels, parsed->nrels, true); > + > + /* > + * Call smgrclose() in reverse order as when smgropen() is called. > + * This trick enables remove_from_unowned_list() in smgrclose() > + * to search the SMgrRelation from the unowned list, > + * in O(1) performance. > + */ > + for (i = parsed->nrels - 1; i >= 0; i--) > + smgrclose(srels[i]); > + pfree(srels); > } > > /* > @@ -5685,6 +5699,7 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid) > { > int i; > TransactionId max_xid; > + SMgrRelation *srels; > > Assert(TransactionIdIsValid(xid)); > > @@ -5748,6 +5763,7 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid) > } > > /* Make sure files supposed to be dropped are dropped */ > + srels = palloc(sizeof(SMgrRelation) * parsed->nrels); > for (i = 0; i < parsed->nrels; i++) > { > SMgrRelation srel = smgropen(parsed->xnodes[i], InvalidBackendId); > @@ -5755,9 +5771,20 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid) > > for (fork = 0; fork <= MAX_FORKNUM; fork++) > XLogDropRelation(parsed->xnodes[i], fork); > - smgrdounlink(srel, true); > - smgrclose(srel); > + srels[i] = srel; > } > + > + smgrdounlinkall(srels, parsed->nrels, true); > + > + /* > + * Call smgrclose() in reverse order as when smgropen() is called. > + * This trick enables remove_from_unowned_list() in smgrclose() > + * to search the SMgrRelation from the unowned list, > + * in O(1) performance. > + */ > + for (i = parsed->nrels - 1; i >= 0; i--) > + smgrclose(srels[i]); > + pfree(srels); > } > > void Can you please move the repeated stanzy to a separate function? Repeating the same code and comment three times isn't good. Greetings, Andres Freund
On Thu, Jun 28, 2018 at 3:23 AM, Andres Freund <andres@anarazel.de> wrote: > On 2018-06-28 03:21:51 +0900, Fujii Masao wrote: >> On Wed, Jun 27, 2018 at 10:44 AM, Thomas Munro >> <thomas.munro@enterprisedb.com> wrote: >> > On Wed, Jun 27, 2018 at 1:13 PM, Thomas Munro >> > <thomas.munro@enterprisedb.com> wrote: >> >> On Wed, Jun 27, 2018 at 12:16 PM, Thomas Munro >> >> <thomas.munro@enterprisedb.com> wrote: >> >>> I think we should take the hint in the comments and make it O(1) >> >>> anyway. See attached draft patch. >> >> >> >> Alternatively, here is a shorter and sweeter dlist version (I did the >> >> open-coded one thinking of theoretical back-patchability). >> > >> > ... though, on second thoughts, the dlist version steam-rolls over the >> > possibility that it might not be in the list (mentioned in the >> > comments, though it's not immediately clear how that would happen). >> > >> > On further reflection, on the basis that it's the most conservative >> > change, +1 for Fujii-san's close-in-reverse-order idea. We should >> > reconsider that data structure for 12 >> >> +1 >> >> Attached is v3 patch which I implemented close-in-reverse-order idea >> on v2. >> >> Regards, >> >> -- >> Fujii Masao > >> --- a/src/backend/access/transam/twophase.c >> +++ b/src/backend/access/transam/twophase.c >> @@ -1457,6 +1457,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit) >> int ndelrels; >> SharedInvalidationMessage *invalmsgs; >> int i; >> + SMgrRelation *srels; >> >> /* >> * Validate the GID, and lock the GXACT to ensure that two backends do not >> @@ -1549,13 +1550,22 @@ FinishPreparedTransaction(const char *gid, bool isCommit) >> delrels = abortrels; >> ndelrels = hdr->nabortrels; >> } >> + >> + srels = palloc(sizeof(SMgrRelation) * ndelrels); >> for (i = 0; i < ndelrels; i++) >> - { >> - SMgrRelation srel = smgropen(delrels[i], InvalidBackendId); >> + srels[i] = smgropen(delrels[i], InvalidBackendId); >> >> - smgrdounlink(srel, false); >> - smgrclose(srel); >> - } >> + smgrdounlinkall(srels, ndelrels, false); >> + >> + /* >> + * Call smgrclose() in reverse order as when smgropen() is called. >> + * This trick enables remove_from_unowned_list() in smgrclose() >> + * to search the SMgrRelation from the unowned list, >> + * in O(1) performance. >> + */ >> + for (i = ndelrels - 1; i >= 0; i--) >> + smgrclose(srels[i]); >> + pfree(srels); >> >> /* >> * Handle cache invalidation messages. >> --- a/src/backend/access/transam/xact.c >> +++ b/src/backend/access/transam/xact.c >> @@ -5618,6 +5618,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed, >> /* Make sure files supposed to be dropped are dropped */ >> if (parsed->nrels > 0) >> { >> + SMgrRelation *srels; >> + >> /* >> * First update minimum recovery point to cover this WAL record. Once >> * a relation is deleted, there's no going back. The buffer manager >> @@ -5635,6 +5637,7 @@ xact_redo_commit(xl_xact_parsed_commit *parsed, >> */ >> XLogFlush(lsn); >> >> + srels = palloc(sizeof(SMgrRelation) * parsed->nrels); >> for (i = 0; i < parsed->nrels; i++) >> { >> SMgrRelation srel = smgropen(parsed->xnodes[i], InvalidBackendId); >> @@ -5642,9 +5645,20 @@ xact_redo_commit(xl_xact_parsed_commit *parsed, >> >> for (fork = 0; fork <= MAX_FORKNUM; fork++) >> XLogDropRelation(parsed->xnodes[i], fork); >> - smgrdounlink(srel, true); >> - smgrclose(srel); >> + srels[i] = srel; >> } >> + >> + smgrdounlinkall(srels, parsed->nrels, true); >> + >> + /* >> + * Call smgrclose() in reverse order as when smgropen() is called. >> + * This trick enables remove_from_unowned_list() in smgrclose() >> + * to search the SMgrRelation from the unowned list, >> + * in O(1) performance. >> + */ >> + for (i = parsed->nrels - 1; i >= 0; i--) >> + smgrclose(srels[i]); >> + pfree(srels); >> } >> >> /* >> @@ -5685,6 +5699,7 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid) >> { >> int i; >> TransactionId max_xid; >> + SMgrRelation *srels; >> >> Assert(TransactionIdIsValid(xid)); >> >> @@ -5748,6 +5763,7 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid) >> } >> >> /* Make sure files supposed to be dropped are dropped */ >> + srels = palloc(sizeof(SMgrRelation) * parsed->nrels); >> for (i = 0; i < parsed->nrels; i++) >> { >> SMgrRelation srel = smgropen(parsed->xnodes[i], InvalidBackendId); >> @@ -5755,9 +5771,20 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid) >> >> for (fork = 0; fork <= MAX_FORKNUM; fork++) >> XLogDropRelation(parsed->xnodes[i], fork); >> - smgrdounlink(srel, true); >> - smgrclose(srel); >> + srels[i] = srel; >> } >> + >> + smgrdounlinkall(srels, parsed->nrels, true); >> + >> + /* >> + * Call smgrclose() in reverse order as when smgropen() is called. >> + * This trick enables remove_from_unowned_list() in smgrclose() >> + * to search the SMgrRelation from the unowned list, >> + * in O(1) performance. >> + */ >> + for (i = parsed->nrels - 1; i >= 0; i--) >> + smgrclose(srels[i]); >> + pfree(srels); >> } >> >> void > > Can you please move the repeated stanzy to a separate function? > Repeating the same code and comment three times isn't good. OK, so what about the attached patch? Regards, -- Fujii Masao
Attachment
On Tue, Jul 03, 2018 at 04:13:15AM +0900, Fujii Masao wrote: > OK, so what about the attached patch? I have been looking at this patch, and this looks in good shape to me (please indent!). + * Call smgrclose() in reverse order as when smgropen() is called. + * This trick enables remove_from_unowned_list() in smgrclose() + * to search the SMgrRelation from the unowned list, + * in O(1) performance. A nit here: with O(1) performance. Could it be possible to add an assertion so as isRedo is never enabled out of recovery? -- Michael
Attachment
On Tue, Jul 3, 2018 at 11:28 AM, Michael Paquier <michael@paquier.xyz> wrote: > On Tue, Jul 03, 2018 at 04:13:15AM +0900, Fujii Masao wrote: >> OK, so what about the attached patch? > > I have been looking at this patch, and this looks in good shape to me Thanks for the review! So, committed. > (please indent!). Hmm.. I failed to find indent issue in my patch... But anyway future execution of pgindent will fix that even if it exists. > + * Call smgrclose() in reverse order as when smgropen() is called. > + * This trick enables remove_from_unowned_list() in smgrclose() > + * to search the SMgrRelation from the unowned list, > + * in O(1) performance. > > A nit here: with O(1) performance. I changed the patch accordingly. > Could it be possible to add an assertion so as isRedo is never enabled > out of recovery? I'm not sure if adding such assertion into only the function that the patch added is helpful. There are many functions having something like isRedo as an argument. Regards, -- Fujii Masao
On Thu, Jul 05, 2018 at 03:10:33AM +0900, Fujii Masao wrote: > On Tue, Jul 3, 2018 at 11:28 AM, Michael Paquier <michael@paquier.xyz> wrote: > Thanks for the review! So, committed. Thanks. >> (please indent!). > > Hmm.. I failed to find indent issue in my patch... But anyway > future execution of pgindent will fix that even if it exists. md.c is telling a rather different story ;) Well, spurious noise when touching the same files for a different patch is annoying... I have made the experience already for a couple of commits. -- Michael
Attachment
Hello Fujii-san, On April 18, 2018, Fujii Masao wrote: > On Fri, Mar 30, 2018 at 12:18 PM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: >> Furthermore, TRUNCATE has a similar and worse issue. While DROP TABLE >> scans the shared buffers once for each table, TRUNCATE does that for >> each fork, resulting in three scans per table. How about changing the >> following functions >> >> smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber >> nblocks) DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber forkNum, >> BlockNumber firstDelBlock) >> >> to >> >> smgrtruncate(SMgrRelation reln, ForkNumber *forknum, BlockNumber >> *nblocks, int nforks) DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum, >> BlockNumber *firstDelBlock, >> int nforks) >> >> to perform the scan only once per table? If there doesn't seem to be a problem, >> I think I'll submit the patch next month. I'd welcome if anyone could do that. > > Yeah, it's worth working on this problem. To decrease the number of scans of > shared_buffers, you would need to change the order of truncations of files and WAL > logging. In RelationTruncate(), currently WAL is logged after FSM and VM are truncated. > IOW, with the patch, FSM and VM would need to be truncated after WAL logging. You would > need to check whether this reordering is valid. I know it's been almost a year since the previous message, but if you could still recall your suggestion above, I'd like to ask question. Could you explain your idea a bit further on how would the reordering of WAL writing and file truncation possibly reduce the number of shared_buffers scan? Thanks a lot in advance. Regards, Kirk Jamison
On Tue, Apr 16, 2019 at 10:48 AM Jamison, Kirk <k.jamison@jp.fujitsu.com> wrote: > > Hello Fujii-san, > > On April 18, 2018, Fujii Masao wrote: > > > On Fri, Mar 30, 2018 at 12:18 PM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > >> Furthermore, TRUNCATE has a similar and worse issue. While DROP TABLE > >> scans the shared buffers once for each table, TRUNCATE does that for > >> each fork, resulting in three scans per table. How about changing the > >> following functions > >> > >> smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber > >> nblocks) DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber forkNum, > >> BlockNumber firstDelBlock) > >> > >> to > >> > >> smgrtruncate(SMgrRelation reln, ForkNumber *forknum, BlockNumber > >> *nblocks, int nforks) DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum, > >> BlockNumber *firstDelBlock, > >> int nforks) > >> > >> to perform the scan only once per table? If there doesn't seem to be a problem, > >> I think I'll submit the patch next month. I'd welcome if anyone could do that. > > > > Yeah, it's worth working on this problem. To decrease the number of scans of > > shared_buffers, you would need to change the order of truncations of files and WAL > > logging. In RelationTruncate(), currently WAL is logged after FSM and VM are truncated. > > IOW, with the patch, FSM and VM would need to be truncated after WAL logging. You would > > need to check whether this reordering is valid. > > I know it's been almost a year since the previous message, but if you could still > recall your suggestion above, I'd like to ask question. > Could you explain your idea a bit further on how would the reordering of WAL writing and > file truncation possibly reduce the number of shared_buffers scan? Sorry, I forgot the detail of that my comment. But anyway, you want to modify smgr_redo(info = XLOG_SMGR_TRUNCATE) so that the number of scans of shared_buffers to be decreased to one. Right? IMO it's worth thinking to change smgrtruncate(MAIN_FORK), FreeSpaceMapTruncateRel() and visibilitymap_truncate() so that they just mark the relation and blocks as to be deleted, and then so that they scan shared_buffers at once to invalidate the blocks at the end of smgr_redo(). Regards, -- Fujii Masao