Thread: Speedup of relation deletes during recovery

Speedup of relation deletes during recovery

From
Fujii Masao
Date:
Hi,

When multiple relations are deleted at the same transaction,
the files of those relations are deleted by one call to smgrdounlinkall(),
which leads to scan whole shared_buffers only one time. OTOH,
during recovery, smgrdounlink() (not smgrdounlinkall()) is called
for each file to delete, which leads to scan shared_buffers multiple times.
Obviously this can cause to increase the WAL replay time very much
especially when shared_buffers is huge.

To alleviate this situation, I'd like to propose to change the recovery
so that it also calls smgrdounlinkall() only one time to delete multiple
relation files. Patch attached. Thought?

Regards,

-- 
Fujii Masao

Attachment

Re: Speedup of relation deletes during recovery

From
Kyotaro HORIGUCHI
Date:
Hello.

At Fri, 30 Mar 2018 08:31:29 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com>
> Hi,
> 
> When multiple relations are deleted at the same transaction,
> the files of those relations are deleted by one call to smgrdounlinkall(),
> which leads to scan whole shared_buffers only one time. OTOH,
> during recovery, smgrdounlink() (not smgrdounlinkall()) is called
> for each file to delete, which leads to scan shared_buffers multiple times.
> Obviously this can cause to increase the WAL replay time very much
> especially when shared_buffers is huge.
> 
> To alleviate this situation, I'd like to propose to change the recovery
> so that it also calls smgrdounlinkall() only one time to delete multiple
> relation files. Patch attached. Thought?

It is obviously a left-over of 279628a0a7. This patch applies the
same change with the patch and looks fine for me. Note that
FinishPreparedTransaction has the same loop over smgrdounlink.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Speedup of relation deletes during recovery

From
Michael Paquier
Date:
On Fri, Mar 30, 2018 at 11:19:58AM +0900, Kyotaro HORIGUCHI wrote:
> At Fri, 30 Mar 2018 08:31:29 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com>
>> When multiple relations are deleted at the same transaction,
>> the files of those relations are deleted by one call to smgrdounlinkall(),
>> which leads to scan whole shared_buffers only one time. OTOH,
>> during recovery, smgrdounlink() (not smgrdounlinkall()) is called
>> for each file to delete, which leads to scan shared_buffers multiple times.
>> Obviously this can cause to increase the WAL replay time very much
>> especially when shared_buffers is huge.
>>
>> To alleviate this situation, I'd like to propose to change the recovery
>> so that it also calls smgrdounlinkall() only one time to delete multiple
>> relation files. Patch attached. Thought?
>
> It is obviously a left-over of 279628a0a7. This patch applies the
> same change with the patch and looks fine for me. Note that
> FinishPreparedTransaction has the same loop over smgrdounlink.

Yeah, I was just going to say the same after looking at Fujii-san's
patch.  This would also cause smgrdounlink() to become unused in the
core code.  So this could just be... Removed?

/me Preparing shelter against incoming wraith.

That's not v11 material at this point in any case.
--
Michael

Attachment

RE: Speedup of relation deletes during recovery

From
"Tsunakawa, Takayuki"
Date:
From: Fujii Masao [mailto:masao.fujii@gmail.com]
> When multiple relations are deleted at the same transaction, the files of
> those relations are deleted by one call to smgrdounlinkall(), which leads
> to scan whole shared_buffers only one time. OTOH, during recovery,
> smgrdounlink() (not smgrdounlinkall()) is called for each file to delete,
> which leads to scan shared_buffers multiple times.
> Obviously this can cause to increase the WAL replay time very much especially
> when shared_buffers is huge.
> 
> To alleviate this situation, I'd like to propose to change the recovery
> so that it also calls smgrdounlinkall() only one time to delete multiple
> relation files. Patch attached. Thought?

Nice catch.  As Horiguchi-san and Michael already commented, the patch looks good.

As a related improvement, the following proposal is effective for shortening WAL replay time of DROP TABLE (and
possiblyTRUNCATE as well.)  How should we proceed with this?
 

https://www.postgresql.org/message-id/A1CF58A8CBA14341B3F3AC6A468F18454545E4F3@g01jpexmbyt23


Furthermore, TRUNCATE has a similar and worse issue.  While DROP TABLE scans the shared buffers once for each table,
TRUNCATEdoes that for each fork, resulting in three scans per table.  How about changing the following functions
 

smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber forkNum,
                       BlockNumber firstDelBlock)

to

smgrtruncate(SMgrRelation reln, ForkNumber *forknum, BlockNumber *nblocks, int nforks)
DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
                       BlockNumber *firstDelBlock, int nforks)

to perform the scan only once per table?  If there doesn't seem to be a problem, I think I'll submit the patch next
month. I'd welcome if anyone could do that.
 


Regards
Takayuki Tsunakawa




Re: Speedup of relation deletes during recovery

From
Jerry Sievers
Date:
Fujii Masao <masao.fujii@gmail.com> writes:

> Hi,
>
> When multiple relations are deleted at the same transaction,
> the files of those relations are deleted by one call to smgrdounlinkall(),
> which leads to scan whole shared_buffers only one time. OTOH,
> during recovery, smgrdounlink() (not smgrdounlinkall()) is called
> for each file to delete, which leads to scan shared_buffers multiple times.
> Obviously this can cause to increase the WAL replay time very much
> especially when shared_buffers is huge.

Wonder if this is the case for streaming standbys replaying truncates
also?

Just couple days ago I was running a pg_upgrade test scenario but did
not reach the point of upgrade yet.

We made snapshots of our large reporting system putting them into a
master and 1-slave streaming replication configuration.

There are hundreds of unlogged tables that we wish to trunc before the
upgrade to save time during the rsyncing of standbys, since a standard
rsync replicates all data which is timeconsumeing and useless sending to
the standbys.

Indeed the tables are already trunc'd on the test master since it too
was a recovered system upon initial start but the truncates took place
anyway since it's part of our framework.  It ran in just a few seconds
on master.

The standby however was $slow replaying these truncates which we noticed
because the upgrade wil not proceed until master and 1 or more standbys
are confirmed all at same checkpoint.

I straced the standby's startup process to find it unlinking lots of
tables, getting -1 on the unlink syscall since the non-init fork files
were already missing.

I can't describe just how slow if was but took minutes and the lines
being output by strace were *not* blowing up my display as happens
generally when any busy process is straced.

And, the test systems were config'd with $large shared buffers of 64G.

Please advise


>
> To alleviate this situation, I'd like to propose to change the recovery
> so that it also calls smgrdounlinkall() only one time to delete multiple
> relation files. Patch attached. Thought?
>
> Regards,

-- 
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800


Re: Speedup of relation deletes during recovery

From
Jerry Sievers
Date:
Fujii Masao <masao.fujii@gmail.com> writes:

> Hi,
>
> When multiple relations are deleted at the same transaction,
> the files of those relations are deleted by one call to smgrdounlinkall(),
> which leads to scan whole shared_buffers only one time. OTOH,
> during recovery, smgrdounlink() (not smgrdounlinkall()) is called
> for each file to delete, which leads to scan shared_buffers multiple times.
> Obviously this can cause to increase the WAL replay time very much
> especially when shared_buffers is huge.

Forgot to mention version in my TLDR prev reply :-)

                                                                    version
                       
 

------------------------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 9.5.12 on x86_64-pc-linux-gnu (Ubuntu 9.5.12-1.pgdg16.04+1), compiled by gcc (Ubuntu
5.4.0-6ubuntu1~16.04.9)5.4.0 20160609, 64-bit
 
(1 row)


>
> To alleviate this situation, I'd like to propose to change the recovery
> so that it also calls smgrdounlinkall() only one time to delete multiple
> relation files. Patch attached. Thought?
>
> Regards,

-- 
Jerry Sievers
Postgres DBA/Development Consulting
e: postgres.consulting@comcast.net
p: 312.241.7800


RE: Speedup of relation deletes during recovery

From
"Tsunakawa, Takayuki"
Date:
From: Jerry Sievers [mailto:gsievers19@comcast.net]
> Wonder if this is the case for streaming standbys replaying truncates
> also?

Yes, As I wrote in my previous mail, TRUNCATE is worse than DROP TABLE.

Regards
Takayuki Tsunakawa




Re: Speedup of relation deletes during recovery

From
Fujii Masao
Date:
On Fri, Mar 30, 2018 at 11:46 AM, Michael Paquier <michael@paquier.xyz> wrote:
> On Fri, Mar 30, 2018 at 11:19:58AM +0900, Kyotaro HORIGUCHI wrote:
>> At Fri, 30 Mar 2018 08:31:29 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwHVQkdfDqtvGVkty+19cQakAydXn1etGND3X0PHbZ3+6w@mail.gmail.com>
>>> When multiple relations are deleted at the same transaction,
>>> the files of those relations are deleted by one call to smgrdounlinkall(),
>>> which leads to scan whole shared_buffers only one time. OTOH,
>>> during recovery, smgrdounlink() (not smgrdounlinkall()) is called
>>> for each file to delete, which leads to scan shared_buffers multiple times.
>>> Obviously this can cause to increase the WAL replay time very much
>>> especially when shared_buffers is huge.
>>>
>>> To alleviate this situation, I'd like to propose to change the recovery
>>> so that it also calls smgrdounlinkall() only one time to delete multiple
>>> relation files. Patch attached. Thought?
>>
>> It is obviously a left-over of 279628a0a7. This patch applies the
>> same change with the patch and looks fine for me. Note that
>> FinishPreparedTransaction has the same loop over smgrdounlink.

Thanks for the review! I also changed FinishPreparedTransaction() so that
it uses smgrdounlinkall(). Patch attached.

> Yeah, I was just going to say the same after looking at Fujii-san's
> patch.  This would also cause smgrdounlink() to become unused in the
> core code.  So this could just be... Removed?

Yes, I think. And, I found that smgrdounlinkfork() is also dead code.
Per the discussion [1], this unused function was left intentionally.
But it's still dead code since 2012, so I'd like to remove it. Patch attached.

[1]
https://www.postgresql.org/message-id/1471.1339106082@sss.pgh.pa.us

Regards,

-- 
Fujii Masao

Attachment

Re: Speedup of relation deletes during recovery

From
Fujii Masao
Date:
On Fri, Mar 30, 2018 at 12:18 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> From: Fujii Masao [mailto:masao.fujii@gmail.com]
>> When multiple relations are deleted at the same transaction, the files of
>> those relations are deleted by one call to smgrdounlinkall(), which leads
>> to scan whole shared_buffers only one time. OTOH, during recovery,
>> smgrdounlink() (not smgrdounlinkall()) is called for each file to delete,
>> which leads to scan shared_buffers multiple times.
>> Obviously this can cause to increase the WAL replay time very much especially
>> when shared_buffers is huge.
>>
>> To alleviate this situation, I'd like to propose to change the recovery
>> so that it also calls smgrdounlinkall() only one time to delete multiple
>> relation files. Patch attached. Thought?
>
> Nice catch.  As Horiguchi-san and Michael already commented, the patch looks good.
>
> As a related improvement, the following proposal is effective for shortening WAL replay time of DROP TABLE (and
possiblyTRUNCATE as well.)  How should we proceed with this?
 
>
> https://www.postgresql.org/message-id/A1CF58A8CBA14341B3F3AC6A468F18454545E4F3@g01jpexmbyt23
>
>
> Furthermore, TRUNCATE has a similar and worse issue.  While DROP TABLE scans the shared buffers once for each table,
TRUNCATEdoes that for each fork, resulting in three scans per table.  How about changing the following functions
 
>
> smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
> DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber forkNum,
>                                            BlockNumber firstDelBlock)
>
> to
>
> smgrtruncate(SMgrRelation reln, ForkNumber *forknum, BlockNumber *nblocks, int nforks)
> DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
>                                            BlockNumber *firstDelBlock, int nforks)
>
> to perform the scan only once per table?  If there doesn't seem to be a problem, I think I'll submit the patch next
month. I'd welcome if anyone could do that.
 

Yeah, it's worth working on this problem. To decrease the number of scans of
shared_buffers, you would need to change the order of truncations of files and
WAL logging. In RelationTruncate(), currently WAL is logged after FSM and VM
are truncated. IOW, with the patch, FSM and VM would need to be truncated after
WAL logging. You would need to check whether this reordering is valid.

Regards,

-- 
Fujii Masao


Re: Speedup of relation deletes during recovery

From
Michael Paquier
Date:
On Wed, Apr 18, 2018 at 12:46:58AM +0900, Fujii Masao wrote:
> Yes, I think. And, I found that smgrdounlinkfork() is also dead code.
> Per the discussion [1], this unused function was left intentionally.
> But it's still dead code since 2012, so I'd like to remove it. Patch attached.

Indeed, it's close to six years and the status is the same.  So let's
drop it.  I have been surrounding the area to see if any modules
actually use those, particularly on github, but I could not find
callers.

The patch looks logically fine to me.  In your first message, you
mentioned that the replay time increased a lot.  Do you have numbers to
share with some large settings of shared_buffers?

It would be better to wait for v12 branch to open before pushing
anything, as the focus is on stabililizing things on v11.
--
Michael

Attachment

Re: Speedup of relation deletes during recovery

From
Fujii Masao
Date:
On Wed, Apr 18, 2018 at 10:44 AM, Michael Paquier <michael@paquier.xyz> wrote:
> On Wed, Apr 18, 2018 at 12:46:58AM +0900, Fujii Masao wrote:
>> Yes, I think. And, I found that smgrdounlinkfork() is also dead code.
>> Per the discussion [1], this unused function was left intentionally.
>> But it's still dead code since 2012, so I'd like to remove it. Patch attached.
>
> Indeed, it's close to six years and the status is the same.  So let's
> drop it.  I have been surrounding the area to see if any modules
> actually use those, particularly on github, but I could not find
> callers.
>
> The patch looks logically fine to me.  In your first message, you
> mentioned that the replay time increased a lot.  Do you have numbers to
> share with some large settings of shared_buffers?

No. But after my colleague truncated more than one hundred tables on
the server with shared_buffers = 300GB, the recovery could not finish
even after 10 minutes since the startup of the recovery. So I had to
shutdown the server immediately, set shared_buffers to very small
temporarily and start the server to cause the recovery to finish soon.

> It would be better to wait for v12 branch to open before pushing
> anything, as the focus is on stabililizing things on v11.

Yes, so I added this to next CommitFest.

Regards,

-- 
Fujii Masao


Re: Speedup of relation deletes during recovery

From
Michael Paquier
Date:
On Thu, Apr 19, 2018 at 01:52:26AM +0900, Fujii Masao wrote:
> No. But after my colleague truncated more than one hundred tables on
> the server with shared_buffers = 300GB, the recovery could not finish
> even after 10 minutes since the startup of the recovery. So I had to
> shutdown the server immediately, set shared_buffers to very small
> temporarily and start the server to cause the recovery to finish soon.

Hm, OK.  Here are simple functions to create and drop many relations in
a single transaction:
create or replace function create_tables(numtabs int)
returns void as $$
declare query_string text;
begin
  for i in 1..numtabs loop
    query_string := 'create table tab_' || i::text || ' (a int);';
    execute query_string;
  end loop;
end;
$$ language plpgsql;

create or replace function drop_tables(numtabs int)
returns void as $$
declare query_string text;
begin
  for i in 1..numtabs loop
    query_string := 'drop table tab_' || i::text;
    execute query_string;
  end loop;
end;
$$ language plpgsql;

With a server running 8GB of shared buffers (you likely need to increase
max_locks_per_transaction) and 10k relations created and dropped, it
took 50 seconds to finish redo of roughly 4 segments:
2018-04-19 11:17:15 JST [7472]: [14-1] db=,user=,app=,client= LOG:  redo
starts at 0/15DF2E8
2018-04-19 11:18:05 JST [7472]: [15-1] db=,user=,app=,client= LOG:
invalid record length at 0/4E46160: wanted 24, got 0
2018-04-19 11:18:05 JST [7472]: [16-1] db=,user=,app=,client= LOG:  redo
done at 0/4A7C6E

Then with your patch it took... Barely 1 second.
2018-04-19 11:20:33 JST [11690]: [14-1] db=,user=,app=,client= LOG:
redo starts at 0/15DF2E8
2018-04-19 11:20:34 JST [11690]: [15-1] db=,user=,app=,client= LOG:
invalid record length at 0/4E299B0: wanted 24, got 0
2018-04-19 11:20:34 JST [11690]: [16-1] db=,user=,app=,client= LOG:
redo done at 0/4E29978

Looking at profiles with perf I can also see that 95% of the time is
spent in DropRelFileNodesAllBuffers which is where the startup process
spends most of its CPU.  So HEAD is booh.  And your patch is excellent
in this context.
--
Michael

Attachment

RE: Speedup of relation deletes during recovery

From
"Tsunakawa, Takayuki"
Date:
From: Fujii Masao [mailto:masao.fujii@gmail.com]
> Yeah, it's worth working on this problem. To decrease the number of scans
> of
> shared_buffers, you would need to change the order of truncations of files
> and
> WAL logging. In RelationTruncate(), currently WAL is logged after FSM and
> VM
> are truncated. IOW, with the patch, FSM and VM would need to be truncated
> after
> WAL logging. You would need to check whether this reordering is valid.

Sure, thank you for advice.

Takayuki Tsunakawa


Re: Speedup of relation deletes during recovery

From
Andres Freund
Date:
Hi,

We just had a customer hit this issue. I kind of wonder whether this
shouldn't be backpatched: Currently the execution on the primary is
O(NBuffers * log(ndrels)) whereas it's O(NBuffers * ndrels) on the
standby - with a lot higher constants to boot.  That means it's very
easy to get into situations where the standy starts to lag behind very significantly.

> --- a/src/backend/access/transam/twophase.c
> +++ b/src/backend/access/transam/twophase.c
> @@ -1445,6 +1445,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
>      int            ndelrels;
>      SharedInvalidationMessage *invalmsgs;
>      int            i;
> +    SMgrRelation *srels = NULL;
>  
>      /*
>       * Validate the GID, and lock the GXACT to ensure that two backends do not
> @@ -1534,13 +1535,16 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
>          delrels = abortrels;
>          ndelrels = hdr->nabortrels;
>      }
> +
> +    srels = palloc(sizeof(SMgrRelation) * ndelrels);
>      for (i = 0; i < ndelrels; i++)
> -    {
> -        SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
> +        srels[i] = smgropen(delrels[i], InvalidBackendId);
>  
> -        smgrdounlink(srel, false);
> -        smgrclose(srel);
> -    }
> +    smgrdounlinkall(srels, ndelrels, false);
> +
> +    for (i = 0; i < ndelrels; i++)
> +        smgrclose(srels[i]);
> +    pfree(srels);

This code is now duplicated three times - shouldn't we just add a
function that encapsulates dropping relations in a commit/abort record?

Greetings,

Andres Freund


Re: Speedup of relation deletes during recovery

From
Teodor Sigaev
Date:
> We just had a customer hit this issue. I kind of wonder whether this
> shouldn't be backpatched: Currently the execution on the primary is
> O(NBuffers * log(ndrels)) whereas it's O(NBuffers * ndrels) on the
> standby - with a lot higher constants to boot.  That means it's very
> easy to get into situations where the standy starts to lag behind very significantly.
+1, we faced with that too
-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/


Re: Speedup of relation deletes during recovery

From
Andres Freund
Date:
Hi,

On 2018-06-15 10:45:04 -0700, Andres Freund wrote:
> > +
> > +    srels = palloc(sizeof(SMgrRelation) * ndelrels);
> >      for (i = 0; i < ndelrels; i++)
> > -    {
> > -        SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
> > +        srels[i] = smgropen(delrels[i], InvalidBackendId);
> >  
> > -        smgrdounlink(srel, false);
> > -        smgrclose(srel);
> > -    }
> > +    smgrdounlinkall(srels, ndelrels, false);
> > +
> > +    for (i = 0; i < ndelrels; i++)
> > +        smgrclose(srels[i]);
> > +    pfree(srels);

Hm. This will probably cause another complexity issue. If we just
smgropen() the relation will be unowned. Which means smgrclose() will
call remove_from_unowned_list(), which is O(open-relations). Which means
this change afaict creates a new O(ndrels^2) behaviour?

It seems like the easiest fix for that would be to have a local
SMgrRelationData in that loop, that we assign the relations to?  That's
a bit hacky, but afaict should work?

Greetings,

Andres Freund


Re: Speedup of relation deletes during recovery

From
Fujii Masao
Date:
On Sat, Jun 16, 2018 at 3:28 AM, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> On 2018-06-15 10:45:04 -0700, Andres Freund wrote:
>> > +
>> > +   srels = palloc(sizeof(SMgrRelation) * ndelrels);
>> >     for (i = 0; i < ndelrels; i++)
>> > -   {
>> > -           SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
>> > +           srels[i] = smgropen(delrels[i], InvalidBackendId);
>> >
>> > -           smgrdounlink(srel, false);
>> > -           smgrclose(srel);
>> > -   }
>> > +   smgrdounlinkall(srels, ndelrels, false);
>> > +
>> > +   for (i = 0; i < ndelrels; i++)
>> > +           smgrclose(srels[i]);
>> > +   pfree(srels);
>
> Hm. This will probably cause another complexity issue. If we just
> smgropen() the relation will be unowned. Which means smgrclose() will
> call remove_from_unowned_list(), which is O(open-relations). Which means
> this change afaict creates a new O(ndrels^2) behaviour?
>
> It seems like the easiest fix for that would be to have a local
> SMgrRelationData in that loop, that we assign the relations to?  That's
> a bit hacky, but afaict should work?

The easier (but tricky) fix for that is to call smgrclose() for
each SMgrRelation in the reverse order. That is, we should do

   for (i = ndelrels - 1; i >= 0; i--)
           smgrclose(srels[i]);

instead of

   for (i = 0; i < ndelrels; i++)
           smgrclose(srels[i]);

Since add_to_unowned_list() always adds the entry into the head of
the list, each SMgrRelation entry is added into the "unowned" list in
descending order of its identifier, i.e., SMgrRelation entry with larger
identifier should be placed ahead of one with smaller identifier.
So if we calls remove_from_unowed_list() in descending order of
SMgrRelation entry's identifier, the performance of
remove_from_unowned_list()'s search should O(1). IOW, we can
immediately find the SMgrRelation entry to remove, at the head of
the list.

Regards,

-- 
Fujii Masao


Re: Speedup of relation deletes during recovery

From
Fujii Masao
Date:
On Sat, Jun 16, 2018 at 2:54 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
>> We just had a customer hit this issue. I kind of wonder whether this
>> shouldn't be backpatched: Currently the execution on the primary is
>> O(NBuffers * log(ndrels)) whereas it's O(NBuffers * ndrels) on the
>> standby - with a lot higher constants to boot.  That means it's very
>> easy to get into situations where the standy starts to lag behind very
>> significantly.
>
> +1, we faced with that too

+1 to back-patch. As Horiguchi-san pointed out, this is basically
the fix for oversight of commit 279628a0a7, not new feature.

Regards,

-- 
Fujii Masao


Re: Speedup of relation deletes during recovery

From
Andres Freund
Date:
On 2018-06-19 03:06:54 +0900, Fujii Masao wrote:
> On Sat, Jun 16, 2018 at 3:28 AM, Andres Freund <andres@anarazel.de> wrote:
> > Hi,
> >
> > On 2018-06-15 10:45:04 -0700, Andres Freund wrote:
> >> > +
> >> > +   srels = palloc(sizeof(SMgrRelation) * ndelrels);
> >> >     for (i = 0; i < ndelrels; i++)
> >> > -   {
> >> > -           SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
> >> > +           srels[i] = smgropen(delrels[i], InvalidBackendId);
> >> >
> >> > -           smgrdounlink(srel, false);
> >> > -           smgrclose(srel);
> >> > -   }
> >> > +   smgrdounlinkall(srels, ndelrels, false);
> >> > +
> >> > +   for (i = 0; i < ndelrels; i++)
> >> > +           smgrclose(srels[i]);
> >> > +   pfree(srels);
> >
> > Hm. This will probably cause another complexity issue. If we just
> > smgropen() the relation will be unowned. Which means smgrclose() will
> > call remove_from_unowned_list(), which is O(open-relations). Which means
> > this change afaict creates a new O(ndrels^2) behaviour?
> >
> > It seems like the easiest fix for that would be to have a local
> > SMgrRelationData in that loop, that we assign the relations to?  That's
> > a bit hacky, but afaict should work?
> 
> The easier (but tricky) fix for that is to call smgrclose() for
> each SMgrRelation in the reverse order. That is, we should do
> 
>    for (i = ndelrels - 1; i >= 0; i--)
>            smgrclose(srels[i]);
> 
> instead of
> 
>    for (i = 0; i < ndelrels; i++)
>            smgrclose(srels[i]);

We could do that - but add_to_unowned_list() is actually a bottleneck in
other places during recovery too. We pretty much never (outside of
dropping relations / databases) close opened relations during recovery -
which is obviously problematic since nearly all of the entries are
unowned.  I've come to the conclusion that we should have a global
variable that just disables adding anything to the global lists.

Greetings,

Andres Freund


Re: Speedup of relation deletes during recovery

From
Thomas Munro
Date:
On Tue, Jun 19, 2018 at 6:13 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Sat, Jun 16, 2018 at 2:54 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
>>> We just had a customer hit this issue. I kind of wonder whether this
>>> shouldn't be backpatched: Currently the execution on the primary is
>>> O(NBuffers * log(ndrels)) whereas it's O(NBuffers * ndrels) on the
>>> standby - with a lot higher constants to boot.  That means it's very
>>> easy to get into situations where the standy starts to lag behind very
>>> significantly.
>>
>> +1, we faced with that too
>
> +1 to back-patch. As Horiguchi-san pointed out, this is basically
> the fix for oversight of commit 279628a0a7, not new feature.

+1

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Speedup of relation deletes during recovery

From
Andres Freund
Date:
On 2018-06-18 11:13:47 -0700, Andres Freund wrote:
> On 2018-06-19 03:06:54 +0900, Fujii Masao wrote:
> > On Sat, Jun 16, 2018 at 3:28 AM, Andres Freund <andres@anarazel.de> wrote:
> > > Hi,
> > >
> > > On 2018-06-15 10:45:04 -0700, Andres Freund wrote:
> > >> > +
> > >> > +   srels = palloc(sizeof(SMgrRelation) * ndelrels);
> > >> >     for (i = 0; i < ndelrels; i++)
> > >> > -   {
> > >> > -           SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
> > >> > +           srels[i] = smgropen(delrels[i], InvalidBackendId);
> > >> >
> > >> > -           smgrdounlink(srel, false);
> > >> > -           smgrclose(srel);
> > >> > -   }
> > >> > +   smgrdounlinkall(srels, ndelrels, false);
> > >> > +
> > >> > +   for (i = 0; i < ndelrels; i++)
> > >> > +           smgrclose(srels[i]);
> > >> > +   pfree(srels);
> > >
> > > Hm. This will probably cause another complexity issue. If we just
> > > smgropen() the relation will be unowned. Which means smgrclose() will
> > > call remove_from_unowned_list(), which is O(open-relations). Which means
> > > this change afaict creates a new O(ndrels^2) behaviour?
> > >
> > > It seems like the easiest fix for that would be to have a local
> > > SMgrRelationData in that loop, that we assign the relations to?  That's
> > > a bit hacky, but afaict should work?
> > 
> > The easier (but tricky) fix for that is to call smgrclose() for
> > each SMgrRelation in the reverse order. That is, we should do
> > 
> >    for (i = ndelrels - 1; i >= 0; i--)
> >            smgrclose(srels[i]);
> > 
> > instead of
> > 
> >    for (i = 0; i < ndelrels; i++)
> >            smgrclose(srels[i]);
> 
> We could do that - but add_to_unowned_list() is actually a bottleneck in
> other places during recovery too. We pretty much never (outside of
> dropping relations / databases) close opened relations during recovery -
> which is obviously problematic since nearly all of the entries are
> unowned.  I've come to the conclusion that we should have a global
> variable that just disables adding anything to the global lists.

On second thought: I think we should your approach in the back branches,
and something like I'm suggesting in master once open.

Greetings,

Andres Freund


Re: Speedup of relation deletes during recovery

From
Michael Paquier
Date:
On Wed, Jun 20, 2018 at 08:43:11PM -0700, Andres Freund wrote:
> On 2018-06-18 11:13:47 -0700, Andres Freund wrote:
>> We could do that - but add_to_unowned_list() is actually a bottleneck in
>> other places during recovery too. We pretty much never (outside of
>> dropping relations / databases) close opened relations during recovery -
>> which is obviously problematic since nearly all of the entries are
>> unowned.  I've come to the conclusion that we should have a global
>> variable that just disables adding anything to the global lists.
>
> On second thought: I think we should your approach in the back branches,
> and something like I'm suggesting in master once open.

+1.  Let's also make sure that the removal of smgrdounlink and
smgrdounlinkfork happens only on master.
--
Michael

Attachment

Re: Speedup of relation deletes during recovery

From
Andres Freund
Date:
On 2018-06-21 14:40:58 +0900, Michael Paquier wrote:
> On Wed, Jun 20, 2018 at 08:43:11PM -0700, Andres Freund wrote:
> > On 2018-06-18 11:13:47 -0700, Andres Freund wrote:
> >> We could do that - but add_to_unowned_list() is actually a bottleneck in
> >> other places during recovery too. We pretty much never (outside of
> >> dropping relations / databases) close opened relations during recovery -
> >> which is obviously problematic since nearly all of the entries are
> >> unowned.  I've come to the conclusion that we should have a global
> >> variable that just disables adding anything to the global lists.
> > 
> > On second thought: I think we should your approach in the back branches,
> > and something like I'm suggesting in master once open.
> 
> +1.  Let's also make sure that the removal of smgrdounlink and
> smgrdounlinkfork happens only on master.

FWIW, I'd just skip removing them. They seem useful functionality and
don't hurt. I don't see the point in mucking around with them.

Greetings,

Andres Freund


Re: Speedup of relation deletes during recovery

From
Thomas Munro
Date:
On Fri, Jun 22, 2018 at 5:41 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2018-06-21 14:40:58 +0900, Michael Paquier wrote:
>> On Wed, Jun 20, 2018 at 08:43:11PM -0700, Andres Freund wrote:
>> > On 2018-06-18 11:13:47 -0700, Andres Freund wrote:
>> >> We could do that - but add_to_unowned_list() is actually a bottleneck in
>> >> other places during recovery too. We pretty much never (outside of
>> >> dropping relations / databases) close opened relations during recovery -
>> >> which is obviously problematic since nearly all of the entries are
>> >> unowned.  I've come to the conclusion that we should have a global
>> >> variable that just disables adding anything to the global lists.
>> >
>> > On second thought: I think we should your approach in the back branches,
>> > and something like I'm suggesting in master once open.

I think we should take the hint in the comments and make it O(1)
anyway.  See attached draft patch.

+       SMgrRelation *srels = NULL;

There seems to be no reason to initialise the variable to NULL in the
three places where you do that.  It is always assigned a value before
use.

There is probably a way to share a bit more code among those three
places, but that sort of refactoring should be for later.

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: Speedup of relation deletes during recovery

From
Thomas Munro
Date:
On Wed, Jun 27, 2018 at 12:16 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I think we should take the hint in the comments and make it O(1)
> anyway.  See attached draft patch.

Alternatively, here is a shorter and sweeter dlist version (I did the
open-coded one thinking of theoretical back-patchability).

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: Speedup of relation deletes during recovery

From
Thomas Munro
Date:
On Wed, Jun 27, 2018 at 1:13 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Jun 27, 2018 at 12:16 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> I think we should take the hint in the comments and make it O(1)
>> anyway.  See attached draft patch.
>
> Alternatively, here is a shorter and sweeter dlist version (I did the
> open-coded one thinking of theoretical back-patchability).

... though, on second thoughts, the dlist version steam-rolls over the
possibility that it might not be in the list (mentioned in the
comments, though it's not immediately clear how that would happen).

On further reflection, on the basis that it's the most conservative
change, +1 for Fujii-san's close-in-reverse-order idea.  We should
reconsider that data structure for 12; there doesn't seems to be a
good reason to carry all those comments warning about performance when
the O(1) version is shorter than the comments.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Speedup of relation deletes during recovery

From
Andres Freund
Date:
On 2018-06-27 13:44:03 +1200, Thomas Munro wrote:
> On further reflection, on the basis that it's the most conservative
> change, +1 for Fujii-san's close-in-reverse-order idea.  We should
> reconsider that data structure for 12; there doesn't seems to be a
> good reason to carry all those comments warning about performance when
> the O(1) version is shorter than the comments.

Agreed on this.  I like the dlist version more than the earlier one, so
let's fix it up, for v12+.  But regardless I'd argue that we consider
disabling that infrastructure while in recovery - it's just unnecessary.

Greetings,

Andres Freund


Re: Speedup of relation deletes during recovery

From
Thomas Munro
Date:
On Wed, Jun 27, 2018 at 1:46 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2018-06-27 13:44:03 +1200, Thomas Munro wrote:
>> On further reflection, on the basis that it's the most conservative
>> change, +1 for Fujii-san's close-in-reverse-order idea.  We should
>> reconsider that data structure for 12; there doesn't seems to be a
>> good reason to carry all those comments warning about performance when
>> the O(1) version is shorter than the comments.
>
> Agreed on this.  I like the dlist version more than the earlier one, so
> let's fix it up, for v12+.  But regardless I'd argue that we consider
> disabling that infrastructure while in recovery - it's just unnecessary.

I tested this patch using the functions that Michael provided + an
extra variant truncate_tables(), using the -v2 patch + the adjustment
to close in reverse order.  I set shared_buffers to 4GB and
max_locks_per_transaction to 30000.  I tested three CREATE/DROP code
paths, and also two different TRUNCATE paths:

SELECT create_tables(10000);
SELECT drop_tables(10000);

BEGIN;
SELECT create_tables(10000);
ROLLBACK;

SELECT create_tables(10000);
BEGIN;
SELECT drop_tables(10000);
PREPARE TRANSACTION 'x';
COMMIT PREPARED 'x';

SELECT create_tables(10000);
SELECT truncate_tables(10000);

In all these cases my startup process would eat 100% CPU for ~20
seconds, and with the patch this dropped to ~1-2 second (I didn't
measure precisely, I just watched htop) with the patch.  I also tried
back-patching to 9.3 and the results were the same.

Based on this and positive feedback from other reviewers, I will mark
this "Ready for Committer" (meaning v2 + close-in-reverse-order).  I
did have one cosmetic remark a few messages up (redundant variable
initialisation).

By the way, as noted by others already, there is still a way to reach
a horrible case with or without the patch:

BEGIN;
SELECT create_tables(10000);
SELECT truncate_tables(10000);

That's a different code path that eats a lot of CPU on the *primary*, because:

                /*
                 * Normally, we need a transaction-safe truncation
here.  However, if
                 * the table was either created in the current
(sub)transaction or has
                 * a new relfilenode in the current (sub)transaction,
then we can just
                 * truncate it in-place, because a rollback would
cause the whole
                 * table or the current physical file to be thrown away anyway.
                 */
                if (rel->rd_createSubid == mySubid ||
                        rel->rd_newRelfilenodeSubid == mySubid)
                {
                        /* Immediate, non-rollbackable truncation is OK */
                        heap_truncate_one_rel(rel);
                }
                else

Without range-scannable buffer mapping (Andres's radix tree thing),
that bet doesn't work out too well when you do it more than once.
Hmm... we could just... not do that?

(Has anyone ever looked into a lazier approach to dropping buffers?)

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Speedup of relation deletes during recovery

From
Andres Freund
Date:
Hi,

On 2018-06-27 15:56:58 +1200, Thomas Munro wrote:
> That's a different code path that eats a lot of CPU on the *primary*, because:

While that's obviously not great, I think it's far less dangerous than
the standby having worse complexity than the primary. There's an obvious
backpressure if commands on the primary take a while, but there's
normally none for commands with high complexity on the standby...


>                 /*
>                  * Normally, we need a transaction-safe truncation
> here.  However, if
>                  * the table was either created in the current
> (sub)transaction or has
>                  * a new relfilenode in the current (sub)transaction,
> then we can just
>                  * truncate it in-place, because a rollback would
> cause the whole
>                  * table or the current physical file to be thrown away anyway.
>                  */
>                 if (rel->rd_createSubid == mySubid ||
>                         rel->rd_newRelfilenodeSubid == mySubid)
>                 {
>                         /* Immediate, non-rollbackable truncation is OK */
>                         heap_truncate_one_rel(rel);
>                 }
>                 else
> 
> Without range-scannable buffer mapping (Andres's radix tree thing),
> that bet doesn't work out too well when you do it more than once.
> Hmm... we could just... not do that?

That'd probably hurt noticably too...


> (Has anyone ever looked into a lazier approach to dropping buffers?)

What precisely are you thinking of? We kinda now do something lazy-ish
at EOXact...

Greetings,

Andres Freund


Re: Speedup of relation deletes during recovery

From
Thomas Munro
Date:
On Wed, Jun 27, 2018 at 4:17 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2018-06-27 15:56:58 +1200, Thomas Munro wrote:
>> Without range-scannable buffer mapping (Andres's radix tree thing),
>> that bet doesn't work out too well when you do it more than once.
>> Hmm... we could just... not do that?
>
> That'd probably hurt noticably too...

Allow the optimisation only once per transaction?

>> (Has anyone ever looked into a lazier approach to dropping buffers?)
>
> What precisely are you thinking of? We kinda now do something lazy-ish
> at EOXact...

I mean at the buffer level.  If we did nothing at all, the problem
would be dirty buffers that you'd eventually try to write out to
non-existant files.  What if... you just remembered recently dropped
relations, and then whenever writing dirty buffers (either because of
stealing or checkpointing) you could check if they belong to recently
dropped relations and just mark them clean?  To garbage collect the
recently dropped list, you could somehow make use of the knowledge
that checkpoints and full clock hand cycles must drop all such
buffers.  I'm not proposing anything, just musing and wondering if
anyone has looked into this sort of thing...

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Speedup of relation deletes during recovery

From
Fujii Masao
Date:
On Wed, Jun 27, 2018 at 10:44 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Jun 27, 2018 at 1:13 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Wed, Jun 27, 2018 at 12:16 PM, Thomas Munro
>> <thomas.munro@enterprisedb.com> wrote:
>>> I think we should take the hint in the comments and make it O(1)
>>> anyway.  See attached draft patch.
>>
>> Alternatively, here is a shorter and sweeter dlist version (I did the
>> open-coded one thinking of theoretical back-patchability).
>
> ... though, on second thoughts, the dlist version steam-rolls over the
> possibility that it might not be in the list (mentioned in the
> comments, though it's not immediately clear how that would happen).
>
> On further reflection, on the basis that it's the most conservative
> change, +1 for Fujii-san's close-in-reverse-order idea.  We should
> reconsider that data structure for 12

+1

Attached is v3 patch which I implemented close-in-reverse-order idea
on v2.

Regards,

-- 
Fujii Masao

Attachment

Re: Speedup of relation deletes during recovery

From
Andres Freund
Date:
On 2018-06-28 03:21:51 +0900, Fujii Masao wrote:
> On Wed, Jun 27, 2018 at 10:44 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
> > On Wed, Jun 27, 2018 at 1:13 PM, Thomas Munro
> > <thomas.munro@enterprisedb.com> wrote:
> >> On Wed, Jun 27, 2018 at 12:16 PM, Thomas Munro
> >> <thomas.munro@enterprisedb.com> wrote:
> >>> I think we should take the hint in the comments and make it O(1)
> >>> anyway.  See attached draft patch.
> >>
> >> Alternatively, here is a shorter and sweeter dlist version (I did the
> >> open-coded one thinking of theoretical back-patchability).
> >
> > ... though, on second thoughts, the dlist version steam-rolls over the
> > possibility that it might not be in the list (mentioned in the
> > comments, though it's not immediately clear how that would happen).
> >
> > On further reflection, on the basis that it's the most conservative
> > change, +1 for Fujii-san's close-in-reverse-order idea.  We should
> > reconsider that data structure for 12
> 
> +1
> 
> Attached is v3 patch which I implemented close-in-reverse-order idea
> on v2.
> 
> Regards,
> 
> -- 
> Fujii Masao

> --- a/src/backend/access/transam/twophase.c
> +++ b/src/backend/access/transam/twophase.c
> @@ -1457,6 +1457,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
>      int            ndelrels;
>      SharedInvalidationMessage *invalmsgs;
>      int            i;
> +    SMgrRelation *srels;
>  
>      /*
>       * Validate the GID, and lock the GXACT to ensure that two backends do not
> @@ -1549,13 +1550,22 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
>          delrels = abortrels;
>          ndelrels = hdr->nabortrels;
>      }
> +
> +    srels = palloc(sizeof(SMgrRelation) * ndelrels);
>      for (i = 0; i < ndelrels; i++)
> -    {
> -        SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
> +        srels[i] = smgropen(delrels[i], InvalidBackendId);
>  
> -        smgrdounlink(srel, false);
> -        smgrclose(srel);
> -    }
> +    smgrdounlinkall(srels, ndelrels, false);
> +
> +    /*
> +     * Call smgrclose() in reverse order as when smgropen() is called.
> +     * This trick enables remove_from_unowned_list() in smgrclose()
> +     * to search the SMgrRelation from the unowned list,
> +     * in O(1) performance.
> +     */
> +    for (i = ndelrels - 1; i >= 0; i--)
> +        smgrclose(srels[i]);
> +    pfree(srels);
>  
>      /*
>       * Handle cache invalidation messages.
> --- a/src/backend/access/transam/xact.c
> +++ b/src/backend/access/transam/xact.c
> @@ -5618,6 +5618,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
>      /* Make sure files supposed to be dropped are dropped */
>      if (parsed->nrels > 0)
>      {
> +        SMgrRelation *srels;
> +
>          /*
>           * First update minimum recovery point to cover this WAL record. Once
>           * a relation is deleted, there's no going back. The buffer manager
> @@ -5635,6 +5637,7 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
>           */
>          XLogFlush(lsn);
>  
> +        srels = palloc(sizeof(SMgrRelation) * parsed->nrels);
>          for (i = 0; i < parsed->nrels; i++)
>          {
>              SMgrRelation srel = smgropen(parsed->xnodes[i], InvalidBackendId);
> @@ -5642,9 +5645,20 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
>  
>              for (fork = 0; fork <= MAX_FORKNUM; fork++)
>                  XLogDropRelation(parsed->xnodes[i], fork);
> -            smgrdounlink(srel, true);
> -            smgrclose(srel);
> +            srels[i] = srel;
>          }
> +
> +        smgrdounlinkall(srels, parsed->nrels, true);
> +
> +        /*
> +         * Call smgrclose() in reverse order as when smgropen() is called.
> +         * This trick enables remove_from_unowned_list() in smgrclose()
> +         * to search the SMgrRelation from the unowned list,
> +         * in O(1) performance.
> +         */
> +        for (i = parsed->nrels - 1; i >= 0; i--)
> +            smgrclose(srels[i]);
> +        pfree(srels);
>      }
>  
>      /*
> @@ -5685,6 +5699,7 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
>  {
>      int            i;
>      TransactionId max_xid;
> +    SMgrRelation *srels;
>  
>      Assert(TransactionIdIsValid(xid));
>  
> @@ -5748,6 +5763,7 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
>      }
>  
>      /* Make sure files supposed to be dropped are dropped */
> +    srels = palloc(sizeof(SMgrRelation) * parsed->nrels);
>      for (i = 0; i < parsed->nrels; i++)
>      {
>          SMgrRelation srel = smgropen(parsed->xnodes[i], InvalidBackendId);
> @@ -5755,9 +5771,20 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
>  
>          for (fork = 0; fork <= MAX_FORKNUM; fork++)
>              XLogDropRelation(parsed->xnodes[i], fork);
> -        smgrdounlink(srel, true);
> -        smgrclose(srel);
> +        srels[i] = srel;
>      }
> +
> +    smgrdounlinkall(srels, parsed->nrels, true);
> +
> +    /*
> +     * Call smgrclose() in reverse order as when smgropen() is called.
> +     * This trick enables remove_from_unowned_list() in smgrclose()
> +     * to search the SMgrRelation from the unowned list,
> +     * in O(1) performance.
> +     */
> +    for (i = parsed->nrels - 1; i >= 0; i--)
> +        smgrclose(srels[i]);
> +    pfree(srels);
>  }
>  
>  void

Can you please move the repeated stanzy to a separate function?
Repeating the same code and comment three times isn't good.

Greetings,

Andres Freund


Re: Speedup of relation deletes during recovery

From
Fujii Masao
Date:
On Thu, Jun 28, 2018 at 3:23 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2018-06-28 03:21:51 +0900, Fujii Masao wrote:
>> On Wed, Jun 27, 2018 at 10:44 AM, Thomas Munro
>> <thomas.munro@enterprisedb.com> wrote:
>> > On Wed, Jun 27, 2018 at 1:13 PM, Thomas Munro
>> > <thomas.munro@enterprisedb.com> wrote:
>> >> On Wed, Jun 27, 2018 at 12:16 PM, Thomas Munro
>> >> <thomas.munro@enterprisedb.com> wrote:
>> >>> I think we should take the hint in the comments and make it O(1)
>> >>> anyway.  See attached draft patch.
>> >>
>> >> Alternatively, here is a shorter and sweeter dlist version (I did the
>> >> open-coded one thinking of theoretical back-patchability).
>> >
>> > ... though, on second thoughts, the dlist version steam-rolls over the
>> > possibility that it might not be in the list (mentioned in the
>> > comments, though it's not immediately clear how that would happen).
>> >
>> > On further reflection, on the basis that it's the most conservative
>> > change, +1 for Fujii-san's close-in-reverse-order idea.  We should
>> > reconsider that data structure for 12
>>
>> +1
>>
>> Attached is v3 patch which I implemented close-in-reverse-order idea
>> on v2.
>>
>> Regards,
>>
>> --
>> Fujii Masao
>
>> --- a/src/backend/access/transam/twophase.c
>> +++ b/src/backend/access/transam/twophase.c
>> @@ -1457,6 +1457,7 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
>>       int                     ndelrels;
>>       SharedInvalidationMessage *invalmsgs;
>>       int                     i;
>> +     SMgrRelation *srels;
>>
>>       /*
>>        * Validate the GID, and lock the GXACT to ensure that two backends do not
>> @@ -1549,13 +1550,22 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
>>               delrels = abortrels;
>>               ndelrels = hdr->nabortrels;
>>       }
>> +
>> +     srels = palloc(sizeof(SMgrRelation) * ndelrels);
>>       for (i = 0; i < ndelrels; i++)
>> -     {
>> -             SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
>> +             srels[i] = smgropen(delrels[i], InvalidBackendId);
>>
>> -             smgrdounlink(srel, false);
>> -             smgrclose(srel);
>> -     }
>> +     smgrdounlinkall(srels, ndelrels, false);
>> +
>> +     /*
>> +      * Call smgrclose() in reverse order as when smgropen() is called.
>> +      * This trick enables remove_from_unowned_list() in smgrclose()
>> +      * to search the SMgrRelation from the unowned list,
>> +      * in O(1) performance.
>> +      */
>> +     for (i = ndelrels - 1; i >= 0; i--)
>> +             smgrclose(srels[i]);
>> +     pfree(srels);
>>
>>       /*
>>        * Handle cache invalidation messages.
>> --- a/src/backend/access/transam/xact.c
>> +++ b/src/backend/access/transam/xact.c
>> @@ -5618,6 +5618,8 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
>>       /* Make sure files supposed to be dropped are dropped */
>>       if (parsed->nrels > 0)
>>       {
>> +             SMgrRelation *srels;
>> +
>>               /*
>>                * First update minimum recovery point to cover this WAL record. Once
>>                * a relation is deleted, there's no going back. The buffer manager
>> @@ -5635,6 +5637,7 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
>>                */
>>               XLogFlush(lsn);
>>
>> +             srels = palloc(sizeof(SMgrRelation) * parsed->nrels);
>>               for (i = 0; i < parsed->nrels; i++)
>>               {
>>                       SMgrRelation srel = smgropen(parsed->xnodes[i], InvalidBackendId);
>> @@ -5642,9 +5645,20 @@ xact_redo_commit(xl_xact_parsed_commit *parsed,
>>
>>                       for (fork = 0; fork <= MAX_FORKNUM; fork++)
>>                               XLogDropRelation(parsed->xnodes[i], fork);
>> -                     smgrdounlink(srel, true);
>> -                     smgrclose(srel);
>> +                     srels[i] = srel;
>>               }
>> +
>> +             smgrdounlinkall(srels, parsed->nrels, true);
>> +
>> +             /*
>> +              * Call smgrclose() in reverse order as when smgropen() is called.
>> +              * This trick enables remove_from_unowned_list() in smgrclose()
>> +              * to search the SMgrRelation from the unowned list,
>> +              * in O(1) performance.
>> +              */
>> +             for (i = parsed->nrels - 1; i >= 0; i--)
>> +                     smgrclose(srels[i]);
>> +             pfree(srels);
>>       }
>>
>>       /*
>> @@ -5685,6 +5699,7 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
>>  {
>>       int                     i;
>>       TransactionId max_xid;
>> +     SMgrRelation *srels;
>>
>>       Assert(TransactionIdIsValid(xid));
>>
>> @@ -5748,6 +5763,7 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
>>       }
>>
>>       /* Make sure files supposed to be dropped are dropped */
>> +     srels = palloc(sizeof(SMgrRelation) * parsed->nrels);
>>       for (i = 0; i < parsed->nrels; i++)
>>       {
>>               SMgrRelation srel = smgropen(parsed->xnodes[i], InvalidBackendId);
>> @@ -5755,9 +5771,20 @@ xact_redo_abort(xl_xact_parsed_abort *parsed, TransactionId xid)
>>
>>               for (fork = 0; fork <= MAX_FORKNUM; fork++)
>>                       XLogDropRelation(parsed->xnodes[i], fork);
>> -             smgrdounlink(srel, true);
>> -             smgrclose(srel);
>> +             srels[i] = srel;
>>       }
>> +
>> +     smgrdounlinkall(srels, parsed->nrels, true);
>> +
>> +     /*
>> +      * Call smgrclose() in reverse order as when smgropen() is called.
>> +      * This trick enables remove_from_unowned_list() in smgrclose()
>> +      * to search the SMgrRelation from the unowned list,
>> +      * in O(1) performance.
>> +      */
>> +     for (i = parsed->nrels - 1; i >= 0; i--)
>> +             smgrclose(srels[i]);
>> +     pfree(srels);
>>  }
>>
>>  void
>
> Can you please move the repeated stanzy to a separate function?
> Repeating the same code and comment three times isn't good.

OK, so what about the attached patch?

Regards,

-- 
Fujii Masao

Attachment

Re: Speedup of relation deletes during recovery

From
Michael Paquier
Date:
On Tue, Jul 03, 2018 at 04:13:15AM +0900, Fujii Masao wrote:
> OK, so what about the attached patch?

I have been looking at this patch, and this looks in good shape to me
(please indent!).

+    * Call smgrclose() in reverse order as when smgropen() is called.
+    * This trick enables remove_from_unowned_list() in smgrclose()
+    * to search the SMgrRelation from the unowned list,
+    * in O(1) performance.

A nit here: with O(1) performance.

Could it be possible to add an assertion so as isRedo is never enabled
out of recovery?
--
Michael

Attachment

Re: Speedup of relation deletes during recovery

From
Fujii Masao
Date:
On Tue, Jul 3, 2018 at 11:28 AM, Michael Paquier <michael@paquier.xyz> wrote:
> On Tue, Jul 03, 2018 at 04:13:15AM +0900, Fujii Masao wrote:
>> OK, so what about the attached patch?
>
> I have been looking at this patch, and this looks in good shape to me

Thanks for the review! So, committed.

> (please indent!).

Hmm.. I failed to find indent issue in my patch... But anyway
future execution of pgindent will fix that even if it exists.

> +    * Call smgrclose() in reverse order as when smgropen() is called.
> +    * This trick enables remove_from_unowned_list() in smgrclose()
> +    * to search the SMgrRelation from the unowned list,
> +    * in O(1) performance.
>
> A nit here: with O(1) performance.

I changed the patch accordingly.

> Could it be possible to add an assertion so as isRedo is never enabled
> out of recovery?

I'm not sure if adding such assertion into only the function that
the patch added is helpful. There are many functions having something
like isRedo as an argument.

Regards,

-- 
Fujii Masao


Re: Speedup of relation deletes during recovery

From
Michael Paquier
Date:
On Thu, Jul 05, 2018 at 03:10:33AM +0900, Fujii Masao wrote:
> On Tue, Jul 3, 2018 at 11:28 AM, Michael Paquier <michael@paquier.xyz> wrote:
> Thanks for the review! So, committed.

Thanks.

>> (please indent!).
>
> Hmm.. I failed to find indent issue in my patch... But anyway
> future execution of pgindent will fix that even if it exists.

md.c is telling a rather different story ;)

Well, spurious noise when touching the same files for a different patch
is annoying...  I have made the experience already for a couple of
commits.
--
Michael

Attachment

RE: Speedup of relation deletes during recovery

From
"Jamison, Kirk"
Date:
Hello Fujii-san, 

On April 18, 2018, Fujii Masao wrote: 

> On Fri, Mar 30, 2018 at 12:18 PM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote:
>> Furthermore, TRUNCATE has a similar and worse issue.  While DROP TABLE 
>> scans the shared buffers once for each table, TRUNCATE does that for 
>> each fork, resulting in three scans per table.  How about changing the 
>> following functions
>>
>> smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber 
>> nblocks) DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber forkNum,
>>                                            BlockNumber firstDelBlock)
>>
>> to
>>
>> smgrtruncate(SMgrRelation reln, ForkNumber *forknum, BlockNumber 
>> *nblocks, int nforks) DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
>>                                            BlockNumber *firstDelBlock, 
>> int nforks)
>>
>> to perform the scan only once per table?  If there doesn't seem to be a problem,
>> I think I'll submit the patch next month.  I'd welcome if anyone could do that.
>
> Yeah, it's worth working on this problem. To decrease the number of scans of
> shared_buffers, you would need to change the order of truncations of files and WAL
> logging. In RelationTruncate(), currently WAL is logged after FSM and VM are truncated.
> IOW, with the patch, FSM and VM would need to be truncated after WAL logging. You would
> need to check whether this reordering is valid.

I know it's been almost a year since the previous message, but if you could still
recall your suggestion above, I'd like to ask question.
Could you explain your idea a bit further on how would the reordering of WAL writing and
file truncation possibly reduce the number of shared_buffers scan?
Thanks a lot in advance.

Regards,
Kirk Jamison

Re: Speedup of relation deletes during recovery

From
Fujii Masao
Date:
On Tue, Apr 16, 2019 at 10:48 AM Jamison, Kirk <k.jamison@jp.fujitsu.com> wrote:
>
> Hello Fujii-san,
>
> On April 18, 2018, Fujii Masao wrote:
>
> > On Fri, Mar 30, 2018 at 12:18 PM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote:
> >> Furthermore, TRUNCATE has a similar and worse issue.  While DROP TABLE
> >> scans the shared buffers once for each table, TRUNCATE does that for
> >> each fork, resulting in three scans per table.  How about changing the
> >> following functions
> >>
> >> smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber
> >> nblocks) DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber forkNum,
> >>                                            BlockNumber firstDelBlock)
> >>
> >> to
> >>
> >> smgrtruncate(SMgrRelation reln, ForkNumber *forknum, BlockNumber
> >> *nblocks, int nforks) DropRelFileNodeBuffers(RelFileNodeBackend rnode, ForkNumber *forkNum,
> >>                                            BlockNumber *firstDelBlock,
> >> int nforks)
> >>
> >> to perform the scan only once per table?  If there doesn't seem to be a problem,
> >> I think I'll submit the patch next month.  I'd welcome if anyone could do that.
> >
> > Yeah, it's worth working on this problem. To decrease the number of scans of
> > shared_buffers, you would need to change the order of truncations of files and WAL
> > logging. In RelationTruncate(), currently WAL is logged after FSM and VM are truncated.
> > IOW, with the patch, FSM and VM would need to be truncated after WAL logging. You would
> > need to check whether this reordering is valid.
>
> I know it's been almost a year since the previous message, but if you could still
> recall your suggestion above, I'd like to ask question.
> Could you explain your idea a bit further on how would the reordering of WAL writing and
> file truncation possibly reduce the number of shared_buffers scan?

Sorry, I forgot the detail of that my comment. But anyway, you want to
modify smgr_redo(info = XLOG_SMGR_TRUNCATE) so that the number of
scans of shared_buffers to be decreased to one. Right?

IMO it's worth thinking to change smgrtruncate(MAIN_FORK),
FreeSpaceMapTruncateRel() and visibilitymap_truncate() so that
they just mark the relation and blocks as to be deleted, and then
so that they scan shared_buffers at once to invalidate the blocks
at the end of smgr_redo().

Regards,

-- 
Fujii Masao