Thread: [HACKERS] Moving relation extension locks out of heavyweight lock manager

[HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
Hi all,

Currently, the relation extension lock is implemented using
heavyweight lock manager and almost functions (except for
brin_page_cleanup) using LockRelationForExntesion use it with
ExclusiveLock mode. But actually it doesn't need multiple lock modes
or deadlock detection or any of the other functionality that the
heavyweight lock manager provides. I think It's enough to use
something like LWLock. So I'd like to propose to change relation
extension lock management so that it works using LWLock instead.

Attached draft patch makes relation extension locks uses LWLock rather
than heavyweight lock manager, using by shared hash table storing
information of the relation extension lock. The basic idea is that we
add hash table in shared memory for relation extension locks and each
hash entry is LWLock struct. Whenever the process wants to acquire
relation extension locks, it searches appropriate LWLock entry in hash
table and acquire it. The process can remove a hash entry when
unlocking it if nobody is holding and waiting it.

This work would be helpful not only for existing workload but also
future works like some parallel utility commands, which is discussed
on other threads[1]. At least for parallel vacuum, this feature helps
to solve issue that the implementation of parallel vacuum has.

I ran pgbench for 10 min three times(scale factor is 5000), here is a
performance measurement result.

clients   TPS(HEAD)   TPS(Patched)
4           2092.612       2031.277
8           3153.732       3046.789
16         4562.072       4625.419
32         6439.391       6479.526
64         7767.364       7779.636
100       7917.173       7906.567

* 16 core Xeon E5620 2.4GHz
* 32 GB RAM
* ioDrive

In current implementation, it seems there is no performance degradation so far.
Please give me feedback.

[1]
* Block level parallel vacuum WIP
   <https://www.postgresql.org/message-id/CAD21AoD1xAqp4zK-Vi1cuY3feq2oO8HcpJiz32UDUfe0BE31Xw%40mail.gmail.com>
* CREATE TABLE with parallel workers, 10.0?
  <https://www.postgresql.org/message-id/CAFBoRzeoDdjbPV4riCE%2B2ApV%2BY8nV4HDepYUGftm5SuKWna3rQ%40mail.gmail.com>
* utility commands benefiting from parallel plan
  <https://www.postgresql.org/message-id/CAJrrPGcY3SZa40vU%2BR8d8dunXp9JRcFyjmPn2RF9_4cxjHd7uA%40mail.gmail.com>

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment
On Wed, May 10, 2017 at 8:39 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Currently, the relation extension lock is implemented using
> heavyweight lock manager and almost functions (except for
> brin_page_cleanup) using LockRelationForExntesion use it with
> ExclusiveLock mode. But actually it doesn't need multiple lock modes
> or deadlock detection or any of the other functionality that the
> heavyweight lock manager provides. I think It's enough to use
> something like LWLock. So I'd like to propose to change relation
> extension lock management so that it works using LWLock instead.

That's not a good idea because it'll make the code that executes while
holding that lock noninterruptible.

Possibly something based on condition variables would work better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, May 10, 2017 at 8:39 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> ... I'd like to propose to change relation
>> extension lock management so that it works using LWLock instead.

> That's not a good idea because it'll make the code that executes while
> holding that lock noninterruptible.

Is that really a problem?  We typically only hold it over one kernel call,
which ought to be noninterruptible anyway.  Also, the CheckpointLock is
held for far longer, and we've not heard complaints about that one.

I'm slightly suspicious of the claim that we don't need deadlock
detection.  There are places that e.g. touch FSM while holding this
lock.  It might be all right but it needs close review, not just an
assertion that it's not a problem.
        regards, tom lane



On Thu, May 11, 2017 at 6:09 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> This work would be helpful not only for existing workload but also
> future works like some parallel utility commands, which is discussed
> on other threads[1]. At least for parallel vacuum, this feature helps
> to solve issue that the implementation of parallel vacuum has.
>
> I ran pgbench for 10 min three times(scale factor is 5000), here is a
> performance measurement result.
>
> clients   TPS(HEAD)   TPS(Patched)
> 4           2092.612       2031.277
> 8           3153.732       3046.789
> 16         4562.072       4625.419
> 32         6439.391       6479.526
> 64         7767.364       7779.636
> 100       7917.173       7906.567
>
> * 16 core Xeon E5620 2.4GHz
> * 32 GB RAM
> * ioDrive
>
> In current implementation, it seems there is no performance degradation so far.
>

I think it is good to check pgbench, but we should do tests of the
bulk load as this lock is stressed during such a workload.  Some of
the tests we have done when we have improved the performance of bulk
load can be found in an e-mail [1].

[1] -
https://www.postgresql.org/message-id/CAFiTN-tkX6gs-jL8VrPxg6OG9VUAKnObUq7r7pWQqASzdF5OwA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



On Fri, May 12, 2017 at 9:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, May 10, 2017 at 8:39 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> ... I'd like to propose to change relation
>>> extension lock management so that it works using LWLock instead.
>
>> That's not a good idea because it'll make the code that executes while
>> holding that lock noninterruptible.
>
> Is that really a problem?  We typically only hold it over one kernel call,
> which ought to be noninterruptible anyway.
>

During parallel bulk load operations, I think we hold it over multiple
kernel calls.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Sat, May 13, 2017 at 8:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, May 11, 2017 at 6:09 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> This work would be helpful not only for existing workload but also
>> future works like some parallel utility commands, which is discussed
>> on other threads[1]. At least for parallel vacuum, this feature helps
>> to solve issue that the implementation of parallel vacuum has.
>>
>> I ran pgbench for 10 min three times(scale factor is 5000), here is a
>> performance measurement result.
>>
>> clients   TPS(HEAD)   TPS(Patched)
>> 4           2092.612       2031.277
>> 8           3153.732       3046.789
>> 16         4562.072       4625.419
>> 32         6439.391       6479.526
>> 64         7767.364       7779.636
>> 100       7917.173       7906.567
>>
>> * 16 core Xeon E5620 2.4GHz
>> * 32 GB RAM
>> * ioDrive
>>
>> In current implementation, it seems there is no performance degradation so far.
>>
>
> I think it is good to check pgbench, but we should do tests of the
> bulk load as this lock is stressed during such a workload.  Some of
> the tests we have done when we have improved the performance of bulk
> load can be found in an e-mail [1].
>

Thank you for sharing.

I've measured using two test scripts attached on that thread. Here is result.

* Copy test script
Client    HEAD     Patched
4          452.60     455.53
8          561.74     561.09
16        592.50     592.21
32        602.53     599.53
64        605.01     606.42

* Insert test script
Client    HEAD     Patched
4          159.04     158.44
8          169.41     169.69
16        177.11     178.14
32        182.14     181.99
64        182.11     182.73

It seems there is no performance degradation so far.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



On Sat, May 13, 2017 at 7:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, May 12, 2017 at 9:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> On Wed, May 10, 2017 at 8:39 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>> ... I'd like to propose to change relation
>>>> extension lock management so that it works using LWLock instead.
>>
>>> That's not a good idea because it'll make the code that executes while
>>> holding that lock noninterruptible.
>>
>> Is that really a problem?  We typically only hold it over one kernel call,
>> which ought to be noninterruptible anyway.
>
> During parallel bulk load operations, I think we hold it over multiple
> kernel calls.

We do.  Also, RelationGetNumberOfBlocks() is not necessarily only one
kernel call, no?  Nor is vm_extend.

Also, it's not just the backend doing the filesystem operation that's
non-interruptible, but also any waiters, right?

Maybe this isn't a big problem, but it does seem to be that it would
be better to avoid it if we can.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Wed, May 17, 2017 at 1:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, May 13, 2017 at 7:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Fri, May 12, 2017 at 9:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Robert Haas <robertmhaas@gmail.com> writes:
>>>> On Wed, May 10, 2017 at 8:39 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>>> ... I'd like to propose to change relation
>>>>> extension lock management so that it works using LWLock instead.
>>>
>>>> That's not a good idea because it'll make the code that executes while
>>>> holding that lock noninterruptible.
>>>
>>> Is that really a problem?  We typically only hold it over one kernel call,
>>> which ought to be noninterruptible anyway.
>>
>> During parallel bulk load operations, I think we hold it over multiple
>> kernel calls.
>
> We do.  Also, RelationGetNumberOfBlocks() is not necessarily only one
> kernel call, no?  Nor is vm_extend.

Yeah, these functions could call more than one kernel calls while
holding extension lock.

> Also, it's not just the backend doing the filesystem operation that's
> non-interruptible, but also any waiters, right?
>
> Maybe this isn't a big problem, but it does seem to be that it would
> be better to avoid it if we can.
>

I agree to change it to be interruptible for more safety.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Fri, May 19, 2017 at 11:12 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Wed, May 17, 2017 at 1:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sat, May 13, 2017 at 7:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> On Fri, May 12, 2017 at 9:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> Robert Haas <robertmhaas@gmail.com> writes:
>>>>> On Wed, May 10, 2017 at 8:39 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>>>> ... I'd like to propose to change relation
>>>>>> extension lock management so that it works using LWLock instead.
>>>>
>>>>> That's not a good idea because it'll make the code that executes while
>>>>> holding that lock noninterruptible.
>>>>
>>>> Is that really a problem?  We typically only hold it over one kernel call,
>>>> which ought to be noninterruptible anyway.
>>>
>>> During parallel bulk load operations, I think we hold it over multiple
>>> kernel calls.
>>
>> We do.  Also, RelationGetNumberOfBlocks() is not necessarily only one
>> kernel call, no?  Nor is vm_extend.
>
> Yeah, these functions could call more than one kernel calls while
> holding extension lock.
>
>> Also, it's not just the backend doing the filesystem operation that's
>> non-interruptible, but also any waiters, right?
>>
>> Maybe this isn't a big problem, but it does seem to be that it would
>> be better to avoid it if we can.
>>
>
> I agree to change it to be interruptible for more safety.
>

Attached updated version patch. To use the lock mechanism similar to
LWLock but interrupt-able, I introduced new lock manager for extension
lock. A lot of code especially locking and unlocking, is inspired by
LWLock but it uses the condition variables to wait for acquiring lock.
Other part is not changed from previous patch. This is still a PoC
patch, lacks documentation. The following is the measurement result
with test script same as I used before.

* Copy test script
     HEAD    Patched
4    436.6   436.1
8    561.8   561.8
16   580.7   579.4
32   588.5   597.0
64   596.1   599.0

* Insert test script
     HEAD    Patched
4    156.5   156.0
8    167.0   167.9
16   176.2   175.6
32   181.1   181.0
64   181.5   183.0

Since I replaced heavyweight lock with lightweight lock I expected the
performance slightly improves from HEAD but it was almost same result.
I'll continue to look at more detail.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Thu, Jun 22, 2017 at 12:03 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Fri, May 19, 2017 at 11:12 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Wed, May 17, 2017 at 1:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Sat, May 13, 2017 at 7:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>> On Fri, May 12, 2017 at 9:14 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>>> Robert Haas <robertmhaas@gmail.com> writes:
>>>>>> On Wed, May 10, 2017 at 8:39 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>>>>> ... I'd like to propose to change relation
>>>>>>> extension lock management so that it works using LWLock instead.
>>>>>
>>>>>> That's not a good idea because it'll make the code that executes while
>>>>>> holding that lock noninterruptible.
>>>>>
>>>>> Is that really a problem?  We typically only hold it over one kernel call,
>>>>> which ought to be noninterruptible anyway.
>>>>
>>>> During parallel bulk load operations, I think we hold it over multiple
>>>> kernel calls.
>>>
>>> We do.  Also, RelationGetNumberOfBlocks() is not necessarily only one
>>> kernel call, no?  Nor is vm_extend.
>>
>> Yeah, these functions could call more than one kernel calls while
>> holding extension lock.
>>
>>> Also, it's not just the backend doing the filesystem operation that's
>>> non-interruptible, but also any waiters, right?
>>>
>>> Maybe this isn't a big problem, but it does seem to be that it would
>>> be better to avoid it if we can.
>>>
>>
>> I agree to change it to be interruptible for more safety.
>>
>
> Attached updated version patch. To use the lock mechanism similar to
> LWLock but interrupt-able, I introduced new lock manager for extension
> lock. A lot of code especially locking and unlocking, is inspired by
> LWLock but it uses the condition variables to wait for acquiring lock.
> Other part is not changed from previous patch. This is still a PoC
> patch, lacks documentation. The following is the measurement result
> with test script same as I used before.
>
> * Copy test script
>      HEAD    Patched
> 4    436.6   436.1
> 8    561.8   561.8
> 16   580.7   579.4
> 32   588.5   597.0
> 64   596.1   599.0
>
> * Insert test script
>      HEAD    Patched
> 4    156.5   156.0
> 8    167.0   167.9
> 16   176.2   175.6
> 32   181.1   181.0
> 64   181.5   183.0
>
> Since I replaced heavyweight lock with lightweight lock I expected the
> performance slightly improves from HEAD but it was almost same result.
> I'll continue to look at more detail.
>

The previous patch conflicts with current HEAD, I rebased the patch to
current HEAD.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Thomas Munro
Date:
On Wed, Aug 16, 2017 at 2:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> The previous patch conflicts with current HEAD, I rebased the patch to
> current HEAD.

Hi Masahiko-san,

FYI this doesn't build anymore.  I think it's just because the wait
event enumerators were re-alphabetised in pgstat.h:

../../../../src/include/pgstat.h:820:2: error: redeclaration of
enumerator ‘WAIT_EVENT_LOGICAL_SYNC_DATA’ WAIT_EVENT_LOGICAL_SYNC_DATA, ^
../../../../src/include/pgstat.h:806:2: note: previous definition of
‘WAIT_EVENT_LOGICAL_SYNC_DATA’ was here WAIT_EVENT_LOGICAL_SYNC_DATA, ^
../../../../src/include/pgstat.h:821:2: error: redeclaration of
enumerator ‘WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE’ WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE, ^
../../../../src/include/pgstat.h:807:2: note: previous definition of
‘WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE’ was here WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE, ^

--
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Thomas Munro
Date:
On Fri, Sep 8, 2017 at 10:24 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Aug 16, 2017 at 2:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> The previous patch conflicts with current HEAD, I rebased the patch to
>> current HEAD.
>
> Hi Masahiko-san,

Hi Sawada-san,

I have just learned from a colleague who is knowledgeable about
Japanese customs and kind enough to correct me that the appropriate
term of address for our colleagues in Japan on this mailing list is
<lastname>-san.  I was confused about that -- apologies for my
clumsiness.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Fri, Sep 8, 2017 at 8:25 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Fri, Sep 8, 2017 at 10:24 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Wed, Aug 16, 2017 at 2:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> The previous patch conflicts with current HEAD, I rebased the patch to
>>> current HEAD.
>>
>> Hi Masahiko-san,
>
> Hi Sawada-san,
>
> I have just learned from a colleague who is knowledgeable about
> Japanese customs and kind enough to correct me that the appropriate
> term of address for our colleagues in Japan on this mailing list is
> <lastname>-san.  I was confused about that -- apologies for my
> clumsiness.

Don't worry about it, either is ok. In Japan there is a custom of
writing <lastname>-san but <firstname>-san is also not incorrect :-)
(also I think it's hard to distinguish between last name and first
name of Japanese name).

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Fri, Sep 8, 2017 at 7:24 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Aug 16, 2017 at 2:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> The previous patch conflicts with current HEAD, I rebased the patch to
>> current HEAD.
>
> Hi Masahiko-san,
>
> FYI this doesn't build anymore.  I think it's just because the wait
> event enumerators were re-alphabetised in pgstat.h:
>
> ../../../../src/include/pgstat.h:820:2: error: redeclaration of
> enumerator ‘WAIT_EVENT_LOGICAL_SYNC_DATA’
>   WAIT_EVENT_LOGICAL_SYNC_DATA,
>   ^
> ../../../../src/include/pgstat.h:806:2: note: previous definition of
> ‘WAIT_EVENT_LOGICAL_SYNC_DATA’ was here
>   WAIT_EVENT_LOGICAL_SYNC_DATA,
>   ^
> ../../../../src/include/pgstat.h:821:2: error: redeclaration of
> enumerator ‘WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE’
>   WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
>   ^
> ../../../../src/include/pgstat.h:807:2: note: previous definition of
> ‘WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE’ was here
>   WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
>   ^
>

Thank you for the information! Attached rebased patch.

--
Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Fri, Sep 8, 2017 at 4:32 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Fri, Sep 8, 2017 at 7:24 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Wed, Aug 16, 2017 at 2:13 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> The previous patch conflicts with current HEAD, I rebased the patch to
>>> current HEAD.
>>
>> Hi Masahiko-san,
>>
>> FYI this doesn't build anymore.  I think it's just because the wait
>> event enumerators were re-alphabetised in pgstat.h:
>>
>> ../../../../src/include/pgstat.h:820:2: error: redeclaration of
>> enumerator ‘WAIT_EVENT_LOGICAL_SYNC_DATA’
>>   WAIT_EVENT_LOGICAL_SYNC_DATA,
>>   ^
>> ../../../../src/include/pgstat.h:806:2: note: previous definition of
>> ‘WAIT_EVENT_LOGICAL_SYNC_DATA’ was here
>>   WAIT_EVENT_LOGICAL_SYNC_DATA,
>>   ^
>> ../../../../src/include/pgstat.h:821:2: error: redeclaration of
>> enumerator ‘WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE’
>>   WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
>>   ^
>> ../../../../src/include/pgstat.h:807:2: note: previous definition of
>> ‘WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE’ was here
>>   WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
>>   ^
>>
>
> Thank you for the information! Attached rebased patch.
>

Since the previous patch conflicts with current HEAD, I attached the
updated patch for next CF.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Robert Haas
Date:
On Thu, Oct 26, 2017 at 12:36 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Since the previous patch conflicts with current HEAD, I attached the
> updated patch for next CF.

I think we should back up here and ask ourselves a couple of questions:

1. What are we trying to accomplish here?

2. Is this the best way to accomplish it?

To the first question, the problem as I understand it as follows:
Heavyweight locks don't conflict between members of a parallel group.
However, this is wrong for LOCKTAG_RELATION_EXTENSION, LOCKTAG_PAGE,
LOCKTAG_TUPLE, and LOCKTAG_SPECULATIVE_TOKEN.  Currently, those cases
don't arise, because parallel operations are strictly read-only
(except for inserts by the leader into a just-created table, when only
one member of the group can be taking the lock anyway).  However, once
we allow writes, they become possible, so some solution is needed.

To the second question, there are a couple of ways we could fix this.
First, we could continue to allow these locks to be taken in the
heavyweight lock manager, but make them conflict even between members
of the same lock group.  This is, however, complicated.  A significant
problem (or so I think) is that the deadlock detector logic, which is
already quite hard to test, will become even more complicated, since
wait edges between members of a lock group need to exist at some times
and not other times.  Moreover, to the best of my knowledge, the
increased complexity would have no benefit, because it doesn't look to
me like we ever take any other heavyweight lock while holding one of
these four kinds of locks.  Therefore, no deadlock can occur: if we're
waiting for one of these locks, the process that holds it is not
waiting for any other heavyweight lock.  This gives rise to a second
idea: move these locks out of the heavyweight lock manager and handle
them with separate code that does not have deadlock detection and
doesn't need as many lock modes.  I think that idea is basically
sound, although it's possibly not the only sound idea.

However, that makes me wonder whether we shouldn't be a bit more
aggressive with this patch: why JUST relation extension locks?  Why
not all four types of locks listed above?  Actually, tuple locks are a
bit sticky, because they have four lock modes.  The other three kinds
are very similar -- all you can do is "take it" (implicitly, in
exclusive mode), "try to take it" (again, implicitly, in exclusive
mode), or "wait for it to be released" (i.e. share lock and then
release).  Another idea is to try to handle those three types and
leave the tuple locking problem for another day.

I suggest that a good thing to do more or less immediately, regardless
of when this patch ends up being ready, would be to insert an
insertion that LockAcquire() is never called while holding a lock of
one of these types.  If that assertion ever fails, then the whole
theory that these lock types don't need deadlock detection is wrong,
and we'd like to find out about that sooner or later.

On the details of the patch, it appears that RelExtLockAcquire()
executes the wait-for-lock code with the partition lock held, and then
continues to hold the partition lock for the entire time that the
relation extension lock is held.  That not only makes all code that
runs while holding the lock non-interruptible but makes a lot of the
rest of this code pointless.  How is any of this atomics code going to
be reached by more than one process at the same time if the entire
bucket is exclusive-locked?  I would guess that the concurrency is not
very good here for the same reason.  Of course, just releasing the
bucket lock wouldn't be right either, because then ext_lock might go
away while we've got a pointer to it, which wouldn't be good.  I think
you could make this work if each lock had both a locker count and a
pin count, and the object can only be removed when the pin_count is 0.
So the lock algorithm would look like this:

- Acquire the partition LWLock.
- Find the item of interest, creating it if necessary.  If out of
memory for more elements, sweep through the table and reclaim
0-pin-count entries, then retry.
- Increment the pin count.
- Attempt to acquire the lock atomically; if we succeed, release the
partition lock and return.
- If this was a conditional-acquire, then decrement the pin count,
release the partition lock and return.
- Release the partition lock.
- Sleep on the condition variable until we manage to atomically
acquire the lock.

The unlock algorithm would just decrement the pin count and, if the
resulting value is non-zero, broadcast on the condition variable.

Although I think this will work, I'm not sure this is actually a great
algorithm.  Every lock acquisition has to take and release the
partition lock, use at least two more atomic ops (to take the pin and
the lock), and search a hash table.  I don't think that's going to be
staggeringly fast.  Maybe it's OK.  It's not that much worse, possibly
not any worse, than what the main lock manager does now.  However,
especially if we implement a solution specific to relation locks, it
seems like it would be better if we could somehow optimize based on
the facts that (1) many relation locks will not conflict and (2) it's
very common for the same backend to take and release the same
extension lock over and over again.  I don't have a specific proposal
right now.

Whatever we end up with, I think we should write some kind of a test
harness to benchmark the number of acquire/release cycles per second
that we can do with the current relation extension lock system vs. the
proposed new system.  Ideally, we'd be faster, since we're proposing a
more specialized mechanism.  But at least we should not be slower.
pgbench isn't a good test because the relation extension lock will
barely be taken let alone contended; we need to check something like
parallel copies into the same table to see any effect.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Fri, Oct 27, 2017 at 12:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Oct 26, 2017 at 12:36 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Since the previous patch conflicts with current HEAD, I attached the
>> updated patch for next CF.
>
> I think we should back up here and ask ourselves a couple of questions:

Thank you for summarizing of the purpose and discussion of this patch.

> 1. What are we trying to accomplish here?
>
> 2. Is this the best way to accomplish it?
>
> To the first question, the problem as I understand it as follows:
> Heavyweight locks don't conflict between members of a parallel group.
> However, this is wrong for LOCKTAG_RELATION_EXTENSION, LOCKTAG_PAGE,
> LOCKTAG_TUPLE, and LOCKTAG_SPECULATIVE_TOKEN.  Currently, those cases
> don't arise, because parallel operations are strictly read-only
> (except for inserts by the leader into a just-created table, when only
> one member of the group can be taking the lock anyway).  However, once
> we allow writes, they become possible, so some solution is needed.
>
> To the second question, there are a couple of ways we could fix this.
> First, we could continue to allow these locks to be taken in the
> heavyweight lock manager, but make them conflict even between members
> of the same lock group.  This is, however, complicated.  A significant
> problem (or so I think) is that the deadlock detector logic, which is
> already quite hard to test, will become even more complicated, since
> wait edges between members of a lock group need to exist at some times
> and not other times.  Moreover, to the best of my knowledge, the
> increased complexity would have no benefit, because it doesn't look to
> me like we ever take any other heavyweight lock while holding one of
> these four kinds of locks.  Therefore, no deadlock can occur: if we're
> waiting for one of these locks, the process that holds it is not
> waiting for any other heavyweight lock.  This gives rise to a second
> idea: move these locks out of the heavyweight lock manager and handle
> them with separate code that does not have deadlock detection and
> doesn't need as many lock modes.  I think that idea is basically
> sound, although it's possibly not the only sound idea.

I'm on the same page.

>
> However, that makes me wonder whether we shouldn't be a bit more
> aggressive with this patch: why JUST relation extension locks?  Why
> not all four types of locks listed above?  Actually, tuple locks are a
> bit sticky, because they have four lock modes.  The other three kinds
> are very similar -- all you can do is "take it" (implicitly, in
> exclusive mode), "try to take it" (again, implicitly, in exclusive
> mode), or "wait for it to be released" (i.e. share lock and then
> release).  Another idea is to try to handle those three types and
> leave the tuple locking problem for another day.
>
> I suggest that a good thing to do more or less immediately, regardless
> of when this patch ends up being ready, would be to insert an
> insertion that LockAcquire() is never called while holding a lock of
> one of these types.  If that assertion ever fails, then the whole
> theory that these lock types don't need deadlock detection is wrong,
> and we'd like to find out about that sooner or later.

I understood. I'll check that first. If this direction has no problem
and we changed these three locks so that it uses new lock mechanism,
we'll not be able to use these locks at the same time. Since it also
means that we impose a limitation to the future we should think
carefully about it. We can implement the deadlock detection mechanism
for it again but it doesn't make sense.

>
> On the details of the patch, it appears that RelExtLockAcquire()
> executes the wait-for-lock code with the partition lock held, and then
> continues to hold the partition lock for the entire time that the
> relation extension lock is held.  That not only makes all code that
> runs while holding the lock non-interruptible but makes a lot of the
> rest of this code pointless.  How is any of this atomics code going to
> be reached by more than one process at the same time if the entire
> bucket is exclusive-locked?  I would guess that the concurrency is not
> very good here for the same reason.  Of course, just releasing the
> bucket lock wouldn't be right either, because then ext_lock might go
> away while we've got a pointer to it, which wouldn't be good.  I think
> you could make this work if each lock had both a locker count and a
> pin count, and the object can only be removed when the pin_count is 0.
> So the lock algorithm would look like this:
>
> - Acquire the partition LWLock.
> - Find the item of interest, creating it if necessary.  If out of
> memory for more elements, sweep through the table and reclaim
> 0-pin-count entries, then retry.
> - Increment the pin count.
> - Attempt to acquire the lock atomically; if we succeed, release the
> partition lock and return.
> - If this was a conditional-acquire, then decrement the pin count,
> release the partition lock and return.
> - Release the partition lock.
> - Sleep on the condition variable until we manage to atomically
> acquire the lock.
>
> The unlock algorithm would just decrement the pin count and, if the
> resulting value is non-zero, broadcast on the condition variable.

Thank you for the suggestion!

> Although I think this will work, I'm not sure this is actually a great
> algorithm.  Every lock acquisition has to take and release the
> partition lock, use at least two more atomic ops (to take the pin and
> the lock), and search a hash table.  I don't think that's going to be
> staggeringly fast.  Maybe it's OK.  It's not that much worse, possibly
> not any worse, than what the main lock manager does now.  However,
> especially if we implement a solution specific to relation locks, it
> seems like it would be better if we could somehow optimize based on
> the facts that (1) many relation locks will not conflict and (2) it's
> very common for the same backend to take and release the same
> extension lock over and over again.  I don't have a specific proposal
> right now.

Yeah, we can optimize based on the purpose of the solution. In either
case I should answer the above question first.

>
> Whatever we end up with, I think we should write some kind of a test
> harness to benchmark the number of acquire/release cycles per second
> that we can do with the current relation extension lock system vs. the
> proposed new system.  Ideally, we'd be faster, since we're proposing a
> more specialized mechanism.  But at least we should not be slower.
> pgbench isn't a good test because the relation extension lock will
> barely be taken let alone contended; we need to check something like
> parallel copies into the same table to see any effect.
>

I did a benchmark using a custom script that always updates the
primary key (disabling HOT updates). But parallel copies into the same
tale would also be good. Thank you.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Mon, Oct 30, 2017 at 3:17 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Fri, Oct 27, 2017 at 12:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Oct 26, 2017 at 12:36 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> Since the previous patch conflicts with current HEAD, I attached the
>>> updated patch for next CF.
>>
>> I think we should back up here and ask ourselves a couple of questions:
>
> Thank you for summarizing of the purpose and discussion of this patch.
>
>> 1. What are we trying to accomplish here?
>>
>> 2. Is this the best way to accomplish it?
>>
>> To the first question, the problem as I understand it as follows:
>> Heavyweight locks don't conflict between members of a parallel group.
>> However, this is wrong for LOCKTAG_RELATION_EXTENSION, LOCKTAG_PAGE,
>> LOCKTAG_TUPLE, and LOCKTAG_SPECULATIVE_TOKEN.  Currently, those cases
>> don't arise, because parallel operations are strictly read-only
>> (except for inserts by the leader into a just-created table, when only
>> one member of the group can be taking the lock anyway).  However, once
>> we allow writes, they become possible, so some solution is needed.
>>
>> To the second question, there are a couple of ways we could fix this.
>> First, we could continue to allow these locks to be taken in the
>> heavyweight lock manager, but make them conflict even between members
>> of the same lock group.  This is, however, complicated.  A significant
>> problem (or so I think) is that the deadlock detector logic, which is
>> already quite hard to test, will become even more complicated, since
>> wait edges between members of a lock group need to exist at some times
>> and not other times.  Moreover, to the best of my knowledge, the
>> increased complexity would have no benefit, because it doesn't look to
>> me like we ever take any other heavyweight lock while holding one of
>> these four kinds of locks.  Therefore, no deadlock can occur: if we're
>> waiting for one of these locks, the process that holds it is not
>> waiting for any other heavyweight lock.  This gives rise to a second
>> idea: move these locks out of the heavyweight lock manager and handle
>> them with separate code that does not have deadlock detection and
>> doesn't need as many lock modes.  I think that idea is basically
>> sound, although it's possibly not the only sound idea.
>
> I'm on the same page.
>
>>
>> However, that makes me wonder whether we shouldn't be a bit more
>> aggressive with this patch: why JUST relation extension locks?  Why
>> not all four types of locks listed above?  Actually, tuple locks are a
>> bit sticky, because they have four lock modes.  The other three kinds
>> are very similar -- all you can do is "take it" (implicitly, in
>> exclusive mode), "try to take it" (again, implicitly, in exclusive
>> mode), or "wait for it to be released" (i.e. share lock and then
>> release).  Another idea is to try to handle those three types and
>> leave the tuple locking problem for another day.
>>
>> I suggest that a good thing to do more or less immediately, regardless
>> of when this patch ends up being ready, would be to insert an
>> insertion that LockAcquire() is never called while holding a lock of
>> one of these types.  If that assertion ever fails, then the whole
>> theory that these lock types don't need deadlock detection is wrong,
>> and we'd like to find out about that sooner or later.
>
> I understood. I'll check that first.

I've checked whether LockAcquire is called while holding a lock of one
of four types: LOCKTAG_RELATION_EXTENSION, LOCKTAG_PAGE,
LOCKTAG_TUPLE, and LOCKTAG_SPECULATIVE_TOKEN. To summary, I think that
we cannot move these four lock types together out of heavy-weight
lock, but can move only the relation extension lock with tricks.

Here is detail of the survey.

* LOCKTAG_RELATION_EXTENSION
There is a path that LockRelationForExtension() could be called while
holding another relation extension lock. In brin_getinsertbuffer(), we
acquire a relation extension lock for a index relation and could
initialize a new buffer (brin_initailize_empty_new_buffer()). During
initializing a new buffer, we call RecordPageWithFreeSpace() which
eventually can call fsm_readbuf(rel, addr, true) where the third
argument is "extend". We can process this problem by having the list
(or local hash) of acquired locks and skip acquiring the lock if
already had. For other call paths calling LockRelationForExtension, I
don't see any problem.

* LOCKTAG_PAGE, LOCKTAG_TUPLE, LOCKTAG_SPECULATIVE_INSERTION
There is a path that we can acquire a relation extension lock while
holding these lock.
For LOCKTAG_PAGE, in ginInsertCleanup() we acquire a page lock for the
meta page and process the pending list which could acquire a relation
extension lock for a index relation. For LOCKTAG_TUPLE, in
heap_update() we acquire a tuple lock and could call
RelationGetBufferForTuple(). For LOCKTAG_SPECULATIVE_INSERTION, in
ExecInsert() we acquire a speculative insertion lock and call
heap_insert and ExecInsertIndexTuples(). The operation that is called
while holding each lock type can acquire a relation extension lock.

Also the following is the list of places where we call LockAcquire()
with four lock types (result of git grep "XXX"). I've checked based on
the following list.

* LockRelationForExtension()
contrib/bloom/blutils.c:
LockRelationForExtension(index, ExclusiveLock);
contrib/pgstattuple/pgstattuple.c:
LockRelationForExtension(rel, ExclusiveLock);
src/backend/access/brin/brin_pageops.c:
LockRelationForExtension(idxrel, ShareLock);
src/backend/access/brin/brin_pageops.c:
LockRelationForExtension(irel, ExclusiveLock);
src/backend/access/brin/brin_revmap.c:
LockRelationForExtension(irel, ExclusiveLock);
src/backend/access/gin/ginutil.c:
LockRelationForExtension(index, ExclusiveLock);
src/backend/access/gin/ginvacuum.c:
LockRelationForExtension(index, ExclusiveLock);
src/backend/access/gin/ginvacuum.c:
LockRelationForExtension(index, ExclusiveLock);
src/backend/access/gist/gistutil.c:
LockRelationForExtension(r, ExclusiveLock);
src/backend/access/gist/gistvacuum.c:
LockRelationForExtension(rel, ExclusiveLock);
src/backend/access/gist/gistvacuum.c:
LockRelationForExtension(rel, ExclusiveLock);
src/backend/access/heap/hio.c:
LockRelationForExtension(relation, ExclusiveLock);
src/backend/access/heap/hio.c:
LockRelationForExtension(relation, ExclusiveLock);
src/backend/access/heap/visibilitymap.c:
LockRelationForExtension(rel, ExclusiveLock);
src/backend/access/nbtree/nbtpage.c:
LockRelationForExtension(rel, ExclusiveLock);
src/backend/access/nbtree/nbtree.c:
LockRelationForExtension(rel, ExclusiveLock);
src/backend/access/spgist/spgutils.c:
LockRelationForExtension(index, ExclusiveLock);
src/backend/access/spgist/spgvacuum.c:
LockRelationForExtension(index, ExclusiveLock);
src/backend/commands/vacuumlazy.c:
LockRelationForExtension(onerel, ExclusiveLock);
src/backend/storage/freespace/freespace.c:
LockRelationForExtension(rel, ExclusiveLock);
src/backend/storage/lmgr/lmgr.c:LockRelationForExtension(Relation
relation, LOCKMODE lockmode)

* ConditionalLockRelationForExtension
src/backend/access/heap/hio.c:          else if
(!ConditionalLockRelationForExtension(relation, ExclusiveLock))
src/backend/storage/lmgr/lmgr.c:ConditionalLockRelationForExtension(Relation
relation, LOCKMODE lockmode)

* LockPage
src/backend/access/gin/ginfast.c:               LockPage(index,
GIN_METAPAGE_BLKNO, ExclusiveLock);

* ConditionalLockPage
src/backend/access/gin/ginfast.c:               if
(!ConditionalLockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock))

* LockTuple
src/backend/access/heap/heapam.c:       LockTuple((rel), (tup),
tupleLockExtraInfo[mode].hwlock)

* ConditionalLockTuple
src/backend/access/heap/heapam.c:       ConditionalLockTuple((rel),
(tup), tupleLockExtraInfo[mode].hwlock)
src/backend/storage/lmgr/lmgr.c:ConditionalLockTuple(Relation
relation, ItemPointer tid, LOCKMODE lockmode)

* SpeculativeInsertionLockAcquire
src/backend/executor/nodeModifyTable.c:                 specToken =
SpeculativeInsertionLockAcquire(GetCurrentTransactionId());

> If this direction has no problem
> and we changed these three locks so that it uses new lock mechanism,
> we'll not be able to use these locks at the same time. Since it also
> means that we impose a limitation to the future we should think
> carefully about it. We can implement the deadlock detection mechanism
> for it again but it doesn't make sense.
>
>>
>> On the details of the patch, it appears that RelExtLockAcquire()
>> executes the wait-for-lock code with the partition lock held, and then
>> continues to hold the partition lock for the entire time that the
>> relation extension lock is held.  That not only makes all code that
>> runs while holding the lock non-interruptible but makes a lot of the
>> rest of this code pointless.  How is any of this atomics code going to
>> be reached by more than one process at the same time if the entire
>> bucket is exclusive-locked?  I would guess that the concurrency is not
>> very good here for the same reason.  Of course, just releasing the
>> bucket lock wouldn't be right either, because then ext_lock might go
>> away while we've got a pointer to it, which wouldn't be good.  I think
>> you could make this work if each lock had both a locker count and a
>> pin count, and the object can only be removed when the pin_count is 0.
>> So the lock algorithm would look like this:
>>
>> - Acquire the partition LWLock.
>> - Find the item of interest, creating it if necessary.  If out of
>> memory for more elements, sweep through the table and reclaim
>> 0-pin-count entries, then retry.
>> - Increment the pin count.
>> - Attempt to acquire the lock atomically; if we succeed, release the
>> partition lock and return.
>> - If this was a conditional-acquire, then decrement the pin count,
>> release the partition lock and return.
>> - Release the partition lock.
>> - Sleep on the condition variable until we manage to atomically
>> acquire the lock.
>>
>> The unlock algorithm would just decrement the pin count and, if the
>> resulting value is non-zero, broadcast on the condition variable.
>
> Thank you for the suggestion!
>
>> Although I think this will work, I'm not sure this is actually a great
>> algorithm.  Every lock acquisition has to take and release the
>> partition lock, use at least two more atomic ops (to take the pin and
>> the lock), and search a hash table.  I don't think that's going to be
>> staggeringly fast.  Maybe it's OK.  It's not that much worse, possibly
>> not any worse, than what the main lock manager does now.  However,
>> especially if we implement a solution specific to relation locks, it
>> seems like it would be better if we could somehow optimize based on
>> the facts that (1) many relation locks will not conflict and (2) it's
>> very common for the same backend to take and release the same
>> extension lock over and over again.  I don't have a specific proposal
>> right now.
>
> Yeah, we can optimize based on the purpose of the solution. In either
> case I should answer the above question first.
>
>>
>> Whatever we end up with, I think we should write some kind of a test
>> harness to benchmark the number of acquire/release cycles per second
>> that we can do with the current relation extension lock system vs. the
>> proposed new system.  Ideally, we'd be faster, since we're proposing a
>> more specialized mechanism.  But at least we should not be slower.
>> pgbench isn't a good test because the relation extension lock will
>> barely be taken let alone contended; we need to check something like
>> parallel copies into the same table to see any effect.
>>
>
> I did a benchmark using a custom script that always updates the
> primary key (disabling HOT updates). But parallel copies into the same
> tale would also be good. Thank you.
>

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Robert Haas
Date:
On Mon, Nov 6, 2017 at 4:42 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> I suggest that a good thing to do more or less immediately, regardless
>>> of when this patch ends up being ready, would be to insert an
>>> insertion that LockAcquire() is never called while holding a lock of
>>> one of these types.  If that assertion ever fails, then the whole
>>> theory that these lock types don't need deadlock detection is wrong,
>>> and we'd like to find out about that sooner or later.
>>
>> I understood. I'll check that first.
>
> I've checked whether LockAcquire is called while holding a lock of one
> of four types: LOCKTAG_RELATION_EXTENSION, LOCKTAG_PAGE,
> LOCKTAG_TUPLE, and LOCKTAG_SPECULATIVE_TOKEN. To summary, I think that
> we cannot move these four lock types together out of heavy-weight
> lock, but can move only the relation extension lock with tricks.
>
> Here is detail of the survey.

Thanks for these details, but I'm not sure I fully understand.

> * LOCKTAG_RELATION_EXTENSION
> There is a path that LockRelationForExtension() could be called while
> holding another relation extension lock. In brin_getinsertbuffer(), we
> acquire a relation extension lock for a index relation and could
> initialize a new buffer (brin_initailize_empty_new_buffer()). During
> initializing a new buffer, we call RecordPageWithFreeSpace() which
> eventually can call fsm_readbuf(rel, addr, true) where the third
> argument is "extend". We can process this problem by having the list
> (or local hash) of acquired locks and skip acquiring the lock if
> already had. For other call paths calling LockRelationForExtension, I
> don't see any problem.

Does calling fsm_readbuf(rel,addr,true) take some heavyweight lock?

Basically, what matters here in the end is whether we can articulate a
deadlock-proof rule around the order in which these locks are
acquired.  The simplest such rule would be "you can only acquire one
lock of any of these types at a time, and you can't subsequently
acquire a heavyweight lock".  But a more complicated rule would be OK
too, e.g. "you can acquire as many heavyweight locks as you want, and
after that you can optionally acquire one page, tuple, or speculative
token lock, and after that you can acquire a relation extension lock".
The latter rule, although more complex, is still deadlock-proof,
because the heavyweight locks still use the deadlock detector, and the
rest has a consistent order of lock acquisition that precludes one
backend taking A then B while another backend takes B then A.  I'm not
entirely clear whether your survey leads us to a place where we can
articulate such a deadlock-proof rule.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Wed, Nov 8, 2017 at 5:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Nov 6, 2017 at 4:42 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>> I suggest that a good thing to do more or less immediately, regardless
>>>> of when this patch ends up being ready, would be to insert an
>>>> insertion that LockAcquire() is never called while holding a lock of
>>>> one of these types.  If that assertion ever fails, then the whole
>>>> theory that these lock types don't need deadlock detection is wrong,
>>>> and we'd like to find out about that sooner or later.
>>>
>>> I understood. I'll check that first.
>>
>> I've checked whether LockAcquire is called while holding a lock of one
>> of four types: LOCKTAG_RELATION_EXTENSION, LOCKTAG_PAGE,
>> LOCKTAG_TUPLE, and LOCKTAG_SPECULATIVE_TOKEN. To summary, I think that
>> we cannot move these four lock types together out of heavy-weight
>> lock, but can move only the relation extension lock with tricks.
>>
>> Here is detail of the survey.
>
> Thanks for these details, but I'm not sure I fully understand.
>
>> * LOCKTAG_RELATION_EXTENSION
>> There is a path that LockRelationForExtension() could be called while
>> holding another relation extension lock. In brin_getinsertbuffer(), we
>> acquire a relation extension lock for a index relation and could
>> initialize a new buffer (brin_initailize_empty_new_buffer()). During
>> initializing a new buffer, we call RecordPageWithFreeSpace() which
>> eventually can call fsm_readbuf(rel, addr, true) where the third
>> argument is "extend". We can process this problem by having the list
>> (or local hash) of acquired locks and skip acquiring the lock if
>> already had. For other call paths calling LockRelationForExtension, I
>> don't see any problem.
>
> Does calling fsm_readbuf(rel,addr,true) take some heavyweight lock?

No, I meant fsm_readbuf(rel,addr,true) can acquire a relation
extension lock. So it's not problem.

> Basically, what matters here in the end is whether we can articulate a
> deadlock-proof rule around the order in which these locks are
> acquired.

You're right, my survey was not enough to make a decision.

As far as the acquiring these four lock types goes, there are two call
paths that acquire any type of locks while holding another type of
lock. The one is that acquiring a relation extension lock and then
acquiring a relation extension lock for the same relation again. As
explained before, this can be resolved by remembering the holding lock
(perhaps holding only last one is enough). Another is that acquiring
either a tuple lock, a page lock or a speculative insertion lock and
then acquiring a relation extension lock. In the second case, we try
to acquire these two locks in the same order; acquiring 3 types lock
and then extension lock. So it's not problem if we apply the rule that
is that we disallow to try acquiring these three lock types while
holding any relation extension lock. Also, as far as I surveyed there
is no path to acquire a relation lock while holding other 3 type
locks.

>  The simplest such rule would be "you can only acquire one
> lock of any of these types at a time, and you can't subsequently
> acquire a heavyweight lock".  But a more complicated rule would be OK
> too, e.g. "you can acquire as many heavyweight locks as you want, and
> after that you can optionally acquire one page, tuple, or speculative
> token lock, and after that you can acquire a relation extension lock".
> The latter rule, although more complex, is still deadlock-proof,
> because the heavyweight locks still use the deadlock detector, and the
> rest has a consistent order of lock acquisition that precludes one
> backend taking A then B while another backend takes B then A.  I'm not
> entirely clear whether your survey leads us to a place where we can
> articulate such a deadlock-proof rule.

Speaking of the acquiring these four lock types and heavy weight lock,
there obviously is a call path to acquire any of four lock types while
holding a heavy weight lock. In reverse, there also is a call path
that we acquire a heavy weight lock while holding any of four lock
types. The call path I found is that in heap_delete we acquire a tuple
lock and call XactLockTableWait or MultiXactIdWait which eventually
could acquire LOCKTAG_TRANSACTION in order to wait for the concurrent
transactions finish. But IIUC since these functions acquire the lock
for the concurrent transaction's transaction id, deadlocks doesn't
happen.
However, there might be other similar call paths if I'm missing
something. For example, we do some operations that might acquire any
heavy weight locks other than LOCKTAG_TRANSACTION, while holding a
page lock (in ginInsertCleanup) or holding a specualtive insertion
lock (in nodeModifyTable).

To summary, I think we can put the following rules in order to move
four lock types out of heavy weight lock.

1. Do not acquire either a tuple lock, a page lock or a speculative
insertion lock while holding a extension lock.
2. Do not acquire any heavy weight lock except for LOCKTAG_TRANSACTION
while holding any of these four lock types.

Also I'm concerned that it imposes the rules for developers which is
difficult to check statically. We can put several assertions to source
code but it's hard to test the all possible paths by regression tests.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Robert Haas
Date:
On Wed, Nov 8, 2017 at 9:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Speaking of the acquiring these four lock types and heavy weight lock,
> there obviously is a call path to acquire any of four lock types while
> holding a heavy weight lock. In reverse, there also is a call path
> that we acquire a heavy weight lock while holding any of four lock
> types. The call path I found is that in heap_delete we acquire a tuple
> lock and call XactLockTableWait or MultiXactIdWait which eventually
> could acquire LOCKTAG_TRANSACTION in order to wait for the concurrent
> transactions finish. But IIUC since these functions acquire the lock
> for the concurrent transaction's transaction id, deadlocks doesn't
> happen.

No, that's not right.  Now that you mention it, I realize that tuple
locks can definitely cause deadlocks.  Example:

setup:
rhaas=# create table foo (a int, b text);
CREATE TABLE
rhaas=# create table bar (a int, b text);
CREATE TABLE
rhaas=# insert into foo values (1, 'hoge');
INSERT 0 1

session 1:
rhaas=# begin;
BEGIN
rhaas=# update foo set b = 'hogehoge' where a = 1;
UPDATE 1

session 2:
rhaas=# begin;
BEGIN
rhaas=# update foo set b = 'quux' where a = 1;

session 3:
rhaas=# begin;
BEGIN
rhaas=# lock bar;
LOCK TABLE
rhaas=# update foo set b = 'blarfle' where a = 1;

back to session 1:
rhaas=# select * from bar;
ERROR:  deadlock detected
LINE 1: select * from bar;                     ^
DETAIL:  Process 88868 waits for AccessShareLock on relation 16391 of
database 16384; blocked by process 88845.
Process 88845 waits for ExclusiveLock on tuple (0,1) of relation 16385
of database 16384; blocked by process 88840.
Process 88840 waits for ShareLock on transaction 1193; blocked by process 88868.
HINT:  See server log for query details.

So what I said before was wrong: we definitely cannot exclude tuple
locks from deadlock detection.  However, we might be able to handle
the problem in another way: introduce a separate, parallel-query
specific mechanism to avoid having two participants try to update
and/or delete the same tuple at the same time - e.g. advertise the
BufferTag + offset within the page in DSM, and if somebody else
already has that same combination advertised, wait until they no
longer do.  That shouldn't ever deadlock, because the other worker
shouldn't be able to find itself waiting for us while it's busy
updating a tuple.

After some further study, speculative insertion locks look problematic
too.  I'm worried about the code path ExecInsert() [taking speculative
insertion locking] -> heap_insert -> heap_prepare_insert ->
toast_insert_or_update -> toast_save_datum ->
heap_open(rel->rd_rel->reltoastrelid, RowExclusiveLock).  That sure
looks like we can end up waiting for a relation lock while holding a
speculative insertion lock, which seems to mean that speculative
insertion locks are subject to at least theoretical deadlock hazards
as well.  Note that even if we were guaranteed to be holding the lock
on the toast relation already at this point, it wouldn't fix the
problem, because we might still have to build or refresh a relcache
entry at this point, which could end up scanning (and thus locking)
system catalogs.  Any syscache lookup can theoretically take a lock,
even though most of the time it doesn't, and thus taking a lock that
has been removed from the deadlock detector (or, say, an lwlock) and
then performing a syscache lookup with it held is not OK.  So I don't
think we can remove speculative insertion locks from the deadlock
detector either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas <robertmhaas@gmail.com> writes:
> No, that's not right.  Now that you mention it, I realize that tuple
> locks can definitely cause deadlocks.  Example:

Yeah.  Foreign-key-related tuple locks are another rich source of
examples.

> ... So I don't
> think we can remove speculative insertion locks from the deadlock
> detector either.

That scares me too.  I think that relation extension can safely
be transferred to some lower-level mechanism, because what has to
be done while holding the lock is circumscribed and below the level
of database operations (which might need other locks).  These other
ideas seem a lot riskier.

(But see recent conversation where I discouraged Alvaro from holding
extension locks across BRIN summarization activity.  We'll need to look
and make sure that nobody else has had creative ideas like that.)
        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
Thank you for pointing out and comments.

On Fri, Nov 10, 2017 at 12:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> No, that's not right.  Now that you mention it, I realize that tuple
>> locks can definitely cause deadlocks.  Example:
>
> Yeah.  Foreign-key-related tuple locks are another rich source of
> examples.
>
>> ... So I don't
>> think we can remove speculative insertion locks from the deadlock
>> detector either.
>
> That scares me too.  I think that relation extension can safely
> be transferred to some lower-level mechanism, because what has to
> be done while holding the lock is circumscribed and below the level
> of database operations (which might need other locks).  These other
> ideas seem a lot riskier.
>
> (But see recent conversation where I discouraged Alvaro from holding
> extension locks across BRIN summarization activity.  We'll need to look
> and make sure that nobody else has had creative ideas like that.)
>

It seems that we should focus on transferring only relation extension
locks as a first step. The page locks would also be safe but it might
require some fundamental changes related to fast insertion, which is
discussed on other thread[1]. Also in this case I think it's better to
focus on relation extension locks so that we can optimize the
lower-level lock mechanism for it.

So I'll update the patch based on the comment I got from Robert before.

[1] https://www.postgresql.org/message-id/CAD21AoBLUSyiYKnTYtSAbC%2BF%3DXDjiaBrOUEGK%2BzUXdQ8owfPKw%40mail.gmail.com

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Tue, Nov 14, 2017 at 4:36 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Thank you for pointing out and comments.
>
> On Fri, Nov 10, 2017 at 12:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> No, that's not right.  Now that you mention it, I realize that tuple
>>> locks can definitely cause deadlocks.  Example:
>>
>> Yeah.  Foreign-key-related tuple locks are another rich source of
>> examples.
>>
>>> ... So I don't
>>> think we can remove speculative insertion locks from the deadlock
>>> detector either.
>>
>> That scares me too.  I think that relation extension can safely
>> be transferred to some lower-level mechanism, because what has to
>> be done while holding the lock is circumscribed and below the level
>> of database operations (which might need other locks).  These other
>> ideas seem a lot riskier.
>>
>> (But see recent conversation where I discouraged Alvaro from holding
>> extension locks across BRIN summarization activity.  We'll need to look
>> and make sure that nobody else has had creative ideas like that.)
>>
>
> It seems that we should focus on transferring only relation extension
> locks as a first step. The page locks would also be safe but it might
> require some fundamental changes related to fast insertion, which is
> discussed on other thread[1]. Also in this case I think it's better to
> focus on relation extension locks so that we can optimize the
> lower-level lock mechanism for it.
>
> So I'll update the patch based on the comment I got from Robert before.
>

Attached updated version patch. I've moved only relation extension
locks out of heavy-weight lock as per discussion so far.

I've done a write-heavy benchmark on my laptop; loading 24kB data to
one table using COPY by 1 client, for 10 seconds. The through-put of
patched is 10% better than current HEAD. The result of 5 times is the
following.

----- PATCHED -----
tps = 178.791515 (excluding connections establishing)
tps = 176.522693 (excluding connections establishing)
tps = 168.705442 (excluding connections establishing)
tps = 158.158009 (excluding connections establishing)
tps = 161.145709 (excluding connections establishing)

----- HEAD -----
tps = 147.079803 (excluding connections establishing)
tps = 149.079540 (excluding connections establishing)
tps = 149.082275 (excluding connections establishing)
tps = 148.255376 (excluding connections establishing)
tps = 145.542552 (excluding connections establishing)

Also I've done a micro-benchmark; calling LockRelationForExtension and
UnlockRelationForExtension tightly in order to measure the number of
lock/unlock cycles per second. The result is,
PATCHED = 3.95892e+06 (cycles/sec)
HEAD = 1.15284e+06 (cycles/sec)
The patched is 3 times faster than current HEAD.

Attached updated patch and the function I used for micro-benchmark.
Please review it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Robert Haas
Date:
On Mon, Nov 20, 2017 at 5:19 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Attached updated version patch. I've moved only relation extension
> locks out of heavy-weight lock as per discussion so far.
>
> I've done a write-heavy benchmark on my laptop; loading 24kB data to
> one table using COPY by 1 client, for 10 seconds. The through-put of
> patched is 10% better than current HEAD. The result of 5 times is the
> following.
>
> ----- PATCHED -----
> tps = 178.791515 (excluding connections establishing)
> tps = 176.522693 (excluding connections establishing)
> tps = 168.705442 (excluding connections establishing)
> tps = 158.158009 (excluding connections establishing)
> tps = 161.145709 (excluding connections establishing)
>
> ----- HEAD -----
> tps = 147.079803 (excluding connections establishing)
> tps = 149.079540 (excluding connections establishing)
> tps = 149.082275 (excluding connections establishing)
> tps = 148.255376 (excluding connections establishing)
> tps = 145.542552 (excluding connections establishing)
>
> Also I've done a micro-benchmark; calling LockRelationForExtension and
> UnlockRelationForExtension tightly in order to measure the number of
> lock/unlock cycles per second. The result is,
> PATCHED = 3.95892e+06 (cycles/sec)
> HEAD = 1.15284e+06 (cycles/sec)
> The patched is 3 times faster than current HEAD.
>
> Attached updated patch and the function I used for micro-benchmark.
> Please review it.

That's a nice speed-up.

How about a preliminary patch that asserts that we never take another
heavyweight lock while holding a relation extension lock?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Wed, Nov 22, 2017 at 5:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Nov 20, 2017 at 5:19 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Attached updated version patch. I've moved only relation extension
>> locks out of heavy-weight lock as per discussion so far.
>>
>> I've done a write-heavy benchmark on my laptop; loading 24kB data to
>> one table using COPY by 1 client, for 10 seconds. The through-put of
>> patched is 10% better than current HEAD. The result of 5 times is the
>> following.
>>
>> ----- PATCHED -----
>> tps = 178.791515 (excluding connections establishing)
>> tps = 176.522693 (excluding connections establishing)
>> tps = 168.705442 (excluding connections establishing)
>> tps = 158.158009 (excluding connections establishing)
>> tps = 161.145709 (excluding connections establishing)
>>
>> ----- HEAD -----
>> tps = 147.079803 (excluding connections establishing)
>> tps = 149.079540 (excluding connections establishing)
>> tps = 149.082275 (excluding connections establishing)
>> tps = 148.255376 (excluding connections establishing)
>> tps = 145.542552 (excluding connections establishing)
>>
>> Also I've done a micro-benchmark; calling LockRelationForExtension and
>> UnlockRelationForExtension tightly in order to measure the number of
>> lock/unlock cycles per second. The result is,
>> PATCHED = 3.95892e+06 (cycles/sec)
>> HEAD = 1.15284e+06 (cycles/sec)
>> The patched is 3 times faster than current HEAD.
>>
>> Attached updated patch and the function I used for micro-benchmark.
>> Please review it.
>
> That's a nice speed-up.
>
> How about a preliminary patch that asserts that we never take another
> heavyweight lock while holding a relation extension lock?
>

Agreed. Also, since we disallow to holding more than one locks of
different relations at once I'll add an assertion for it as well.

I think we no longer need to pass the lock level to
UnloclRelationForExtension(). Now that relation extension lock will be
simple we can release the lock in the mode that we used to acquire
like LWLock.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Wed, Nov 22, 2017 at 11:32 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Wed, Nov 22, 2017 at 5:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Nov 20, 2017 at 5:19 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> Attached updated version patch. I've moved only relation extension
>>> locks out of heavy-weight lock as per discussion so far.
>>>
>>> I've done a write-heavy benchmark on my laptop; loading 24kB data to
>>> one table using COPY by 1 client, for 10 seconds. The through-put of
>>> patched is 10% better than current HEAD. The result of 5 times is the
>>> following.
>>>
>>> ----- PATCHED -----
>>> tps = 178.791515 (excluding connections establishing)
>>> tps = 176.522693 (excluding connections establishing)
>>> tps = 168.705442 (excluding connections establishing)
>>> tps = 158.158009 (excluding connections establishing)
>>> tps = 161.145709 (excluding connections establishing)
>>>
>>> ----- HEAD -----
>>> tps = 147.079803 (excluding connections establishing)
>>> tps = 149.079540 (excluding connections establishing)
>>> tps = 149.082275 (excluding connections establishing)
>>> tps = 148.255376 (excluding connections establishing)
>>> tps = 145.542552 (excluding connections establishing)
>>>
>>> Also I've done a micro-benchmark; calling LockRelationForExtension and
>>> UnlockRelationForExtension tightly in order to measure the number of
>>> lock/unlock cycles per second. The result is,
>>> PATCHED = 3.95892e+06 (cycles/sec)
>>> HEAD = 1.15284e+06 (cycles/sec)
>>> The patched is 3 times faster than current HEAD.
>>>
>>> Attached updated patch and the function I used for micro-benchmark.
>>> Please review it.
>>
>> That's a nice speed-up.
>>
>> How about a preliminary patch that asserts that we never take another
>> heavyweight lock while holding a relation extension lock?
>>
>
> Agreed. Also, since we disallow to holding more than one locks of
> different relations at once I'll add an assertion for it as well.
>
> I think we no longer need to pass the lock level to
> UnloclRelationForExtension(). Now that relation extension lock will be
> simple we can release the lock in the mode that we used to acquire
> like LWLock.
>

Attached latest patch incorporated all comments so far. Please review it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Robert Haas
Date:
On Sun, Nov 26, 2017 at 9:33 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Attached latest patch incorporated all comments so far. Please review it.

I think you only need RelExtLockReleaseAllI() where we currently have
LockReleaseAll(DEFAULT_LOCKMETHOD, ...) not where we have
LockReleaseAll(USER_LOCKMETHOD, ...).  That's because relation
extension locks use the default lock method, not USER_LOCKMETHOD.

You need to update the table of wait events in the documentation.
Please be sure to actually build the documentation afterwards and make
sure it looks OK.  Maybe the way event name should be
RelationExtensionLock rather than just RelationExtension; we are not
waiting for the extension itself.

You have a typo/thinko in lmgr/README: confliction is not a word.
Maybe you mean "When conflicts occur, lock waits are implemented using
condition variables."

Instead of having shared and exclusive locks, how about just having
exclusive locks and introducing a new primitive operation that waits
for the lock to be free and returns without acquiring it? That is
essentially what brin_pageops.c is doing by taking and releasing the
shared lock, and it's the only caller that takes anything but an
exclusive lock.  This seems like it would permit a considerable
simplification of the locking mechanism, since there would then be
only two possible states: 1 (locked) and 0 (not locked).

In RelExtLockAcquire, I can't endorse this sort of coding:

+        if (relid == held_relextlock.lock->relid &&
+            lockmode == held_relextlock.mode)
+        {
+            held_relextlock.nLocks++;
+            return true;
+        }
+        else
+            Assert(false);    /* cannot happen */

Either convert the Assert() to an elog(), or change the if-statement
to an Assert() of the same condition.  I'd probably vote for the first
one.  As it is, if that Assert(false) is ever hit, chaos will (maybe)
ensue.  Let's make sure we nip any such problems in the bud.

"successed" is not a good variable name; that's not an English word.

+        /* Could not got the lock, return iff in conditional locking */
+        if (mustwait && conditional)

Comment contradicts code.  The comment is right; the code need not
test mustwait, as that's already been done.

The way this is hooked into the shared-memory initialization stuff
looks strange in a number of ways:

- Apparently, you're making initialize enough space for as many
relation extension locks as the save of the main heavyweight lock
table, but that seems like overkill.  I'm not sure how much space we
actually need for relation extension locks, but I bet it's a lot less
than we need for regular heavyweight locks.
- The error message emitted when you run out of space also claims that
you can fix the issue by raising max_pred_locks_per_transaction, but
that has no effect on the size of the main lock table or this table.
- The changes to LockShmemSize() suppose that the hash table elements
have a size equal to the size of an LWLock, but the actual size is
sizeof(RELEXTLOCK).
- I don't really know why the code for this should be daisy-chained
off of the lock.c code inside of being called from
CreateSharedMemoryAndSemaphores() just like (almost) all of the other
subsystems.

This code ignores the existence of multiple databases; RELEXTLOCK
contains a relid, but no database OID.  That's easy enough to fix, but
it actually causes no problem unless, by bad luck, you have two
relations with the same OID in different databases that are both being
rapidly extended at the same time -- and even then, it's only a
performance problem, not a correctness problem.  In fact, I wonder if
we shouldn't go further: instead of creating these RELEXTLOCK
structures dynamically, let's just have a fixed number of them, say
1024.  When we get a request to take a lock, hash <dboid, reloid> and
take the result modulo 1024; lock the RELEXTLOCK at that offset in the
array.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Michael Paquier
Date:
On Wed, Nov 29, 2017 at 5:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Nov 26, 2017 at 9:33 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Attached latest patch incorporated all comments so far. Please review it.
>
> I think you only need RelExtLockReleaseAllI() where we currently have
> LockReleaseAll(DEFAULT_LOCKMETHOD, ...) not where we have
> LockReleaseAll(USER_LOCKMETHOD, ...).  That's because relation
> extension locks use the default lock method, not USER_LOCKMETHOD.

Latest review is fresh. I am moving this to next CF with "waiting on
author" as status.
-- 
Michael


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Thu, Nov 30, 2017 at 10:52 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Nov 29, 2017 at 5:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sun, Nov 26, 2017 at 9:33 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> Attached latest patch incorporated all comments so far. Please review it.
>>
>> I think you only need RelExtLockReleaseAllI() where we currently have
>> LockReleaseAll(DEFAULT_LOCKMETHOD, ...) not where we have
>> LockReleaseAll(USER_LOCKMETHOD, ...).  That's because relation
>> extension locks use the default lock method, not USER_LOCKMETHOD.
>
> Latest review is fresh. I am moving this to next CF with "waiting on
> author" as status.

Thank you Michael-san, I'll submit a latest patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Wed, Nov 29, 2017 at 5:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Nov 26, 2017 at 9:33 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Attached latest patch incorporated all comments so far. Please review it.
>
> I think you only need RelExtLockReleaseAllI() where we currently have
> LockReleaseAll(DEFAULT_LOCKMETHOD, ...) not where we have
> LockReleaseAll(USER_LOCKMETHOD, ...).  That's because relation
> extension locks use the default lock method, not USER_LOCKMETHOD.

Fixed.

> You need to update the table of wait events in the documentation.
> Please be sure to actually build the documentation afterwards and make
> sure it looks OK.  Maybe the way event name should be
> RelationExtensionLock rather than just RelationExtension; we are not
> waiting for the extension itself.

Fixed. I added both new wait_event and wait_event_type for relext
lock. Also checked to pass a documentation build.

>
> You have a typo/thinko in lmgr/README: confliction is not a word.
> Maybe you mean "When conflicts occur, lock waits are implemented using
> condition variables."

Fixed.

>
> Instead of having shared and exclusive locks, how about just having
> exclusive locks and introducing a new primitive operation that waits
> for the lock to be free and returns without acquiring it? That is
> essentially what brin_pageops.c is doing by taking and releasing the
> shared lock, and it's the only caller that takes anything but an
> exclusive lock.  This seems like it would permit a considerable
> simplification of the locking mechanism, since there would then be
> only two possible states: 1 (locked) and 0 (not locked).

I think it's a good idea. With this change, the concurrency of
executing brin_page_cleanup() would get decreased. But since
brin_page_cleanup() is called only during vacuum so far it's no
problem. I think we can process the code in vacuumlazy.c in the same
manner as well. I've changed the patch so that it has only exclusive
locks and introduces WaitForRelationExtensionLockToBeFree() function
to wait for the the lock to be free.

Also, now that we got rid of shared locks, I gathered lock state and
pin count into a atomic uint32.

> In RelExtLockAcquire, I can't endorse this sort of coding:
>
> +        if (relid == held_relextlock.lock->relid &&
> +            lockmode == held_relextlock.mode)
> +        {
> +            held_relextlock.nLocks++;
> +            return true;
> +        }
> +        else
> +            Assert(false);    /* cannot happen */
>
> Either convert the Assert() to an elog(), or change the if-statement
> to an Assert() of the same condition.  I'd probably vote for the first
> one.  As it is, if that Assert(false) is ever hit, chaos will (maybe)
> ensue.  Let's make sure we nip any such problems in the bud.

Agreed, fixed.

>
> "successed" is not a good variable name; that's not an English word.

Fixed.

> +        /* Could not got the lock, return iff in conditional locking */
> +        if (mustwait && conditional)
>
> Comment contradicts code.  The comment is right; the code need not
> test mustwait, as that's already been done.

Fixed.

> The way this is hooked into the shared-memory initialization stuff
> looks strange in a number of ways:
>
> - Apparently, you're making initialize enough space for as many
> relation extension locks as the save of the main heavyweight lock
> table, but that seems like overkill.  I'm not sure how much space we
> actually need for relation extension locks, but I bet it's a lot less
> than we need for regular heavyweight locks.

Agreed. The maximum of the number of relext locks is the number of
relations on a database cluster, it's not relevant with the number of
clients. Currently NLOCKENTS() counts the number of locks including
relation extension lock. One idea is to introduce a new GUC to control
the memory size, although the total memory size for locks will get
increased. Probably we can make it behave similar to
max_pred_locks_per_relation. Or, in order to not change total memory
size for lock even after moved it out of heavyweight lock, we can
divide NLOCKENTS() into heavyweight lock and relation extension lock
(for example, 80% for heavyweight locks and 20% relation extension
locks). But the latter would make parameter tuning hard. I'd vote for
the first one to keep it simple. Any ideas? This part is not fixed in
the patch yet.

> - The error message emitted when you run out of space also claims that
> you can fix the issue by raising max_pred_locks_per_transaction, but
> that has no effect on the size of the main lock table or this table.

Fixed.

> - The changes to LockShmemSize() suppose that the hash table elements
> have a size equal to the size of an LWLock, but the actual size is
> sizeof(RELEXTLOCK).

Fixed.

> - I don't really know why the code for this should be daisy-chained
> off of the lock.c code inside of being called from
> CreateSharedMemoryAndSemaphores() just like (almost) all of the other
> subsystems.

Fixed.

>
> This code ignores the existence of multiple databases; RELEXTLOCK
> contains a relid, but no database OID.  That's easy enough to fix, but
> it actually causes no problem unless, by bad luck, you have two
> relations with the same OID in different databases that are both being
> rapidly extended at the same time -- and even then, it's only a
> performance problem, not a correctness problem.  In fact, I wonder if
> we shouldn't go further: instead of creating these RELEXTLOCK
> structures dynamically, let's just have a fixed number of them, say
> 1024.  When we get a request to take a lock, hash <dboid, reloid> and
> take the result modulo 1024; lock the RELEXTLOCK at that offset in the
> array.
>

Attached the latest patch incorporated comments except for the fix of
the memory size for relext lock.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Robert Haas
Date:
On Thu, Nov 30, 2017 at 6:20 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> This code ignores the existence of multiple databases; RELEXTLOCK
>> contains a relid, but no database OID.  That's easy enough to fix, but
>> it actually causes no problem unless, by bad luck, you have two
>> relations with the same OID in different databases that are both being
>> rapidly extended at the same time -- and even then, it's only a
>> performance problem, not a correctness problem.  In fact, I wonder if
>> we shouldn't go further: instead of creating these RELEXTLOCK
>> structures dynamically, let's just have a fixed number of them, say
>> 1024.  When we get a request to take a lock, hash <dboid, reloid> and
>> take the result modulo 1024; lock the RELEXTLOCK at that offset in the
>> array.
>
> Attached the latest patch incorporated comments except for the fix of
> the memory size for relext lock.

It doesn't do anything about the comment of mine quoted above.  Since
it's only possible to hold one relation extension lock at a time, we
don't really need the hash table here at all. We can just have an
array of 1024 or so locks and map every <db,relid> pair on to one of
them by hashing.  The worst thing we'll get it some false contention,
but that doesn't seem awful, and it would permit considerable further
simplification of this code -- and maybe make it faster in the
process, because we'd no longer need the hash table, or the pin count,
or the extra LWLocks that protect the hash table.  All we would have
is atomic operations manipulating the lock state, which seems like it
would be quite a lot faster and simpler.

BTW, I think RelExtLockReleaseAll is broken because it shouldn't
HOLD_INTERRUPTS(); I also think it's kind of silly to loop here when
we know we can only hold one lock.  Maybe RelExtLockRelease can take
bool force and do if (force) held_relextlock.nLocks = 0; else
held_relextlock.nLocks--.  Or, better yet, have the caller adjust that
value and then only call RelExtLockRelease() if we needed to release
the lock in shared memory.  That avoids needless branching.  On a
related note, is there any point in having both held_relextlock.nLocks
and num_held_relextlocks?

I think RelationExtensionLock should be a new type of IPC wait event,
rather than a whole new category.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Fri, Dec 1, 2017 at 10:26 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Fri, Dec 1, 2017 at 3:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Nov 30, 2017 at 6:20 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>>> This code ignores the existence of multiple databases; RELEXTLOCK
>>>> contains a relid, but no database OID.  That's easy enough to fix, but
>>>> it actually causes no problem unless, by bad luck, you have two
>>>> relations with the same OID in different databases that are both being
>>>> rapidly extended at the same time -- and even then, it's only a
>>>> performance problem, not a correctness problem.  In fact, I wonder if
>>>> we shouldn't go further: instead of creating these RELEXTLOCK
>>>> structures dynamically, let's just have a fixed number of them, say
>>>> 1024.  When we get a request to take a lock, hash <dboid, reloid> and
>>>> take the result modulo 1024; lock the RELEXTLOCK at that offset in the
>>>> array.
>>>
>>> Attached the latest patch incorporated comments except for the fix of
>>> the memory size for relext lock.
>>
>> It doesn't do anything about the comment of mine quoted above.
>
> Sorry I'd missed the comment.
>
>>   Since it's only possible to hold one relation extension lock at a time, we
>> don't really need the hash table here at all. We can just have an
>> array of 1024 or so locks and map every <db,relid> pair on to one of
>> them by hashing.  The worst thing we'll get it some false contention,
>> but that doesn't seem awful, and it would permit considerable further
>> simplification of this code -- and maybe make it faster in the
>> process, because we'd no longer need the hash table, or the pin count,
>> or the extra LWLocks that protect the hash table.  All we would have
>> is atomic operations manipulating the lock state, which seems like it
>> would be quite a lot faster and simpler.
>
> Agreed. With this change, we will have an array of the struct that has
> lock state and cv. The lock state has the wait count as well as the
> status of lock.
>
>> BTW, I think RelExtLockReleaseAll is broken because it shouldn't
>> HOLD_INTERRUPTS(); I also think it's kind of silly to loop here when
>> we know we can only hold one lock.  Maybe RelExtLockRelease can take
>> bool force and do if (force) held_relextlock.nLocks = 0; else
>> held_relextlock.nLocks--.  Or, better yet, have the caller adjust that
>> value and then only call RelExtLockRelease() if we needed to release
>> the lock in shared memory.  That avoids needless branching.
>
> Agreed. I'd vote for the latter.
>
>>  On a
>> related note, is there any point in having both held_relextlock.nLocks
>> and num_held_relextlocks?
>
> num_held_relextlocks is actually unnecessary, will be removed.
>
>> I think RelationExtensionLock should be a new type of IPC wait event,
>> rather than a whole new category.
>
> Hmm, I thought the wait event types of IPC seems related to events
> that communicates to other processes for the same purpose, for example
> parallel query, sync repli etc. On the other hand, the relation
> extension locks are one kind of the lock mechanism. That's way I added
> a new category. But maybe it can be fit to the IPC wait event.
>

Attached updated patch. I've done a performance measurement again on
the same configuration as before since the acquiring/releasing
procedures have been changed.

----- PATCHED -----
tps = 162.579320 (excluding connections establishing)
tps = 162.144352 (excluding connections establishing)
tps = 160.659403 (excluding connections establishing)
tps = 161.213995 (excluding connections establishing)
tps = 164.560460 (excluding connections establishing)
----- HEAD -----
tps = 157.738645 (excluding connections establishing)
tps = 146.178575 (excluding connections establishing)
tps = 143.788961 (excluding connections establishing)
tps = 144.886594 (excluding connections establishing)
tps = 145.496337 (excluding connections establishing)

* micro-benchmark
PATCHED = 1.61757e+07 (cycles/sec)
HEAD = 1.48685e+06 (cycles/sec)
The patched is 10 times faster than current HEAD.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Robert Haas
Date:
On Fri, Dec 1, 2017 at 10:14 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> The patched is 10 times faster than current HEAD.

Nifty.

The first hunk in monitoring.sgml looks unnecessary.

The second hunk breaks the formatting of the documentation; you need
to adjust the "morerows" value from 9 to 8 here:

         <entry morerows="9"><literal>Lock</literal></entry>

And similarly make this one 18:

         <entry morerows="17"><literal>IPC</literal></entry>

+* Relation extension locks.  The relation extension lock manager is
+specialized in relation extensions. In PostgreSQL 11 relation extension
+lock has been moved out of regular lock. It's similar to regular locks
+but doesn't have full dead lock detection, group locking and multiple
+lock modes. When conflicts occur, lock waits are implemented using
+condition variables.

Higher up, it says that "Postgres uses four types of interprocess
locks", but because you added this, it's now a list of five items.

I suggest moving the section on relation locks to the end and
rewriting the text here as follows: Only one process can extend a
relation at a time; we use a specialized lock manager for this
purpose, which is much simpler than the regular lock manager.  It is
similar to the lightweight lock mechanism, but is ever simpler because
there is only one lock mode and only one lock can be taken at a time.
A process holding a relation extension lock is interruptible, unlike a
process holding an LWLock.

+#define RelExtLockTargetTagToIndex(relextlock_tag) \
+    (tag_hash((const void *) relextlock_tag, sizeof(RelExtLockTag)) \
+        % N_RELEXTLOCK_ENTS)

How about using a static inline function for this?

+#define SET_RELEXTLOCK_TAG(locktag, d, r) \
+    ((locktag).dbid = (d), \
+     (locktag).relid = (r))

How about getting rid of this and just doing the assignments instead?

+#define RELEXT_VAL_LOCK     ((uint32) ((1 << 25)))
+#define RELEXT_LOCK_MASK    ((uint32) ((1 << 25)))

It seems confusing to have two macros for the same value and an
almost-interchangeable purpose.  Maybe just call it RELEXT_LOCK_BIT?

+RelationExtensionLockWaiterCount(Relation relation)

Hmm.  This is sort of problematic, because with then new design we
have no guarantee that the return value is actually accurate.  I don't
think that's a functional problem, but the optics aren't great.

+    if (held_relextlock.nLocks > 0)
+    {
+        RelExtLockRelease(held_relextlock.relid, true);
+    }

Excess braces.

+int
+RelExtLockHoldingLockCount(void)
+{
+    return held_relextlock.nLocks;
+}

Maybe IsAnyRelationExtensionLockHeld(), returning bool?

+    /* If the lock is held by me, no need to wait */

If we already hold the lock, no need to wait.

+     * Luckily if we're trying to acquire the same lock as what we
+     * had held just before, we don't need to get the entry from the
+     * array by hashing.

We're not trying to acquire a lock here.  "If the last relation
extension lock we touched is the same one for which we now need to
wait, we can use our cached pointer to the lock instead of recomputing
it."

+            registered_wait_list = true;

Isn't it really registered_wait_count?  The only list here is
encapsulated in the CV.

+    /* Before retuning, decrement the wait count if we had been waiting */

returning -> returning, but I'd rewrite this as "Release any wait
count we hold."

+ * Acquire the relation extension lock. If we're trying to acquire the same
+ * lock as what already held, we just increment nLock locally and return
+ * without touching the RelExtLock array.

"Acquire a relation extension lock."  I think you can forget the rest
of this; it duplicates comments in the function body.

+     * Since we don't support dead lock detection for relation extension
+     * lock and don't control the order of lock acquisition, it cannot not
+     * happen that trying to take a new lock while holding an another lock.

Since we don't do deadlock detection, caller must not try to take a
new relation extension lock while already holding them.

+        if (relid == held_relextlock.relid)
+        {
+            held_relextlock.nLocks++;
+            return true;
+        }
+        else
+            elog(ERROR,
+                 "cannot acquire relation extension locks for
multiple relations at the same");

I'd prefer if (relid != held_relextlock.relid) elog(ERROR, ...) to
save a level of indentation for the rest.

+     * If we're trying to acquire the same lock as what we just released
+     * we don't need to get the entry from the array by hashing. we expect
+     * to happen this case because it's a common case in acquisition of
+     * relation extension locks.

"If the last relation extension lock we touched is the same one for we
now need to acquire, we can use our cached pointer to the lock instead
of recomputing it.  This is likely to be a common case in practice."

+        /* Could not got the lock, return iff in conditional locking */

"locking conditionally"

+        ConditionVariableSleep(&(relextlock->cv),
WAIT_EVENT_RELATION_EXTENSION_LOCK);

Break line at comma

+    /* Decrement wait count if we had been waiting */

"Release any wait count we hold."

+    /* Always return true if not conditional lock */

"We got the lock!"

+    /* If force releasing, release all locks we're holding */
+    if (force)
+        held_relextlock.nLocks = 0;
+    else
+        held_relextlock.nLocks--;
+
+    Assert(held_relextlock.nLocks >= 0);
+
+    /* Return if we're still holding the lock even after computation */
+    if (held_relextlock.nLocks > 0)
+        return;

I thought you were going to have the caller adjust nLocks?

+    /* Get RelExtLock entry from the array */
+    SET_RELEXTLOCK_TAG(tag, MyDatabaseId, relid);
+    relextlock = &RelExtLockArray[RelExtLockTargetTagToIndex(&tag)];

This seems to make no sense in RelExtLockRelease -- isn't the cache
guaranteed valid?

+    /* Wake up waiters if there is someone looking at this lock */

"If there may be waiters, wake them up."

+     * We allow to take a relation extension lock after took a
+     * heavy-weight lock. However, since we don't have dead lock
+     * detection mechanism between heavy-weight lock and relation
+     * extension lock it's not allowed taking an another heavy-weight
+     * lock while holding a relation extension lock.

"Relation extension locks don't participate in deadlock detection, so
make sure we don't try to acquire a heavyweight lock while holding
one."

+    /* Release relation extension locks */

"If we hold a relation extension lock, release it."

+/* Number of partitions the shared relation extension lock tables are
divided into */
+#define LOG2_NUM_RELEXTLOCK_PARTITIONS 4
+#define NUM_RELEXTLOCK_PARTITIONS      (1 << LOG2_NUM_RELEXTLOCK_PARTITIONS)

Dead code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Robert Haas
Date:
On Fri, Dec 1, 2017 at 1:28 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> [ lots of minor comments ]

When I took a break from sitting at the computer, I realized that I
think this has a more serious problem: won't it permanently leak
reference counts if someone hits ^C or an error occurs while the lock
is held?  I think it will -- it probably needs to do cleanup at the
places where we do LWLockReleaseAll() that includes decrementing the
shared refcount if necessary, rather than doing cleanup at the places
we release heavyweight locks.

I might be wrong about the details here -- this is off the top of my head.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Sat, Dec 2, 2017 at 3:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Dec 1, 2017 at 10:14 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> The patched is 10 times faster than current HEAD.
>
> Nifty.

Thank you for your dedicated reviewing the patch.

> The first hunk in monitoring.sgml looks unnecessary.

You meant the following hunk?

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8d461c8..7aa7981 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -669,8 +669,8 @@ postgres   27093  0.0  0.0  30096  2752 ?
Ss   11:34   0:00 postgres: ser
           Heavyweight locks, also known as lock manager locks or simply locks,
           primarily protect SQL-visible objects such as tables.  However,
           they are also used to ensure mutual exclusion for certain internal
-          operations such as relation extension.
<literal>wait_event</literal> will
-          identify the type of lock awaited.
+          operations such as waiting for a transaction to finish.
+          <literal>wait_event</literal> will identify the type of lock awaited.
          </para>
         </listitem>
         <listitem>

I think that since the extension locks are no longer a part of
heavyweight locks we should change the explanation.

> The second hunk breaks the formatting of the documentation; you need
> to adjust the "morerows" value from 9 to 8 here:
>
>          <entry morerows="9"><literal>Lock</literal></entry>
>
> And similarly make this one 18:
>
>          <entry morerows="17"><literal>IPC</literal></entry>

Fixed.

> +* Relation extension locks.  The relation extension lock manager is
> +specialized in relation extensions. In PostgreSQL 11 relation extension
> +lock has been moved out of regular lock. It's similar to regular locks
> +but doesn't have full dead lock detection, group locking and multiple
> +lock modes. When conflicts occur, lock waits are implemented using
> +condition variables.
>
> Higher up, it says that "Postgres uses four types of interprocess
> locks", but because you added this, it's now a list of five items.

Fixed.

> I suggest moving the section on relation locks to the end and
> rewriting the text here as follows: Only one process can extend a
> relation at a time; we use a specialized lock manager for this
> purpose, which is much simpler than the regular lock manager.  It is
> similar to the lightweight lock mechanism, but is ever simpler because
> there is only one lock mode and only one lock can be taken at a time.
> A process holding a relation extension lock is interruptible, unlike a
> process holding an LWLock.

Agreed and fixed.

> +#define RelExtLockTargetTagToIndex(relextlock_tag) \
> +    (tag_hash((const void *) relextlock_tag, sizeof(RelExtLockTag)) \
> +        % N_RELEXTLOCK_ENTS)
>
> How about using a static inline function for this?

Fixed.

> +#define SET_RELEXTLOCK_TAG(locktag, d, r) \
> +    ((locktag).dbid = (d), \
> +     (locktag).relid = (r))
>
> How about getting rid of this and just doing the assignments instead?

Fixed.

> +#define RELEXT_VAL_LOCK     ((uint32) ((1 << 25)))
> +#define RELEXT_LOCK_MASK    ((uint32) ((1 << 25)))
>
> It seems confusing to have two macros for the same value and an
> almost-interchangeable purpose.  Maybe just call it RELEXT_LOCK_BIT?

Fixed.

>
> +RelationExtensionLockWaiterCount(Relation relation)
>
> Hmm.  This is sort of problematic, because with then new design we
> have no guarantee that the return value is actually accurate.  I don't
> think that's a functional problem, but the optics aren't great.

Yeah, with this patch we could overestimate it and then add extra
blocks to the relation. Since the number of extra blocks is capped at
512 I think it would not become serious problem.

> +    if (held_relextlock.nLocks > 0)
> +    {
> +        RelExtLockRelease(held_relextlock.relid, true);
> +    }
>
> Excess braces.

Fixed.

>
> +int
> +RelExtLockHoldingLockCount(void)
> +{
> +    return held_relextlock.nLocks;
> +}
>
> Maybe IsAnyRelationExtensionLockHeld(), returning bool?

Fixed.

> +    /* If the lock is held by me, no need to wait */
>
> If we already hold the lock, no need to wait.

Fixed.

> +     * Luckily if we're trying to acquire the same lock as what we
> +     * had held just before, we don't need to get the entry from the
> +     * array by hashing.
>
> We're not trying to acquire a lock here.  "If the last relation
> extension lock we touched is the same one for which we now need to
> wait, we can use our cached pointer to the lock instead of recomputing
> it."

Fixed.

> +            registered_wait_list = true;
>
> Isn't it really registered_wait_count?  The only list here is
> encapsulated in the CV.

Changed to "waiting".

>
> +    /* Before retuning, decrement the wait count if we had been waiting */
>
> returning -> returning, but I'd rewrite this as "Release any wait
> count we hold."

Fixed.

> + * Acquire the relation extension lock. If we're trying to acquire the same
> + * lock as what already held, we just increment nLock locally and return
> + * without touching the RelExtLock array.
>
> "Acquire a relation extension lock."  I think you can forget the rest
> of this; it duplicates comments in the function body.

Fixed.

> +     * Since we don't support dead lock detection for relation extension
> +     * lock and don't control the order of lock acquisition, it cannot not
> +     * happen that trying to take a new lock while holding an another lock.
>
> Since we don't do deadlock detection, caller must not try to take a
> new relation extension lock while already holding them.

Fixed.

>
> +        if (relid == held_relextlock.relid)
> +        {
> +            held_relextlock.nLocks++;
> +            return true;
> +        }
> +        else
> +            elog(ERROR,
> +                 "cannot acquire relation extension locks for
> multiple relations at the same");
>
> I'd prefer if (relid != held_relextlock.relid) elog(ERROR, ...) to
> save a level of indentation for the rest.

Fixed.

>
> +     * If we're trying to acquire the same lock as what we just released
> +     * we don't need to get the entry from the array by hashing. we expect
> +     * to happen this case because it's a common case in acquisition of
> +     * relation extension locks.
>
> "If the last relation extension lock we touched is the same one for we
> now need to acquire, we can use our cached pointer to the lock instead
> of recomputing it.  This is likely to be a common case in practice."

Fixed.

>
> +        /* Could not got the lock, return iff in conditional locking */
>
> "locking conditionally"

Fixed.

> +        ConditionVariableSleep(&(relextlock->cv),
> WAIT_EVENT_RELATION_EXTENSION_LOCK);
> Break line at comma
>

Fixed.

> +    /* Decrement wait count if we had been waiting */
>
> "Release any wait count we hold."

Fixed.

> +    /* Always return true if not conditional lock */
>
> "We got the lock!"

Fixed.

> +    /* If force releasing, release all locks we're holding */
> +    if (force)
> +        held_relextlock.nLocks = 0;
> +    else
> +        held_relextlock.nLocks--;
> +
> +    Assert(held_relextlock.nLocks >= 0);
> +
> +    /* Return if we're still holding the lock even after computation */
> +    if (held_relextlock.nLocks > 0)
> +        return;
>
> I thought you were going to have the caller adjust nLocks?

Yeah, I was supposed to change so but since we always release either
one lock or all relext locks I thought it'd better to pass a bool
rather than an int.

> +    /* Get RelExtLock entry from the array */
> +    SET_RELEXTLOCK_TAG(tag, MyDatabaseId, relid);
> +    relextlock = &RelExtLockArray[RelExtLockTargetTagToIndex(&tag)];
>
> This seems to make no sense in RelExtLockRelease -- isn't the cache
> guaranteed valid?

Right, fixed.

>
> +    /* Wake up waiters if there is someone looking at this lock */
>
> "If there may be waiters, wake them up."

Fixed.

> +     * We allow to take a relation extension lock after took a
> +     * heavy-weight lock. However, since we don't have dead lock
> +     * detection mechanism between heavy-weight lock and relation
> +     * extension lock it's not allowed taking an another heavy-weight
> +     * lock while holding a relation extension lock.
>
> "Relation extension locks don't participate in deadlock detection, so
> make sure we don't try to acquire a heavyweight lock while holding
> one."

Fixed.

> +    /* Release relation extension locks */
>
> "If we hold a relation extension lock, release it."

Fixed.

> +/* Number of partitions the shared relation extension lock tables are
> divided into */
> +#define LOG2_NUM_RELEXTLOCK_PARTITIONS 4
> +#define NUM_RELEXTLOCK_PARTITIONS      (1 << LOG2_NUM_RELEXTLOCK_PARTITIONS)
>
> Dead code.

Fixed.

> When I took a break from sitting at the computer, I realized that I
> think this has a more serious problem: won't it permanently leak
> reference counts if someone hits ^C or an error occurs while the lock
> is held?  I think it will -- it probably needs to do cleanup at the
> places where we do LWLockReleaseAll() that includes decrementing the
> shared refcount if necessary, rather than doing cleanup at the places
> we release heavyweight locks.
> I might be wrong about the details here -- this is off the top of my head.

Good catch. It can leak reference counts if someone hits ^C or an
error occurs while waiting. Fixed in the latest patch. But since
RelExtLockReleaseAll() is called even when such situations I think we
don't need to change the place where releasing the all relext lock. We
just moved it from heavyweight locks. Am I missing something?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Robert Haas
Date:
On Fri, Dec 8, 2017 at 3:20 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> The first hunk in monitoring.sgml looks unnecessary.
>
> You meant the following hunk?
>
> diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> index 8d461c8..7aa7981 100644
> --- a/doc/src/sgml/monitoring.sgml
> +++ b/doc/src/sgml/monitoring.sgml
> @@ -669,8 +669,8 @@ postgres   27093  0.0  0.0  30096  2752 ?
> Ss   11:34   0:00 postgres: ser
>            Heavyweight locks, also known as lock manager locks or simply locks,
>            primarily protect SQL-visible objects such as tables.  However,
>            they are also used to ensure mutual exclusion for certain internal
> -          operations such as relation extension.
> <literal>wait_event</literal> will
> -          identify the type of lock awaited.
> +          operations such as waiting for a transaction to finish.
> +          <literal>wait_event</literal> will identify the type of lock awaited.
>           </para>
>          </listitem>
>          <listitem>
>
> I think that since the extension locks are no longer a part of
> heavyweight locks we should change the explanation.

Yes, you are right.

>> +RelationExtensionLockWaiterCount(Relation relation)
>>
>> Hmm.  This is sort of problematic, because with then new design we
>> have no guarantee that the return value is actually accurate.  I don't
>> think that's a functional problem, but the optics aren't great.
>
> Yeah, with this patch we could overestimate it and then add extra
> blocks to the relation. Since the number of extra blocks is capped at
> 512 I think it would not become serious problem.

How about renaming it EstimateNumberOfExtensionLockWaiters?

>> +    /* If force releasing, release all locks we're holding */
>> +    if (force)
>> +        held_relextlock.nLocks = 0;
>> +    else
>> +        held_relextlock.nLocks--;
>> +
>> +    Assert(held_relextlock.nLocks >= 0);
>> +
>> +    /* Return if we're still holding the lock even after computation */
>> +    if (held_relextlock.nLocks > 0)
>> +        return;
>>
>> I thought you were going to have the caller adjust nLocks?
>
> Yeah, I was supposed to change so but since we always release either
> one lock or all relext locks I thought it'd better to pass a bool
> rather than an int.

I don't see why you need to pass either one.  The caller can set
held_relextlock.nLocks either with -- or = 0, and then call
RelExtLockRelease() only if the resulting value is 0.

>> When I took a break from sitting at the computer, I realized that I
>> think this has a more serious problem: won't it permanently leak
>> reference counts if someone hits ^C or an error occurs while the lock
>> is held?  I think it will -- it probably needs to do cleanup at the
>> places where we do LWLockReleaseAll() that includes decrementing the
>> shared refcount if necessary, rather than doing cleanup at the places
>> we release heavyweight locks.
>> I might be wrong about the details here -- this is off the top of my head.
>
> Good catch. It can leak reference counts if someone hits ^C or an
> error occurs while waiting. Fixed in the latest patch. But since
> RelExtLockReleaseAll() is called even when such situations I think we
> don't need to change the place where releasing the all relext lock. We
> just moved it from heavyweight locks. Am I missing something?

Hmm, that might be an OK way to handle it.  I don't see a problem off
the top of my head.  It might be clearer to rename it to
RelExtLockCleanup() though, since it is not just releasing the lock
but also any wait count we hold.

+/* Must be greater than MAX_BACKENDS - which is 2^23-1, so we're fine. */
+#define RELEXT_WAIT_COUNT_MASK    ((uint32) ((1 << 24) - 1))

Let's drop the comment here and instead add a StaticAssertStmt() that
checks this.

I am slightly puzzled, though.  If I read this correctly, bits 0-23
are used for the waiter count, bit 24 is always 0, bit 25 indicates
the presence or absence of an exclusive lock, and bits 26+ are always
0.  That seems slightly odd.  Shouldn't we either use the highest
available bit for the locker (bit 31) or the lowest one (bit 24)?  The
former seems better, in case MAX_BACKENDS changes later.  We could
make RELEXT_WAIT_COUNT_MASK bigger too, just in case.

+        /* Make a lock tag */
+        tag.dbid = MyDatabaseId;
+        tag.relid = relid;

What about shared relations?  I bet we need to use 0 in that case.
Otherwise, if backends in two different databases try to extend the
same shared relation at the same time, we'll (probably) fail to notice
that they conflict.

+ * To avoid unnecessary recomputations of the hash code, we try to do this
+ * just once per function, and then pass it around as needed.  we can
+ * extract the index number of RelExtLockArray.

This is just a copy-and-paste from lock.c, but actually we have a more
sophisticated scheme here.  I think you can just drop this comment
altogether, really.

+    return (tag_hash((const void *) locktag, sizeof(RelExtLockTag))
+            % N_RELEXTLOCK_ENTS);

I would drop the outermost set of parentheses.  Is the cast to (const
void *) really doing anything?

+                 "cannot acquire relation extension locks for
multiple relations at the same");

cannot simultaneously acquire more than one distinct relation lock?
As you have it, you'd have to add the word "time" at the end, but my
version is shorter.

+        /* Sleep until the lock is released */

Really, there's no guarantee that the lock will be released when we
wake up.  I think just /* Sleep until something happens, then recheck
*/

+        lock_free = (oldstate & RELEXT_LOCK_BIT) == 0;
+        if (lock_free)
+            desired_state += RELEXT_LOCK_BIT;
+
+        if (pg_atomic_compare_exchange_u32(&relextlock->state,
+                                           &oldstate, desired_state))
+        {
+            if (lock_free)
+                return false;
+            else
+                return true;
+        }

Hmm.  If the lock is not free, we attempt to compare-and-swap anyway,
but then return false?  Why not just lock_free = (oldstate &
RELEXT_LOCK_BIT) == 0; if (!lock_free) return true; if
(pg_atomic_compare_exchange(&relextlock->state, &oldstate, oldstate |
RELEXT_LOCK_BIT)) return false;

+    Assert(IsAnyRelationExtensionLockHeld() == 0);

Since this is return bool now, it should just be
Assert(!IsAnyRelationExtensionLockHeld()).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Sat, Dec 9, 2017 at 2:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Dec 8, 2017 at 3:20 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> The first hunk in monitoring.sgml looks unnecessary.
>>
>> You meant the following hunk?
>>
>> diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
>> index 8d461c8..7aa7981 100644
>> --- a/doc/src/sgml/monitoring.sgml
>> +++ b/doc/src/sgml/monitoring.sgml
>> @@ -669,8 +669,8 @@ postgres   27093  0.0  0.0  30096  2752 ?
>> Ss   11:34   0:00 postgres: ser
>>            Heavyweight locks, also known as lock manager locks or simply locks,
>>            primarily protect SQL-visible objects such as tables.  However,
>>            they are also used to ensure mutual exclusion for certain internal
>> -          operations such as relation extension.
>> <literal>wait_event</literal> will
>> -          identify the type of lock awaited.
>> +          operations such as waiting for a transaction to finish.
>> +          <literal>wait_event</literal> will identify the type of lock awaited.
>>           </para>
>>          </listitem>
>>          <listitem>
>>
>> I think that since the extension locks are no longer a part of
>> heavyweight locks we should change the explanation.
>
> Yes, you are right.
>
>>> +RelationExtensionLockWaiterCount(Relation relation)
>>>
>>> Hmm.  This is sort of problematic, because with then new design we
>>> have no guarantee that the return value is actually accurate.  I don't
>>> think that's a functional problem, but the optics aren't great.
>>
>> Yeah, with this patch we could overestimate it and then add extra
>> blocks to the relation. Since the number of extra blocks is capped at
>> 512 I think it would not become serious problem.
>
> How about renaming it EstimateNumberOfExtensionLockWaiters?

Agreed, fixed.

>
>>> +    /* If force releasing, release all locks we're holding */
>>> +    if (force)
>>> +        held_relextlock.nLocks = 0;
>>> +    else
>>> +        held_relextlock.nLocks--;
>>> +
>>> +    Assert(held_relextlock.nLocks >= 0);
>>> +
>>> +    /* Return if we're still holding the lock even after computation */
>>> +    if (held_relextlock.nLocks > 0)
>>> +        return;
>>>
>>> I thought you were going to have the caller adjust nLocks?
>>
>> Yeah, I was supposed to change so but since we always release either
>> one lock or all relext locks I thought it'd better to pass a bool
>> rather than an int.
>
> I don't see why you need to pass either one.  The caller can set
> held_relextlock.nLocks either with -- or = 0, and then call
> RelExtLockRelease() only if the resulting value is 0.

Fixed.

>
>>> When I took a break from sitting at the computer, I realized that I
>>> think this has a more serious problem: won't it permanently leak
>>> reference counts if someone hits ^C or an error occurs while the lock
>>> is held?  I think it will -- it probably needs to do cleanup at the
>>> places where we do LWLockReleaseAll() that includes decrementing the
>>> shared refcount if necessary, rather than doing cleanup at the places
>>> we release heavyweight locks.
>>> I might be wrong about the details here -- this is off the top of my head.
>>
>> Good catch. It can leak reference counts if someone hits ^C or an
>> error occurs while waiting. Fixed in the latest patch. But since
>> RelExtLockReleaseAll() is called even when such situations I think we
>> don't need to change the place where releasing the all relext lock. We
>> just moved it from heavyweight locks. Am I missing something?
>
> Hmm, that might be an OK way to handle it.  I don't see a problem off
> the top of my head.  It might be clearer to rename it to
> RelExtLockCleanup() though, since it is not just releasing the lock
> but also any wait count we hold.

Yeah, it seems better. Fixed.

> +/* Must be greater than MAX_BACKENDS - which is 2^23-1, so we're fine. */
> +#define RELEXT_WAIT_COUNT_MASK    ((uint32) ((1 << 24) - 1))
>
> Let's drop the comment here and instead add a StaticAssertStmt() that
> checks this.

Fixed. I added StaticAssertStmt() to InitRelExtLocks().

>
> I am slightly puzzled, though.  If I read this correctly, bits 0-23
> are used for the waiter count, bit 24 is always 0, bit 25 indicates
> the presence or absence of an exclusive lock, and bits 26+ are always
> 0.  That seems slightly odd.  Shouldn't we either use the highest
> available bit for the locker (bit 31) or the lowest one (bit 24)?  The
> former seems better, in case MAX_BACKENDS changes later.  We could
> make RELEXT_WAIT_COUNT_MASK bigger too, just in case.

I agree with the former. Fixed.

> +        /* Make a lock tag */
> +        tag.dbid = MyDatabaseId;
> +        tag.relid = relid;
>
> What about shared relations?  I bet we need to use 0 in that case.
> Otherwise, if backends in two different databases try to extend the
> same shared relation at the same time, we'll (probably) fail to notice
> that they conflict.
>

You're right. I changed it so that we set invalidOId to tag.dbid if
the relation is shared relation.


> + * To avoid unnecessary recomputations of the hash code, we try to do this
> + * just once per function, and then pass it around as needed.  we can
> + * extract the index number of RelExtLockArray.
>
> This is just a copy-and-paste from lock.c, but actually we have a more
> sophisticated scheme here.  I think you can just drop this comment
> altogether, really.

Fixed.

>
> +    return (tag_hash((const void *) locktag, sizeof(RelExtLockTag))
> +            % N_RELEXTLOCK_ENTS);
>
> I would drop the outermost set of parentheses.  Is the cast to (const
> void *) really doing anything?
>

Fixed.

> +                 "cannot acquire relation extension locks for
> multiple relations at the same");
>
> cannot simultaneously acquire more than one distinct relation lock?
> As you have it, you'd have to add the word "time" at the end, but my
> version is shorter.

I wanted to mean, cannot acquire relation extension locks for multiple
relations at the "time". Fixed.

>
> +        /* Sleep until the lock is released */
>
> Really, there's no guarantee that the lock will be released when we
> wake up.  I think just /* Sleep until something happens, then recheck
> */

Fixed.

> +        lock_free = (oldstate & RELEXT_LOCK_BIT) == 0;
> +        if (lock_free)
> +            desired_state += RELEXT_LOCK_BIT;
> +
> +        if (pg_atomic_compare_exchange_u32(&relextlock->state,
> +                                           &oldstate, desired_state))
> +        {
> +            if (lock_free)
> +                return false;
> +            else
> +                return true;
> +        }
>
> Hmm.  If the lock is not free, we attempt to compare-and-swap anyway,
> but then return false?  Why not just lock_free = (oldstate &
> RELEXT_LOCK_BIT) == 0; if (!lock_free) return true; if
> (pg_atomic_compare_exchange(&relextlock->state, &oldstate, oldstate |
> RELEXT_LOCK_BIT)) return false;

Fixed.

>
> +    Assert(IsAnyRelationExtensionLockHeld() == 0);
>
> Since this is return bool now, it should just be
> Assert(!IsAnyRelationExtensionLockHeld()).

Fixed.

Attached updated version patch. Please review it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Robert Haas
Date:
On Sun, Dec 10, 2017 at 11:51 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Attached updated version patch. Please review it.

I went over this today; please find attached an updated version which
I propose to commit.

Changes:

- Various formatting fixes, including running pgindent.

- Various comment updates.

- Make RELEXT_WAIT_COUNT_MASK equal RELEXT_LOCK_BIT - 1 rather than
some unnecessarily smaller number.

- In InitRelExtLocks, don't bother using mul_size; we already know it
won't overflow, because we did the same thing in RelExtLockShmemSize.

- When we run into an error trying to release a lock, log it as a
WARNING and don't mark it as translatable.  Follows lock.c.  An ERROR
here probably just recurses infinitely.

- Don't bother passing OID to RelExtLockRelease.

- Reorder functions a bit for (IMHO) better clarity.

- Make UnlockRelationForExtension just use a single message for both
failure modes.  They are closely-enough related that I think that's
fine.

- Make WaitForRelationExtensionLockToBeFree complain if we already
hold an extension lock.

- In RelExtLockCleanup, clear held_relextlock.waiting.  This would've
made for a nasty bug.

- Also in that function, assert that we don't hold both a lock and a wait count.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: [HACKERS] Moving relation extension locks out of heavyweightlock manager

From
Andres Freund
Date:
Hi,

On 2017-12-11 15:15:50 -0500, Robert Haas wrote:
> +* Relation extension locks. Only one process can extend a relation at
> +a time; we use a specialized lock manager for this purpose, which is
> +much simpler than the regular lock manager.  It is similar to the
> +lightweight lock mechanism, but is ever simpler because there is only
> +one lock mode and only one lock can be taken at a time. A process holding
> +a relation extension lock is interruptible, unlike a process holding an
> +LWLock.


> +/*-------------------------------------------------------------------------
> + *
> + * extension_lock.c
> + *      Relation extension lock manager
> + *
> + * This specialized lock manager is used only for relation extension
> + * locks.  Unlike the heavyweight lock manager, it doesn't provide
> + * deadlock detection or group locking.  Unlike lwlock.c, extension lock
> + * waits are interruptible.  Unlike both systems, there is only one lock
> + * mode.
> + *
> + * False sharing is possible.  We have a fixed-size array of locks, and
> + * every database OID/relation OID combination is mapped to a slot in
> + * the array.  Therefore, if two processes try to extend relations that
> + * map to the same array slot, they will contend even though it would
> + * be OK to let both proceed at once.  Since these locks are typically
> + * taken only for very short periods of time, this doesn't seem likely
> + * to be a big problem in practice.  If it is, we could make the array
> + * bigger.

For me "very short periods of time" and journaled metadatachanging
filesystem operations don't quite mesh.  Language lawyering aside, this
seems quite likely to bite us down the road.

It's imo perfectly fine to say that there's only a limited number of
file extension locks, but that there's a far from neglegible chance of
conflict even without the array being full doesn't seem nice. Think this
needs use some open addressing like conflict handling or something
alike.

Greetings,

Andres Freund


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Robert Haas
Date:
On Mon, Dec 11, 2017 at 3:25 PM, Andres Freund <andres@anarazel.de> wrote:
> For me "very short periods of time" and journaled metadatachanging
> filesystem operations don't quite mesh.  Language lawyering aside, this
> seems quite likely to bite us down the road.
>
> It's imo perfectly fine to say that there's only a limited number of
> file extension locks, but that there's a far from neglegible chance of
> conflict even without the array being full doesn't seem nice. Think this
> needs use some open addressing like conflict handling or something
> alike.

I guess we could consider that, but I'm not really convinced that it's
solving a real problem.  Right now, you start having meaningful chance
of lock-manager lock contention when the number of concurrent
processes in the system requesting heavyweight locks is still in the
single digits, because there are only 16 lock-manager locks.  With
this, there are effectively 1024 partitions.

Now I realize you're going to point out, not wrongly, that we're
contending on the locks themselves rather than the locks protecting
the locks, and that this makes everything worse because the hold time
is much longer.  Fair enough.  On the other hand, what workload would
actually be harmed?  I think you basically have to imagine a lot of
relations being extended simultaneously, like a parallel bulk load,
and an underlying filesystem which performs individual operations
slowly but scales really well.  I'm slightly skeptical that's how
real-world filesystems behave.

It might be a good idea, though, to test how parallel bulk loading
behaves with this patch applied, maybe even after reducing
N_RELEXTLOCK_ENTS to simulate an unfortunate number of collisions.

This isn't a zero-sum game.  If we add collision resolution, we're
going to slow down the ordinary uncontended case; the bookkeeping will
get significantly more complicated.  That is only worth doing if the
current behavior produces pathological cases on workloads that are
actually somewhat realistic.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Tue, Dec 12, 2017 at 5:15 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Dec 10, 2017 at 11:51 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Attached updated version patch. Please review it.
>
> I went over this today; please find attached an updated version which
> I propose to commit.
>
> Changes:
>
> - Various formatting fixes, including running pgindent.
>
> - Various comment updates.
>
> - Make RELEXT_WAIT_COUNT_MASK equal RELEXT_LOCK_BIT - 1 rather than
> some unnecessarily smaller number.
>
> - In InitRelExtLocks, don't bother using mul_size; we already know it
> won't overflow, because we did the same thing in RelExtLockShmemSize.
>
> - When we run into an error trying to release a lock, log it as a
> WARNING and don't mark it as translatable.  Follows lock.c.  An ERROR
> here probably just recurses infinitely.
>
> - Don't bother passing OID to RelExtLockRelease.
>
> - Reorder functions a bit for (IMHO) better clarity.
>
> - Make UnlockRelationForExtension just use a single message for both
> failure modes.  They are closely-enough related that I think that's
> fine.
>
> - Make WaitForRelationExtensionLockToBeFree complain if we already
> hold an extension lock.
>
> - In RelExtLockCleanup, clear held_relextlock.waiting.  This would've
> made for a nasty bug.
>
> - Also in that function, assert that we don't hold both a lock and a wait count.
>

Thank you for updating the patch. Here is two minor comments.

+ * we acquire the same relation extension lock repeatedly.  nLocks is 0 is the
+ * number of times we've acquired that lock;

Should it be "nLocks is the number of times we've acquired that lock:"?

+    /* Remember lock held by this backend */
+    held_relextlock.relid = relid;
+    held_relextlock.lock = relextlock;
+    held_relextlock.nLocks = 1;

We set held_relextlock.relid and held_relextlock.lock again. Can we remove them?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] Moving relation extension locks out of heavyweightlock manager

From
Andres Freund
Date:
On 2017-12-11 15:55:42 -0500, Robert Haas wrote:
> On Mon, Dec 11, 2017 at 3:25 PM, Andres Freund <andres@anarazel.de> wrote:
> > For me "very short periods of time" and journaled metadatachanging
> > filesystem operations don't quite mesh.  Language lawyering aside, this
> > seems quite likely to bite us down the road.
> >
> > It's imo perfectly fine to say that there's only a limited number of
> > file extension locks, but that there's a far from neglegible chance of
> > conflict even without the array being full doesn't seem nice. Think this
> > needs use some open addressing like conflict handling or something
> > alike.
> 
> I guess we could consider that, but I'm not really convinced that it's
> solving a real problem.  Right now, you start having meaningful chance
> of lock-manager lock contention when the number of concurrent
> processes in the system requesting heavyweight locks is still in the
> single digits, because there are only 16 lock-manager locks.  With
> this, there are effectively 1024 partitions.
> 
> Now I realize you're going to point out, not wrongly, that we're
> contending on the locks themselves rather than the locks protecting
> the locks, and that this makes everything worse because the hold time
> is much longer.

Indeed.


> Fair enough.  On the other hand, what workload would actually be
> harmed?  I think you basically have to imagine a lot of relations
> being extended simultaneously, like a parallel bulk load, and an
> underlying filesystem which performs individual operations slowly but
> scales really well.  I'm slightly skeptical that's how real-world
> filesystems behave.

Or just two independent relations on two different filesystems.


> It might be a good idea, though, to test how parallel bulk loading
> behaves with this patch applied, maybe even after reducing
> N_RELEXTLOCK_ENTS to simulate an unfortunate number of collisions.

Yea, that sounds like a good plan. Measure two COPYs to relations on
different filesystems, reduce N_RELEXTLOCK_ENTS to 1, and measure
performance. Then increase the concurrency of the copies to each
relation.


> This isn't a zero-sum game.  If we add collision resolution, we're
> going to slow down the ordinary uncontended case; the bookkeeping will
> get significantly more complicated.  That is only worth doing if the
> current behavior produces pathological cases on workloads that are
> actually somewhat realistic.

Yea, measuring sounds like a good plan.

Greetings,

Andres Freund


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Robert Haas
Date:
On Mon, Dec 11, 2017 at 4:10 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Thank you for updating the patch. Here is two minor comments.
>
> + * we acquire the same relation extension lock repeatedly.  nLocks is 0 is the
> + * number of times we've acquired that lock;
>
> Should it be "nLocks is the number of times we've acquired that lock:"?

Yes.

> +    /* Remember lock held by this backend */
> +    held_relextlock.relid = relid;
> +    held_relextlock.lock = relextlock;
> +    held_relextlock.nLocks = 1;
>
> We set held_relextlock.relid and held_relextlock.lock again. Can we remove them?

Yes.

Can you also try the experiment Andres mentions: "Measure two COPYs to
relations on different filesystems, reduce N_RELEXTLOCK_ENTS to 1, and
measure performance. Then increase the concurrency of the copies to
each relation."  We want to see whether and how much this regresses
performance in that case.  It simulates the case of a hash collision.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Wed, Dec 13, 2017 at 12:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Dec 11, 2017 at 4:10 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Thank you for updating the patch. Here is two minor comments.
>>
>> + * we acquire the same relation extension lock repeatedly.  nLocks is 0 is the
>> + * number of times we've acquired that lock;
>>
>> Should it be "nLocks is the number of times we've acquired that lock:"?
>
> Yes.
>
>> +    /* Remember lock held by this backend */
>> +    held_relextlock.relid = relid;
>> +    held_relextlock.lock = relextlock;
>> +    held_relextlock.nLocks = 1;
>>
>> We set held_relextlock.relid and held_relextlock.lock again. Can we remove them?
>
> Yes.
>
> Can you also try the experiment Andres mentions: "Measure two COPYs to
> relations on different filesystems, reduce N_RELEXTLOCK_ENTS to 1, and
> measure performance.

Yes. I'll measure the performance on such environment.

> Then increase the concurrency of the copies to
> each relation."  We want to see whether and how much this regresses
> performance in that case.  It simulates the case of a hash collision.
>

When we add extra blocks on a relation do we access to the disk? I
guess we just call lseek and write and don't access to the disk. If so
the performance degradation regression might not be much.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] Moving relation extension locks out of heavyweightlock manager

From
Andres Freund
Date:
On 2017-12-13 16:02:45 +0900, Masahiko Sawada wrote:
> When we add extra blocks on a relation do we access to the disk? I
> guess we just call lseek and write and don't access to the disk. If so
> the performance degradation regression might not be much.

Usually changes in the file size require the filesystem to perform
metadata operations, which in turn requires journaling on most
FSs. Which'll often result in synchronous disk writes.

Greetings,

Andres Freund


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Wed, Dec 13, 2017 at 4:30 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-12-13 16:02:45 +0900, Masahiko Sawada wrote:
>> When we add extra blocks on a relation do we access to the disk? I
>> guess we just call lseek and write and don't access to the disk. If so
>> the performance degradation regression might not be much.
>
> Usually changes in the file size require the filesystem to perform
> metadata operations, which in turn requires journaling on most
> FSs. Which'll often result in synchronous disk writes.
>

Thank you. I understood the reason why this measurement should use two
different filesystems.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Wed, Dec 13, 2017 at 5:57 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Wed, Dec 13, 2017 at 4:30 PM, Andres Freund <andres@anarazel.de> wrote:
>> On 2017-12-13 16:02:45 +0900, Masahiko Sawada wrote:
>>> When we add extra blocks on a relation do we access to the disk? I
>>> guess we just call lseek and write and don't access to the disk. If so
>>> the performance degradation regression might not be much.
>>
>> Usually changes in the file size require the filesystem to perform
>> metadata operations, which in turn requires journaling on most
>> FSs. Which'll often result in synchronous disk writes.
>>
>
> Thank you. I understood the reason why this measurement should use two
> different filesystems.
>

Here is the result.
I've measured the through-put with some cases on my virtual machine.
Each client loads 48k file to each different relations located on
either xfs filesystem or ext4 filesystem, for 30 sec.

Case 1: COPYs to relations on different filessystems(xfs and ext4) and
N_RELEXTLOCK_ENTS is 1024

clients = 2, avg = 296.2068
clients = 5, avg = 372.0707
clients = 10, avg = 389.8850
clients = 50, avg = 428.8050

Case 2: COPYs to relations on different filessystems(xfs and ext4) and
N_RELEXTLOCK_ENTS is 1

clients = 2, avg = 294.3633
clients = 5, avg = 358.9364
clients = 10, avg = 383.6945
clients = 50, avg = 424.3687

And the result of current HEAD is following.

clients = 2, avg = 284.9976
clients = 5, avg = 356.1726
clients = 10, avg = 375.9856
clients = 50, avg = 429.5745

In case2, the through-put got decreased compare to case 1 but it seems
to be almost same as current HEAD. Because the speed of acquiring and
releasing extension lock got x10 faster than current HEAD as I
mentioned before, the performance degradation may not have gotten
decreased than I expected even in case 2.
Since my machine doesn't have enough resources the result of clients =
50 might not be a valid result.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Robert Haas
Date:
On Thu, Dec 14, 2017 at 5:45 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Here is the result.
> I've measured the through-put with some cases on my virtual machine.
> Each client loads 48k file to each different relations located on
> either xfs filesystem or ext4 filesystem, for 30 sec.
>
> Case 1: COPYs to relations on different filessystems(xfs and ext4) and
> N_RELEXTLOCK_ENTS is 1024
>
> clients = 2, avg = 296.2068
> clients = 5, avg = 372.0707
> clients = 10, avg = 389.8850
> clients = 50, avg = 428.8050
>
> Case 2: COPYs to relations on different filessystems(xfs and ext4) and
> N_RELEXTLOCK_ENTS is 1
>
> clients = 2, avg = 294.3633
> clients = 5, avg = 358.9364
> clients = 10, avg = 383.6945
> clients = 50, avg = 424.3687
>
> And the result of current HEAD is following.
>
> clients = 2, avg = 284.9976
> clients = 5, avg = 356.1726
> clients = 10, avg = 375.9856
> clients = 50, avg = 429.5745
>
> In case2, the through-put got decreased compare to case 1 but it seems
> to be almost same as current HEAD. Because the speed of acquiring and
> releasing extension lock got x10 faster than current HEAD as I
> mentioned before, the performance degradation may not have gotten
> decreased than I expected even in case 2.
> Since my machine doesn't have enough resources the result of clients =
> 50 might not be a valid result.

I have to admit that result is surprising to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Sun, Dec 17, 2017 at 12:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Dec 14, 2017 at 5:45 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Here is the result.
>> I've measured the through-put with some cases on my virtual machine.
>> Each client loads 48k file to each different relations located on
>> either xfs filesystem or ext4 filesystem, for 30 sec.
>>
>> Case 1: COPYs to relations on different filessystems(xfs and ext4) and
>> N_RELEXTLOCK_ENTS is 1024
>>
>> clients = 2, avg = 296.2068
>> clients = 5, avg = 372.0707
>> clients = 10, avg = 389.8850
>> clients = 50, avg = 428.8050
>>
>> Case 2: COPYs to relations on different filessystems(xfs and ext4) and
>> N_RELEXTLOCK_ENTS is 1
>>
>> clients = 2, avg = 294.3633
>> clients = 5, avg = 358.9364
>> clients = 10, avg = 383.6945
>> clients = 50, avg = 424.3687
>>
>> And the result of current HEAD is following.
>>
>> clients = 2, avg = 284.9976
>> clients = 5, avg = 356.1726
>> clients = 10, avg = 375.9856
>> clients = 50, avg = 429.5745
>>
>> In case2, the through-put got decreased compare to case 1 but it seems
>> to be almost same as current HEAD. Because the speed of acquiring and
>> releasing extension lock got x10 faster than current HEAD as I
>> mentioned before, the performance degradation may not have gotten
>> decreased than I expected even in case 2.
>> Since my machine doesn't have enough resources the result of clients =
>> 50 might not be a valid result.
>
> I have to admit that result is surprising to me.
>

I think the environment I used for performance measurement did not
have enough resources. I will do the same benchmark on an another
environment to see if it was a valid result, and will share it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Mon, Dec 18, 2017 at 2:04 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Sun, Dec 17, 2017 at 12:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Dec 14, 2017 at 5:45 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> Here is the result.
>>> I've measured the through-put with some cases on my virtual machine.
>>> Each client loads 48k file to each different relations located on
>>> either xfs filesystem or ext4 filesystem, for 30 sec.
>>>
>>> Case 1: COPYs to relations on different filessystems(xfs and ext4) and
>>> N_RELEXTLOCK_ENTS is 1024
>>>
>>> clients = 2, avg = 296.2068
>>> clients = 5, avg = 372.0707
>>> clients = 10, avg = 389.8850
>>> clients = 50, avg = 428.8050
>>>
>>> Case 2: COPYs to relations on different filessystems(xfs and ext4) and
>>> N_RELEXTLOCK_ENTS is 1
>>>
>>> clients = 2, avg = 294.3633
>>> clients = 5, avg = 358.9364
>>> clients = 10, avg = 383.6945
>>> clients = 50, avg = 424.3687
>>>
>>> And the result of current HEAD is following.
>>>
>>> clients = 2, avg = 284.9976
>>> clients = 5, avg = 356.1726
>>> clients = 10, avg = 375.9856
>>> clients = 50, avg = 429.5745
>>>
>>> In case2, the through-put got decreased compare to case 1 but it seems
>>> to be almost same as current HEAD. Because the speed of acquiring and
>>> releasing extension lock got x10 faster than current HEAD as I
>>> mentioned before, the performance degradation may not have gotten
>>> decreased than I expected even in case 2.
>>> Since my machine doesn't have enough resources the result of clients =
>>> 50 might not be a valid result.
>>
>> I have to admit that result is surprising to me.
>>
>
> I think the environment I used for performance measurement did not
> have enough resources. I will do the same benchmark on an another
> environment to see if it was a valid result, and will share it.
>

I did performance measurement on an different environment where has 4
cores and physically separated two disk volumes. Also I've change the
benchmarking so that COPYs load only 300 integer tuples which are not
fit within single page, and changed tables to unlogged tables to
observe the overhead of locking/unlocking relext locks.

Case 1: COPYs to relations on different filessystems(xfs and ext4) and
N_RELEXTLOCK_ENTS is 1024

clients = 1, avg = 3033.8933
clients = 2, avg = 5992.9077
clients = 4, avg = 8055.9515
clients = 8, avg = 8468.9306
clients = 16, avg = 7718.6879

Case 2: COPYs to relations on different filessystems(xfs and ext4) and
N_RELEXTLOCK_ENTS is 1

clients = 1, avg = 3012.4993
clients = 2, avg = 5854.9966
clients = 4, avg = 7380.6082
clients = 8, avg = 7091.8367
clients = 16, avg = 7573.2904

And the result of current HEAD is following.

clients = 1, avg = 2962.2416
clients = 2, avg = 5856.9774
clients = 4, avg = 7561.1376
clients = 8, avg = 7252.0192
clients = 16, avg = 7916.7651

As per the above results, compared with current HEAD the through-put
of case 1 got increased up to 17%. On the other hand, the through-put
of case 2 got decreased 2%~5%.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


On Tue, Dec 19, 2017 at 5:52 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Mon, Dec 18, 2017 at 2:04 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> On Sun, Dec 17, 2017 at 12:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>
>>> I have to admit that result is surprising to me.
>>
>> I think the environment I used for performance measurement did not
>> have enough resources. I will do the same benchmark on an another
>> environment to see if it was a valid result, and will share it.
>>
> I did performance measurement on an different environment where has 4
> cores and physically separated two disk volumes. Also I've change the
> benchmarking so that COPYs load only 300 integer tuples which are not
> fit within single page, and changed tables to unlogged tables to
> observe the overhead of locking/unlocking relext locks.

I ran same test as asked by Robert it was just an extension of tests
[1] pointed by Amit Kapila,

Machine : cthulhu
------------------------
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                128
On-line CPU(s) list:   0-127
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             8
NUMA node(s):          8
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 47
Model name:            Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz
Stepping:              2
CPU MHz:               1064.000
CPU max MHz:           2129.0000
CPU min MHz:           1064.0000
BogoMIPS:              4266.59
Virtualization:        VT-x
Hypervisor vendor:     vertical
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              24576K
NUMA node0 CPU(s):     0-7,64-71
NUMA node1 CPU(s):     8-15,72-79
NUMA node2 CPU(s):     16-23,80-87
NUMA node3 CPU(s):     24-31,88-95
NUMA node4 CPU(s):     32-39,96-103
NUMA node5 CPU(s):     40-47,104-111
NUMA node6 CPU(s):     48-55,112-119
NUMA node7 CPU(s):     56-63,120-127

It has 2 discs with different filesytem as below
/dev/mapper/vg_mag-data2        ext4      5.1T  3.6T  1.2T  76% /mnt/data-mag2
/dev/mapper/vg_mag-data1        xfs       5.1T  1.6T  3.6T  31% /mnt/data-mag

I have created 2 tables each one on above filesystem.

test_size_copy.sh --> automated script to run copy test.
copy_script1, copy_script2 -> copy pg_bench script's used by
test_size_copy.sh to load to 2 different tables.

To run above copy_scripts in parallel I have run it with equal weights as below.
./pgbench -c $threads -j $threads -f copy_script1@1 -f copy_script2@1
-T 120 postgres >> test_results.txt


Results :
-----------

Clients        HEAD-TPS
---------        ---------------
1                84.460734
2                121.359035
4                175.886335
8                268.764828
16              369.996667
32              439.032756
64              482.185392


Clients    N_RELEXTLOCK_ENTS = 1024    %diff with DEAD
----------------------------------------------------------------------------------
1        87.165777        3.20272258112273
2        131.094037        8.02165409439848
4        181.667104        3.2866504381935
8        267.412856        -0.503031594595423
16        376.118671        1.65461058058666
32        460.756357        4.94805927419228
64        492.723975        2.18558736428913

Not much of an improvement from HEAD

Clients    N_RELEXTLOCK_ENTS = 1    %diff with HEAD
-----------------------------------------------------------------------------
1        86.288574        2.16412990206786
2        131.398667        8.27266960387414
4        168.681079        -4.09654109854526
8        245.841999        -8.52895416806549
16        321.972147        -12.9797169226933
32        375.783299        -14.4065462395703
64        360.134531        -25.3120196142317


So in case of  N_RELEXTLOCK_ENTS = 1 we can see regression as high 25%. ?


[1]https://www.postgresql.org/message-id/CAFiTN-tkX6gs-jL8VrPxg6OG9VUAKnObUq7r7pWQqASzdF5OwA%40mail.gmail.com
-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Robert Haas
Date:
On Tue, Jan 2, 2018 at 1:09 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> So in case of  N_RELEXTLOCK_ENTS = 1 we can see regression as high 25%. ?

So now the question is: what do these results mean for this patch?

I think that the chances of someone simultaneously bulk-loading 16 or
more relations that all happen to hash to the same relation extension
lock bucket is pretty darn small.  Most people aren't going to be
running 16 bulk loads at the same time in the first place, and if they
are, then there's a good chance that at least some of those loads are
either actually to the same relation, or that many or all of the loads
are targeting the same filesystem and the bottleneck will occur at
that level, or that the loads are to relations which hash to different
buckets.  Now, if we want to reduce the chances of hash collisions, we
could boost the default value of N_RELEXTLOCK_ENTS to 2048 or 4096.

However, if we take the position that no hash collision probability is
low enough and that we must eliminate all chance of false collisions,
except perhaps when the table is full, then we have to make this
locking mechanism a whole lot more complicated.  We can no longer
compute the location of the lock we need without first taking some
other kind of lock that protects the mapping from {db_oid, rel_oid} ->
{memory address of the relevant lock}.  We can no longer cache the
location where we found the lock last time so that we can retake it.
If we do that, we're adding extra cycles and extra atomics and extra
code that can harbor bugs to every relation extension to guard against
something which I'm not sure is really going to happen.  Something
that's 3-8% faster in a case that occurs all the time and as much as
25% slower in a case that virtually never arises seems like it might
be a win overall.

However, it's quite possible that I'm not seeing the whole picture
here.  Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Fri, Jan 5, 2018 at 1:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jan 2, 2018 at 1:09 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
>> So in case of  N_RELEXTLOCK_ENTS = 1 we can see regression as high 25%. ?

Thank you for the performance measurement!

> So now the question is: what do these results mean for this patch?
>
> I think that the chances of someone simultaneously bulk-loading 16 or
> more relations that all happen to hash to the same relation extension
> lock bucket is pretty darn small.  Most people aren't going to be
> running 16 bulk loads at the same time in the first place, and if they
> are, then there's a good chance that at least some of those loads are
> either actually to the same relation, or that many or all of the loads
> are targeting the same filesystem and the bottleneck will occur at
> that level, or that the loads are to relations which hash to different
> buckets.  Now, if we want to reduce the chances of hash collisions, we
> could boost the default value of N_RELEXTLOCK_ENTS to 2048 or 4096.
>
> However, if we take the position that no hash collision probability is
> low enough and that we must eliminate all chance of false collisions,
> except perhaps when the table is full, then we have to make this
> locking mechanism a whole lot more complicated.  We can no longer
> compute the location of the lock we need without first taking some
> other kind of lock that protects the mapping from {db_oid, rel_oid} ->
> {memory address of the relevant lock}.  We can no longer cache the
> location where we found the lock last time so that we can retake it.
> If we do that, we're adding extra cycles and extra atomics and extra
> code that can harbor bugs to every relation extension to guard against
> something which I'm not sure is really going to happen.  Something
> that's 3-8% faster in a case that occurs all the time and as much as
> 25% slower in a case that virtually never arises seems like it might
> be a win overall.
>
> However, it's quite possible that I'm not seeing the whole picture
> here.  Thoughts?
>

I agree that the chances of the case where through-put got worse is
pretty small and we can get performance improvement in common cases.
Also, we could mistakenly overestimate the number of blocks we need to
add by false collisions. Thereby the performance might got worse and
we extend a relation more than necessary but I think the chances are
small. Considering the further parallel operations (e.g. parallel
loading, parallel index creation etc) multiple processes will be
taking a relext lock of the same relation. Thinking of that, the
benefit of this patch that improves the speeds of acquiring/releasing
the lock would be effective.

In short I personally think the current patch is simple and the result
is not a bad. But If community cannot accept these degradations we
have to deal with the problem. For example, we could make the length
of relext lock array configurable by users. That way, users can reduce
the possibility of collisions. Or we could improve the relext lock
manager to eliminate false collision by changing it to a
open-addressing hash table. The code would get complex but false
collisions don't happen unless the array is not full.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Sun, Jan 7, 2018 at 11:26 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> On Fri, Jan 5, 2018 at 1:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Jan 2, 2018 at 1:09 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
>>> So in case of  N_RELEXTLOCK_ENTS = 1 we can see regression as high 25%. ?
>
> Thank you for the performance measurement!
>
>> So now the question is: what do these results mean for this patch?
>>
>> I think that the chances of someone simultaneously bulk-loading 16 or
>> more relations that all happen to hash to the same relation extension
>> lock bucket is pretty darn small.  Most people aren't going to be
>> running 16 bulk loads at the same time in the first place, and if they
>> are, then there's a good chance that at least some of those loads are
>> either actually to the same relation, or that many or all of the loads
>> are targeting the same filesystem and the bottleneck will occur at
>> that level, or that the loads are to relations which hash to different
>> buckets.  Now, if we want to reduce the chances of hash collisions, we
>> could boost the default value of N_RELEXTLOCK_ENTS to 2048 or 4096.
>>
>> However, if we take the position that no hash collision probability is
>> low enough and that we must eliminate all chance of false collisions,
>> except perhaps when the table is full, then we have to make this
>> locking mechanism a whole lot more complicated.  We can no longer
>> compute the location of the lock we need without first taking some
>> other kind of lock that protects the mapping from {db_oid, rel_oid} ->
>> {memory address of the relevant lock}.  We can no longer cache the
>> location where we found the lock last time so that we can retake it.
>> If we do that, we're adding extra cycles and extra atomics and extra
>> code that can harbor bugs to every relation extension to guard against
>> something which I'm not sure is really going to happen.  Something
>> that's 3-8% faster in a case that occurs all the time and as much as
>> 25% slower in a case that virtually never arises seems like it might
>> be a win overall.
>>
>> However, it's quite possible that I'm not seeing the whole picture
>> here.  Thoughts?
>>
>
> I agree that the chances of the case where through-put got worse is
> pretty small and we can get performance improvement in common cases.
> Also, we could mistakenly overestimate the number of blocks we need to
> add by false collisions. Thereby the performance might got worse and
> we extend a relation more than necessary but I think the chances are
> small. Considering the further parallel operations (e.g. parallel
> loading, parallel index creation etc) multiple processes will be
> taking a relext lock of the same relation. Thinking of that, the
> benefit of this patch that improves the speeds of acquiring/releasing
> the lock would be effective.
>
> In short I personally think the current patch is simple and the result
> is not a bad. But If community cannot accept these degradations we
> have to deal with the problem. For example, we could make the length
> of relext lock array configurable by users. That way, users can reduce
> the possibility of collisions. Or we could improve the relext lock
> manager to eliminate false collision by changing it to a
> open-addressing hash table. The code would get complex but false
> collisions don't happen unless the array is not full.
>

On second thought, perhaps we should also do performance measurement
with the patch that uses HTAB instead a fixed array. Probably the
performance with that patch will be equal to or slightly greater than
current HEAD, hopefully not be worse. In addition to that, if the
performance degradation by false collision doesn't happen or we can
avoid it by increasing GUC parameter, I think it's better than current
fixed array approach. Thoughts?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] Moving relation extension locks out of heavyweightlock manager

From
Andres Freund
Date:
Hi,

On 2018-01-04 11:39:40 -0500, Robert Haas wrote:
> On Tue, Jan 2, 2018 at 1:09 AM, Mithun Cy <mithun.cy@enterprisedb.com> wrote:
> > So in case of  N_RELEXTLOCK_ENTS = 1 we can see regression as high 25%. ?
> 
> So now the question is: what do these results mean for this patch?

> I think that the chances of someone simultaneously bulk-loading 16 or
> more relations that all happen to hash to the same relation extension
> lock bucket is pretty darn small.

I'm not convinced that that's true. Especially with partitioning in the
mix.

Also, birthday paradoxon and all that make collisions not that
unlikely. And you really don't need a 16 way conflict to feel pain,
you'll imo feel it earlier.

I think bumping up the size a bit would make that less likely. Not sure
it actually addresses the issue.


> However, if we take the position that no hash collision probability is
> low enough and that we must eliminate all chance of false collisions,
> except perhaps when the table is full, then we have to make this
> locking mechanism a whole lot more complicated.  We can no longer
> compute the location of the lock we need without first taking some
> other kind of lock that protects the mapping from {db_oid, rel_oid} ->
> {memory address of the relevant lock}.

Hm, that's not necessarily true, is it? Wile not trivial, it also
doesn't seem impossible?

Greetings,

Andres Freund


On Thu, Mar 1, 2018 at 2:17 PM, Andres Freund <andres@anarazel.de> wrote:
>> However, if we take the position that no hash collision probability is
>> low enough and that we must eliminate all chance of false collisions,
>> except perhaps when the table is full, then we have to make this
>> locking mechanism a whole lot more complicated.  We can no longer
>> compute the location of the lock we need without first taking some
>> other kind of lock that protects the mapping from {db_oid, rel_oid} ->
>> {memory address of the relevant lock}.
>
> Hm, that's not necessarily true, is it? Wile not trivial, it also
> doesn't seem impossible?

You can't both store every lock at a fixed address and at the same
time put locks at a different address if the one they would have used
is already occupied.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Moving relation extension locks out of heavyweightlock manager

From
Andres Freund
Date:
On 2018-03-01 15:37:17 -0500, Robert Haas wrote:
> On Thu, Mar 1, 2018 at 2:17 PM, Andres Freund <andres@anarazel.de> wrote:
> >> However, if we take the position that no hash collision probability is
> >> low enough and that we must eliminate all chance of false collisions,
> >> except perhaps when the table is full, then we have to make this
> >> locking mechanism a whole lot more complicated.  We can no longer
> >> compute the location of the lock we need without first taking some
> >> other kind of lock that protects the mapping from {db_oid, rel_oid} ->
> >> {memory address of the relevant lock}.
> >
> > Hm, that's not necessarily true, is it? Wile not trivial, it also
> > doesn't seem impossible?
> 
> You can't both store every lock at a fixed address and at the same
> time put locks at a different address if the one they would have used
> is already occupied.

Right, but why does that require a lock?

Greetings,

Andres Freund


On Thu, Mar 1, 2018 at 3:40 PM, Andres Freund <andres@anarazel.de> wrote:
>> You can't both store every lock at a fixed address and at the same
>> time put locks at a different address if the one they would have used
>> is already occupied.
>
> Right, but why does that require a lock?

Maybe I'm being dense here but ... how could it not?

If the lock for relation X is always at pointer P, then I can compute
the address for the lock and assume it will be there, because that's
where it *always is*.

If the lock for relation X can be at any of various addresses
depending on other system activity, then I cannot assume that an
address that I compute for it remains valid except for so long as I
hold a lock strong enough to keep it from being moved.

Concretely, I imagine that if you put the lock at different addresses
at different times, you would implement that by reclaiming unused
entries to make room for new entries that you need to allocate.  So if
I hold the lock at 0x1000, I can probably it will assume it will stay
there for as long as I hold it.  But the instant I release it, even
for a moment, somebody might garbage-collect the entry and reallocate
it for something else.  Now the next time I need it it will be
elsewhere.  I'll have to search for it, I presume, while holding some
analogue of the buffer-mapping lock.  In the patch as proposed, that's
not needed.  Once you know that the lock for relation 123 is at
0x1000, you can just keep locking it at that same address without
checking anything, which is quite appealing given that the same
backend extending the same relation many times in a row is a pretty
common pattern.

If you have a clever idea how to make this work with as few atomic
operations as the current patch uses while at the same time reducing
the possibility of contention, I'm all ears.  But I don't see how to
do that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Moving relation extension locks out of heavyweightlock manager

From
Michael Paquier
Date:
On Thu, Mar 01, 2018 at 04:01:28PM -0500, Robert Haas wrote:
> If you have a clever idea how to make this work with as few atomic
> operations as the current patch uses while at the same time reducing
> the possibility of contention, I'm all ears.  But I don't see how to
> do that.

This thread has no activity since the beginning of the commit fest, and
it seems that it would be hard to reach something committable for v11,
so I am marking it as returned with feedback.
--
Michael

Attachment

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Fri, Mar 30, 2018 at 4:43 PM, Michael Paquier <michael@paquier.xyz> wrote:
> On Thu, Mar 01, 2018 at 04:01:28PM -0500, Robert Haas wrote:
>> If you have a clever idea how to make this work with as few atomic
>> operations as the current patch uses while at the same time reducing
>> the possibility of contention, I'm all ears.  But I don't see how to
>> do that.
>
> This thread has no activity since the beginning of the commit fest, and
> it seems that it would be hard to reach something committable for v11,
> so I am marking it as returned with feedback.

Thank you.

The probability of performance degradation can be reduced by
increasing N_RELEXTLOCK_ENTS. But as Robert mentioned, while keeping
fast and simple implementation like acquiring lock by a few atomic
operation it's hard to improve or at least keep the current
performance on all cases. I was thinking that this patch is necessary
by parallel DML operations and vacuum but if the community cannot
accept this approach it might be better to mark it as "Rejected" and
then I should reconsider the design of parallel vacuum.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


On Tue, Apr 10, 2018 at 5:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> The probability of performance degradation can be reduced by
> increasing N_RELEXTLOCK_ENTS. But as Robert mentioned, while keeping
> fast and simple implementation like acquiring lock by a few atomic
> operation it's hard to improve or at least keep the current
> performance on all cases. I was thinking that this patch is necessary
> by parallel DML operations and vacuum but if the community cannot
> accept this approach it might be better to mark it as "Rejected" and
> then I should reconsider the design of parallel vacuum.

I'm sorry that I didn't get time to work further on this during the
CommitFest.  In terms of moving forward, I'd still like to hear what
Andres has to say about the comments I made on March 1st.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Wed, Apr 11, 2018 at 1:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Apr 10, 2018 at 5:40 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> The probability of performance degradation can be reduced by
>> increasing N_RELEXTLOCK_ENTS. But as Robert mentioned, while keeping
>> fast and simple implementation like acquiring lock by a few atomic
>> operation it's hard to improve or at least keep the current
>> performance on all cases. I was thinking that this patch is necessary
>> by parallel DML operations and vacuum but if the community cannot
>> accept this approach it might be better to mark it as "Rejected" and
>> then I should reconsider the design of parallel vacuum.
>
> I'm sorry that I didn't get time to work further on this during the
> CommitFest.

Never mind. There was a lot of items especially at the last CommitFest.

> In terms of moving forward, I'd still like to hear what
> Andres has to say about the comments I made on March 1st.

Yeah, agreed.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


On Tue, Apr 10, 2018 at 9:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Never mind. There was a lot of items especially at the last CommitFest.
>
>> In terms of moving forward, I'd still like to hear what
>> Andres has to say about the comments I made on March 1st.
>
> Yeah, agreed.

$ ping -n andres.freund
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1
Request timeout for icmp_seq 2
Request timeout for icmp_seq 3
Request timeout for icmp_seq 4
^C
--- andres.freund ping statistics ---
6 packets transmitted, 0 packets received, 100.0% packet loss

Meanwhile, https://www.postgresql.org/message-id/4c171ffe-e3ee-acc5-9066-a40d52bc5ae9@postgrespro.ru
shows that this patch has some benefits for other cases, which is a
point in favor IMHO.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Thu, Apr 26, 2018 at 3:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Apr 10, 2018 at 9:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>> Never mind. There was a lot of items especially at the last CommitFest.
>>
>>> In terms of moving forward, I'd still like to hear what
>>> Andres has to say about the comments I made on March 1st.
>>
>> Yeah, agreed.
>
> $ ping -n andres.freund
> Request timeout for icmp_seq 0
> Request timeout for icmp_seq 1
> Request timeout for icmp_seq 2
> Request timeout for icmp_seq 3
> Request timeout for icmp_seq 4
> ^C
> --- andres.freund ping statistics ---
> 6 packets transmitted, 0 packets received, 100.0% packet loss
>
> Meanwhile, https://www.postgresql.org/message-id/4c171ffe-e3ee-acc5-9066-a40d52bc5ae9@postgrespro.ru
> shows that this patch has some benefits for other cases, which is a
> point in favor IMHO.

Thank you for sharing. That's good to know.

Andres pointed out the performance degradation due to hash collision
when multiple loading. I think the point is that it happens at where
users don't know.  Therefore even if we make N_RELEXTLOCK_ENTS
configurable parameter, since users don't know the hash collision they
don't know when they should tune it.

So it's just an idea but how about adding an SQL-callable function
that returns the estimated number of lock waiters of the given
relation? Since user knows how many processes are loading to the
relation, if a returned value by the function is greater than the
expected value user  can know hash collision and will be able to start
to consider to increase N_RELEXTLOCK_ENTS.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


On Thu, Apr 26, 2018 at 2:10 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Thank you for sharing. That's good to know.
>
> Andres pointed out the performance degradation due to hash collision
> when multiple loading. I think the point is that it happens at where
> users don't know.  Therefore even if we make N_RELEXTLOCK_ENTS
> configurable parameter, since users don't know the hash collision they
> don't know when they should tune it.
>
> So it's just an idea but how about adding an SQL-callable function
> that returns the estimated number of lock waiters of the given
> relation? Since user knows how many processes are loading to the
> relation, if a returned value by the function is greater than the
> expected value user  can know hash collision and will be able to start
> to consider to increase N_RELEXTLOCK_ENTS.

I don't think that's a very useful suggestion.  Changing
N_RELEXTLOCK_ENTS requires a recompile, which is going to be
impractical for most users.  Even if we made it a GUC, we don't want
users to have to tune stuff like this.  If we actually think this is
going to be a problem, we'd probably better rethink the desgin.

I think the real question is whether the scenario is common enough to
worry about.  In practice, you'd have to be extremely unlucky to be
doing many bulk loads at the same time that all happened to hash to
the same bucket.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Moving relation extension locks out of heavyweightlock manager

From
Andres Freund
Date:
Hi,

On 2018-04-26 15:08:24 -0400, Robert Haas wrote:
> I don't think that's a very useful suggestion.  Changing
> N_RELEXTLOCK_ENTS requires a recompile, which is going to be
> impractical for most users.  Even if we made it a GUC, we don't want
> users to have to tune stuff like this.  If we actually think this is
> going to be a problem, we'd probably better rethink the desgin.

Agreed.


> I think the real question is whether the scenario is common enough to
> worry about.  In practice, you'd have to be extremely unlucky to be
> doing many bulk loads at the same time that all happened to hash to
> the same bucket.

With a bunch of parallel bulkloads into partitioned tables that really
doesn't seem that unlikely?

Greetings,

Andres Freund


On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund <andres@anarazel.de> wrote:
>> I think the real question is whether the scenario is common enough to
>> worry about.  In practice, you'd have to be extremely unlucky to be
>> doing many bulk loads at the same time that all happened to hash to
>> the same bucket.
>
> With a bunch of parallel bulkloads into partitioned tables that really
> doesn't seem that unlikely?

It increases the likelihood of collisions, but probably decreases the
number of cases where the contention gets really bad.

For example, suppose each table has 100 partitions and you are
bulk-loading 10 of them at a time.  It's virtually certain that you
will have some collisions, but the amount of contention within each
bucket will remain fairly low because each backend spends only 1% of
its time in the bucket corresponding to any given partition.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




-----Original Message-----
From: Robert Haas <robertmhaas@gmail.com>
Sent: Thursday, April 26, 2018 10:25 PM
To: Andres Freund <andres@anarazel.de>
Cc: Masahiko Sawada <sawada.mshk@gmail.com>; Michael Paquier <michael@paquier.xyz>; Mithun Cy
<mithun.cy@enterprisedb.com>;Tom Lane <tgl@sss.pgh.pa.us>; Thomas Munro <thomas.munro@enterprisedb.com>; Amit Kapila
<amit.kapila16@gmail.com>;PostgreSQL-development <pgsql-hackers@postgresql.org> 
Subject: Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund <andres@anarazel.de> wrote:
>> I think the real question is whether the scenario is common enough to
>> worry about.  In practice, you'd have to be extremely unlucky to be
>> doing many bulk loads at the same time that all happened to hash to
>> the same bucket.
>
> With a bunch of parallel bulkloads into partitioned tables that really
> doesn't seem that unlikely?

It increases the likelihood of collisions, but probably decreases the number of cases where the contention gets really
bad.

For example, suppose each table has 100 partitions and you are bulk-loading 10 of them at a time.  It's virtually
certainthat you will have some collisions, but the amount of contention within each bucket will remain fairly low
becauseeach backend spends only 1% of its time in the bucket corresponding to any given partition. 

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company

Hello!
I want to try to test this patch on 302(704 ht) core machine.

Patching on master (commit 81256cd05f0745353c6572362155b57250a0d2a0) is ok but
got some error while compiling :

gistvacuum.c: In function ‘gistvacuumcleanup’:
gistvacuum.c:92:3: error: too many arguments to function ‘LockRelationForExtension’
   LockRelationForExtension(rel, ExclusiveLock);
   ^
In file included from gistvacuum.c:21:0:
../../../../src/include/storage/extension_lock.h:30:13: note: declared here
 extern void LockRelationForExtension(Relation relation);
             ^
gistvacuum.c:95:3: error: too many arguments to function ‘UnlockRelationForExtension’
   UnlockRelationForExtension(rel, ExclusiveLock);
   ^
In file included from gistvacuum.c:21:0:
../../../../src/include/storage/extension_lock.h:31:13: note: declared here
 extern void UnlockRelationForExtension(Relation relation);


--
Alex Ignatov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




--
Alex Ignatov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

-----Original Message-----
From: Alex Ignatov <a.ignatov@postgrespro.ru>
Sent: Monday, May 21, 2018 6:00 PM
To: 'Robert Haas' <robertmhaas@gmail.com>; 'Andres Freund' <andres@anarazel.de>
Cc: 'Masahiko Sawada' <sawada.mshk@gmail.com>; 'Michael Paquier' <michael@paquier.xyz>; 'Mithun Cy'
<mithun.cy@enterprisedb.com>;'Tom Lane' <tgl@sss.pgh.pa.us>; 'Thomas Munro' <thomas.munro@enterprisedb.com>; 'Amit
Kapila'<amit.kapila16@gmail.com>; 'PostgreSQL-development' <pgsql-hackers@postgresql.org> 
Subject: RE: [HACKERS] Moving relation extension locks out of heavyweight lock manager




-----Original Message-----
From: Robert Haas <robertmhaas@gmail.com>
Sent: Thursday, April 26, 2018 10:25 PM
To: Andres Freund <andres@anarazel.de>
Cc: Masahiko Sawada <sawada.mshk@gmail.com>; Michael Paquier <michael@paquier.xyz>; Mithun Cy
<mithun.cy@enterprisedb.com>;Tom Lane <tgl@sss.pgh.pa.us>; Thomas Munro <thomas.munro@enterprisedb.com>; Amit Kapila
<amit.kapila16@gmail.com>;PostgreSQL-development <pgsql-hackers@postgresql.org> 
Subject: Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund <andres@anarazel.de> wrote:
>> I think the real question is whether the scenario is common enough to
>> worry about.  In practice, you'd have to be extremely unlucky to be
>> doing many bulk loads at the same time that all happened to hash to
>> the same bucket.
>
> With a bunch of parallel bulkloads into partitioned tables that really
> doesn't seem that unlikely?

It increases the likelihood of collisions, but probably decreases the number of cases where the contention gets really
bad.

For example, suppose each table has 100 partitions and you are bulk-loading 10 of them at a time.  It's virtually
certainthat you will have some collisions, but the amount of contention within each bucket will remain fairly low
becauseeach backend spends only 1% of its time in the bucket corresponding to any given partition. 

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company

Hello!
I want to try to test this patch on 302(704 ht) core machine.

Patching on master (commit 81256cd05f0745353c6572362155b57250a0d2a0) is ok but got some error while compiling :

gistvacuum.c: In function ‘gistvacuumcleanup’:
gistvacuum.c:92:3: error: too many arguments to function ‘LockRelationForExtension’
   LockRelationForExtension(rel, ExclusiveLock);
   ^
In file included from gistvacuum.c:21:0:
../../../../src/include/storage/extension_lock.h:30:13: note: declared here  extern void
LockRelationForExtension(Relationrelation); 
             ^
gistvacuum.c:95:3: error: too many arguments to function ‘UnlockRelationForExtension’
   UnlockRelationForExtension(rel, ExclusiveLock);
   ^
In file included from gistvacuum.c:21:0:
../../../../src/include/storage/extension_lock.h:31:13: note: declared here  extern void
UnlockRelationForExtension(Relationrelation); 



Sorry, forgot to mention that patch version is extension-lock-v12.patch

--
Alex Ignatov
Postgres Professional: http://www.postgrespro.com The Russian Postgres Company



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Tue, May 22, 2018 at 12:05 AM, Alex Ignatov <a.ignatov@postgrespro.ru> wrote:
>
>
> --
> Alex Ignatov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
> -----Original Message-----
> From: Alex Ignatov <a.ignatov@postgrespro.ru>
> Sent: Monday, May 21, 2018 6:00 PM
> To: 'Robert Haas' <robertmhaas@gmail.com>; 'Andres Freund' <andres@anarazel.de>
> Cc: 'Masahiko Sawada' <sawada.mshk@gmail.com>; 'Michael Paquier' <michael@paquier.xyz>; 'Mithun Cy'
<mithun.cy@enterprisedb.com>;'Tom Lane' <tgl@sss.pgh.pa.us>; 'Thomas Munro' <thomas.munro@enterprisedb.com>; 'Amit
Kapila'<amit.kapila16@gmail.com>; 'PostgreSQL-development' <pgsql-hackers@postgresql.org> 
> Subject: RE: [HACKERS] Moving relation extension locks out of heavyweight lock manager
>
>
>
>
> -----Original Message-----
> From: Robert Haas <robertmhaas@gmail.com>
> Sent: Thursday, April 26, 2018 10:25 PM
> To: Andres Freund <andres@anarazel.de>
> Cc: Masahiko Sawada <sawada.mshk@gmail.com>; Michael Paquier <michael@paquier.xyz>; Mithun Cy
<mithun.cy@enterprisedb.com>;Tom Lane <tgl@sss.pgh.pa.us>; Thomas Munro <thomas.munro@enterprisedb.com>; Amit Kapila
<amit.kapila16@gmail.com>;PostgreSQL-development <pgsql-hackers@postgresql.org> 
> Subject: Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager
>
> On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund <andres@anarazel.de> wrote:
>>> I think the real question is whether the scenario is common enough to
>>> worry about.  In practice, you'd have to be extremely unlucky to be
>>> doing many bulk loads at the same time that all happened to hash to
>>> the same bucket.
>>
>> With a bunch of parallel bulkloads into partitioned tables that really
>> doesn't seem that unlikely?
>
> It increases the likelihood of collisions, but probably decreases the number of cases where the contention gets
reallybad. 
>
> For example, suppose each table has 100 partitions and you are bulk-loading 10 of them at a time.  It's virtually
certainthat you will have some collisions, but the amount of contention within each bucket will remain fairly low
becauseeach backend spends only 1% of its time in the bucket corresponding to any given partition. 
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
>
> Hello!
> I want to try to test this patch on 302(704 ht) core machine.
>
> Patching on master (commit 81256cd05f0745353c6572362155b57250a0d2a0) is ok but got some error while compiling :

Thank you for reporting.
Attached an rebased patch with current HEAD.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment

Re: [HACKERS] Moving relation extension locks out of heavyweight lockmanager

From
Konstantin Knizhnik
Date:

On 26.04.2018 09:10, Masahiko Sawada wrote:
> On Thu, Apr 26, 2018 at 3:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Apr 10, 2018 at 9:08 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>> Never mind. There was a lot of items especially at the last CommitFest.
>>>
>>>> In terms of moving forward, I'd still like to hear what
>>>> Andres has to say about the comments I made on March 1st.
>>> Yeah, agreed.
>> $ ping -n andres.freund
>> Request timeout for icmp_seq 0
>> Request timeout for icmp_seq 1
>> Request timeout for icmp_seq 2
>> Request timeout for icmp_seq 3
>> Request timeout for icmp_seq 4
>> ^C
>> --- andres.freund ping statistics ---
>> 6 packets transmitted, 0 packets received, 100.0% packet loss
>>
>> Meanwhile, https://www.postgresql.org/message-id/4c171ffe-e3ee-acc5-9066-a40d52bc5ae9@postgrespro.ru
>> shows that this patch has some benefits for other cases, which is a
>> point in favor IMHO.
> Thank you for sharing. That's good to know.
>
> Andres pointed out the performance degradation due to hash collision
> when multiple loading. I think the point is that it happens at where
> users don't know.  Therefore even if we make N_RELEXTLOCK_ENTS
> configurable parameter, since users don't know the hash collision they
> don't know when they should tune it.
>
> So it's just an idea but how about adding an SQL-callable function
> that returns the estimated number of lock waiters of the given
> relation? Since user knows how many processes are loading to the
> relation, if a returned value by the function is greater than the
> expected value user  can know hash collision and will be able to start
> to consider to increase N_RELEXTLOCK_ENTS.
>
> Regards,
>
> --
> Masahiko Sawada
> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
> NTT Open Source Software Center
>
We in PostgresProc were faced with lock extension contention problem at 
two more customers and tried to use this patch (v13) to address this issue.
Unfortunately replacing heavy lock with lwlock couldn't completely 
eliminate contention, now most of backends are blocked on conditional 
variable:

0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
#0  0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1  0x00000000007024ee in WaitEventSetWait ()
#2  0x0000000000718fa6 in ConditionVariableSleep ()
#3  0x000000000071954d in RelExtLockAcquire ()
#4  0x00000000004ba99d in RelationGetBufferForTuple ()
#5  0x00000000004b3f18 in heap_insert ()
#6  0x00000000006109c8 in ExecInsert ()
#7  0x0000000000611a49 in ExecModifyTable ()
#8  0x00000000005ef97a in standard_ExecutorRun ()
#9  0x000000000072440a in ProcessQuery ()
#10 0x0000000000724631 in PortalRunMulti ()
#11 0x00000000007250ec in PortalRun ()
#12 0x0000000000721287 in exec_simple_query ()
#13 0x0000000000722532 in PostgresMain ()
#14 0x000000000047a9eb in ServerLoop ()
#15 0x00000000006b9fe9 in PostmasterMain ()
#16 0x000000000047b431 in main ()

Obviously there is nothing surprising here: if a lot of processes try to 
acquire the same exclusive lock, then high contention is expected.
I just want to notice that this patch is not able to completely 
eliminate the problem with large number of concurrent inserts to the 
same table.

Second problem we observed was even more critical: if backed is granted 
relation extension lock and then got some error before releasing this lock,
then abort of the current transaction doesn't release this lock (unlike 
heavy weight lock) and the relation is kept locked.
So database is actually stalled and server has to be restarted.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Moving relation extension locks out of heavyweightlock manager

From
Andres Freund
Date:
Hi,

On 2018-06-04 16:47:29 +0300, Konstantin Knizhnik wrote:
> We in PostgresProc were faced with lock extension contention problem at two
> more customers and tried to use this patch (v13) to address this issue.
> Unfortunately replacing heavy lock with lwlock couldn't completely eliminate
> contention, now most of backends are blocked on conditional variable:
> 
> 0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
> #0  0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
> #1  0x00000000007024ee in WaitEventSetWait ()
> #2  0x0000000000718fa6 in ConditionVariableSleep ()
> #3  0x000000000071954d in RelExtLockAcquire ()

That doesn't necessarily mean that the postgres code is to fault
here. It's entirely possible that the filesystem or storage is the
bottleneck.  Could you briefly describe workload & hardware?


> Second problem we observed was even more critical: if backed is granted
> relation extension lock and then got some error before releasing this lock,
> then abort of the current transaction doesn't release this lock (unlike
> heavy weight lock) and the relation is kept locked.
> So database is actually stalled and server has to be restarted.

That obvioulsy needs to be fixed...

Greetings,

Andres Freund


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Mon, Jun 4, 2018 at 10:47 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
>
>
> On 26.04.2018 09:10, Masahiko Sawada wrote:
>>
>> On Thu, Apr 26, 2018 at 3:30 AM, Robert Haas <robertmhaas@gmail.com>
>> wrote:
>>>
>>> On Tue, Apr 10, 2018 at 9:08 PM, Masahiko Sawada <sawada.mshk@gmail.com>
>>> wrote:
>>>>
>>>> Never mind. There was a lot of items especially at the last CommitFest.
>>>>
>>>>> In terms of moving forward, I'd still like to hear what
>>>>> Andres has to say about the comments I made on March 1st.
>>>>
>>>> Yeah, agreed.
>>>
>>> $ ping -n andres.freund
>>> Request timeout for icmp_seq 0
>>> Request timeout for icmp_seq 1
>>> Request timeout for icmp_seq 2
>>> Request timeout for icmp_seq 3
>>> Request timeout for icmp_seq 4
>>> ^C
>>> --- andres.freund ping statistics ---
>>> 6 packets transmitted, 0 packets received, 100.0% packet loss
>>>
>>> Meanwhile,
>>> https://www.postgresql.org/message-id/4c171ffe-e3ee-acc5-9066-a40d52bc5ae9@postgrespro.ru
>>> shows that this patch has some benefits for other cases, which is a
>>> point in favor IMHO.
>>
>> Thank you for sharing. That's good to know.
>>
>> Andres pointed out the performance degradation due to hash collision
>> when multiple loading. I think the point is that it happens at where
>> users don't know.  Therefore even if we make N_RELEXTLOCK_ENTS
>> configurable parameter, since users don't know the hash collision they
>> don't know when they should tune it.
>>
>> So it's just an idea but how about adding an SQL-callable function
>> that returns the estimated number of lock waiters of the given
>> relation? Since user knows how many processes are loading to the
>> relation, if a returned value by the function is greater than the
>> expected value user  can know hash collision and will be able to start
>> to consider to increase N_RELEXTLOCK_ENTS.
>>
>> Regards,
>>
>> --
>> Masahiko Sawada
>> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
>> NTT Open Source Software Center
>>
> We in PostgresProc were faced with lock extension contention problem at two
> more customers and tried to use this patch (v13) to address this issue.
> Unfortunately replacing heavy lock with lwlock couldn't completely eliminate
> contention, now most of backends are blocked on conditional variable:
>
> 0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
> #0  0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
> #1  0x00000000007024ee in WaitEventSetWait ()
> #2  0x0000000000718fa6 in ConditionVariableSleep ()
> #3  0x000000000071954d in RelExtLockAcquire ()
> #4  0x00000000004ba99d in RelationGetBufferForTuple ()
> #5  0x00000000004b3f18 in heap_insert ()
> #6  0x00000000006109c8 in ExecInsert ()
> #7  0x0000000000611a49 in ExecModifyTable ()
> #8  0x00000000005ef97a in standard_ExecutorRun ()
> #9  0x000000000072440a in ProcessQuery ()
> #10 0x0000000000724631 in PortalRunMulti ()
> #11 0x00000000007250ec in PortalRun ()
> #12 0x0000000000721287 in exec_simple_query ()
> #13 0x0000000000722532 in PostgresMain ()
> #14 0x000000000047a9eb in ServerLoop ()
> #15 0x00000000006b9fe9 in PostmasterMain ()
> #16 0x000000000047b431 in main ()
>
> Obviously there is nothing surprising here: if a lot of processes try to
> acquire the same exclusive lock, then high contention is expected.
> I just want to notice that this patch is not able to completely eliminate
> the problem with large number of concurrent inserts to the same table.
>
> Second problem we observed was even more critical: if backed is granted
> relation extension lock and then got some error before releasing this lock,
> then abort of the current transaction doesn't release this lock (unlike
> heavy weight lock) and the relation is kept locked.
> So database is actually stalled and server has to be restarted.
>

Thank you for reporting.

Regarding the second problem, I tried to reproduce that bug with
latest version patch (v13) but could not. When transaction aborts, we
call ReousrceOwnerRelease()->ResourceOwnerReleaseInternal()->ProcReleaseLocks()->RelExtLockCleanup()
and clear either relext lock bits we are holding or waiting. If we
raise an error after we added a relext lock bit but before we
increment its holding count then the relext lock is remained, but I
couldn't see the code raises an error between them. Could you please
share the concrete reproduction steps of the cause of database stalled
if possible?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] Moving relation extension locks out of heavyweight lockmanager

From
Konstantin Knizhnik
Date:


On 04.06.2018 21:42, Andres Freund wrote:
Hi,

On 2018-06-04 16:47:29 +0300, Konstantin Knizhnik wrote:
We in PostgresProc were faced with lock extension contention problem at two
more customers and tried to use this patch (v13) to address this issue.
Unfortunately replacing heavy lock with lwlock couldn't completely eliminate
contention, now most of backends are blocked on conditional variable:

0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
#0  0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1  0x00000000007024ee in WaitEventSetWait ()
#2  0x0000000000718fa6 in ConditionVariableSleep ()
#3  0x000000000071954d in RelExtLockAcquire ()
That doesn't necessarily mean that the postgres code is to fault
here. It's entirely possible that the filesystem or storage is the
bottleneck.  Could you briefly describe workload & hardware?

Workload is combination of inserts and selects.
Looks like shared locks obtained by select cause starvation of inserts, trying to get exclusive relation extension lock.
The problem is fixed by fair lwlock patch, implemented by Alexander Korotkov. This patch prevents granting of shared lock if wait queue is not empty.
May be we should use this patch or find some other way to prevent starvation of writers on relation extension locks for such workloads.



Second problem we observed was even more critical: if backed is granted
relation extension lock and then got some error before releasing this lock,
then abort of the current transaction doesn't release this lock (unlike
heavy weight lock) and the relation is kept locked.
So database is actually stalled and server has to be restarted.
That obvioulsy needs to be fixed...

Sorry, looks like the problem is more obscure than I expected.
What we have observed is that all backends are blocked in lwlock (sorry stack trace is not complete):
#0  0x00007ff5a9c566d6 in futex_abstimed_wait_cancelable (private=128, abstime=0x0, expected=0, futex_word=0x7ff3c57b9b38) at ../sysdeps/unix/sysv/lin
ux/futex-internal.h:205                                                                                                                              
#1  do_futex_wait (sem=sem@entry=0x7ff3c57b9b38, abstime=0x0) at sem_waitcommon.c:111              
#2  0x00007ff5a9c567c8 in __new_sem_wait_slow (sem=sem@entry=0x7ff3c57b9b38, abstime=0x0) at sem_waitcommon.c:181                                     #3  0x00007ff5a9c56839 in __new_sem_wait (sem=sem@entry=0x7ff3c57b9b38) at sem_wait.c:42                                                              #4  0x000056290c901582 in PGSemaphoreLock (sema=0x7ff3c57b9b38) at pg_sema.c:310
#5  0x000056290c97923c in LWLockAcquire (lock=0x7ff3c7038c64, mode=LW_SHARED) at ./build/../src/backend/storage/lmgr/lwlock.c:1233


I happen after error in disk write operation. Unfortunately we do not have core files and not able to reproduce the problem.
All LW locks should be cleared by LWLockReleaseAll but ... for some reasons it doesn't happen.
We will continue investigation and try to reproduce the problem.
I will let you know if we find the reason of the problem.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] Moving relation extension locks out of heavyweight lockmanager

From
Konstantin Knizhnik
Date:

On 05.06.2018 07:22, Masahiko Sawada wrote:
> On Mon, Jun 4, 2018 at 10:47 PM, Konstantin Knizhnik
> <k.knizhnik@postgrespro.ru> wrote:
>>
>> On 26.04.2018 09:10, Masahiko Sawada wrote:
>>> On Thu, Apr 26, 2018 at 3:30 AM, Robert Haas <robertmhaas@gmail.com>
>>> wrote:
>>>> On Tue, Apr 10, 2018 at 9:08 PM, Masahiko Sawada <sawada.mshk@gmail.com>
>>>> wrote:
>>>>> Never mind. There was a lot of items especially at the last CommitFest.
>>>>>
>>>>>> In terms of moving forward, I'd still like to hear what
>>>>>> Andres has to say about the comments I made on March 1st.
>>>>> Yeah, agreed.
>>>> $ ping -n andres.freund
>>>> Request timeout for icmp_seq 0
>>>> Request timeout for icmp_seq 1
>>>> Request timeout for icmp_seq 2
>>>> Request timeout for icmp_seq 3
>>>> Request timeout for icmp_seq 4
>>>> ^C
>>>> --- andres.freund ping statistics ---
>>>> 6 packets transmitted, 0 packets received, 100.0% packet loss
>>>>
>>>> Meanwhile,
>>>> https://www.postgresql.org/message-id/4c171ffe-e3ee-acc5-9066-a40d52bc5ae9@postgrespro.ru
>>>> shows that this patch has some benefits for other cases, which is a
>>>> point in favor IMHO.
>>> Thank you for sharing. That's good to know.
>>>
>>> Andres pointed out the performance degradation due to hash collision
>>> when multiple loading. I think the point is that it happens at where
>>> users don't know.  Therefore even if we make N_RELEXTLOCK_ENTS
>>> configurable parameter, since users don't know the hash collision they
>>> don't know when they should tune it.
>>>
>>> So it's just an idea but how about adding an SQL-callable function
>>> that returns the estimated number of lock waiters of the given
>>> relation? Since user knows how many processes are loading to the
>>> relation, if a returned value by the function is greater than the
>>> expected value user  can know hash collision and will be able to start
>>> to consider to increase N_RELEXTLOCK_ENTS.
>>>
>>> Regards,
>>>
>>> --
>>> Masahiko Sawada
>>> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
>>> NTT Open Source Software Center
>>>
>> We in PostgresProc were faced with lock extension contention problem at two
>> more customers and tried to use this patch (v13) to address this issue.
>> Unfortunately replacing heavy lock with lwlock couldn't completely eliminate
>> contention, now most of backends are blocked on conditional variable:
>>
>> 0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
>> #0  0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
>> #1  0x00000000007024ee in WaitEventSetWait ()
>> #2  0x0000000000718fa6 in ConditionVariableSleep ()
>> #3  0x000000000071954d in RelExtLockAcquire ()
>> #4  0x00000000004ba99d in RelationGetBufferForTuple ()
>> #5  0x00000000004b3f18 in heap_insert ()
>> #6  0x00000000006109c8 in ExecInsert ()
>> #7  0x0000000000611a49 in ExecModifyTable ()
>> #8  0x00000000005ef97a in standard_ExecutorRun ()
>> #9  0x000000000072440a in ProcessQuery ()
>> #10 0x0000000000724631 in PortalRunMulti ()
>> #11 0x00000000007250ec in PortalRun ()
>> #12 0x0000000000721287 in exec_simple_query ()
>> #13 0x0000000000722532 in PostgresMain ()
>> #14 0x000000000047a9eb in ServerLoop ()
>> #15 0x00000000006b9fe9 in PostmasterMain ()
>> #16 0x000000000047b431 in main ()
>>
>> Obviously there is nothing surprising here: if a lot of processes try to
>> acquire the same exclusive lock, then high contention is expected.
>> I just want to notice that this patch is not able to completely eliminate
>> the problem with large number of concurrent inserts to the same table.
>>
>> Second problem we observed was even more critical: if backed is granted
>> relation extension lock and then got some error before releasing this lock,
>> then abort of the current transaction doesn't release this lock (unlike
>> heavy weight lock) and the relation is kept locked.
>> So database is actually stalled and server has to be restarted.
>>
> Thank you for reporting.
>
> Regarding the second problem, I tried to reproduce that bug with
> latest version patch (v13) but could not. When transaction aborts, we
> call ReousrceOwnerRelease()->ResourceOwnerReleaseInternal()->ProcReleaseLocks()->RelExtLockCleanup()
> and clear either relext lock bits we are holding or waiting. If we
> raise an error after we added a relext lock bit but before we
> increment its holding count then the relext lock is remained, but I
> couldn't see the code raises an error between them. Could you please
> share the concrete reproduction

Sorry, my original guess that LW-locks are not released in case of 
transaction abort is not correct.
There was really situation when all backends were blocked in relation 
extension lock and looks like it happens after disk write error,
but as far as it happens at customer's site, we have no time for 
investigation and not able to reproduce this problem locally.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Alexander Korotkov
Date:
On Tue, Jun 5, 2018 at 12:48 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> Workload is combination of inserts and selects.
> Looks like shared locks obtained by select cause starvation of inserts, trying to get exclusive relation extension
lock.
> The problem is fixed by fair lwlock patch, implemented by Alexander Korotkov. This patch prevents granting of shared
lockif wait queue is not empty. 
> May be we should use this patch or find some other way to prevent starvation of writers on relation extension locks
forsuch workloads. 

Fair lwlock patch really fixed starvation of exclusive lwlock waiters.
But that starvation happens not on relation extension lock – selects
don't get shared relation extension lock.  The real issue there was
not relation extension lock itself, but the time spent inside this
lock.  It appears that buffer replacement happening inside relation
extension lock is affected by starvation on exclusive buffer mapping
lwlocks and buffer content lwlocks, caused by many concurrent shared
lockers.  So, fair lwlock patch have no direct influence to relation
extension lock, which is naturally not even lwlock...

I'll post fair lwlock path in a separate thread.  It requires detailed
consideration and benchmarking, because there is a risk of regression
on specific workloads.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Tue, Jun 5, 2018 at 6:47 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
>
>
> On 05.06.2018 07:22, Masahiko Sawada wrote:
>>
>> On Mon, Jun 4, 2018 at 10:47 PM, Konstantin Knizhnik
>> <k.knizhnik@postgrespro.ru> wrote:
>>>
>>>
>>> On 26.04.2018 09:10, Masahiko Sawada wrote:
>>>>
>>>> On Thu, Apr 26, 2018 at 3:30 AM, Robert Haas <robertmhaas@gmail.com>
>>>> wrote:
>>>>>
>>>>> On Tue, Apr 10, 2018 at 9:08 PM, Masahiko Sawada
>>>>> <sawada.mshk@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Never mind. There was a lot of items especially at the last
>>>>>> CommitFest.
>>>>>>
>>>>>>> In terms of moving forward, I'd still like to hear what
>>>>>>> Andres has to say about the comments I made on March 1st.
>>>>>>
>>>>>> Yeah, agreed.
>>>>>
>>>>> $ ping -n andres.freund
>>>>> Request timeout for icmp_seq 0
>>>>> Request timeout for icmp_seq 1
>>>>> Request timeout for icmp_seq 2
>>>>> Request timeout for icmp_seq 3
>>>>> Request timeout for icmp_seq 4
>>>>> ^C
>>>>> --- andres.freund ping statistics ---
>>>>> 6 packets transmitted, 0 packets received, 100.0% packet loss
>>>>>
>>>>> Meanwhile,
>>>>>
>>>>> https://www.postgresql.org/message-id/4c171ffe-e3ee-acc5-9066-a40d52bc5ae9@postgrespro.ru
>>>>> shows that this patch has some benefits for other cases, which is a
>>>>> point in favor IMHO.
>>>>
>>>> Thank you for sharing. That's good to know.
>>>>
>>>> Andres pointed out the performance degradation due to hash collision
>>>> when multiple loading. I think the point is that it happens at where
>>>> users don't know.  Therefore even if we make N_RELEXTLOCK_ENTS
>>>> configurable parameter, since users don't know the hash collision they
>>>> don't know when they should tune it.
>>>>
>>>> So it's just an idea but how about adding an SQL-callable function
>>>> that returns the estimated number of lock waiters of the given
>>>> relation? Since user knows how many processes are loading to the
>>>> relation, if a returned value by the function is greater than the
>>>> expected value user  can know hash collision and will be able to start
>>>> to consider to increase N_RELEXTLOCK_ENTS.
>>>>
>>>> Regards,
>>>>
>>>> --
>>>> Masahiko Sawada
>>>> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
>>>> NTT Open Source Software Center
>>>>
>>> We in PostgresProc were faced with lock extension contention problem at
>>> two
>>> more customers and tried to use this patch (v13) to address this issue.
>>> Unfortunately replacing heavy lock with lwlock couldn't completely
>>> eliminate
>>> contention, now most of backends are blocked on conditional variable:
>>>
>>> 0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
>>> #0  0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
>>> #1  0x00000000007024ee in WaitEventSetWait ()
>>> #2  0x0000000000718fa6 in ConditionVariableSleep ()
>>> #3  0x000000000071954d in RelExtLockAcquire ()
>>> #4  0x00000000004ba99d in RelationGetBufferForTuple ()
>>> #5  0x00000000004b3f18 in heap_insert ()
>>> #6  0x00000000006109c8 in ExecInsert ()
>>> #7  0x0000000000611a49 in ExecModifyTable ()
>>> #8  0x00000000005ef97a in standard_ExecutorRun ()
>>> #9  0x000000000072440a in ProcessQuery ()
>>> #10 0x0000000000724631 in PortalRunMulti ()
>>> #11 0x00000000007250ec in PortalRun ()
>>> #12 0x0000000000721287 in exec_simple_query ()
>>> #13 0x0000000000722532 in PostgresMain ()
>>> #14 0x000000000047a9eb in ServerLoop ()
>>> #15 0x00000000006b9fe9 in PostmasterMain ()
>>> #16 0x000000000047b431 in main ()
>>>
>>> Obviously there is nothing surprising here: if a lot of processes try to
>>> acquire the same exclusive lock, then high contention is expected.
>>> I just want to notice that this patch is not able to completely eliminate
>>> the problem with large number of concurrent inserts to the same table.
>>>
>>> Second problem we observed was even more critical: if backed is granted
>>> relation extension lock and then got some error before releasing this
>>> lock,
>>> then abort of the current transaction doesn't release this lock (unlike
>>> heavy weight lock) and the relation is kept locked.
>>> So database is actually stalled and server has to be restarted.
>>>
>> Thank you for reporting.
>>
>> Regarding the second problem, I tried to reproduce that bug with
>> latest version patch (v13) but could not. When transaction aborts, we
>> call
>> ReousrceOwnerRelease()->ResourceOwnerReleaseInternal()->ProcReleaseLocks()->RelExtLockCleanup()
>> and clear either relext lock bits we are holding or waiting. If we
>> raise an error after we added a relext lock bit but before we
>> increment its holding count then the relext lock is remained, but I
>> couldn't see the code raises an error between them. Could you please
>> share the concrete reproduction
>
>
> Sorry, my original guess that LW-locks are not released in case of
> transaction abort is not correct.
> There was really situation when all backends were blocked in relation
> extension lock and looks like it happens after disk write error,

You're saying that it is not correct that LWlock are not released but
it's correct that all backends were blocked in relext lock, but in
other your mail you're saying something opposite. Which is correct?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] Moving relation extension locks out of heavyweight lockmanager

From
Konstantin Knizhnik
Date:

On 05.06.2018 13:29, Masahiko Sawada wrote:
> On Tue, Jun 5, 2018 at 6:47 PM, Konstantin Knizhnik
> <k.knizhnik@postgrespro.ru> wrote:
>>
>> On 05.06.2018 07:22, Masahiko Sawada wrote:
>>> On Mon, Jun 4, 2018 at 10:47 PM, Konstantin Knizhnik
>>> <k.knizhnik@postgrespro.ru> wrote:
>>>>
>>>> On 26.04.2018 09:10, Masahiko Sawada wrote:
>>>>> On Thu, Apr 26, 2018 at 3:30 AM, Robert Haas <robertmhaas@gmail.com>
>>>>> wrote:
>>>>>> On Tue, Apr 10, 2018 at 9:08 PM, Masahiko Sawada
>>>>>> <sawada.mshk@gmail.com>
>>>>>> wrote:
>>>>>>> Never mind. There was a lot of items especially at the last
>>>>>>> CommitFest.
>>>>>>>
>>>>>>>> In terms of moving forward, I'd still like to hear what
>>>>>>>> Andres has to say about the comments I made on March 1st.
>>>>>>> Yeah, agreed.
>>>>>> $ ping -n andres.freund
>>>>>> Request timeout for icmp_seq 0
>>>>>> Request timeout for icmp_seq 1
>>>>>> Request timeout for icmp_seq 2
>>>>>> Request timeout for icmp_seq 3
>>>>>> Request timeout for icmp_seq 4
>>>>>> ^C
>>>>>> --- andres.freund ping statistics ---
>>>>>> 6 packets transmitted, 0 packets received, 100.0% packet loss
>>>>>>
>>>>>> Meanwhile,
>>>>>>
>>>>>> https://www.postgresql.org/message-id/4c171ffe-e3ee-acc5-9066-a40d52bc5ae9@postgrespro.ru
>>>>>> shows that this patch has some benefits for other cases, which is a
>>>>>> point in favor IMHO.
>>>>> Thank you for sharing. That's good to know.
>>>>>
>>>>> Andres pointed out the performance degradation due to hash collision
>>>>> when multiple loading. I think the point is that it happens at where
>>>>> users don't know.  Therefore even if we make N_RELEXTLOCK_ENTS
>>>>> configurable parameter, since users don't know the hash collision they
>>>>> don't know when they should tune it.
>>>>>
>>>>> So it's just an idea but how about adding an SQL-callable function
>>>>> that returns the estimated number of lock waiters of the given
>>>>> relation? Since user knows how many processes are loading to the
>>>>> relation, if a returned value by the function is greater than the
>>>>> expected value user  can know hash collision and will be able to start
>>>>> to consider to increase N_RELEXTLOCK_ENTS.
>>>>>
>>>>> Regards,
>>>>>
>>>>> --
>>>>> Masahiko Sawada
>>>>> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
>>>>> NTT Open Source Software Center
>>>>>
>>>> We in PostgresProc were faced with lock extension contention problem at
>>>> two
>>>> more customers and tried to use this patch (v13) to address this issue.
>>>> Unfortunately replacing heavy lock with lwlock couldn't completely
>>>> eliminate
>>>> contention, now most of backends are blocked on conditional variable:
>>>>
>>>> 0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
>>>> #0  0x00007fb03a318903 in __epoll_wait_nocancel () from /lib64/libc.so.6
>>>> #1  0x00000000007024ee in WaitEventSetWait ()
>>>> #2  0x0000000000718fa6 in ConditionVariableSleep ()
>>>> #3  0x000000000071954d in RelExtLockAcquire ()
>>>> #4  0x00000000004ba99d in RelationGetBufferForTuple ()
>>>> #5  0x00000000004b3f18 in heap_insert ()
>>>> #6  0x00000000006109c8 in ExecInsert ()
>>>> #7  0x0000000000611a49 in ExecModifyTable ()
>>>> #8  0x00000000005ef97a in standard_ExecutorRun ()
>>>> #9  0x000000000072440a in ProcessQuery ()
>>>> #10 0x0000000000724631 in PortalRunMulti ()
>>>> #11 0x00000000007250ec in PortalRun ()
>>>> #12 0x0000000000721287 in exec_simple_query ()
>>>> #13 0x0000000000722532 in PostgresMain ()
>>>> #14 0x000000000047a9eb in ServerLoop ()
>>>> #15 0x00000000006b9fe9 in PostmasterMain ()
>>>> #16 0x000000000047b431 in main ()
>>>>
>>>> Obviously there is nothing surprising here: if a lot of processes try to
>>>> acquire the same exclusive lock, then high contention is expected.
>>>> I just want to notice that this patch is not able to completely eliminate
>>>> the problem with large number of concurrent inserts to the same table.
>>>>
>>>> Second problem we observed was even more critical: if backed is granted
>>>> relation extension lock and then got some error before releasing this
>>>> lock,
>>>> then abort of the current transaction doesn't release this lock (unlike
>>>> heavy weight lock) and the relation is kept locked.
>>>> So database is actually stalled and server has to be restarted.
>>>>
>>> Thank you for reporting.
>>>
>>> Regarding the second problem, I tried to reproduce that bug with
>>> latest version patch (v13) but could not. When transaction aborts, we
>>> call
>>> ReousrceOwnerRelease()->ResourceOwnerReleaseInternal()->ProcReleaseLocks()->RelExtLockCleanup()
>>> and clear either relext lock bits we are holding or waiting. If we
>>> raise an error after we added a relext lock bit but before we
>>> increment its holding count then the relext lock is remained, but I
>>> couldn't see the code raises an error between them. Could you please
>>> share the concrete reproduction
>>
>> Sorry, my original guess that LW-locks are not released in case of
>> transaction abort is not correct.
>> There was really situation when all backends were blocked in relation
>> extension lock and looks like it happens after disk write error,
> You're saying that it is not correct that LWlock are not released but
> it's correct that all backends were blocked in relext lock, but in
> other your mail you're saying something opposite. Which is correct?
I am sorry for confusion. I have not investigated core files myself and 
just share information received from our engineer.
Looks like this problem may be related with relation extension locks at all.
Sorry for false alarm.



Re: [HACKERS] Moving relation extension locks out of heavyweightlock manager

From
Andres Freund
Date:
On 2018-06-05 13:09:08 +0300, Alexander Korotkov wrote:
> On Tue, Jun 5, 2018 at 12:48 PM Konstantin Knizhnik
> <k.knizhnik@postgrespro.ru> wrote:
> > Workload is combination of inserts and selects.
> > Looks like shared locks obtained by select cause starvation of inserts, trying to get exclusive relation extension
lock.
> > The problem is fixed by fair lwlock patch, implemented by Alexander Korotkov. This patch prevents granting of
sharedlock if wait queue is not empty.
 
> > May be we should use this patch or find some other way to prevent starvation of writers on relation extension locks
forsuch workloads.
 
> 
> Fair lwlock patch really fixed starvation of exclusive lwlock waiters.
> But that starvation happens not on relation extension lock – selects
> don't get shared relation extension lock.  The real issue there was
> not relation extension lock itself, but the time spent inside this
> lock.

Yea, that makes a lot more sense to me.


> It appears that buffer replacement happening inside relation
> extension lock is affected by starvation on exclusive buffer mapping
> lwlocks and buffer content lwlocks, caused by many concurrent shared
> lockers.  So, fair lwlock patch have no direct influence to relation
> extension lock, which is naturally not even lwlock...

Yea, that makes sense. I wonder how much the fix here is to "pre-clear"
a victim buffer, and how much is a saner buffer replacement
implementation (either by going away from O(NBuffers), or by having a
queue of clean victim buffers like my bgwriter replacement).


> I'll post fair lwlock path in a separate thread.  It requires detailed
> consideration and benchmarking, because there is a risk of regression
> on specific workloads.

I bet that doing it naively will regress massively in a number of cases.

Greetings,

Andres Freund


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Alexander Korotkov
Date:
On Tue, Jun 5, 2018 at 4:02 PM Andres Freund <andres@anarazel.de> wrote:
> On 2018-06-05 13:09:08 +0300, Alexander Korotkov wrote:
> > It appears that buffer replacement happening inside relation
> > extension lock is affected by starvation on exclusive buffer mapping
> > lwlocks and buffer content lwlocks, caused by many concurrent shared
> > lockers.  So, fair lwlock patch have no direct influence to relation
> > extension lock, which is naturally not even lwlock...
>
> Yea, that makes sense. I wonder how much the fix here is to "pre-clear"
> a victim buffer, and how much is a saner buffer replacement
> implementation (either by going away from O(NBuffers), or by having a
> queue of clean victim buffers like my bgwriter replacement).

The particular thing I observed on our environment is BufferAlloc()
waiting hours on buffer partition lock.  Increasing NUM_BUFFER_PARTITIONS
didn't give any significant help.  It appears that very hot page (root page of
some frequently used index) reside on that partition, so this partition was
continuously under shared lock.  So, in order to resolve without changing
LWLock, we probably should move our buffers hash table to something
lockless.

> > I'll post fair lwlock path in a separate thread.  It requires detailed
> > consideration and benchmarking, because there is a risk of regression
> > on specific workloads.
>
> I bet that doing it naively will regress massively in a number of cases.

Yes, I suspect the same.  However, I tend to think that something is wrong
with LWLock itself.  It seems that it is the only of our locks, which provides
some lockers almost infinite starvations under certain workloads.  In contrast,
even our SpinLock gives all the waiting processes nearly same chances to
acquire it.  So, I think idea of improving LWLock in this aspect deserves
at least further investigation.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


On Tue, Jun 5, 2018 at 7:35 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Tue, Jun 5, 2018 at 4:02 PM Andres Freund <andres@anarazel.de> wrote:
> On 2018-06-05 13:09:08 +0300, Alexander Korotkov wrote:
> > It appears that buffer replacement happening inside relation
> > extension lock is affected by starvation on exclusive buffer mapping
> > lwlocks and buffer content lwlocks, caused by many concurrent shared
> > lockers.  So, fair lwlock patch have no direct influence to relation
> > extension lock, which is naturally not even lwlock...
>
> Yea, that makes sense. I wonder how much the fix here is to "pre-clear"
> a victim buffer, and how much is a saner buffer replacement
> implementation (either by going away from O(NBuffers), or by having a
> queue of clean victim buffers like my bgwriter replacement).

The particular thing I observed on our environment is BufferAlloc()
waiting hours on buffer partition lock.  Increasing NUM_BUFFER_PARTITIONS
didn't give any significant help.  It appears that very hot page (root page of
some frequently used index) reside on that partition, so this partition was
continuously under shared lock.  So, in order to resolve without changing
LWLock, we probably should move our buffers hash table to something
lockless.


I think Robert's chash stuff [1] might be helpful to reduce the contention you are seeing.



--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Fri, Apr 27, 2018 at 4:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund <andres@anarazel.de> wrote:
>>> I think the real question is whether the scenario is common enough to
>>> worry about.  In practice, you'd have to be extremely unlucky to be
>>> doing many bulk loads at the same time that all happened to hash to
>>> the same bucket.
>>
>> With a bunch of parallel bulkloads into partitioned tables that really
>> doesn't seem that unlikely?
>
> It increases the likelihood of collisions, but probably decreases the
> number of cases where the contention gets really bad.
>
> For example, suppose each table has 100 partitions and you are
> bulk-loading 10 of them at a time.  It's virtually certain that you
> will have some collisions, but the amount of contention within each
> bucket will remain fairly low because each backend spends only 1% of
> its time in the bucket corresponding to any given partition.
>

I share another result of performance evaluation between current HEAD
and current HEAD with v13 patch(N_RELEXTLOCK_ENTS = 1024).

Type of table: normal table, unlogged table
Number of child tables : 16, 64 (all tables are located on the same tablespace)
Number of clients : 32
Number of trials : 100
Duration: 180 seconds for each trials

The hardware spec of server is Intel Xeon 2.4GHz (HT 160cores), 256GB
RAM, NVMe SSD 1.5TB.
Each clients load 10kB random data across all partitioned tables.

Here is the result.

 childs |   type   | target  |  avg_tps   | diff with HEAD
--------+----------+---------+------------+------------------
     16 | normal   | HEAD    |   1643.833 |
     16 | normal   | Patched |  1619.5404 |      0.985222
     16 | unlogged | HEAD    |  9069.3543 |
     16 | unlogged | Patched |  9368.0263 |      1.032932
     64 | normal   | HEAD    |   1598.698 |
     64 | normal   | Patched |  1587.5906 |      0.993052
     64 | unlogged | HEAD    |  9629.7315 |
     64 | unlogged | Patched | 10208.2196 |      1.060073
(8 rows)

For normal tables, loading tps decreased 1% ~ 2% with this patch
whereas it increased 3% ~ 6% for unlogged tables. There were
collisions at 0 ~ 5 relation extension lock slots between 2 relations
in the 64 child tables case but it didn't seem to affect the tps.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] Moving relation extension locks out of heavyweightlock manager

From
Michael Paquier
Date:
On Wed, Jun 06, 2018 at 07:03:47PM +0530, Amit Kapila wrote:
> I think Robert's chash stuff [1] might be helpful to reduce the contention
> you are seeing.

Latest patch available does not apply, so I moved it to next CF.  The
thread has died a bit as well...
--
Michael

Attachment

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Dmitry Dolgov
Date:
> On Mon, Oct 1, 2018 at 8:54 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Jun 06, 2018 at 07:03:47PM +0530, Amit Kapila wrote:
> > I think Robert's chash stuff [1] might be helpful to reduce the contention
> > you are seeing.
>
> Latest patch available does not apply, so I moved it to next CF.  The
> thread has died a bit as well...

Unfortunately, patch is still needs to be rebased. Could you do this, are there
any plans about the patch?


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Fri, Nov 30, 2018 at 1:17 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Mon, Oct 1, 2018 at 8:54 AM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Wed, Jun 06, 2018 at 07:03:47PM +0530, Amit Kapila wrote:
> > > I think Robert's chash stuff [1] might be helpful to reduce the contention
> > > you are seeing.
> >
> > Latest patch available does not apply, so I moved it to next CF.  The
> > thread has died a bit as well...
>
> Unfortunately, patch is still needs to be rebased. Could you do this, are there
> any plans about the patch?

I have a plan but it's a future plan. This patch is for parallel
vacuum patch. As I mentioned at that thread[1], I'm  focusing on only
parallel index vacuum, which would not require the relation extension
lock improvements for now. Therefore, I want to withdraw this patch
and to reactivate when we need this enhancement.

So I think we can mark it as 'Returned with feedback'.

[1] https://www.postgresql.org/message-id/CAD21AoDhAutvKbQ37Btf4taMVbQaOaSvOpxpLgu814T1-OqYGg%40mail.gmail.com

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Amit Kapila
Date:
On Tue, Jun 26, 2018 at 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Apr 27, 2018 at 4:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund <andres@anarazel.de> wrote:
> >>> I think the real question is whether the scenario is common enough to
> >>> worry about.  In practice, you'd have to be extremely unlucky to be
> >>> doing many bulk loads at the same time that all happened to hash to
> >>> the same bucket.
> >>
> >> With a bunch of parallel bulkloads into partitioned tables that really
> >> doesn't seem that unlikely?
> >
> > It increases the likelihood of collisions, but probably decreases the
> > number of cases where the contention gets really bad.
> >
> > For example, suppose each table has 100 partitions and you are
> > bulk-loading 10 of them at a time.  It's virtually certain that you
> > will have some collisions, but the amount of contention within each
> > bucket will remain fairly low because each backend spends only 1% of
> > its time in the bucket corresponding to any given partition.
> >
>
> I share another result of performance evaluation between current HEAD
> and current HEAD with v13 patch(N_RELEXTLOCK_ENTS = 1024).
>
> Type of table: normal table, unlogged table
> Number of child tables : 16, 64 (all tables are located on the same tablespace)
> Number of clients : 32
> Number of trials : 100
> Duration: 180 seconds for each trials
>
> The hardware spec of server is Intel Xeon 2.4GHz (HT 160cores), 256GB
> RAM, NVMe SSD 1.5TB.
> Each clients load 10kB random data across all partitioned tables.
>
> Here is the result.
>
>  childs |   type   | target  |  avg_tps   | diff with HEAD
> --------+----------+---------+------------+------------------
>      16 | normal   | HEAD    |   1643.833 |
>      16 | normal   | Patched |  1619.5404 |      0.985222
>      16 | unlogged | HEAD    |  9069.3543 |
>      16 | unlogged | Patched |  9368.0263 |      1.032932
>      64 | normal   | HEAD    |   1598.698 |
>      64 | normal   | Patched |  1587.5906 |      0.993052
>      64 | unlogged | HEAD    |  9629.7315 |
>      64 | unlogged | Patched | 10208.2196 |      1.060073
> (8 rows)
>
> For normal tables, loading tps decreased 1% ~ 2% with this patch
> whereas it increased 3% ~ 6% for unlogged tables. There were
> collisions at 0 ~ 5 relation extension lock slots between 2 relations
> in the 64 child tables case but it didn't seem to affect the tps.
>

AFAIU, this resembles the workload that Andres was worried about.   I
think we should once run this test in a different environment, but
considering this to be correct and repeatable, where do we go with
this patch especially when we know it improves many workloads [1] as
well.  We know that on a pathological case constructed by Mithun [2],
this causes regression as well.  I am not sure if the test done by
Mithun really mimics any real-world workload as he has tested by
making N_RELEXTLOCK_ENTS = 1 to hit the worst case.

Sawada-San, if you have a script or data for the test done by you,
then please share it so that others can also try to reproduce it.

[1] - https://www.postgresql.org/message-id/4c171ffe-e3ee-acc5-9066-a40d52bc5ae9%40postgrespro.ru
[2] - https://www.postgresql.org/message-id/CAD__Oug52j%3DDQMoP2b%3DVY7wZb0S9wMNu4irXOH3-ZjFkzWZPGg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Mon, Feb 3, 2020 at 8:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jun 26, 2018 at 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Fri, Apr 27, 2018 at 4:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > > On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund <andres@anarazel.de> wrote:
> > >>> I think the real question is whether the scenario is common enough to
> > >>> worry about.  In practice, you'd have to be extremely unlucky to be
> > >>> doing many bulk loads at the same time that all happened to hash to
> > >>> the same bucket.
> > >>
> > >> With a bunch of parallel bulkloads into partitioned tables that really
> > >> doesn't seem that unlikely?
> > >
> > > It increases the likelihood of collisions, but probably decreases the
> > > number of cases where the contention gets really bad.
> > >
> > > For example, suppose each table has 100 partitions and you are
> > > bulk-loading 10 of them at a time.  It's virtually certain that you
> > > will have some collisions, but the amount of contention within each
> > > bucket will remain fairly low because each backend spends only 1% of
> > > its time in the bucket corresponding to any given partition.
> > >
> >
> > I share another result of performance evaluation between current HEAD
> > and current HEAD with v13 patch(N_RELEXTLOCK_ENTS = 1024).
> >
> > Type of table: normal table, unlogged table
> > Number of child tables : 16, 64 (all tables are located on the same tablespace)
> > Number of clients : 32
> > Number of trials : 100
> > Duration: 180 seconds for each trials
> >
> > The hardware spec of server is Intel Xeon 2.4GHz (HT 160cores), 256GB
> > RAM, NVMe SSD 1.5TB.
> > Each clients load 10kB random data across all partitioned tables.
> >
> > Here is the result.
> >
> >  childs |   type   | target  |  avg_tps   | diff with HEAD
> > --------+----------+---------+------------+------------------
> >      16 | normal   | HEAD    |   1643.833 |
> >      16 | normal   | Patched |  1619.5404 |      0.985222
> >      16 | unlogged | HEAD    |  9069.3543 |
> >      16 | unlogged | Patched |  9368.0263 |      1.032932
> >      64 | normal   | HEAD    |   1598.698 |
> >      64 | normal   | Patched |  1587.5906 |      0.993052
> >      64 | unlogged | HEAD    |  9629.7315 |
> >      64 | unlogged | Patched | 10208.2196 |      1.060073
> > (8 rows)
> >
> > For normal tables, loading tps decreased 1% ~ 2% with this patch
> > whereas it increased 3% ~ 6% for unlogged tables. There were
> > collisions at 0 ~ 5 relation extension lock slots between 2 relations
> > in the 64 child tables case but it didn't seem to affect the tps.
> >
>
> AFAIU, this resembles the workload that Andres was worried about.   I
> think we should once run this test in a different environment, but
> considering this to be correct and repeatable, where do we go with
> this patch especially when we know it improves many workloads [1] as
> well.  We know that on a pathological case constructed by Mithun [2],
> this causes regression as well.  I am not sure if the test done by
> Mithun really mimics any real-world workload as he has tested by
> making N_RELEXTLOCK_ENTS = 1 to hit the worst case.
>
> Sawada-San, if you have a script or data for the test done by you,
> then please share it so that others can also try to reproduce it.

Unfortunately the environment I used for performance verification is
no longer available.

I agree to run this test in a different environment. I've attached the
rebased version patch. I'm measuring the performance with/without
patch, so will share the results.

Regards,

--
Masahiko Sawada  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Mahendra Singh Thalor
Date:
On Wed, 5 Feb 2020 at 12:07, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Mon, Feb 3, 2020 at 8:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Jun 26, 2018 at 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Fri, Apr 27, 2018 at 4:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > > > On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund <andres@anarazel.de> wrote:
> > > >>> I think the real question is whether the scenario is common enough to
> > > >>> worry about.  In practice, you'd have to be extremely unlucky to be
> > > >>> doing many bulk loads at the same time that all happened to hash to
> > > >>> the same bucket.
> > > >>
> > > >> With a bunch of parallel bulkloads into partitioned tables that really
> > > >> doesn't seem that unlikely?
> > > >
> > > > It increases the likelihood of collisions, but probably decreases the
> > > > number of cases where the contention gets really bad.
> > > >
> > > > For example, suppose each table has 100 partitions and you are
> > > > bulk-loading 10 of them at a time.  It's virtually certain that you
> > > > will have some collisions, but the amount of contention within each
> > > > bucket will remain fairly low because each backend spends only 1% of
> > > > its time in the bucket corresponding to any given partition.
> > > >
> > >
> > > I share another result of performance evaluation between current HEAD
> > > and current HEAD with v13 patch(N_RELEXTLOCK_ENTS = 1024).
> > >
> > > Type of table: normal table, unlogged table
> > > Number of child tables : 16, 64 (all tables are located on the same tablespace)
> > > Number of clients : 32
> > > Number of trials : 100
> > > Duration: 180 seconds for each trials
> > >
> > > The hardware spec of server is Intel Xeon 2.4GHz (HT 160cores), 256GB
> > > RAM, NVMe SSD 1.5TB.
> > > Each clients load 10kB random data across all partitioned tables.
> > >
> > > Here is the result.
> > >
> > >  childs |   type   | target  |  avg_tps   | diff with HEAD
> > > --------+----------+---------+------------+------------------
> > >      16 | normal   | HEAD    |   1643.833 |
> > >      16 | normal   | Patched |  1619.5404 |      0.985222
> > >      16 | unlogged | HEAD    |  9069.3543 |
> > >      16 | unlogged | Patched |  9368.0263 |      1.032932
> > >      64 | normal   | HEAD    |   1598.698 |
> > >      64 | normal   | Patched |  1587.5906 |      0.993052
> > >      64 | unlogged | HEAD    |  9629.7315 |
> > >      64 | unlogged | Patched | 10208.2196 |      1.060073
> > > (8 rows)
> > >
> > > For normal tables, loading tps decreased 1% ~ 2% with this patch
> > > whereas it increased 3% ~ 6% for unlogged tables. There were
> > > collisions at 0 ~ 5 relation extension lock slots between 2 relations
> > > in the 64 child tables case but it didn't seem to affect the tps.
> > >
> >
> > AFAIU, this resembles the workload that Andres was worried about.   I
> > think we should once run this test in a different environment, but
> > considering this to be correct and repeatable, where do we go with
> > this patch especially when we know it improves many workloads [1] as
> > well.  We know that on a pathological case constructed by Mithun [2],
> > this causes regression as well.  I am not sure if the test done by
> > Mithun really mimics any real-world workload as he has tested by
> > making N_RELEXTLOCK_ENTS = 1 to hit the worst case.
> >
> > Sawada-San, if you have a script or data for the test done by you,
> > then please share it so that others can also try to reproduce it.
>
> Unfortunately the environment I used for performance verification is
> no longer available.
>
> I agree to run this test in a different environment. I've attached the
> rebased version patch. I'm measuring the performance with/without
> patch, so will share the results.
>

Thanks Sawada-san for patch.

From last few days, I was reading this thread and was reviewing v13 patch.  To debug and test, I did re-base of v13 patch. I compared my re-based patch and v14 patch. I think,  ordering of header files is not alphabetically in v14 patch. (I haven't reviewed v14 patch fully because before review, I wanted to test false sharing).  While debugging, I didn't noticed any hang or lock related issue.

I did some testing to test false sharing(bulk insert, COPY data, bulk insert into partitions tables).  Below is the testing summary.

Test setup(Bulk insert into partition tables):
autovacuum=off
shared_buffers=512MB -c max_wal_size=20GB -c checkpoint_timeout=12min

Basically, I created a table with 13 partitions. Using pgbench, I inserted bulk data. I used below pgbench command:
./pgbench -c $threads -j $threads -T 180 -f insert1.sql@1 -f insert2.sql@1 -f insert3.sql@1 -f insert4.sql@1 postgres

I took scripts from previews mails and modified. For reference, I am attaching test scripts.  I tested with default 1024 slots(N_RELEXTLOCK_ENTS = 1024).

Clients          HEAD (tps)                     With v14 patch (tps)      %change      (time: 180s)
1                    92.979796                        100.877446                     +8.49 %
32                   392.881863                      388.470622                    -1.12 %
56                   551.753235                       528.018852                   -4.30 %
60                   648.273767                       653.251507                   +0.76 %
64                   645.975124                       671.322140                   +3.92 %
66                   662.728010                       673.399762                   +1.61 %            
70                   647.103183                       660.694914                   +2.10 %
74                   648.824027                       676.487622                  +4.26 %        

From above results, we can see that in most cases, TPS is slightly increased with v14 patch. I am still testing and will post my results.

I want to test extension lock by blocking use of fsm(use_fsm=false in code).  I think, if we block use of fsm, then load will increase into extension lock.  Is this correct way to test?

Please let me know if you have any specific testing scenario.

--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Amit Kapila
Date:
On Thu, Feb 6, 2020 at 1:57 AM Mahendra Singh Thalor <mahi6run@gmail.com> wrote:
>
> On Wed, 5 Feb 2020 at 12:07, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Feb 3, 2020 at 8:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Tue, Jun 26, 2018 at 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Fri, Apr 27, 2018 at 4:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > > > > On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund <andres@anarazel.de> wrote:
> > > > >>> I think the real question is whether the scenario is common enough to
> > > > >>> worry about.  In practice, you'd have to be extremely unlucky to be
> > > > >>> doing many bulk loads at the same time that all happened to hash to
> > > > >>> the same bucket.
> > > > >>
> > > > >> With a bunch of parallel bulkloads into partitioned tables that really
> > > > >> doesn't seem that unlikely?
> > > > >
> > > > > It increases the likelihood of collisions, but probably decreases the
> > > > > number of cases where the contention gets really bad.
> > > > >
> > > > > For example, suppose each table has 100 partitions and you are
> > > > > bulk-loading 10 of them at a time.  It's virtually certain that you
> > > > > will have some collisions, but the amount of contention within each
> > > > > bucket will remain fairly low because each backend spends only 1% of
> > > > > its time in the bucket corresponding to any given partition.
> > > > >
> > > >
> > > > I share another result of performance evaluation between current HEAD
> > > > and current HEAD with v13 patch(N_RELEXTLOCK_ENTS = 1024).
> > > >
> > > > Type of table: normal table, unlogged table
> > > > Number of child tables : 16, 64 (all tables are located on the same tablespace)
> > > > Number of clients : 32
> > > > Number of trials : 100
> > > > Duration: 180 seconds for each trials
> > > >
> > > > The hardware spec of server is Intel Xeon 2.4GHz (HT 160cores), 256GB
> > > > RAM, NVMe SSD 1.5TB.
> > > > Each clients load 10kB random data across all partitioned tables.
> > > >
> > > > Here is the result.
> > > >
> > > >  childs |   type   | target  |  avg_tps   | diff with HEAD
> > > > --------+----------+---------+------------+------------------
> > > >      16 | normal   | HEAD    |   1643.833 |
> > > >      16 | normal   | Patched |  1619.5404 |      0.985222
> > > >      16 | unlogged | HEAD    |  9069.3543 |
> > > >      16 | unlogged | Patched |  9368.0263 |      1.032932
> > > >      64 | normal   | HEAD    |   1598.698 |
> > > >      64 | normal   | Patched |  1587.5906 |      0.993052
> > > >      64 | unlogged | HEAD    |  9629.7315 |
> > > >      64 | unlogged | Patched | 10208.2196 |      1.060073
> > > > (8 rows)
> > > >
> > > > For normal tables, loading tps decreased 1% ~ 2% with this patch
> > > > whereas it increased 3% ~ 6% for unlogged tables. There were
> > > > collisions at 0 ~ 5 relation extension lock slots between 2 relations
> > > > in the 64 child tables case but it didn't seem to affect the tps.
> > > >
> > >
> > > AFAIU, this resembles the workload that Andres was worried about.   I
> > > think we should once run this test in a different environment, but
> > > considering this to be correct and repeatable, where do we go with
> > > this patch especially when we know it improves many workloads [1] as
> > > well.  We know that on a pathological case constructed by Mithun [2],
> > > this causes regression as well.  I am not sure if the test done by
> > > Mithun really mimics any real-world workload as he has tested by
> > > making N_RELEXTLOCK_ENTS = 1 to hit the worst case.
> > >
> > > Sawada-San, if you have a script or data for the test done by you,
> > > then please share it so that others can also try to reproduce it.
> >
> > Unfortunately the environment I used for performance verification is
> > no longer available.
> >
> > I agree to run this test in a different environment. I've attached the
> > rebased version patch. I'm measuring the performance with/without
> > patch, so will share the results.
> >
>
> Thanks Sawada-san for patch.
>
> From last few days, I was reading this thread and was reviewing v13 patch.  To debug and test, I did re-base of v13
patch.I compared my re-based patch and v14 patch. I think,  ordering of header files is not alphabetically in v14
patch.(I haven't reviewed v14 patch fully because before review, I wanted to test false sharing).  While debugging, I
didn'tnoticed any hang or lock related issue. 
>
> I did some testing to test false sharing(bulk insert, COPY data, bulk insert into partitions tables).  Below is the
testingsummary. 
>
> Test setup(Bulk insert into partition tables):
> autovacuum=off
> shared_buffers=512MB -c max_wal_size=20GB -c checkpoint_timeout=12min
>
> Basically, I created a table with 13 partitions. Using pgbench, I inserted bulk data. I used below pgbench command:
> ./pgbench -c $threads -j $threads -T 180 -f insert1.sql@1 -f insert2.sql@1 -f insert3.sql@1 -f insert4.sql@1 postgres
>
> I took scripts from previews mails and modified. For reference, I am attaching test scripts.  I tested with default
1024slots(N_RELEXTLOCK_ENTS = 1024). 
>
> Clients          HEAD (tps)                     With v14 patch (tps)      %change      (time: 180s)
> 1                    92.979796                        100.877446                     +8.49 %
> 32                   392.881863                      388.470622                    -1.12 %
> 56                   551.753235                       528.018852                   -4.30 %
> 60                   648.273767                       653.251507                   +0.76 %
> 64                   645.975124                       671.322140                   +3.92 %
> 66                   662.728010                       673.399762                   +1.61 %
> 70                   647.103183                       660.694914                   +2.10 %
> 74                   648.824027                       676.487622                  +4.26 %
>
> From above results, we can see that in most cases, TPS is slightly increased with v14 patch. I am still testing and
willpost my results. 
>

The number at 56 and 74 client count seem slightly suspicious.   Can
you please repeat those tests?  Basically, I am not able to come up
with a theory why at 56 clients the performance with the patch is a
bit lower and then at 74 it is higher.

> I want to test extension lock by blocking use of fsm(use_fsm=false in code).  I think, if we block use of fsm, then
loadwill increase into extension lock.  Is this correct way to test? 
>

Hmm, I think instead of directly hacking the code, you might want to
use the operation (probably cluster or vacuum full) where we set
HEAP_INSERT_SKIP_FSM.  I think along with this you can try with
unlogged tables because that might stress the extension lock.

In the above test, you might want to test with a higher number of
partitions (say up to 100) as well.  Also, see if you want to use the
Copy command.

> Please let me know if you have any specific testing scenario.
>

Can you test the scenario mentioned by Konstantin Knizhnik [1] where
this patch has shown significant gain?  You might want to use a higher
core count machine to test it.

One thing we can do is to somehow measure the collisions on each bucket.

[1] - https://www.postgresql.org/message-id/ef81da49-d491-db86-3ef6-5138d091fe91%40postgrespro.ru

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Amit Kapila
Date:
On Tue, Jun 26, 2018 at 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> Type of table: normal table, unlogged table
> Number of child tables : 16, 64 (all tables are located on the same tablespace)
> Number of clients : 32
> Number of trials : 100
> Duration: 180 seconds for each trials
>
> The hardware spec of server is Intel Xeon 2.4GHz (HT 160cores), 256GB
> RAM, NVMe SSD 1.5TB.
> Each clients load 10kB random data across all partitioned tables.
>
> Here is the result.
>
>  childs |   type   | target  |  avg_tps   | diff with HEAD
> --------+----------+---------+------------+------------------
>      16 | normal   | HEAD    |   1643.833 |
>      16 | normal   | Patched |  1619.5404 |      0.985222
>      16 | unlogged | HEAD    |  9069.3543 |
>      16 | unlogged | Patched |  9368.0263 |      1.032932
>      64 | normal   | HEAD    |   1598.698 |
>      64 | normal   | Patched |  1587.5906 |      0.993052
>      64 | unlogged | HEAD    |  9629.7315 |
>      64 | unlogged | Patched | 10208.2196 |      1.060073
> (8 rows)
>
> For normal tables, loading tps decreased 1% ~ 2% with this patch
> whereas it increased 3% ~ 6% for unlogged tables. There were
> collisions at 0 ~ 5 relation extension lock slots between 2 relations
> in the 64 child tables case but it didn't seem to affect the tps.
>

How did you measure the collisions in this test?  I think it is better
if Mahendra can also use the same technique in measuring that count.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Thu, 6 Feb 2020 at 13:16, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Jun 26, 2018 at 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > Type of table: normal table, unlogged table
> > Number of child tables : 16, 64 (all tables are located on the same tablespace)
> > Number of clients : 32
> > Number of trials : 100
> > Duration: 180 seconds for each trials
> >
> > The hardware spec of server is Intel Xeon 2.4GHz (HT 160cores), 256GB
> > RAM, NVMe SSD 1.5TB.
> > Each clients load 10kB random data across all partitioned tables.
> >
> > Here is the result.
> >
> >  childs |   type   | target  |  avg_tps   | diff with HEAD
> > --------+----------+---------+------------+------------------
> >      16 | normal   | HEAD    |   1643.833 |
> >      16 | normal   | Patched |  1619.5404 |      0.985222
> >      16 | unlogged | HEAD    |  9069.3543 |
> >      16 | unlogged | Patched |  9368.0263 |      1.032932
> >      64 | normal   | HEAD    |   1598.698 |
> >      64 | normal   | Patched |  1587.5906 |      0.993052
> >      64 | unlogged | HEAD    |  9629.7315 |
> >      64 | unlogged | Patched | 10208.2196 |      1.060073
> > (8 rows)
> >
> > For normal tables, loading tps decreased 1% ~ 2% with this patch
> > whereas it increased 3% ~ 6% for unlogged tables. There were
> > collisions at 0 ~ 5 relation extension lock slots between 2 relations
> > in the 64 child tables case but it didn't seem to affect the tps.
> >
>
> How did you measure the collisions in this test?  I think it is better
> if Mahendra can also use the same technique in measuring that count.
>

I created a created a SQL function that returns the hash value of the
lock tag, which is tag_hash(locktag, sizeof(RelExtLockTag)) %
N_RELEXTLOCK_ENTS. And examined the hash values of all tables.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Mahendra Singh Thalor
Date:
On Thu, 6 Feb 2020 at 09:44, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Feb 6, 2020 at 1:57 AM Mahendra Singh Thalor <mahi6run@gmail.com> wrote:
> >
> > On Wed, 5 Feb 2020 at 12:07, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Mon, Feb 3, 2020 at 8:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Tue, Jun 26, 2018 at 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > >
> > > > > On Fri, Apr 27, 2018 at 4:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > > > > > On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund <andres@anarazel.de> wrote:
> > > > > >>> I think the real question is whether the scenario is common enough to
> > > > > >>> worry about.  In practice, you'd have to be extremely unlucky to be
> > > > > >>> doing many bulk loads at the same time that all happened to hash to
> > > > > >>> the same bucket.
> > > > > >>
> > > > > >> With a bunch of parallel bulkloads into partitioned tables that really
> > > > > >> doesn't seem that unlikely?
> > > > > >
> > > > > > It increases the likelihood of collisions, but probably decreases the
> > > > > > number of cases where the contention gets really bad.
> > > > > >
> > > > > > For example, suppose each table has 100 partitions and you are
> > > > > > bulk-loading 10 of them at a time.  It's virtually certain that you
> > > > > > will have some collisions, but the amount of contention within each
> > > > > > bucket will remain fairly low because each backend spends only 1% of
> > > > > > its time in the bucket corresponding to any given partition.
> > > > > >
> > > > >
> > > > > I share another result of performance evaluation between current HEAD
> > > > > and current HEAD with v13 patch(N_RELEXTLOCK_ENTS = 1024).
> > > > >
> > > > > Type of table: normal table, unlogged table
> > > > > Number of child tables : 16, 64 (all tables are located on the same tablespace)
> > > > > Number of clients : 32
> > > > > Number of trials : 100
> > > > > Duration: 180 seconds for each trials
> > > > >
> > > > > The hardware spec of server is Intel Xeon 2.4GHz (HT 160cores), 256GB
> > > > > RAM, NVMe SSD 1.5TB.
> > > > > Each clients load 10kB random data across all partitioned tables.
> > > > >
> > > > > Here is the result.
> > > > >
> > > > >  childs |   type   | target  |  avg_tps   | diff with HEAD
> > > > > --------+----------+---------+------------+------------------
> > > > >      16 | normal   | HEAD    |   1643.833 |
> > > > >      16 | normal   | Patched |  1619.5404 |      0.985222
> > > > >      16 | unlogged | HEAD    |  9069.3543 |
> > > > >      16 | unlogged | Patched |  9368.0263 |      1.032932
> > > > >      64 | normal   | HEAD    |   1598.698 |
> > > > >      64 | normal   | Patched |  1587.5906 |      0.993052
> > > > >      64 | unlogged | HEAD    |  9629.7315 |
> > > > >      64 | unlogged | Patched | 10208.2196 |      1.060073
> > > > > (8 rows)
> > > > >
> > > > > For normal tables, loading tps decreased 1% ~ 2% with this patch
> > > > > whereas it increased 3% ~ 6% for unlogged tables. There were
> > > > > collisions at 0 ~ 5 relation extension lock slots between 2 relations
> > > > > in the 64 child tables case but it didn't seem to affect the tps.
> > > > >
> > > >
> > > > AFAIU, this resembles the workload that Andres was worried about.   I
> > > > think we should once run this test in a different environment, but
> > > > considering this to be correct and repeatable, where do we go with
> > > > this patch especially when we know it improves many workloads [1] as
> > > > well.  We know that on a pathological case constructed by Mithun [2],
> > > > this causes regression as well.  I am not sure if the test done by
> > > > Mithun really mimics any real-world workload as he has tested by
> > > > making N_RELEXTLOCK_ENTS = 1 to hit the worst case.
> > > >
> > > > Sawada-San, if you have a script or data for the test done by you,
> > > > then please share it so that others can also try to reproduce it.
> > >
> > > Unfortunately the environment I used for performance verification is
> > > no longer available.
> > >
> > > I agree to run this test in a different environment. I've attached the
> > > rebased version patch. I'm measuring the performance with/without
> > > patch, so will share the results.
> > >
> >
> > Thanks Sawada-san for patch.
> >
> > From last few days, I was reading this thread and was reviewing v13 patch.  To debug and test, I did re-base of v13 patch. I compared my re-based patch and v14 patch. I think,  ordering of header files is not alphabetically in v14 patch. (I haven't reviewed v14 patch fully because before review, I wanted to test false sharing).  While debugging, I didn't noticed any hang or lock related issue.
> >
> > I did some testing to test false sharing(bulk insert, COPY data, bulk insert into partitions tables).  Below is the testing summary.
> >
> > Test setup(Bulk insert into partition tables):
> > autovacuum=off
> > shared_buffers=512MB -c max_wal_size=20GB -c checkpoint_timeout=12min
> >
> > Basically, I created a table with 13 partitions. Using pgbench, I inserted bulk data. I used below pgbench command:
> > ./pgbench -c $threads -j $threads -T 180 -f insert1.sql@1 -f insert2.sql@1 -f insert3.sql@1 -f insert4.sql@1 postgres
> >
> > I took scripts from previews mails and modified. For reference, I am attaching test scripts.  I tested with default 1024 slots(N_RELEXTLOCK_ENTS = 1024).
> >
> > Clients          HEAD (tps)                     With v14 patch (tps)      %change      (time: 180s)
> > 1                    92.979796                        100.877446                     +8.49 %
> > 32                   392.881863                      388.470622                    -1.12 %
> > 56                   551.753235                       528.018852                   -4.30 %
> > 60                   648.273767                       653.251507                   +0.76 %
> > 64                   645.975124                       671.322140                   +3.92 %
> > 66                   662.728010                       673.399762                   +1.61 %
> > 70                   647.103183                       660.694914                   +2.10 %
> > 74                   648.824027                       676.487622                  +4.26 %
> >
> > From above results, we can see that in most cases, TPS is slightly increased with v14 patch. I am still testing and will post my results.
> >
>
> The number at 56 and 74 client count seem slightly suspicious.   Can
> you please repeat those tests?  Basically, I am not able to come up
> with a theory why at 56 clients the performance with the patch is a
> bit lower and then at 74 it is higher.

Okay. I will repeat test.

>
> > I want to test extension lock by blocking use of fsm(use_fsm=false in code).  I think, if we block use of fsm, then load will increase into extension lock.  Is this correct way to test?
> >
>
> Hmm, I think instead of directly hacking the code, you might want to
> use the operation (probably cluster or vacuum full) where we set
> HEAP_INSERT_SKIP_FSM.  I think along with this you can try with
> unlogged tables because that might stress the extension lock.

Okay. I will test.

>
> In the above test, you might want to test with a higher number of
> partitions (say up to 100) as well.  Also, see if you want to use the
> Copy command.

Okay. I will test.

>
> > Please let me know if you have any specific testing scenario.
> >
>
> Can you test the scenario mentioned by Konstantin Knizhnik [1] where
> this patch has shown significant gain?  You might want to use a higher
> core count machine to test it.

I followed Konstantin Knizhnik steps and tested insert with high core . Below is the test summary:

Test setup:
autovacuum =  off
max_connections = 1000

My testing machine:
$ lscpu
Architecture:          ppc64le
Byte Order:            Little Endian
CPU(s):                192
On-line CPU(s) list:   0-191
Thread(s) per core:    8
Core(s) per socket:    1
Socket(s):             24
NUMA node(s):          4
Model:                 IBM,8286-42A
L1d cache:             64K
L1i cache:             32K
L2 cache:              512K
L3 cache:              8192K
NUMA node0 CPU(s):     0-47
NUMA node1 CPU(s):     48-95
NUMA node2 CPU(s):     96-143
NUMA node3 CPU(s):     144-191

create table test (i int, md5 text);

insert.sql:
begin;
insert into test select i, md5(i::text) from generate_series(1,1000) AS i;
end;

pgbench command:
./pgbench postgres  -c 1000 -j 36 -T 180 -P 10 -f insert.sql >> results.txt

I tested with 1000 clients. Below it the tps:
TPS on HEAD:
Run 1) : 608.908721
Run 2) : 599.962863
Run 3) : 606.378819
Run 4) : 607.174076
Run 5) : 598.531958

TPS with v14 patch: ( N_RELEXTLOCK_ENTS = 1024)
Run 1) : 649.488472
Run 2) : 657.902261
Run 3) : 654.478580
Run 4) : 648.085126
Run 5) : 647.171482

%change = +7.10 %

Apart from above test, I did some more tests (N_RELEXTLOCK_ENTS = 1024):
1) bulk insert into 1 table for T = 180s, 3600s,  clients-100,1000, table- logged,unlogged
2) copy command
3) bulk load into table having 13 partitions

In all the cases, I can see 4-9% improvement in TPS as compared to HEAD.

@Konstantin Knizhnik, if you remember, then please let me know that how much tps gain was observed in your insert test? Is it nearby to my results?

--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Mahendra Singh Thalor
Date:
On Sat, 8 Feb 2020 at 00:27, Mahendra Singh Thalor <mahi6run@gmail.com> wrote:
>
> On Thu, 6 Feb 2020 at 09:44, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Feb 6, 2020 at 1:57 AM Mahendra Singh Thalor <mahi6run@gmail.com> wrote:
> > >
> > > On Wed, 5 Feb 2020 at 12:07, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > >
> > > > On Mon, Feb 3, 2020 at 8:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > > >
> > > > > On Tue, Jun 26, 2018 at 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Apr 27, 2018 at 4:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > > > > > > On Thu, Apr 26, 2018 at 3:10 PM, Andres Freund <andres@anarazel.de> wrote:
> > > > > > >>> I think the real question is whether the scenario is common enough to
> > > > > > >>> worry about.  In practice, you'd have to be extremely unlucky to be
> > > > > > >>> doing many bulk loads at the same time that all happened to hash to
> > > > > > >>> the same bucket.
> > > > > > >>
> > > > > > >> With a bunch of parallel bulkloads into partitioned tables that really
> > > > > > >> doesn't seem that unlikely?
> > > > > > >
> > > > > > > It increases the likelihood of collisions, but probably decreases the
> > > > > > > number of cases where the contention gets really bad.
> > > > > > >
> > > > > > > For example, suppose each table has 100 partitions and you are
> > > > > > > bulk-loading 10 of them at a time.  It's virtually certain that you
> > > > > > > will have some collisions, but the amount of contention within each
> > > > > > > bucket will remain fairly low because each backend spends only 1% of
> > > > > > > its time in the bucket corresponding to any given partition.
> > > > > > >
> > > > > >
> > > > > > I share another result of performance evaluation between current HEAD
> > > > > > and current HEAD with v13 patch(N_RELEXTLOCK_ENTS = 1024).
> > > > > >
> > > > > > Type of table: normal table, unlogged table
> > > > > > Number of child tables : 16, 64 (all tables are located on the same tablespace)
> > > > > > Number of clients : 32
> > > > > > Number of trials : 100
> > > > > > Duration: 180 seconds for each trials
> > > > > >
> > > > > > The hardware spec of server is Intel Xeon 2.4GHz (HT 160cores), 256GB
> > > > > > RAM, NVMe SSD 1.5TB.
> > > > > > Each clients load 10kB random data across all partitioned tables.
> > > > > >
> > > > > > Here is the result.
> > > > > >
> > > > > >  childs |   type   | target  |  avg_tps   | diff with HEAD
> > > > > > --------+----------+---------+------------+------------------
> > > > > >      16 | normal   | HEAD    |   1643.833 |
> > > > > >      16 | normal   | Patched |  1619.5404 |      0.985222
> > > > > >      16 | unlogged | HEAD    |  9069.3543 |
> > > > > >      16 | unlogged | Patched |  9368.0263 |      1.032932
> > > > > >      64 | normal   | HEAD    |   1598.698 |
> > > > > >      64 | normal   | Patched |  1587.5906 |      0.993052
> > > > > >      64 | unlogged | HEAD    |  9629.7315 |
> > > > > >      64 | unlogged | Patched | 10208.2196 |      1.060073
> > > > > > (8 rows)
> > > > > >
> > > > > > For normal tables, loading tps decreased 1% ~ 2% with this patch
> > > > > > whereas it increased 3% ~ 6% for unlogged tables. There were
> > > > > > collisions at 0 ~ 5 relation extension lock slots between 2 relations
> > > > > > in the 64 child tables case but it didn't seem to affect the tps.
> > > > > >
> > > > >
> > > > > AFAIU, this resembles the workload that Andres was worried about.   I
> > > > > think we should once run this test in a different environment, but
> > > > > considering this to be correct and repeatable, where do we go with
> > > > > this patch especially when we know it improves many workloads [1] as
> > > > > well.  We know that on a pathological case constructed by Mithun [2],
> > > > > this causes regression as well.  I am not sure if the test done by
> > > > > Mithun really mimics any real-world workload as he has tested by
> > > > > making N_RELEXTLOCK_ENTS = 1 to hit the worst case.
> > > > >
> > > > > Sawada-San, if you have a script or data for the test done by you,
> > > > > then please share it so that others can also try to reproduce it.
> > > >
> > > > Unfortunately the environment I used for performance verification is
> > > > no longer available.
> > > >
> > > > I agree to run this test in a different environment. I've attached the
> > > > rebased version patch. I'm measuring the performance with/without
> > > > patch, so will share the results.
> > > >
> > >
> > > Thanks Sawada-san for patch.
> > >
> > > From last few days, I was reading this thread and was reviewing v13 patch.  To debug and test, I did re-base of v13 patch. I compared my re-based patch and v14 patch. I think,  ordering of header files is not alphabetically in v14 patch. (I haven't reviewed v14 patch fully because before review, I wanted to test false sharing).  While debugging, I didn't noticed any hang or lock related issue.
> > >
> > > I did some testing to test false sharing(bulk insert, COPY data, bulk insert into partitions tables).  Below is the testing summary.
> > >
> > > Test setup(Bulk insert into partition tables):
> > > autovacuum=off
> > > shared_buffers=512MB -c max_wal_size=20GB -c checkpoint_timeout=12min
> > >
> > > Basically, I created a table with 13 partitions. Using pgbench, I inserted bulk data. I used below pgbench command:
> > > ./pgbench -c $threads -j $threads -T 180 -f insert1.sql@1 -f insert2.sql@1 -f insert3.sql@1 -f insert4.sql@1 postgres
> > >
> > > I took scripts from previews mails and modified. For reference, I am attaching test scripts.  I tested with default 1024 slots(N_RELEXTLOCK_ENTS = 1024).
> > >
> > > Clients          HEAD (tps)                     With v14 patch (tps)      %change      (time: 180s)
> > > 1                    92.979796                        100.877446                     +8.49 %
> > > 32                   392.881863                      388.470622                    -1.12 %
> > > 56                   551.753235                       528.018852                   -4.30 %
> > > 60                   648.273767                       653.251507                   +0.76 %
> > > 64                   645.975124                       671.322140                   +3.92 %
> > > 66                   662.728010                       673.399762                   +1.61 %
> > > 70                   647.103183                       660.694914                   +2.10 %
> > > 74                   648.824027                       676.487622                  +4.26 %
> > >
> > > From above results, we can see that in most cases, TPS is slightly increased with v14 patch. I am still testing and will post my results.
> > >
> >
> > The number at 56 and 74 client count seem slightly suspicious.   Can
> > you please repeat those tests?  Basically, I am not able to come up
> > with a theory why at 56 clients the performance with the patch is a
> > bit lower and then at 74 it is higher.
>
> Okay. I will repeat test.

I re-tested in different machine because in previous machine, results are  in-consistent

My testing machine:
$ lscpu
Architecture:          ppc64le
Byte Order:            Little Endian
CPU(s):                192
On-line CPU(s) list:   0-191
Thread(s) per core:    8
Core(s) per socket:    1
Socket(s):             24
NUMA node(s):          4
Model:                 IBM,8286-42A
L1d cache:             64K
L1i cache:             32K
L2 cache:              512K
L3 cache:              8192K
NUMA node0 CPU(s):     0-47
NUMA node1 CPU(s):     48-95
NUMA node2 CPU(s):     96-143
NUMA node3 CPU(s):     144-191

./pgbench -c $threads -j $threads -T 180 -f insert1.sql@1 -f insert2.sql@1 -f insert3.sql@1 -f insert4.sql@1 postgres

Clients        HEAD(tps)            With v14 patch(tps)        %change       (time: 180s)
1                 41.491486              41.375532                  -0.27%
32              335.138568            330.028739                 -1.52%
56             353.783930             360.883710                  +2.00%
60             341.741925             359.028041                 +5.05%
64             338.521730             356.511423                  +5.13%
66             339.838921             352.761766                  +3.80%
70            339.305454              353.658425                +4.23%
74            332.016217              348.809042                 +5.05%       

From above results, it seems that there is very little regression with the patch(+-5%) that can be run to run variation.

>
> >
> > > I want to test extension lock by blocking use of fsm(use_fsm=false in code).  I think, if we block use of fsm, then load will increase into extension lock.  Is this correct way to test?
> > >
> >
> > Hmm, I think instead of directly hacking the code, you might want to
> > use the operation (probably cluster or vacuum full) where we set
> > HEAP_INSERT_SKIP_FSM.  I think along with this you can try with
> > unlogged tables because that might stress the extension lock.
>
> Okay. I will test.

I tested with unlogged tables also.  There also I was getting 3-6% gain in tps.

>
> >
> > In the above test, you might want to test with a higher number of
> > partitions (say up to 100) as well.  Also, see if you want to use the
> > Copy command.
>
> Okay. I will test.

I tested with 500, 1000, 2000 paratitions. I observed max +5% regress in the tps and there was no performace degradation.

For example:
I created a table with 2000 paratitions and then I checked false sharing.
Slot NumberSlot Freq.Slot NumberSlot Freq.Slot NumberSlot Freq.
156139731144610
62713521048810
782121031050110
812121131070110
192111751073710
221112351075410
367112541078110
546113141079010
814114191083310
917114241088810

From above table, we can see that total 13 child tables are falling in same backet (slot 156) so I did bulk-loading only in those 13 child tables to check tps in false sharing but I noticed that there was no performance degradation.

--
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Amit Kapila
Date:
On Mon, Feb 10, 2020 at 10:28 PM Mahendra Singh Thalor
<mahi6run@gmail.com> wrote:
>
> On Sat, 8 Feb 2020 at 00:27, Mahendra Singh Thalor <mahi6run@gmail.com> wrote:
> >
> > On Thu, 6 Feb 2020 at 09:44, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > The number at 56 and 74 client count seem slightly suspicious.   Can
> > > you please repeat those tests?  Basically, I am not able to come up
> > > with a theory why at 56 clients the performance with the patch is a
> > > bit lower and then at 74 it is higher.
> >
> > Okay. I will repeat test.
>
> I re-tested in different machine because in previous machine, results are  in-consistent
>

Thanks for doing detailed tests.

> My testing machine:
> $ lscpu
> Architecture:          ppc64le
> Byte Order:            Little Endian
> CPU(s):                192
> On-line CPU(s) list:   0-191
> Thread(s) per core:    8
> Core(s) per socket:    1
> Socket(s):             24
> NUMA node(s):          4
> Model:                 IBM,8286-42A
> L1d cache:             64K
> L1i cache:             32K
> L2 cache:              512K
> L3 cache:              8192K
> NUMA node0 CPU(s):     0-47
> NUMA node1 CPU(s):     48-95
> NUMA node2 CPU(s):     96-143
> NUMA node3 CPU(s):     144-191
>
> ./pgbench -c $threads -j $threads -T 180 -f insert1.sql@1 -f insert2.sql@1 -f insert3.sql@1 -f insert4.sql@1
postgres
>
> Clients        HEAD(tps)            With v14 patch(tps)        %change       (time: 180s)
> 1                 41.491486              41.375532                  -0.27%
> 32              335.138568            330.028739                 -1.52%
> 56             353.783930             360.883710                  +2.00%
> 60             341.741925             359.028041                 +5.05%
> 64             338.521730             356.511423                  +5.13%
> 66             339.838921             352.761766                  +3.80%
> 70            339.305454              353.658425                +4.23%
> 74            332.016217              348.809042                 +5.05%
>
> From above results, it seems that there is very little regression with the patch(+-5%) that can be run to run
variation.
>

Hmm, I don't see 5% regression, rather it is a performance gain of ~5%
with the patch?  When we use regression, that indicates with the patch
performance (TPS) is reduced, but I don't see that in the above
numbers.  Kindly clarify.

> >
> > >
> > > > I want to test extension lock by blocking use of fsm(use_fsm=false in code).  I think, if we block use of fsm,
thenload will increase into extension lock.  Is this correct way to test?
 
> > > >
> > >
> > > Hmm, I think instead of directly hacking the code, you might want to
> > > use the operation (probably cluster or vacuum full) where we set
> > > HEAP_INSERT_SKIP_FSM.  I think along with this you can try with
> > > unlogged tables because that might stress the extension lock.
> >
> > Okay. I will test.
>
> I tested with unlogged tables also.  There also I was getting 3-6% gain in tps.
>
> >
> > >
> > > In the above test, you might want to test with a higher number of
> > > partitions (say up to 100) as well.  Also, see if you want to use the
> > > Copy command.
> >
> > Okay. I will test.
>
> I tested with 500, 1000, 2000 paratitions. I observed max +5% regress in the tps and there was no performace
degradation.
>

Again, I am not sure if you see performance dip here.  I think your
usage of the word 'regression' is not correct or at least confusing.

> For example:
> I created a table with 2000 paratitions and then I checked false sharing.
> Slot NumberSlot Freq.Slot NumberSlot Freq.Slot NumberSlot Freq.
> 156139731144610
> 62713521048810
> 782121031050110
> 812121131070110
> 192111751073710
> 221112351075410
> 367112541078110
> 546113141079010
> 814114191083310
> 917114241088810
>
> From above table, we can see that total 13 child tables are falling in same backet (slot 156) so I did bulk-loading
onlyin those 13 child tables to check tps in false sharing but I noticed that there was no performance degradation.
 
>

Okay.  Is it possible to share these numbers and scripts?

Thanks for doing the detailed tests for this patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Amit Kapila
Date:
On Wed, Feb 5, 2020 at 12:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
>
> Unfortunately the environment I used for performance verification is
> no longer available.
>
> I agree to run this test in a different environment. I've attached the
> rebased version patch. I'm measuring the performance with/without
> patch, so will share the results.
>

Did you get a chance to run these tests?  Lately, Mahendra has done a
lot of performance testing of this patch and shared his results.  I
don't see much downside with the patch, rather there is a performance
increase of 3-9% in various scenarios.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



I took a brief look through this patch.  I agree with the fundamental
idea that we shouldn't need to use the heavyweight lock manager for
relation extension, since deadlock is not a concern and no backend
should ever need to hold more than one such lock at once.  But it feels
to me like this particular solution is rather seriously overengineered.
I would like to suggest that we do something similar to Robert Haas'
excellent hack (daa7527af) for the !HAVE_SPINLOCK case in lmgr/spin.c,
that is,

* Create some predetermined number N of LWLocks for relation extension.
* When we want to extend some relation R, choose one of those locks
  (say, R's relfilenode number mod N) and lock it.

1. As long as all backends agree on the relation-to-lock mapping, this
provides full security against concurrent extensions of the same
relation.

2. Occasionally a backend will be blocked when it doesn't need to be,
because of false sharing of a lock between two relations that need to
be extended at the same time.  But as long as N is large enough (and
I doubt that it needs to be very large), that will be a negligible
penalty.

3. Aside from being a lot simpler than the proposed extension_lock.c,
this approach involves absolutely negligible overhead beyond the raw
LWLockAcquire and LWLockRelease calls.  I suspect therefore that in
typical noncontended cases it will be faster.  It also does not require
any new resource management overhead, thus eliminating this patch's
small but real penalty on transaction exit/cleanup.

We'd need to do a bit of performance testing to choose a good value
for N.  I think that with N comparable to MaxBackends, the odds of
false sharing being a problem would be quite negligible ... but it
could be that we could get away with a lot less than that.

            regards, tom lane



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Wed, 12 Feb 2020 at 00:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I took a brief look through this patch.  I agree with the fundamental
> idea that we shouldn't need to use the heavyweight lock manager for
> relation extension, since deadlock is not a concern and no backend
> should ever need to hold more than one such lock at once.  But it feels
> to me like this particular solution is rather seriously overengineered.
> I would like to suggest that we do something similar to Robert Haas'
> excellent hack (daa7527af) for the !HAVE_SPINLOCK case in lmgr/spin.c,
> that is,
>
> * Create some predetermined number N of LWLocks for relation extension.

My original proposal used LWLocks and hash tables for relation
extension but there was a discussion that using LWLocks is not good
because it's not interruptible[1]. Because of this reason and that we
don't need to have two lock level (shared, exclusive) for relation
extension lock we ended up with implementing dedicated lock manager
for extension lock. I think we will have that problem if we use LWLocks.

Regards,

[1] https://www.postgresql.org/message-id/CA%2BTgmoZnWYQvmeqeGyY%2B0j-Tfmx8cTzRadfxJQwK9A-nCQ7GkA%40mail.gmail.com



--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Moving relation extension locks out of heavyweightlock manager

From
Andres Freund
Date:
Hi,

On 2020-02-11 08:01:34 +0530, Amit Kapila wrote:
> I don't see much downside with the patch, rather there is a
> performance increase of 3-9% in various scenarios.

As I wrote in [1] I started to look at this patch. My problem with itis
that it just seems like the wrong direction architecturally to
me. There's two main aspects to this:

1) It basically builds a another, more lightweight but less capable, of
   a lock manager that can lock more objects than we can have distinct
   locks for.  It is faster because it uses *one* hashtable without
   conflict handling, because it has fewer lock modes, and because it
   doesn't support detecting deadlocks. And probably some other things.

2) A lot of the contention around file extension comes from us doing
   multiple expensive things under one lock (determining current
   relation size, searching victim buffer, extending file), and in tiny
   increments (growing a 1TB table by 8kb). This patch doesn't address
   that at all.

I've focused on 1) in the email referenced above ([1]). Here I'll focus
on 2).

To quantify my concerns I instrumented postgres to measure the time for
various operations that are part of extending a file (all per
process). The hardware is a pretty fast nvme, with unlogged tables, on a
20/40 core/threads machine. The workload is copying a scale 10
pgbench_accounts into an unindexed, unlogged table using pgbench.

Here are the instrumentations for various client counts, when just
measuring 20s:

1 client:
LOG:  extension time: lock wait: 0.00 lock held: 3.19 filesystem: 1.29 buffersearch: 1.58

2 clients:
LOG:  extension time: lock wait: 0.47 lock held: 2.99 filesystem: 1.24 buffersearch: 1.43
LOG:  extension time: lock wait: 0.60 lock held: 3.05 filesystem: 1.23 buffersearch: 1.50

4 clients:
LOG:  extension time: lock wait: 3.92 lock held: 2.69 filesystem: 1.10 buffersearch: 1.29
LOG:  extension time: lock wait: 4.40 lock held: 2.02 filesystem: 0.81 buffersearch: 0.93
LOG:  extension time: lock wait: 3.86 lock held: 2.59 filesystem: 1.06 buffersearch: 1.22
LOG:  extension time: lock wait: 4.00 lock held: 2.65 filesystem: 1.08 buffersearch: 1.26

8 clients:
LOG:  extension time: lock wait: 6.94 lock held: 1.74 filesystem: 0.70 buffersearch: 0.80
LOG:  extension time: lock wait: 7.16 lock held: 1.81 filesystem: 0.73 buffersearch: 0.82
LOG:  extension time: lock wait: 6.93 lock held: 1.95 filesystem: 0.80 buffersearch: 0.89
LOG:  extension time: lock wait: 7.08 lock held: 1.87 filesystem: 0.76 buffersearch: 0.86
LOG:  extension time: lock wait: 6.95 lock held: 1.95 filesystem: 0.80 buffersearch: 0.89
LOG:  extension time: lock wait: 6.88 lock held: 2.01 filesystem: 0.83 buffersearch: 0.93
LOG:  extension time: lock wait: 6.94 lock held: 2.02 filesystem: 0.82 buffersearch: 0.93
LOG:  extension time: lock wait: 7.02 lock held: 1.95 filesystem: 0.80 buffersearch: 0.89

16 clients:
LOG:  extension time: lock wait: 10.37 lock held: 0.88 filesystem: 0.36 buffersearch: 0.39
LOG:  extension time: lock wait: 10.53 lock held: 0.90 filesystem: 0.37 buffersearch: 0.40
LOG:  extension time: lock wait: 10.72 lock held: 1.01 filesystem: 0.42 buffersearch: 0.45
LOG:  extension time: lock wait: 10.45 lock held: 1.25 filesystem: 0.52 buffersearch: 0.55
LOG:  extension time: lock wait: 10.66 lock held: 0.94 filesystem: 0.38 buffersearch: 0.41
LOG:  extension time: lock wait: 10.50 lock held: 1.27 filesystem: 0.53 buffersearch: 0.56
LOG:  extension time: lock wait: 10.53 lock held: 1.19 filesystem: 0.49 buffersearch: 0.53
LOG:  extension time: lock wait: 10.57 lock held: 1.22 filesystem: 0.50 buffersearch: 0.53
LOG:  extension time: lock wait: 10.72 lock held: 1.17 filesystem: 0.48 buffersearch: 0.52
LOG:  extension time: lock wait: 10.67 lock held: 1.32 filesystem: 0.55 buffersearch: 0.58
LOG:  extension time: lock wait: 10.95 lock held: 0.92 filesystem: 0.38 buffersearch: 0.40
LOG:  extension time: lock wait: 10.81 lock held: 1.24 filesystem: 0.51 buffersearch: 0.56
LOG:  extension time: lock wait: 10.62 lock held: 1.27 filesystem: 0.53 buffersearch: 0.56
LOG:  extension time: lock wait: 11.14 lock held: 0.94 filesystem: 0.38 buffersearch: 0.41
LOG:  extension time: lock wait: 11.20 lock held: 0.96 filesystem: 0.39 buffersearch: 0.42
LOG:  extension time: lock wait: 10.75 lock held: 1.41 filesystem: 0.58 buffersearch: 0.63
0.88 + 0.90 + 1.01 + 1.25 + 0.94 + 1.27 + 1.19 + 1.22 + 1.17 + 1.32 + 0.92 + 1.24 + 1.27 + 0.94 + 0.96 + 1.41
in *none* of these cases the drive gets even close to being
saturated. Like not even 1/3.


If you consider the total time with the lock held, and the total time of
the test, it becomes very quickly obvious that pretty quickly we spend
the majority of the total time with the lock held.
client count 1: 3.18/20 = 0.16
client count 2: 6.04/20 = 0.30
client count 4: 9.95/20 = 0.50
client count 8: 15.30/20 = 0.76
client count 16: 17.89/20 = 0.89

In other words, the reason that relation extension scales terribly
isn't, to a significant degree, because the locking is slow. It's
because we hold locks for the majority of the benchmark's time starting
even at just 4 clients.  Focusing on making the locking faster is just
optimizing for the wrong thing.  Amdahl's law will just restrict the
benefits to a pretty small amount.

Looking at a CPU time profile (i.e. it'll not include the time spent
waiting for a lock, once sleeping in the kernel) for time spent within
RelationGetBufferForTuple():

-   19.16%     0.29%  postgres  postgres            [.] RelationGetBufferForTuple
   - 18.88% RelationGetBufferForTuple
      - 13.18% ReadBufferExtended
         - ReadBuffer_common
            + 5.02% mdextend
            + 4.77% FlushBuffer.part.0
            + 0.61% BufTableLookup
              0.52% __memset_avx2_erms
      + 1.65% PageInit
      - 1.18% LockRelationForExtension
         - 1.16% LockAcquireExtended
            - 1.07% WaitOnLock
               - 1.01% ProcSleep
                  - 0.88% WaitLatchOrSocket
                       0.52% WaitEventSetWait
        0.65% RecordAndGetPageWithFreeSpace

the same workload using an assert enabled build, to get a simpler to
interpret profile:
-   13.28%     0.19%  postgres  postgres            [.] RelationGetBufferForTuple
   - 13.09% RelationGetBufferForTuple
      - 8.35% RelationAddExtraBlocks
         - 7.67% ReadBufferBI
            - 7.54% ReadBufferExtended
               - 7.52% ReadBuffer_common
                  - 3.64% BufferAlloc
                     + 2.39% FlushBuffer
                     + 0.27% BufTableLookup
                     + 0.24% BufTableDelete
                     + 0.15% LWLockAcquire
                       0.14% StrategyGetBuffer
                     + 0.13% BufTableHashCode
                  - 2.96% smgrextend
                     + mdextend
                  + 0.52% __memset_avx2_erms
                  + 0.14% smgrnblocks
                    0.11% __GI___clock_gettime (inlined)
         + 0.57% RecordPageWithFreeSpace
      - 1.23% RecordAndGetPageWithFreeSpace
         - 1.03% fsm_set_and_search
            + 0.50% fsm_readbuf
            + 0.20% LockBuffer
            + 0.18% UnlockReleaseBuffer
              0.11% fsm_set_avail
           0.19% fsm_search
      - 0.86% ReadBufferBI
         - 0.72% ReadBufferExtended
            - ReadBuffer_common
               - 0.58% BufferAlloc
                  + 0.20% BufTableLookup
                  + 0.10% LWLockAcquire
      + 0.81% PageInit
      - 0.67% LockRelationForExtension
         - 0.67% LockAcquire
            - LockAcquireExtended
               + 0.60% WaitOnLock


Which, I think, pretty clearly shows a few things:

1) It's crucial to move acquiring a victim buffer to the outside of the
   extension lock, as for copy acquiring the victim buffer will commonly
   cause a buffer having to be written out, due to the ringbuffer. This
   is even more crucial when using a logged table, as the writeout then
   also will often also trigger a WAL flush.

   While doing so will sometimes add a round of acquiring the buffer
   mapping locks, having to do the FlushBuffer while holding the
   extension lock is a huge problem.

   This'd also move a good bit of the cost of finding (i.e. clock sweep
   / ringbuffer replacement) and invalidating the old buffer mapping out
   of the lock.

2) We need to make the smgrwrite more efficient, it is costing a lot of
   time. A small additional experiment shows the cost of doing 8kb
   writes:

   I wrote a small program that just iteratively writes a 32GB file:

   pwrite using 8kb blocks:
   0.24user 17.88system 0:18.16 elapsed 99%CPU

   pwrite using 128kb blocks:
   0.00user 16.71system 0:17.01 elapsed 98%CPU

   pwrite using 256kb blocks:
   0.00user 15.95system 0:16.03 elapsed 99%CPU

   pwritev() using 16 8kb blocks to write 128kb at once:
   0.02user 15.94system 0:16.09 elapsed 99%CPU

   pwritev() using 32 8kb blocks to write 256kb at once:
   0.01user 14.90system 0:14.93 elapsed 99%CPU

   pwritev() using 128 8kb blocks to write 1MB at once:
   0.00user 13.96system 0:13.96 elapsed 99%CPU


   if I instead just use posix_fallocate() with 8kb blocks:
   0.28user 23.49system 0:23.78elapsed 99%CPU (0avgtext+0avgdata 1212maxresident)k
   0inputs+0outputs (0major+66minor)pagefaults 0swaps

   if I instead just use posix_fallocate() with 32 8kb blocks:
   0.01user 1.18system 0:01.19elapsed 99%CPU (0avgtext+0avgdata 1200maxresident)k
   0inputs+0outputs (0major+67minor)pagefaults 0swaps

   obviously fallocate doesn't quite have the same behaviour, and may incur
   a bit higher overhead for a later write.


   using a version that instead uses O_DIRECT + async IO, I get (but
   only when also doing posix_fallocate in larger chunks):
   0.05user 5.53system 0:12.53 elapsed 44%CPU

   So we get considerably higher write throughput, at a considerably
   lower CPU usage (because DMA replaces the CPU doing a memcpy()).


   So it looks like extending the file with posix_fallocate() might be a
   winner, but only if we actually can do so in larger chunks than 8kb
   at once.


   Alternatively it could be worthwhile to rejigger things so we don't
   extend the files with zeroes once, just to then immediately overwrite
   it with actual content. For some users it's probably possible to
   pre-generate a page with contents when extending the file (would need
   fiddling with block numbers etc).


3) We should move the PageInit() that's currently done with the
   extension lock held, to the outside. Since we get the buffer with
   RBM_ZERO_AND_LOCK these days, that should be safe.  Also, we don't
   need to zero the entire buffer both in RelationGetBufferForTuple()'s
   PageInit(), and in ReadBuffer_common() before calling smgrextend().


Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/20200211042229.msv23badgqljrdg2%40alap3.anarazel.de



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Tue, 11 Feb 2020 at 11:31, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Feb 5, 2020 at 12:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> >
> > Unfortunately the environment I used for performance verification is
> > no longer available.
> >
> > I agree to run this test in a different environment. I've attached the
> > rebased version patch. I'm measuring the performance with/without
> > patch, so will share the results.
> >
>
> Did you get a chance to run these tests?  Lately, Mahendra has done a
> lot of performance testing of this patch and shared his results.  I
> don't see much downside with the patch, rather there is a performance
> increase of 3-9% in various scenarios.

I've done performance tests on my laptop while changing the number of
partitions. 4 clients concurrently insert 32 tuples to randomly
selected partitions in a transaction. Therefore by changing the number
of partition the contention of relation extension lock would also be
changed. All tables are unlogged tables and N_RELEXTLOCK_ENTS is 1024.

Here is my test results:

* HEAD
nchilds = 64 tps = 33135
nchilds = 128 tps = 31249
nchilds = 256 tps = 29356

* Patched
nchilds = 64 tps = 32057
nchilds = 128 tps = 32426
nchilds = 256 tps = 29483

The performance has been slightly improved by the patch in two cases.
I've also attached the shell script I used to test.

When I set N_RELEXTLOCK_ENTS to 1 so that all relation locks conflicts
the result is:

nchilds = 64 tps = 30887
nchilds = 128 tps = 30015
nchilds = 256 tps = 27837

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Amit Kapila
Date:
On Wed, Feb 12, 2020 at 7:36 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Wed, 12 Feb 2020 at 00:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > I took a brief look through this patch.  I agree with the fundamental
> > idea that we shouldn't need to use the heavyweight lock manager for
> > relation extension, since deadlock is not a concern and no backend
> > should ever need to hold more than one such lock at once.  But it feels
> > to me like this particular solution is rather seriously overengineered.
> > I would like to suggest that we do something similar to Robert Haas'
> > excellent hack (daa7527af) for the !HAVE_SPINLOCK case in lmgr/spin.c,
> > that is,
> >
> > * Create some predetermined number N of LWLocks for relation extension.
>
> My original proposal used LWLocks and hash tables for relation
> extension but there was a discussion that using LWLocks is not good
> because it's not interruptible[1]. Because of this reason and that we
> don't need to have two lock level (shared, exclusive) for relation
> extension lock we ended up with implementing dedicated lock manager
> for extension lock. I think we will have that problem if we use LWLocks.
>

Hmm, but we use LWLocks for (a) WALWrite/Flush (see the usage of
WALWriteLock), (b) writing the shared buffer contents (see
io_in_progress lock and its usage in FlushBuffer) and might be for few
other similar stuff.  Many times those take more time than extending a
block in relation especially when we combine the WAL write for
multiple commits.  So, if this is a problem for relation extension
lock, then the same thing holds true there also.  Now, there are cases
like when we extend the relation with multiple blocks, finding victim
buffer under this lock, etc. where this can be also equally or more
costly, but I think we can improve some of those cases (some of this
is even pointed by Andres in his email) if we agree on a fundamental
idea of using LWLocks as proposed by Tom.   I am not telling that we
implement Tom's idea without weighing its pros and cons, but it has an
appeal due to its simplicity.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Amit Kapila
Date:
On Wed, Feb 12, 2020 at 10:24 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-02-11 08:01:34 +0530, Amit Kapila wrote:
> > I don't see much downside with the patch, rather there is a
> > performance increase of 3-9% in various scenarios.
>
> As I wrote in [1] I started to look at this patch. My problem with itis
> that it just seems like the wrong direction architecturally to
> me. There's two main aspects to this:
>
> 1) It basically builds a another, more lightweight but less capable, of
>    a lock manager that can lock more objects than we can have distinct
>    locks for.  It is faster because it uses *one* hashtable without
>    conflict handling, because it has fewer lock modes, and because it
>    doesn't support detecting deadlocks. And probably some other things.
>
> 2) A lot of the contention around file extension comes from us doing
>    multiple expensive things under one lock (determining current
>    relation size, searching victim buffer, extending file), and in tiny
>    increments (growing a 1TB table by 8kb). This patch doesn't address
>    that at all.
>

It seems to me both the two points try to address the performance
angle of the patch, but here our actual intention was to make this
lock block among parallel workers so that we can implement/improve
some of the parallel writes operations (like parallelly vacuuming the
heap or index, parallel bulk load, etc.).  Both independently are
worth accomplishing, but not w.r.t parallel writes.  Here, we were
doing some benchmarking to see if we haven't regressed performance in
any cases.

> I've focused on 1) in the email referenced above ([1]). Here I'll focus
> on 2).
>
>
>
> Which, I think, pretty clearly shows a few things:
>

I agree with all your below observations.

> 1) It's crucial to move acquiring a victim buffer to the outside of the
>    extension lock, as for copy acquiring the victim buffer will commonly
>    cause a buffer having to be written out, due to the ringbuffer. This
>    is even more crucial when using a logged table, as the writeout then
>    also will often also trigger a WAL flush.
>
>    While doing so will sometimes add a round of acquiring the buffer
>    mapping locks, having to do the FlushBuffer while holding the
>    extension lock is a huge problem.
>
>    This'd also move a good bit of the cost of finding (i.e. clock sweep
>    / ringbuffer replacement) and invalidating the old buffer mapping out
>    of the lock.
>

I think this mostly because of the way currently code is arranged to
extend a block via ReadBuffer* API.  IIUC, currently the main
operations under relation extension lock are as follows:
a. get the block number for extension via smgrnblocks.
b. find victim buffer
c. associate buffer with the block no. found in step-a.
d. initialize the block with zeros
e. write the block
f.  PageInit

I think if we can rearrange such that steps b and c can be done after
e or f, then we don't need to hold the extension lock to find the
victim buffer.

> 2) We need to make the smgrwrite more efficient, it is costing a lot of
>    time. A small additional experiment shows the cost of doing 8kb
>    writes:
>
>    I wrote a small program that just iteratively writes a 32GB file:
>
..
>
>
>    So it looks like extending the file with posix_fallocate() might be a
>    winner, but only if we actually can do so in larger chunks than 8kb
>    at once.
>

A good experiment and sounds like worth doing.

>
>
> 3) We should move the PageInit() that's currently done with the
>    extension lock held, to the outside. Since we get the buffer with
>    RBM_ZERO_AND_LOCK these days, that should be safe.  Also, we don't
>    need to zero the entire buffer both in RelationGetBufferForTuple()'s
>    PageInit(), and in ReadBuffer_common() before calling smgrextend().
>

Agreed.

I feel all three are independent improvements and can be done separately.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Amit Kapila <amit.kapila16@gmail.com> writes:
> On Wed, Feb 12, 2020 at 7:36 AM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
>> On Wed, 12 Feb 2020 at 00:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I would like to suggest that we do something similar to Robert Haas'
>>> excellent hack (daa7527af) for the !HAVE_SPINLOCK case in lmgr/spin.c,

>> My original proposal used LWLocks and hash tables for relation
>> extension but there was a discussion that using LWLocks is not good
>> because it's not interruptible[1].

> Hmm, but we use LWLocks for (a) WALWrite/Flush (see the usage of
> WALWriteLock), (b) writing the shared buffer contents (see
> io_in_progress lock and its usage in FlushBuffer) and might be for few
> other similar stuff.  Many times those take more time than extending a
> block in relation especially when we combine the WAL write for
> multiple commits.  So, if this is a problem for relation extension
> lock, then the same thing holds true there also.

Yeah.  I would say a couple more things:

* I see no reason to think that a relation extension lock would ever
be held long enough for noninterruptibility to be a real issue.  Our
expectations for query cancel response time are in the tens to
hundreds of msec anyway.

* There are other places where an LWLock can be held for a *long* time,
notably the CheckpointLock.  If we do think this is an issue, we could
devise a way to not insist on noninterruptibility.  The easiest fix
is just to do a matching RESUME_INTERRUPTS after getting the lock and
HOLD_INTERRUPTS again before releasing it; though maybe it'd be worth
offering some slightly cleaner way.  Point here is that LWLockAcquire
only does that because it's useful to the majority of callers, not
because it's graven in stone that it must be like that.

In general, if we think there are issues with LWLock, it seems to me
we'd be better off to try to fix them, not to invent a whole new
single-purpose lock manager that we'll have to debug and maintain.
I do not see anything about this problem that suggests that that would
provide a major win.  As Andres has noted, there are lots of other
aspects of it that are likely to be more useful to spend effort on.

            regards, tom lane



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Amit Kapila
Date:
On Wed, Feb 12, 2020 at 10:23 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > On Wed, Feb 12, 2020 at 7:36 AM Masahiko Sawada
> > <masahiko.sawada@2ndquadrant.com> wrote:
> >> On Wed, 12 Feb 2020 at 00:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >>> I would like to suggest that we do something similar to Robert Haas'
> >>> excellent hack (daa7527af) for the !HAVE_SPINLOCK case in lmgr/spin.c,
>
> >> My original proposal used LWLocks and hash tables for relation
> >> extension but there was a discussion that using LWLocks is not good
> >> because it's not interruptible[1].
>
> > Hmm, but we use LWLocks for (a) WALWrite/Flush (see the usage of
> > WALWriteLock), (b) writing the shared buffer contents (see
> > io_in_progress lock and its usage in FlushBuffer) and might be for few
> > other similar stuff.  Many times those take more time than extending a
> > block in relation especially when we combine the WAL write for
> > multiple commits.  So, if this is a problem for relation extension
> > lock, then the same thing holds true there also.
>
> Yeah.  I would say a couple more things:
>
> * I see no reason to think that a relation extension lock would ever
> be held long enough for noninterruptibility to be a real issue.  Our
> expectations for query cancel response time are in the tens to
> hundreds of msec anyway.
>
> * There are other places where an LWLock can be held for a *long* time,
> notably the CheckpointLock.  If we do think this is an issue, we could
> devise a way to not insist on noninterruptibility.  The easiest fix
> is just to do a matching RESUME_INTERRUPTS after getting the lock and
> HOLD_INTERRUPTS again before releasing it; though maybe it'd be worth
> offering some slightly cleaner way.
>

Yeah, this sounds like the better answer for noninterruptibility
aspect of this design.  One idea that occurred to me was to pass a
parameter to LWLOCK acquire/release APIs to indicate whether to
hold/resume interrupts, but I don't know if that is any better than
doing it at the required place.  I am not sure if all places are
careful whether they really want to hold interrupts, so if we provide
a new parameter at least new users of API will think about it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Amit Kapila
Date:
On Tue, Feb 11, 2020 at 9:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I took a brief look through this patch.  I agree with the fundamental
> idea that we shouldn't need to use the heavyweight lock manager for
> relation extension, since deadlock is not a concern and no backend
> should ever need to hold more than one such lock at once.  But it feels
> to me like this particular solution is rather seriously overengineered.
> I would like to suggest that we do something similar to Robert Haas'
> excellent hack (daa7527af) for the !HAVE_SPINLOCK case in lmgr/spin.c,
> that is,
>
> * Create some predetermined number N of LWLocks for relation extension.
> * When we want to extend some relation R, choose one of those locks
>   (say, R's relfilenode number mod N) and lock it.
>

I am imagining something on the lines of BufferIOLWLockArray (here it
will be RelExtLWLockArray).  The size (N) could MaxBackends or some
percentage of it (depending on testing) and indexing into an array
could be as suggested (R's relfilenode number mod N).  We need to
initialize this during shared memory initialization.  Then, to extend
the relation with multiple blocks at-a-time (as we do in
RelationAddExtraBlocks), we can either use the already proven
technique of group clear xid mechanism (see ProcArrayGroupClearXid) or
have an additional state in the RelExtLWLockArray which will keep the
count of waiters (as done in latest patch of Sawada-san [1]).  We
might want to experiment with both approaches and see which yields
better results.

[1] - https://www.postgresql.org/message-id/CAD21AoADkWhkLEB_%3DkjLZeZ_ML9_hSQqNBWz%2Bd821QHf%3DO9LJQ%40mail.gmail.com
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Dilip Kumar
Date:
On Thu, Feb 13, 2020 at 9:46 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Feb 11, 2020 at 9:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > I took a brief look through this patch.  I agree with the fundamental
> > idea that we shouldn't need to use the heavyweight lock manager for
> > relation extension, since deadlock is not a concern and no backend
> > should ever need to hold more than one such lock at once.  But it feels
> > to me like this particular solution is rather seriously overengineered.
> > I would like to suggest that we do something similar to Robert Haas'
> > excellent hack (daa7527af) for the !HAVE_SPINLOCK case in lmgr/spin.c,
> > that is,
> >
> > * Create some predetermined number N of LWLocks for relation extension.
> > * When we want to extend some relation R, choose one of those locks
> >   (say, R's relfilenode number mod N) and lock it.
> >
>
> I am imagining something on the lines of BufferIOLWLockArray (here it
> will be RelExtLWLockArray).  The size (N) could MaxBackends or some
> percentage of it (depending on testing) and indexing into an array
> could be as suggested (R's relfilenode number mod N).  We need to
> initialize this during shared memory initialization.  Then, to extend
> the relation with multiple blocks at-a-time (as we do in
> RelationAddExtraBlocks), we can either use the already proven
> technique of group clear xid mechanism (see ProcArrayGroupClearXid) or
> have an additional state in the RelExtLWLockArray which will keep the
> count of waiters (as done in latest patch of Sawada-san [1]).  We
> might want to experiment with both approaches and see which yields
> better results.

IMHO, in this case, there is no point in using the "group clear" type
of mechanism mainly for two reasons 1) It will unnecessarily make
PGPROC structure heavy.
2) For our case, we don't need any specific pieces of information from
other waiters, we just need the count.


Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Mahendra Singh Thalor
Date:
On Thu, 13 Feb 2020 at 09:46, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Feb 11, 2020 at 9:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > I took a brief look through this patch.  I agree with the fundamental
> > idea that we shouldn't need to use the heavyweight lock manager for
> > relation extension, since deadlock is not a concern and no backend
> > should ever need to hold more than one such lock at once.  But it feels
> > to me like this particular solution is rather seriously overengineered.
> > I would like to suggest that we do something similar to Robert Haas'
> > excellent hack (daa7527af) for the !HAVE_SPINLOCK case in lmgr/spin.c,
> > that is,
> >
> > * Create some predetermined number N of LWLocks for relation extension.
> > * When we want to extend some relation R, choose one of those locks
> >   (say, R's relfilenode number mod N) and lock it.
> >
>
> I am imagining something on the lines of BufferIOLWLockArray (here it
> will be RelExtLWLockArray).  The size (N) could MaxBackends or some
> percentage of it (depending on testing) and indexing into an array
> could be as suggested (R's relfilenode number mod N).  We need to
> initialize this during shared memory initialization.  Then, to extend
> the relation with multiple blocks at-a-time (as we do in
> RelationAddExtraBlocks), we can either use the already proven
> technique of group clear xid mechanism (see ProcArrayGroupClearXid) or
> have an additional state in the RelExtLWLockArray which will keep the
> count of waiters (as done in latest patch of Sawada-san [1]).  We
> might want to experiment with both approaches and see which yields
> better results.

Thanks all for the suggestions. I have started working on the
implementation based on the suggestion.  I will post a patch for this
in few days.

-- 
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Thu, 13 Feb 2020 at 13:16, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Feb 11, 2020 at 9:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > I took a brief look through this patch.  I agree with the fundamental
> > idea that we shouldn't need to use the heavyweight lock manager for
> > relation extension, since deadlock is not a concern and no backend
> > should ever need to hold more than one such lock at once.  But it feels
> > to me like this particular solution is rather seriously overengineered.
> > I would like to suggest that we do something similar to Robert Haas'
> > excellent hack (daa7527af) for the !HAVE_SPINLOCK case in lmgr/spin.c,
> > that is,
> >
> > * Create some predetermined number N of LWLocks for relation extension.
> > * When we want to extend some relation R, choose one of those locks
> >   (say, R's relfilenode number mod N) and lock it.
> >
>
> I am imagining something on the lines of BufferIOLWLockArray (here it
> will be RelExtLWLockArray).  The size (N) could MaxBackends or some
> percentage of it (depending on testing) and indexing into an array
> could be as suggested (R's relfilenode number mod N).

I'm not sure it's good that the contention of LWLock slot depends on
MaxBackends. Because it means that the more MaxBackends is larger, the
less the LWLock slot conflicts, even if the same number of backends
actually connecting. Normally we don't want to increase unnecessarily
MaxBackends for security reasons. In the current patch we defined a
fixed length of array for extension lock but I agree that we need to
determine what approach is the best depending on testing.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Amit Kapila
Date:
On Fri, Feb 14, 2020 at 11:42 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Thu, 13 Feb 2020 at 13:16, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Feb 11, 2020 at 9:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >
> > > I took a brief look through this patch.  I agree with the fundamental
> > > idea that we shouldn't need to use the heavyweight lock manager for
> > > relation extension, since deadlock is not a concern and no backend
> > > should ever need to hold more than one such lock at once.  But it feels
> > > to me like this particular solution is rather seriously overengineered.
> > > I would like to suggest that we do something similar to Robert Haas'
> > > excellent hack (daa7527af) for the !HAVE_SPINLOCK case in lmgr/spin.c,
> > > that is,
> > >
> > > * Create some predetermined number N of LWLocks for relation extension.
> > > * When we want to extend some relation R, choose one of those locks
> > >   (say, R's relfilenode number mod N) and lock it.
> > >
> >
> > I am imagining something on the lines of BufferIOLWLockArray (here it
> > will be RelExtLWLockArray).  The size (N) could MaxBackends or some
> > percentage of it (depending on testing) and indexing into an array
> > could be as suggested (R's relfilenode number mod N).
>
> I'm not sure it's good that the contention of LWLock slot depends on
> MaxBackends. Because it means that the more MaxBackends is larger, the
> less the LWLock slot conflicts, even if the same number of backends
> actually connecting. Normally we don't want to increase unnecessarily
> MaxBackends for security reasons. In the current patch we defined a
> fixed length of array for extension lock but I agree that we need to
> determine what approach is the best depending on testing.
>

I think MaxBackends will generally limit the number of different
relations that can simultaneously extend, but maybe tables with many
partitions might change the situation.  You are right that some tests
might suggest a good number, let Mahendra write a patch and then we
can test it.  Do you have any better idea?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Amit Kapila <amit.kapila16@gmail.com> writes:
> I think MaxBackends will generally limit the number of different
> relations that can simultaneously extend, but maybe tables with many
> partitions might change the situation.  You are right that some tests
> might suggest a good number, let Mahendra write a patch and then we
> can test it.  Do you have any better idea?

In the first place, there certainly isn't more than one extension
happening at a time per backend, else the entire premise of this
thread is wrong.  Handwaving about partitions won't change that.

In the second place, it's ludicrous to expect that the underlying
platform/filesystem can support an infinite number of concurrent
file-extension operations.  At some level (e.g. where disk blocks
are handed out, or where a record of the operation is written to
a filesystem journal) it's quite likely that things are bottlenecked
down to *one* such operation at a time per filesystem.  So I'm not
that concerned about occasional false-sharing limiting our ability
to issue concurrent requests.  There are probably worse restrictions
at lower levels.

            regards, tom lane



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Robert Haas
Date:
On Wed, Feb 12, 2020 at 11:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeah.  I would say a couple more things:
>
> * I see no reason to think that a relation extension lock would ever
> be held long enough for noninterruptibility to be a real issue.  Our
> expectations for query cancel response time are in the tens to
> hundreds of msec anyway.

I don't agree, because (1) the time to perform a relation extension on
a busy system can be far longer than that and (2) if the disk is
failing, then it can be *really* long, or indefinite.

> * There are other places where an LWLock can be held for a *long* time,
> notably the CheckpointLock.  If we do think this is an issue, we could
> devise a way to not insist on noninterruptibility.  The easiest fix
> is just to do a matching RESUME_INTERRUPTS after getting the lock and
> HOLD_INTERRUPTS again before releasing it; though maybe it'd be worth
> offering some slightly cleaner way.  Point here is that LWLockAcquire
> only does that because it's useful to the majority of callers, not
> because it's graven in stone that it must be like that.

That's an interesting idea, but it doesn't make the lock acquisition
itself interruptible, which seems pretty important to me in this case.
I wonder if we could have an LWLockAcquireInterruptibly() or some such
that allows the lock acquisition itself to be interruptible. I think
that would require some rejiggering but it might be doable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Feb 12, 2020 at 11:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> * I see no reason to think that a relation extension lock would ever
>> be held long enough for noninterruptibility to be a real issue.  Our
>> expectations for query cancel response time are in the tens to
>> hundreds of msec anyway.

> I don't agree, because (1) the time to perform a relation extension on
> a busy system can be far longer than that and (2) if the disk is
> failing, then it can be *really* long, or indefinite.

I remain unconvinced ... wouldn't both of those claims apply to any disk
I/O request?  Are we going to try to ensure that no I/O ever happens
while holding an LWLock, and if so how?  (Again, CheckpointLock is a
counterexample, which has been that way for decades without reported
problems.  But actually I think buffer I/O locks are an even more
direct counterexample.)

>> * There are other places where an LWLock can be held for a *long* time,
>> notably the CheckpointLock.  If we do think this is an issue, we could
>> devise a way to not insist on noninterruptibility.  The easiest fix
>> is just to do a matching RESUME_INTERRUPTS after getting the lock and
>> HOLD_INTERRUPTS again before releasing it; though maybe it'd be worth
>> offering some slightly cleaner way.  Point here is that LWLockAcquire
>> only does that because it's useful to the majority of callers, not
>> because it's graven in stone that it must be like that.

> That's an interesting idea, but it doesn't make the lock acquisition
> itself interruptible, which seems pretty important to me in this case.

Good point: if you think the contained operation might run too long to
suit you, then you don't want other backends to be stuck behind it for
the same amount of time.

> I wonder if we could have an LWLockAcquireInterruptibly() or some such
> that allows the lock acquisition itself to be interruptible. I think
> that would require some rejiggering but it might be doable.

Yeah, I had the impression from a brief look at LWLockAcquire that
it was itself depending on not throwing errors partway through.
But with careful and perhaps-a-shade-slower coding, we could probably
make a version that didn't require that.

            regards, tom lane



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Robert Haas
Date:
On Fri, Feb 14, 2020 at 10:43 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I remain unconvinced ... wouldn't both of those claims apply to any disk
> I/O request?  Are we going to try to ensure that no I/O ever happens
> while holding an LWLock, and if so how?  (Again, CheckpointLock is a
> counterexample, which has been that way for decades without reported
> problems.  But actually I think buffer I/O locks are an even more
> direct counterexample.)

Yes, that's a problem. I proposed a patch a few years ago that
replaced the buffer I/O locks with condition variables, and I think
that's a good idea for lots of reasons, including this one. I never
quite got around to pushing that through to commit, but I think we
should do that. Aside from fixing this problem, it also prevents
certain scenarios where we can currently busy-loop.

I do realize that we're unlikely to ever solve this problem
completely, but I don't think that should discourage us from making
incremental progress. Just as debuggability is a sticking point for
you, what I'm going to call operate-ability is a sticking point for
me. My work here at EnterpriseDB exposes me on a fairly regular basis
to real broken systems, and I'm therefore really sensitive to the
concerns that people have when trying to recover after a system has
become, for one reason or another, really broken.

Interruptibility may not be the #1 concern in that area, but it's very
high on the list. EnterpriseDB customers, as a rule, *really* hate
being told to restart the database because one session is stuck. It
causes a lot of disruption for them and the person who does the
restart gets yelled at by their boss, and maybe their bosses boss and
the boss above that. It means that their whole application, which may
be mission-critical, is down until the database finishes restarting,
and that is not always a quick process, especially after an immediate
shutdown. I don't think we can ever make everything that can get stuck
interruptible, but the more we can do the better.

The work you and others have done over the years to add
CHECK_FOR_INTERRUPTS() to more places pays real dividends. Making
sessions that are blocked on disk I/O interruptible in at least some
of the more common cases would be a huge win. Other people may well
have different experiences, but my experience is that the disk
deciding to conk out for a while or just respond very very slowly is a
very common problem even (and sometimes especially) on very expensive
hardware. Obviously that's not great and you're in lots of trouble,
but being able to hit ^C and get control back significantly improves
your chances of being able to understand what has happened and recover
from it.

> > That's an interesting idea, but it doesn't make the lock acquisition
> > itself interruptible, which seems pretty important to me in this case.
>
> Good point: if you think the contained operation might run too long to
> suit you, then you don't want other backends to be stuck behind it for
> the same amount of time.

Right.

> > I wonder if we could have an LWLockAcquireInterruptibly() or some such
> > that allows the lock acquisition itself to be interruptible. I think
> > that would require some rejiggering but it might be doable.
>
> Yeah, I had the impression from a brief look at LWLockAcquire that
> it was itself depending on not throwing errors partway through.
> But with careful and perhaps-a-shade-slower coding, we could probably
> make a version that didn't require that.

Yeah, that was my thought, too, but I didn't study it that carefully,
so somebody would need to do that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Moving relation extension locks out of heavyweightlock manager

From
Andres Freund
Date:
Hi,

On 2020-02-12 11:53:49 -0500, Tom Lane wrote:
> In general, if we think there are issues with LWLock, it seems to me
> we'd be better off to try to fix them, not to invent a whole new
> single-purpose lock manager that we'll have to debug and maintain.

My impression is that what's being discussed here is doing exactly that,
except with s/lwlock/heavyweight locks/. We're basically replacing the
lock.c lock mapping table with an ad-hoc implementation, and now we're
also reinventing interruptability etc.

I still find the performance arguments pretty ludicruous, to be honest -
I think the numbers I posted about how much time we spend with the locks
held, back that up.  I have a bit more understanding for the parallel
worker arguments, but only a bit:

I think if we develop a custom solution for the extension lock, we're
just going to end up having to develop another custom solution for a
bunch of other types of locks.  It seems quite likely that we'll end up
also wanting TUPLE and also SPECULATIVE and PAGE type locks that we
don't want to share between leader & workers.

IMO the right thing here is to extend lock.c so we can better represent
whether certain types of lockmethods (& levels ?) are [not] to be
shared.

Greetings,

Andres Freund



Re: [HACKERS] Moving relation extension locks out of heavyweightlock manager

From
Andres Freund
Date:
Hi,

On 2020-02-14 09:42:40 -0500, Tom Lane wrote:
> In the second place, it's ludicrous to expect that the underlying
> platform/filesystem can support an infinite number of concurrent
> file-extension operations.  At some level (e.g. where disk blocks
> are handed out, or where a record of the operation is written to
> a filesystem journal) it's quite likely that things are bottlenecked
> down to *one* such operation at a time per filesystem.

That's probably true to some degree from a theoretical POV, but I think
it's so far from where we are at, that it's effectively wrong. I can
concurrently extend a few files at close to 10GB/s on a set of fast
devices below a *single* filesystem. Whereas postgres bottlenecks far
far before this.  Given that a lot of today's storage has latencies in
the 10-100s of microseconds, a journal flush doesn't necessarily cause
that much serialization - and OS journals do group commit like things
too.

Greetings,

Andres Freund



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Robert Haas
Date:
On Fri, Feb 14, 2020 at 11:40 AM Andres Freund <andres@anarazel.de> wrote:
> IMO the right thing here is to extend lock.c so we can better represent
> whether certain types of lockmethods (& levels ?) are [not] to be
> shared.

The part that I find awkward about that is the whole thing with the
deadlock detector. The deadlock detection code is old, crufty,
complex, and very difficult to test (or at least I have found it so).
A bug that I introduced when inventing group locking took like 5 years
for somebody to find.

One way of looking at the requirement that we have here is that
certain kinds of locks need to be exempted from group locking.
Basically, these are because they are a lower-level concept: a lock on
a relation is more of a "logical" concept, and you hold the lock until
eoxact, whereas a lock on an extend the relation is more of a
"physical" concept, and you give it up as soon as you are done. Page
locks are like relation extension locks in this regard. Unlike locks
on SQL-level objects, these should not be shared between members of a
lock group.

Now, if it weren't for the deadlock detector, that would be easy
enough. But figuring out what to do with the deadlock detector seems
really painful to me. I wonder if there's some way we can make an end
run around that problem. For instance, if we could make (and enforce)
a coding rule that you cannot acquire a heavyweight lock while holding
a relation extension or page lock, then maybe we could somehow teach
the deadlock detector to just ignore those kinds of locks, and teach
the lock acquisition machinery that they conflict between lock group
members.

On the other hand, I think you might also be understating the
differences between these kinds of locks and other heavyweight locks.
I suspect that the reason why we use lwlocks for buffers and
heavyweight locks here is because there are a conceptually infinite
number of relations, and lwlocks can't handle that. The only mechanism
we currently have that does handle that is the heavyweight lock
mechanism, and from my point of view, somebody just beat it with a
stick to make it fit this application. But the fact that it has been
made to fit does not mean that it is really fit for purpose. We use 2
of 9 lock levels, we don't need deadlock detection, we need different
behavior when group locking is in use, we release locks right away
rather than at eoxact. I don't think it's crazy to think that those
differences are significant enough to justify having a separate
mechanism, even if the one that is currently on the table is not
exactly what we want.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Moving relation extension locks out of heavyweightlock manager

From
Andres Freund
Date:
Hi,

On 2020-02-14 12:08:45 -0500, Robert Haas wrote:
> On Fri, Feb 14, 2020 at 11:40 AM Andres Freund <andres@anarazel.de> wrote:
> > IMO the right thing here is to extend lock.c so we can better represent
> > whether certain types of lockmethods (& levels ?) are [not] to be
> > shared.
> 
> The part that I find awkward about that is the whole thing with the
> deadlock detector. The deadlock detection code is old, crufty,
> complex, and very difficult to test (or at least I have found it so).
> A bug that I introduced when inventing group locking took like 5 years
> for somebody to find.

Oh, I agree, lock.c and surrounding code is pretty crufty. Doubtful that
just building up a largely parallel piece of infrastructure next to it
is a good answer though.


> One way of looking at the requirement that we have here is that
> certain kinds of locks need to be exempted from group locking.
> Basically, these are because they are a lower-level concept: a lock on
> a relation is more of a "logical" concept, and you hold the lock until
> eoxact, whereas a lock on an extend the relation is more of a
> "physical" concept, and you give it up as soon as you are done. Page
> locks are like relation extension locks in this regard. Unlike locks
> on SQL-level objects, these should not be shared between members of a
> lock group.
> 
> Now, if it weren't for the deadlock detector, that would be easy
> enough. But figuring out what to do with the deadlock detector seems
> really painful to me. I wonder if there's some way we can make an end
> run around that problem. For instance, if we could make (and enforce)
> a coding rule that you cannot acquire a heavyweight lock while holding
> a relation extension or page lock, then maybe we could somehow teach
> the deadlock detector to just ignore those kinds of locks, and teach
> the lock acquisition machinery that they conflict between lock group
> members.

Yea, that seems possible.  I'm not really sure it's needed however? As
long as you're not teaching the locking mechanism new tricks that
influence the wait graph, why would the deadlock detector care? That's
quite different from the group locking case, where you explicitly needed
to teach it something fairly fundamental.

It might still be a good idea independently to add the rule & enforce
that acquire heavyweight locks while holding certain classes of locks is
not allowed.


> On the other hand, I think you might also be understating the
> differences between these kinds of locks and other heavyweight locks.
> I suspect that the reason why we use lwlocks for buffers and
> heavyweight locks here is because there are a conceptually infinite
> number of relations, and lwlocks can't handle that.

Right. For me that's *the* fundamental service that lock.c delivers. And
it's the fundamental bit this thread so far largely has been focusing
on.


> The only mechanism we currently have that does handle that is the
> heavyweight lock mechanism, and from my point of view, somebody just
> beat it with a stick to make it fit this application. But the fact
> that it has been made to fit does not mean that it is really fit for
> purpose. We use 2 of 9 lock levels, we don't need deadlock detection,
> we need different behavior when group locking is in use, we release
> locks right away rather than at eoxact. I don't think it's crazy to
> think that those differences are significant enough to justify having
> a separate mechanism, even if the one that is currently on the table
> is not exactly what we want.

Isn't that mostly true to varying degrees for the majority of lock types
in lock.c? Sure, perhaps historically that's a misuse of lock.c, but
it's been pretty ingrained by now.  I just don't see where leaving out
any of these features is going to give us fundamental advantages
justifying a different locking infrastructure.

E.g. not needing to support "conceptually infinite" number of relations
IMO does provide a fundamental advantage - no need for a mapping.  I'm
not yet seeing anything equivalent for the extension vs. lock.c style
lock case.

Greetings,

Andres Freund



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Robert Haas
Date:
On Fri, Feb 14, 2020 at 1:07 PM Andres Freund <andres@anarazel.de> wrote:
> Yea, that seems possible.  I'm not really sure it's needed however? As
> long as you're not teaching the locking mechanism new tricks that
> influence the wait graph, why would the deadlock detector care? That's
> quite different from the group locking case, where you explicitly needed
> to teach it something fairly fundamental.

Well, you have to teach it that locks of certain types conflict even
if they are in the same group, and that bleeds over pretty quickly
into the whole area of deadlock detection, because lock waits are the
edges in the graph that the deadlock detector processes.

> It might still be a good idea independently to add the rule & enforce
> that acquire heavyweight locks while holding certain classes of locks is
> not allowed.

I think that's absolutely essential, if we're going to continue using
the main lock manager for this. I remain somewhat unconvinced that
doing so is the best way forward, but it is *a* way forward.

> Right. For me that's *the* fundamental service that lock.c delivers. And
> it's the fundamental bit this thread so far largely has been focusing
> on.

For me, the deadlock detection is the far more complicated and problematic bit.

> Isn't that mostly true to varying degrees for the majority of lock types
> in lock.c? Sure, perhaps historically that's a misuse of lock.c, but
> it's been pretty ingrained by now.  I just don't see where leaving out
> any of these features is going to give us fundamental advantages
> justifying a different locking infrastructure.

I think the group locking + deadlock detection things are more
fundamental than you might be crediting, but I agree that having
parallel mechanisms has its own set of pitfalls.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Amit Kapila
Date:
On Fri, Feb 14, 2020 at 8:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amit Kapila <amit.kapila16@gmail.com> writes:
> > I think MaxBackends will generally limit the number of different
> > relations that can simultaneously extend, but maybe tables with many
> > partitions might change the situation.  You are right that some tests
> > might suggest a good number, let Mahendra write a patch and then we
> > can test it.  Do you have any better idea?
>
> In the first place, there certainly isn't more than one extension
> happening at a time per backend, else the entire premise of this
> thread is wrong.  Handwaving about partitions won't change that.
>

Having more number of partitions theoretically increases the chances
of false-sharing with the same number of concurrent sessions.  For ex.
two sessions operating on two relations vs. two sessions working on
two relations with 100 partitions each would increase the chances of
false-sharing.  Sawada-San and Mahendra have done many tests on
different systems and some monitoring with the previous patch that
with a decent number of fixed slots (1024), the false-sharing was very
less and even if it was there the effect was close to nothing.  So, in
short, this is not the point to worry about, but to ensure that we
don't create any significant regressions in this area.

> In the second place, it's ludicrous to expect that the underlying
> platform/filesystem can support an infinite number of concurrent
> file-extension operations.  At some level (e.g. where disk blocks
> are handed out, or where a record of the operation is written to
> a filesystem journal) it's quite likely that things are bottlenecked
> down to *one* such operation at a time per filesystem.  So I'm not
> that concerned about occasional false-sharing limiting our ability
> to issue concurrent requests.  There are probably worse restrictions
> at lower levels.
>

Agreed and what we have observed during the tests is what you have
said in this paragraph.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Amit Kapila
Date:
On Fri, Feb 14, 2020 at 9:13 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Wed, Feb 12, 2020 at 11:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > That's an interesting idea, but it doesn't make the lock acquisition
> > itself interruptible, which seems pretty important to me in this case.
>
> Good point: if you think the contained operation might run too long to
> suit you, then you don't want other backends to be stuck behind it for
> the same amount of time.
>

It is not clear to me why we should add that as a requirement for this
patch when other places like WALWriteLock, etc. have similar coding
patterns and we haven't heard a ton of complaints about making it
interruptable or if there are then I am not aware.

> > I wonder if we could have an LWLockAcquireInterruptibly() or some such
> > that allows the lock acquisition itself to be interruptible. I think
> > that would require some rejiggering but it might be doable.
>
> Yeah, I had the impression from a brief look at LWLockAcquire that
> it was itself depending on not throwing errors partway through.
> But with careful and perhaps-a-shade-slower coding, we could probably
> make a version that didn't require that.
>

If this becomes a requirement to move this patch, then surely we can
do that.  BTW, what exactly we need to ensure for it?  Is it something
on the lines of ensuring that in error path the state of the lock is
cleared?  Are we worried that interrupt handler might do something
which will change the state of lock we are acquiring?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Moving relation extension locks out of heavyweightlock manager

From
Andres Freund
Date:
Hi,

On 2020-02-14 13:34:03 -0500, Robert Haas wrote:
> On Fri, Feb 14, 2020 at 1:07 PM Andres Freund <andres@anarazel.de> wrote:
> > Yea, that seems possible.  I'm not really sure it's needed however? As
> > long as you're not teaching the locking mechanism new tricks that
> > influence the wait graph, why would the deadlock detector care? That's
> > quite different from the group locking case, where you explicitly needed
> > to teach it something fairly fundamental.
> 
> Well, you have to teach it that locks of certain types conflict even
> if they are in the same group, and that bleeds over pretty quickly
> into the whole area of deadlock detection, because lock waits are the
> edges in the graph that the deadlock detector processes.

Shouldn't this *theretically* be doable with changes mostly localized to
lock.c, by not using proc->lockGroupLeader but proc for lock types that
don't support group locking? I do see that deadlock.c largely looks at
->lockGroupLeader, but that kind of doesn't seem right to me.


> > It might still be a good idea independently to add the rule & enforce
> > that acquire heavyweight locks while holding certain classes of locks is
> > not allowed.
> 
> I think that's absolutely essential, if we're going to continue using
> the main lock manager for this. I remain somewhat unconvinced that
> doing so is the best way forward, but it is *a* way forward.

Seems like we should build this part independently of the lock.c/new
infra piece.


> > Right. For me that's *the* fundamental service that lock.c delivers. And
> > it's the fundamental bit this thread so far largely has been focusing
> > on.
> 
> For me, the deadlock detection is the far more complicated and problematic bit.
> 
> > Isn't that mostly true to varying degrees for the majority of lock types
> > in lock.c? Sure, perhaps historically that's a misuse of lock.c, but
> > it's been pretty ingrained by now.  I just don't see where leaving out
> > any of these features is going to give us fundamental advantages
> > justifying a different locking infrastructure.
> 
> I think the group locking + deadlock detection things are more
> fundamental than you might be crediting, but I agree that having
> parallel mechanisms has its own set of pitfalls.

It's possible. But I'm also hesitant to believe that we'll not need
other lock types that conflict between leader/worker, but that still
need deadlock detection.  The more work we want to parallelize, the more
likely that imo will become.

Greetings,

Andres Freund



Andres Freund <andres@anarazel.de> writes:
> On 2020-02-14 13:34:03 -0500, Robert Haas wrote:
>> I think the group locking + deadlock detection things are more
>> fundamental than you might be crediting, but I agree that having
>> parallel mechanisms has its own set of pitfalls.

> It's possible. But I'm also hesitant to believe that we'll not need
> other lock types that conflict between leader/worker, but that still
> need deadlock detection.  The more work we want to parallelize, the more
> likely that imo will become.

Yeah.  The concept that leader and workers can't conflict seems to me
to be dependent, in a very fundamental way, on the assumption that
we only need to parallelize read-only workloads.  I don't think that's
going to have a long half-life.

            regards, tom lane



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Amit Kapila
Date:
On Mon, Feb 17, 2020 at 2:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Andres Freund <andres@anarazel.de> writes:
> > On 2020-02-14 13:34:03 -0500, Robert Haas wrote:
> >> I think the group locking + deadlock detection things are more
> >> fundamental than you might be crediting, but I agree that having
> >> parallel mechanisms has its own set of pitfalls.
>
> > It's possible. But I'm also hesitant to believe that we'll not need
> > other lock types that conflict between leader/worker, but that still
> > need deadlock detection.  The more work we want to parallelize, the more
> > likely that imo will become.
>
> Yeah.  The concept that leader and workers can't conflict seems to me
> to be dependent, in a very fundamental way, on the assumption that
> we only need to parallelize read-only workloads.  I don't think that's
> going to have a long half-life.
>

Surely, someday, we need to solve that problem.  But it is not clear
when because if we see the operations for which we want to solve the
relation extension lock problem doesn't require that.  For example,
for a parallel copy or further improving parallel vacuum to allow
multiple workers to scan and process the heap and individual index, we
don't need to change anything in group locking as far as I understand.

Now, for parallel deletes/updates, I think it will depend on how we
choose to parallelize those operations.  I mean if we decide that each
worker will work on an independent set of pages like we do for a
sequential scan, we again might not need to change the group locking
unless I am missing something which is possible.

I think till we know the real need for changing group locking, going
in the direction of what Tom suggested to use an array of LWLocks [1]
to address the problems in hand is a good idea.  It is not very clear
to me that are we thinking to give up on Tom's idea [1] and change
group locking even though it is not clear or at least nobody has
proposed an idea/patch which requires that?  Or are we thinking that
we can do what Tom suggested for relation extension lock and also plan
to change group locking for future parallel operations that might
require it?

[1] - https://www.postgresql.org/message-id/19443.1581435793%40sss.pgh.pa.us

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Moving relation extension locks out of heavyweightlock manager

From
Andres Freund
Date:
Hi,

On 2020-02-19 11:12:18 +0530, Amit Kapila wrote:
> I think till we know the real need for changing group locking, going
> in the direction of what Tom suggested to use an array of LWLocks [1]
> to address the problems in hand is a good idea.

-many

I think that building yet another locking subsystem is the entirely
wrong idea - especially when there's imo no convincing architectural
reasons to do so.


> It is not very clear to me that are we thinking to give up on Tom's
> idea [1] and change group locking even though it is not clear or at
> least nobody has proposed an idea/patch which requires that?  Or are
> we thinking that we can do what Tom suggested for relation extension
> lock and also plan to change group locking for future parallel
> operations that might require it?

What I'm advocating is that extension locks should continue to go
through lock.c. And yes, that requires some changes to group locking,
but I still don't see why they'd be complicated. And if there's concerns
about the cost of lock.c, I outlined a pretty long list of improvements
that'll help everyone, and I showed that the locking itself isn't
actually a large fraction of the scalability issues that extension has.

Regards,

Andres



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Amit Kapila
Date:
On Thu, Feb 20, 2020 at 8:06 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-02-19 11:12:18 +0530, Amit Kapila wrote:
> > I think till we know the real need for changing group locking, going
> > in the direction of what Tom suggested to use an array of LWLocks [1]
> > to address the problems in hand is a good idea.
>
> -many
>
> I think that building yet another locking subsystem is the entirely
> wrong idea - especially when there's imo no convincing architectural
> reasons to do so.
>

Hmm, AFAIU, it will be done by having an array of LWLocks which we do
at other places as well (like BufferIO locks).  I am not sure if we
can call it as new locking subsystem, but if we decide to continue
using lock.c and change group locking then I think we can do that as
well, see my comments below regarding that.

>
> > It is not very clear to me that are we thinking to give up on Tom's
> > idea [1] and change group locking even though it is not clear or at
> > least nobody has proposed an idea/patch which requires that?  Or are
> > we thinking that we can do what Tom suggested for relation extension
> > lock and also plan to change group locking for future parallel
> > operations that might require it?
>
> What I'm advocating is that extension locks should continue to go
> through lock.c. And yes, that requires some changes to group locking,
> but I still don't see why they'd be complicated.
>

Fair position, as per initial analysis, I think if we do below three
things, it should work out without changing to a new way of locking
for relation extension or page type locks.
a. As per the discussion above, ensure in code we will never try to
acquire another heavy-weight lock after acquiring relation extension
or page type locks (probably by having Asserts in code or maybe some
other way).
b. Change lock.c so that group locking is not considered for these two
lock types. For ex. in LockCheckConflicts, along with the check (if
(proclock->groupLeader == MyProc && MyProc->lockGroupLeader == NULL)),
we also check lock->tag and call it a conflict for these two locks.
c. The deadlock detector can ignore checking these two types of locks
because point (a) ensures that those won't lead to deadlock.  One idea
could be that FindLockCycleRecurseMember just ignores these two types
of locks by checking the lock tag.

It is possible that I might be missing something or we could achieve
this some other way as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Mahendra Singh Thalor
Date:
On Mon, 24 Feb 2020 at 15:39, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Feb 20, 2020 at 8:06 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2020-02-19 11:12:18 +0530, Amit Kapila wrote:
> > > I think till we know the real need for changing group locking, going
> > > in the direction of what Tom suggested to use an array of LWLocks [1]
> > > to address the problems in hand is a good idea.
> >
> > -many
> >
> > I think that building yet another locking subsystem is the entirely
> > wrong idea - especially when there's imo no convincing architectural
> > reasons to do so.
> >
>
> Hmm, AFAIU, it will be done by having an array of LWLocks which we do
> at other places as well (like BufferIO locks).  I am not sure if we
> can call it as new locking subsystem, but if we decide to continue
> using lock.c and change group locking then I think we can do that as
> well, see my comments below regarding that.
>
> >
> > > It is not very clear to me that are we thinking to give up on Tom's
> > > idea [1] and change group locking even though it is not clear or at
> > > least nobody has proposed an idea/patch which requires that?  Or are
> > > we thinking that we can do what Tom suggested for relation extension
> > > lock and also plan to change group locking for future parallel
> > > operations that might require it?
> >
> > What I'm advocating is that extension locks should continue to go
> > through lock.c. And yes, that requires some changes to group locking,
> > but I still don't see why they'd be complicated.
> >
>
> Fair position, as per initial analysis, I think if we do below three
> things, it should work out without changing to a new way of locking
> for relation extension or page type locks.
> a. As per the discussion above, ensure in code we will never try to
> acquire another heavy-weight lock after acquiring relation extension
> or page type locks (probably by having Asserts in code or maybe some
> other way).
> b. Change lock.c so that group locking is not considered for these two
> lock types. For ex. in LockCheckConflicts, along with the check (if
> (proclock->groupLeader == MyProc && MyProc->lockGroupLeader == NULL)),
> we also check lock->tag and call it a conflict for these two locks.
> c. The deadlock detector can ignore checking these two types of locks
> because point (a) ensures that those won't lead to deadlock.  One idea
> could be that FindLockCycleRecurseMember just ignores these two types
> of locks by checking the lock tag.

Thanks Amit for summary.

Based on above 3 points, here attaching 2 patches for review.

1. v01_0001-Conflict-EXTENTION-lock-in-group-member.patch (Patch by Dilip Kumar)
Basically this patch is for point b and c.

2. v01_0002-Added-assert-to-verify-that-we-never-try-to-take-any.patch
(Patch by me)
This patch is for point a.

After applying both the patches, make check-world is passing.

We are testing both the patches and will post results.

Thoughts?

-- 
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

Attachment
On Wed, Mar 4, 2020 at 11:45 AM Mahendra Singh Thalor
<mahi6run@gmail.com> wrote:
>
> On Mon, 24 Feb 2020 at 15:39, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Feb 20, 2020 at 8:06 AM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > Hi,
> > >
> > > On 2020-02-19 11:12:18 +0530, Amit Kapila wrote:
> > > > I think till we know the real need for changing group locking, going
> > > > in the direction of what Tom suggested to use an array of LWLocks [1]
> > > > to address the problems in hand is a good idea.
> > >
> > > -many
> > >
> > > I think that building yet another locking subsystem is the entirely
> > > wrong idea - especially when there's imo no convincing architectural
> > > reasons to do so.
> > >
> >
> > Hmm, AFAIU, it will be done by having an array of LWLocks which we do
> > at other places as well (like BufferIO locks).  I am not sure if we
> > can call it as new locking subsystem, but if we decide to continue
> > using lock.c and change group locking then I think we can do that as
> > well, see my comments below regarding that.
> >
> > >
> > > > It is not very clear to me that are we thinking to give up on Tom's
> > > > idea [1] and change group locking even though it is not clear or at
> > > > least nobody has proposed an idea/patch which requires that?  Or are
> > > > we thinking that we can do what Tom suggested for relation extension
> > > > lock and also plan to change group locking for future parallel
> > > > operations that might require it?
> > >
> > > What I'm advocating is that extension locks should continue to go
> > > through lock.c. And yes, that requires some changes to group locking,
> > > but I still don't see why they'd be complicated.
> > >
> >
> > Fair position, as per initial analysis, I think if we do below three
> > things, it should work out without changing to a new way of locking
> > for relation extension or page type locks.
> > a. As per the discussion above, ensure in code we will never try to
> > acquire another heavy-weight lock after acquiring relation extension
> > or page type locks (probably by having Asserts in code or maybe some
> > other way).
> > b. Change lock.c so that group locking is not considered for these two
> > lock types. For ex. in LockCheckConflicts, along with the check (if
> > (proclock->groupLeader == MyProc && MyProc->lockGroupLeader == NULL)),
> > we also check lock->tag and call it a conflict for these two locks.
> > c. The deadlock detector can ignore checking these two types of locks
> > because point (a) ensures that those won't lead to deadlock.  One idea
> > could be that FindLockCycleRecurseMember just ignores these two types
> > of locks by checking the lock tag.
>
> Thanks Amit for summary.
>
> Based on above 3 points, here attaching 2 patches for review.
>
> 1. v01_0001-Conflict-EXTENTION-lock-in-group-member.patch (Patch by Dilip Kumar)
> Basically this patch is for point b and c.
>
> 2. v01_0002-Added-assert-to-verify-that-we-never-try-to-take-any.patch
> (Patch by me)
> This patch is for point a.
>
> After applying both the patches, make check-world is passing.
>
> We are testing both the patches and will post results.
>
> Thoughts?

+static void AssertAnyExtentionLockHeadByMe(void);

+/*
+ * AssertAnyExtentionLockHeadByMe -- test whether any EXTENSION lock held by
+ * this backend.  If any EXTENSION lock is hold by this backend, then assert
+ * will fail.  To use this function, assert should be enabled.
+ */
+void AssertAnyExtentionLockHeadByMe()
+{

Some minor observations on 0002.
1. static is missing in a function definition.
2. Function name should start in new line after function return type
in function definition, as per pg guideline.
+void AssertAnyExtentionLockHeadByMe()
->
void
AssertAnyExtentionLockHeadByMe()

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Mahendra Singh Thalor
Date:
On Wed, 4 Mar 2020 at 12:03, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Mar 4, 2020 at 11:45 AM Mahendra Singh Thalor
> <mahi6run@gmail.com> wrote:
> >
> > On Mon, 24 Feb 2020 at 15:39, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Feb 20, 2020 at 8:06 AM Andres Freund <andres@anarazel.de> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On 2020-02-19 11:12:18 +0530, Amit Kapila wrote:
> > > > > I think till we know the real need for changing group locking, going
> > > > > in the direction of what Tom suggested to use an array of LWLocks [1]
> > > > > to address the problems in hand is a good idea.
> > > >
> > > > -many
> > > >
> > > > I think that building yet another locking subsystem is the entirely
> > > > wrong idea - especially when there's imo no convincing architectural
> > > > reasons to do so.
> > > >
> > >
> > > Hmm, AFAIU, it will be done by having an array of LWLocks which we do
> > > at other places as well (like BufferIO locks).  I am not sure if we
> > > can call it as new locking subsystem, but if we decide to continue
> > > using lock.c and change group locking then I think we can do that as
> > > well, see my comments below regarding that.
> > >
> > > >
> > > > > It is not very clear to me that are we thinking to give up on Tom's
> > > > > idea [1] and change group locking even though it is not clear or at
> > > > > least nobody has proposed an idea/patch which requires that?  Or are
> > > > > we thinking that we can do what Tom suggested for relation extension
> > > > > lock and also plan to change group locking for future parallel
> > > > > operations that might require it?
> > > >
> > > > What I'm advocating is that extension locks should continue to go
> > > > through lock.c. And yes, that requires some changes to group locking,
> > > > but I still don't see why they'd be complicated.
> > > >
> > >
> > > Fair position, as per initial analysis, I think if we do below three
> > > things, it should work out without changing to a new way of locking
> > > for relation extension or page type locks.
> > > a. As per the discussion above, ensure in code we will never try to
> > > acquire another heavy-weight lock after acquiring relation extension
> > > or page type locks (probably by having Asserts in code or maybe some
> > > other way).
> > > b. Change lock.c so that group locking is not considered for these two
> > > lock types. For ex. in LockCheckConflicts, along with the check (if
> > > (proclock->groupLeader == MyProc && MyProc->lockGroupLeader == NULL)),
> > > we also check lock->tag and call it a conflict for these two locks.
> > > c. The deadlock detector can ignore checking these two types of locks
> > > because point (a) ensures that those won't lead to deadlock.  One idea
> > > could be that FindLockCycleRecurseMember just ignores these two types
> > > of locks by checking the lock tag.
> >
> > Thanks Amit for summary.
> >
> > Based on above 3 points, here attaching 2 patches for review.
> >
> > 1. v01_0001-Conflict-EXTENTION-lock-in-group-member.patch (Patch by Dilip Kumar)
> > Basically this patch is for point b and c.
> >
> > 2. v01_0002-Added-assert-to-verify-that-we-never-try-to-take-any.patch
> > (Patch by me)
> > This patch is for point a.
> >
> > After applying both the patches, make check-world is passing.
> >
> > We are testing both the patches and will post results.

Hi all,

I am planing to test below 3 points on v1 patch set:

1. We will check that new added assert can be hit by hacking code
(while holding extension lock, try to take any heavyweight lock)
2. In FindLockCycleRecurseMember, for testing purposes, we can put
additional loop to check that for all relext holders, there must not
be any outer edge.
3. Test that group members are not granted the lock for the relation
extension lock (group members should conflict).

Please let me know your thoughts to test this patch.

-- 
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com



On Wed, Mar 4, 2020 at 11:45 AM Mahendra Singh Thalor
<mahi6run@gmail.com> wrote:
>
> On Mon, 24 Feb 2020 at 15:39, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Feb 20, 2020 at 8:06 AM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > What I'm advocating is that extension locks should continue to go
> > > through lock.c. And yes, that requires some changes to group locking,
> > > but I still don't see why they'd be complicated.
> > >
> >
> > Fair position, as per initial analysis, I think if we do below three
> > things, it should work out without changing to a new way of locking
> > for relation extension or page type locks.
> > a. As per the discussion above, ensure in code we will never try to
> > acquire another heavy-weight lock after acquiring relation extension
> > or page type locks (probably by having Asserts in code or maybe some
> > other way).
> > b. Change lock.c so that group locking is not considered for these two
> > lock types. For ex. in LockCheckConflicts, along with the check (if
> > (proclock->groupLeader == MyProc && MyProc->lockGroupLeader == NULL)),
> > we also check lock->tag and call it a conflict for these two locks.
> > c. The deadlock detector can ignore checking these two types of locks
> > because point (a) ensures that those won't lead to deadlock.  One idea
> > could be that FindLockCycleRecurseMember just ignores these two types
> > of locks by checking the lock tag.
>
> Thanks Amit for summary.
>
> Based on above 3 points, here attaching 2 patches for review.
>
> 1. v01_0001-Conflict-EXTENTION-lock-in-group-member.patch (Patch by Dilip Kumar)
> Basically this patch is for point b and c.
>
> 2. v01_0002-Added-assert-to-verify-that-we-never-try-to-take-any.patch
> (Patch by me)
> This patch is for point a.
>
> After applying both the patches, make check-world is passing.
>
> We are testing both the patches and will post results.
>

I think we need to do detailed code review in the places where we are
taking Relation Extension Lock and see whether we are acquiring
another heavy-weight lock after that. It seems to me that in
brin_getinsertbuffer, after acquiring Relation Extension Lock, we
might again try to acquire the same lock.  See
brin_initialize_empty_new_buffer which is called after acquiring
Relation Extension Lock, in that function, we call
RecordPageWithFreeSpace and that can again try to acquire the same
lock if it needs to perform fsm_extend.  I think there will be similar
instances in the code.  I think it is fine if we again try to acquire
it, but the current assertion in your patch needs to be adjusted for
that.

Few other minor comments on
v01_0002-Added-assert-to-verify-that-we-never-try-to-take-any:
1. Ideally, this should be the first patch as we first need to ensure
that we don't take any heavy-weight locks after acquiring a relation
extension lock.

2. I think it is better to add an Assert after initial error checks
(after RecoveryInProgress().. check)

3.
+ Assert (locallock->tag.lock.locktag_type != LOCKTAG_RELATION_EXTEND ||
+ locallock->nLocks == 0);

I think it is not possible that we have an entry in
LockMethodLocalHash and its value is zero.  Do you see any such
possibility, if not, then we might want to remove it?

4. We already have a macro for LOCALLOCK_LOCKMETHOD, can we write
another one tag type?  This will make the check look a bit cleaner and
probably if we need to extend it in future for Page type locks, then
also it will be good.

5. I have also tried to think of another way to check if we already
hold lock type LOCKTAG_RELATION_EXTEND, but couldn't come up with a
cheaper way than this.  Basically, I think if we traverse the
MyProc->myProcLocks queue, we will get this information, but that
doesn't seem much cheaper than this.

6. Another thing that could be possible is to make this a test and
elog so that it can hit in production scenarios, but I think the cost
of that will be high unless we have a very simple way to write this
test condition.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



On Thu, Mar 5, 2020 at 12:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Mar 4, 2020 at 11:45 AM Mahendra Singh Thalor
> <mahi6run@gmail.com> wrote:
> >
> > On Mon, 24 Feb 2020 at 15:39, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Feb 20, 2020 at 8:06 AM Andres Freund <andres@anarazel.de> wrote:
> > > >
> > > > What I'm advocating is that extension locks should continue to go
> > > > through lock.c. And yes, that requires some changes to group locking,
> > > > but I still don't see why they'd be complicated.
> > > >
> > >
> > > Fair position, as per initial analysis, I think if we do below three
> > > things, it should work out without changing to a new way of locking
> > > for relation extension or page type locks.
> > > a. As per the discussion above, ensure in code we will never try to
> > > acquire another heavy-weight lock after acquiring relation extension
> > > or page type locks (probably by having Asserts in code or maybe some
> > > other way).
> > > b. Change lock.c so that group locking is not considered for these two
> > > lock types. For ex. in LockCheckConflicts, along with the check (if
> > > (proclock->groupLeader == MyProc && MyProc->lockGroupLeader == NULL)),
> > > we also check lock->tag and call it a conflict for these two locks.
> > > c. The deadlock detector can ignore checking these two types of locks
> > > because point (a) ensures that those won't lead to deadlock.  One idea
> > > could be that FindLockCycleRecurseMember just ignores these two types
> > > of locks by checking the lock tag.
> >
> > Thanks Amit for summary.
> >
> > Based on above 3 points, here attaching 2 patches for review.
> >
> > 1. v01_0001-Conflict-EXTENTION-lock-in-group-member.patch (Patch by Dilip Kumar)
> > Basically this patch is for point b and c.
> >
> > 2. v01_0002-Added-assert-to-verify-that-we-never-try-to-take-any.patch
> > (Patch by me)
> > This patch is for point a.
> >
> > After applying both the patches, make check-world is passing.
> >
> > We are testing both the patches and will post results.
> >
>
> I think we need to do detailed code review in the places where we are
> taking Relation Extension Lock and see whether we are acquiring
> another heavy-weight lock after that. It seems to me that in
> brin_getinsertbuffer, after acquiring Relation Extension Lock, we
> might again try to acquire the same lock.  See
> brin_initialize_empty_new_buffer which is called after acquiring
> Relation Extension Lock, in that function, we call
> RecordPageWithFreeSpace and that can again try to acquire the same
> lock if it needs to perform fsm_extend.  I think there will be similar
> instances in the code.  I think it is fine if we again try to acquire
> it, but the current assertion in your patch needs to be adjusted for
> that.
>
> Few other minor comments on
> v01_0002-Added-assert-to-verify-that-we-never-try-to-take-any:
> 1. Ideally, this should be the first patch as we first need to ensure
> that we don't take any heavy-weight locks after acquiring a relation
> extension lock.
>
> 2. I think it is better to add an Assert after initial error checks
> (after RecoveryInProgress().. check)
>
> 3.
> + Assert (locallock->tag.lock.locktag_type != LOCKTAG_RELATION_EXTEND ||
> + locallock->nLocks == 0);
>
> I think it is not possible that we have an entry in
> LockMethodLocalHash and its value is zero.  Do you see any such
> possibility, if not, then we might want to remove it?
>
> 4. We already have a macro for LOCALLOCK_LOCKMETHOD, can we write
> another one tag type?  This will make the check look a bit cleaner and
> probably if we need to extend it in future for Page type locks, then
> also it will be good.
>
> 5. I have also tried to think of another way to check if we already
> hold lock type LOCKTAG_RELATION_EXTEND, but couldn't come up with a
> cheaper way than this.  Basically, I think if we traverse the
> MyProc->myProcLocks queue, we will get this information, but that
> doesn't seem much cheaper than this.

I think we can maintain a flag (rel_extlock_held).  And, we can set
that true in LockRelationForExtension,
ConditionalLockRelationForExtension functions and we can reset it in
UnlockRelationForExtension or in the error path e.g. LockReleaseAll.
I think, this way we will be able to elog and this will be much
cheaper.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Mahendra Singh Thalor
Date:
On Thu, 5 Mar 2020 at 13:54, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Mar 5, 2020 at 12:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Mar 4, 2020 at 11:45 AM Mahendra Singh Thalor
> > <mahi6run@gmail.com> wrote:
> > >
> > > On Mon, 24 Feb 2020 at 15:39, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > On Thu, Feb 20, 2020 at 8:06 AM Andres Freund <andres@anarazel.de> wrote:
> > > > >
> > > > > What I'm advocating is that extension locks should continue to go
> > > > > through lock.c. And yes, that requires some changes to group locking,
> > > > > but I still don't see why they'd be complicated.
> > > > >
> > > >
> > > > Fair position, as per initial analysis, I think if we do below three
> > > > things, it should work out without changing to a new way of locking
> > > > for relation extension or page type locks.
> > > > a. As per the discussion above, ensure in code we will never try to
> > > > acquire another heavy-weight lock after acquiring relation extension
> > > > or page type locks (probably by having Asserts in code or maybe some
> > > > other way).
> > > > b. Change lock.c so that group locking is not considered for these two
> > > > lock types. For ex. in LockCheckConflicts, along with the check (if
> > > > (proclock->groupLeader == MyProc && MyProc->lockGroupLeader == NULL)),
> > > > we also check lock->tag and call it a conflict for these two locks.
> > > > c. The deadlock detector can ignore checking these two types of locks
> > > > because point (a) ensures that those won't lead to deadlock.  One idea
> > > > could be that FindLockCycleRecurseMember just ignores these two types
> > > > of locks by checking the lock tag.
> > >
> > > Thanks Amit for summary.
> > >
> > > Based on above 3 points, here attaching 2 patches for review.
> > >
> > > 1. v01_0001-Conflict-EXTENTION-lock-in-group-member.patch (Patch by Dilip Kumar)
> > > Basically this patch is for point b and c.
> > >
> > > 2. v01_0002-Added-assert-to-verify-that-we-never-try-to-take-any.patch
> > > (Patch by me)
> > > This patch is for point a.
> > >
> > > After applying both the patches, make check-world is passing.
> > >
> > > We are testing both the patches and will post results.
> > >
> >

Thanks Amit and Dilip for reviewing the patches.

> > I think we need to do detailed code review in the places where we are
> > taking Relation Extension Lock and see whether we are acquiring
> > another heavy-weight lock after that. It seems to me that in
> > brin_getinsertbuffer, after acquiring Relation Extension Lock, we
> > might again try to acquire the same lock.  See
> > brin_initialize_empty_new_buffer which is called after acquiring
> > Relation Extension Lock, in that function, we call
> > RecordPageWithFreeSpace and that can again try to acquire the same
> > lock if it needs to perform fsm_extend.  I think there will be similar
> > instances in the code.  I think it is fine if we again try to acquire
> > it, but the current assertion in your patch needs to be adjusted for
> > that.

I agree with you.  Dilip is doing code review and he will post
results.  As you mentioned that while holing Relation Extension Lock,
we might again try to acquire same Relation Extension Lock, so to
handle this in assert I did some changes in patch and attaching patch
for review. (I will test this scenario)

> >
> > Few other minor comments on
> > v01_0002-Added-assert-to-verify-that-we-never-try-to-take-any:
> > 1. Ideally, this should be the first patch as we first need to ensure
> > that we don't take any heavy-weight locks after acquiring a relation
> > extension lock.

Fixed.

> > 2. I think it is better to add an Assert after initial error checks
> > (after RecoveryInProgress().. check)

I am not getting your points. Can you explain me, that which type of
assert you are suggesting?

> > 3.
> > + Assert (locallock->tag.lock.locktag_type != LOCKTAG_RELATION_EXTEND ||
> > + locallock->nLocks == 0);
> >
> > I think it is not possible that we have an entry in
> > LockMethodLocalHash and its value is zero.  Do you see any such
> > possibility, if not, then we might want to remove it?

Yes, this condition is not needed. Fixed.

> >
> > 4. We already have a macro for LOCALLOCK_LOCKMETHOD, can we write
> > another one tag type?  This will make the check look a bit cleaner and
> > probably if we need to extend it in future for Page type locks, then
> > also it will be good.

Good point. I added macros in this version.

Here, attaching new patch set for review.

Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Mahendra Singh Thalor
Date:
On Wed, 4 Mar 2020 at 12:03, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Mar 4, 2020 at 11:45 AM Mahendra Singh Thalor
> <mahi6run@gmail.com> wrote:
> >
> > On Mon, 24 Feb 2020 at 15:39, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Feb 20, 2020 at 8:06 AM Andres Freund <andres@anarazel.de> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On 2020-02-19 11:12:18 +0530, Amit Kapila wrote:
> > > > > I think till we know the real need for changing group locking, going
> > > > > in the direction of what Tom suggested to use an array of LWLocks [1]
> > > > > to address the problems in hand is a good idea.
> > > >
> > > > -many
> > > >
> > > > I think that building yet another locking subsystem is the entirely
> > > > wrong idea - especially when there's imo no convincing architectural
> > > > reasons to do so.
> > > >
> > >
> > > Hmm, AFAIU, it will be done by having an array of LWLocks which we do
> > > at other places as well (like BufferIO locks).  I am not sure if we
> > > can call it as new locking subsystem, but if we decide to continue
> > > using lock.c and change group locking then I think we can do that as
> > > well, see my comments below regarding that.
> > >
> > > >
> > > > > It is not very clear to me that are we thinking to give up on Tom's
> > > > > idea [1] and change group locking even though it is not clear or at
> > > > > least nobody has proposed an idea/patch which requires that?  Or are
> > > > > we thinking that we can do what Tom suggested for relation extension
> > > > > lock and also plan to change group locking for future parallel
> > > > > operations that might require it?
> > > >
> > > > What I'm advocating is that extension locks should continue to go
> > > > through lock.c. And yes, that requires some changes to group locking,
> > > > but I still don't see why they'd be complicated.
> > > >
> > >
> > > Fair position, as per initial analysis, I think if we do below three
> > > things, it should work out without changing to a new way of locking
> > > for relation extension or page type locks.
> > > a. As per the discussion above, ensure in code we will never try to
> > > acquire another heavy-weight lock after acquiring relation extension
> > > or page type locks (probably by having Asserts in code or maybe some
> > > other way).
> > > b. Change lock.c so that group locking is not considered for these two
> > > lock types. For ex. in LockCheckConflicts, along with the check (if
> > > (proclock->groupLeader == MyProc && MyProc->lockGroupLeader == NULL)),
> > > we also check lock->tag and call it a conflict for these two locks.
> > > c. The deadlock detector can ignore checking these two types of locks
> > > because point (a) ensures that those won't lead to deadlock.  One idea
> > > could be that FindLockCycleRecurseMember just ignores these two types
> > > of locks by checking the lock tag.
> >
> > Thanks Amit for summary.
> >
> > Based on above 3 points, here attaching 2 patches for review.
> >
> > 1. v01_0001-Conflict-EXTENTION-lock-in-group-member.patch (Patch by Dilip Kumar)
> > Basically this patch is for point b and c.
> >
> > 2. v01_0002-Added-assert-to-verify-that-we-never-try-to-take-any.patch
> > (Patch by me)
> > This patch is for point a.
> >
> > After applying both the patches, make check-world is passing.
> >
> > We are testing both the patches and will post results.
> >
> > Thoughts?
>
> +static void AssertAnyExtentionLockHeadByMe(void);
>
> +/*
> + * AssertAnyExtentionLockHeadByMe -- test whether any EXTENSION lock held by
> + * this backend.  If any EXTENSION lock is hold by this backend, then assert
> + * will fail.  To use this function, assert should be enabled.
> + */
> +void AssertAnyExtentionLockHeadByMe()
> +{
>
> Some minor observations on 0002.
> 1. static is missing in a function definition.
> 2. Function name should start in new line after function return type
> in function definition, as per pg guideline.
> +void AssertAnyExtentionLockHeadByMe()
> ->
> void
> AssertAnyExtentionLockHeadByMe()

Thanks Dilip for review.

I have fixed above 2 points in v2 patch set.

-- 
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com



On Thu, Mar 5, 2020 at 2:18 PM Mahendra Singh Thalor <mahi6run@gmail.com> wrote:
> Here, attaching new patch set for review.

I was kind of assuming that the way this would work is that it would
set a flag or increment a counter or something when we acquire a
relation extension lock, and then reverse the process when we release
it. Then the Assert could just check the flag. Walking the whole
LOCALLOCK table is expensive.

Also, spelling counts. This patch features "extention" multiple times,
plus also "hask," "beloging," "belog," and "whle", which is an awful
lot of typos for a 70-line patch. If you are using macOS, try opening
the patch in TextEdit. If you are inventing a new function name, spell
the words you include the same way they are spelled elsewhere.

Even aside from the typo, AssertAnyExtentionLockHeadByMe() is not a
very good function name. It sounds like it's asserting that we hold an
extension lock, rather than that we don't, and also, that's not
exactly what it checks anyway, because there's this special case for
when we're acquiring a relation extension lock we already hold.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Thu, Mar 5, 2020 at 1:54 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Mar 5, 2020 at 12:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > 5. I have also tried to think of another way to check if we already
> > hold lock type LOCKTAG_RELATION_EXTEND, but couldn't come up with a
> > cheaper way than this.  Basically, I think if we traverse the
> > MyProc->myProcLocks queue, we will get this information, but that
> > doesn't seem much cheaper than this.
>
> I think we can maintain a flag (rel_extlock_held).  And, we can set
> that true in LockRelationForExtension,
> ConditionalLockRelationForExtension functions and we can reset it in
> UnlockRelationForExtension or in the error path e.g. LockReleaseAll.
>

I think if we reset it in LockReleaseAll during the error path, then
we need to find a way to reset it during LockReleaseCurrentOwner as
that is called during Subtransaction Abort which can be tricky as we
don't know if it belongs to the current owner.  How about resetting in
Abort(Sub)Transaction and CommitTransaction after we release locks via
ResourceOwnerRelease.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



On Fri, Mar 6, 2020 at 2:19 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Mar 5, 2020 at 2:18 PM Mahendra Singh Thalor <mahi6run@gmail.com> wrote:
> > Here, attaching new patch set for review.
>
> I was kind of assuming that the way this would work is that it would
> set a flag or increment a counter or something when we acquire a
> relation extension lock, and then reverse the process when we release
> it. Then the Assert could just check the flag. Walking the whole
> LOCALLOCK table is expensive.
>

I think we can keep such a flag in TopTransactionState.   We free such
locks after the work is done (except during error where we free them
at transaction abort) rather than at transaction commit, so one might
say it is better not to associate with transaction state, but not sure
if there is other better place.  Do you have any suggestions?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



On Thu, Mar 5, 2020 at 11:45 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think we can keep such a flag in TopTransactionState.   We free such
> locks after the work is done (except during error where we free them
> at transaction abort) rather than at transaction commit, so one might
> say it is better not to associate with transaction state, but not sure
> if there is other better place.  Do you have any suggestions?

I assumed it would be a global variable in lock.c.  lock.c has got to
know when any lock is required or released, so I don't know why we
need to involve xact.c in the bookkeeping.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Fri, Mar 6, 2020 at 9:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Mar 5, 2020 at 1:54 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Mar 5, 2020 at 12:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > 5. I have also tried to think of another way to check if we already
> > > hold lock type LOCKTAG_RELATION_EXTEND, but couldn't come up with a
> > > cheaper way than this.  Basically, I think if we traverse the
> > > MyProc->myProcLocks queue, we will get this information, but that
> > > doesn't seem much cheaper than this.
> >
> > I think we can maintain a flag (rel_extlock_held).  And, we can set
> > that true in LockRelationForExtension,
> > ConditionalLockRelationForExtension functions and we can reset it in
> > UnlockRelationForExtension or in the error path e.g. LockReleaseAll.
> >
>
> I think if we reset it in LockReleaseAll during the error path, then
> we need to find a way to reset it during LockReleaseCurrentOwner as
> that is called during Subtransaction Abort which can be tricky as we
> don't know if it belongs to the current owner.  How about resetting in
> Abort(Sub)Transaction and CommitTransaction after we release locks via
> ResourceOwnerRelease.

I think instead of the flag we need to keep the counter because we can
acquire the same relation extension lock multiple times.  So
basically, every time we acquire the lock we can increment the counter
and while releasing we can decrement it.   During an error path, I
think it is fine to set it to 0 in CommitTransaction/AbortTransaction.
But, I am not sure that we can set to 0 or decrement it in
AbortSubTransaction because we are not sure whether we have acquired
the lock under this subtransaction or not.

Having said that,  I think there should not be any case that we are
starting the sub-transaction while holding the relation extension
lock.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Sat, Mar 7, 2020 at 9:57 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Mar 6, 2020 at 9:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Mar 5, 2020 at 1:54 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Mar 5, 2020 at 12:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > 5. I have also tried to think of another way to check if we already
> > > hold lock type LOCKTAG_RELATION_EXTEND, but couldn't come up with a
> > > cheaper way than this.  Basically, I think if we traverse the
> > > MyProc->myProcLocks queue, we will get this information, but that
> > > doesn't seem much cheaper than this.
> >
> > I think we can maintain a flag (rel_extlock_held).  And, we can set
> > that true in LockRelationForExtension,
> > ConditionalLockRelationForExtension functions and we can reset it in
> > UnlockRelationForExtension or in the error path e.g. LockReleaseAll.
> >
>
> I think if we reset it in LockReleaseAll during the error path, then
> we need to find a way to reset it during LockReleaseCurrentOwner as
> that is called during Subtransaction Abort which can be tricky as we
> don't know if it belongs to the current owner.  How about resetting in
> Abort(Sub)Transaction and CommitTransaction after we release locks via
> ResourceOwnerRelease.

I think instead of the flag we need to keep the counter because we can
acquire the same relation extension lock multiple times.  So
basically, every time we acquire the lock we can increment the counter
and while releasing we can decrement it.   During an error path, I
think it is fine to set it to 0 in CommitTransaction/AbortTransaction.
But, I am not sure that we can set to 0 or decrement it in
AbortSubTransaction because we are not sure whether we have acquired
the lock under this subtransaction or not.

Having said that,  I think there should not be any case that we are
starting the sub-transaction while holding the relation extension
lock.

Right, this is exactly the point.  I think we can mention this in comments to make it clear why setting it to zero is fine during subtransaction abort. 

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Sat, Mar 7, 2020 at 11:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Sat, Mar 7, 2020 at 9:57 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Fri, Mar 6, 2020 at 9:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Mar 5, 2020 at 1:54 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Mar 5, 2020 at 12:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > 5. I have also tried to think of another way to check if we already
> > > hold lock type LOCKTAG_RELATION_EXTEND, but couldn't come up with a
> > > cheaper way than this.  Basically, I think if we traverse the
> > > MyProc->myProcLocks queue, we will get this information, but that
> > > doesn't seem much cheaper than this.
> >
> > I think we can maintain a flag (rel_extlock_held).  And, we can set
> > that true in LockRelationForExtension,
> > ConditionalLockRelationForExtension functions and we can reset it in
> > UnlockRelationForExtension or in the error path e.g. LockReleaseAll.
> >
>
> I think if we reset it in LockReleaseAll during the error path, then
> we need to find a way to reset it during LockReleaseCurrentOwner as
> that is called during Subtransaction Abort which can be tricky as we
> don't know if it belongs to the current owner.  How about resetting in
> Abort(Sub)Transaction and CommitTransaction after we release locks via
> ResourceOwnerRelease.

I think instead of the flag we need to keep the counter because we can
acquire the same relation extension lock multiple times.  So
basically, every time we acquire the lock we can increment the counter
and while releasing we can decrement it.   During an error path, I
think it is fine to set it to 0 in CommitTransaction/AbortTransaction.
But, I am not sure that we can set to 0 or decrement it in
AbortSubTransaction because we are not sure whether we have acquired
the lock under this subtransaction or not.

Having said that,  I think there should not be any case that we are
starting the sub-transaction while holding the relation extension
lock.

Right, this is exactly the point.  I think we can mention this in comments to make it clear why setting it to zero is fine during subtransaction abort. 

Is there anything wrong with having an Assert during subtransaction start to indicate that we don't have a relation extension lock? 

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Sat, Mar 7, 2020 at 3:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Mar 7, 2020 at 11:14 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Sat, Mar 7, 2020 at 9:57 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>>
>>> On Fri, Mar 6, 2020 at 9:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> >
>>> > On Thu, Mar 5, 2020 at 1:54 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>> > >
>>> > > On Thu, Mar 5, 2020 at 12:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> > > >
>>> > > >
>>> > > > 5. I have also tried to think of another way to check if we already
>>> > > > hold lock type LOCKTAG_RELATION_EXTEND, but couldn't come up with a
>>> > > > cheaper way than this.  Basically, I think if we traverse the
>>> > > > MyProc->myProcLocks queue, we will get this information, but that
>>> > > > doesn't seem much cheaper than this.
>>> > >
>>> > > I think we can maintain a flag (rel_extlock_held).  And, we can set
>>> > > that true in LockRelationForExtension,
>>> > > ConditionalLockRelationForExtension functions and we can reset it in
>>> > > UnlockRelationForExtension or in the error path e.g. LockReleaseAll.
>>> > >
>>> >
>>> > I think if we reset it in LockReleaseAll during the error path, then
>>> > we need to find a way to reset it during LockReleaseCurrentOwner as
>>> > that is called during Subtransaction Abort which can be tricky as we
>>> > don't know if it belongs to the current owner.  How about resetting in
>>> > Abort(Sub)Transaction and CommitTransaction after we release locks via
>>> > ResourceOwnerRelease.
>>>
>>> I think instead of the flag we need to keep the counter because we can
>>> acquire the same relation extension lock multiple times.  So
>>> basically, every time we acquire the lock we can increment the counter
>>> and while releasing we can decrement it.   During an error path, I
>>> think it is fine to set it to 0 in CommitTransaction/AbortTransaction.
>>> But, I am not sure that we can set to 0 or decrement it in
>>> AbortSubTransaction because we are not sure whether we have acquired
>>> the lock under this subtransaction or not.
>>>
>>> Having said that,  I think there should not be any case that we are
>>> starting the sub-transaction while holding the relation extension
>>> lock.
>>
>>
>> Right, this is exactly the point.  I think we can mention this in comments to make it clear why setting it to zero
isfine during subtransaction abort.
 
>
>
> Is there anything wrong with having an Assert during subtransaction start to indicate that we don't have a relation
extensionlock?
 

Yes, I was planning to do that.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Dilip Kumar <dilipbalaut@gmail.com> writes:
> I think instead of the flag we need to keep the counter because we can
> acquire the same relation extension lock multiple times.

Uh ... what?  How would that not be broken usage on its face?

I continue to think that we'd be better off getting all of this
out of the heavyweight lock manager.  There is no reason why we
should need deadlock detection, or multiple holds of the same
lock, or pretty much anything that LWLocks don't give you.

            regards, tom lane



On Sat, Mar 7, 2020 at 8:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Dilip Kumar <dilipbalaut@gmail.com> writes:
> > I think instead of the flag we need to keep the counter because we can
> > acquire the same relation extension lock multiple times.
>
> Uh ... what?  How would that not be broken usage on its face?

Basically,  if we can ensure that while holding the relation extension
lock we will not wait for any other lock then we can ignore in the
deadlock detection path so that we don't detect the false deadlock due
to the group locking mechanism.  So if we are already holding the
relation extension lock and trying to acquire the same lock-in same
mode then it can never wait so this is safe.

> I continue to think that we'd be better off getting all of this
> out of the heavyweight lock manager.  There is no reason why we
> should need deadlock detection, or multiple holds of the same
> lock, or pretty much anything that LWLocks don't give you.

Right, we never need deadlock detection for this lock.  But, I think
there are quite a few cases where we have multiple holds at the same
time.  e.g, during RelationAddExtraBlocks, while holding the relation
extension lock we try to update the block in FSM and FSM might need to
add extra FSM block which will again try to acquire the same lock.

But, I think the main reason for not converting it to an LWLocks is
because Andres has a concern about inventing new lock mechanism as
discuss upthread[1]

[1] https://www.postgresql.org/message-id/20200220023612.c44ggploywxtlvmx%40alap3.anarazel.de

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Mon, 24 Feb 2020 at 19:08, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Feb 20, 2020 at 8:06 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2020-02-19 11:12:18 +0530, Amit Kapila wrote:
> > > I think till we know the real need for changing group locking, going
> > > in the direction of what Tom suggested to use an array of LWLocks [1]
> > > to address the problems in hand is a good idea.
> >
> > -many
> >
> > I think that building yet another locking subsystem is the entirely
> > wrong idea - especially when there's imo no convincing architectural
> > reasons to do so.
> >
>
> Hmm, AFAIU, it will be done by having an array of LWLocks which we do
> at other places as well (like BufferIO locks).  I am not sure if we
> can call it as new locking subsystem, but if we decide to continue
> using lock.c and change group locking then I think we can do that as
> well, see my comments below regarding that.
>
> >
> > > It is not very clear to me that are we thinking to give up on Tom's
> > > idea [1] and change group locking even though it is not clear or at
> > > least nobody has proposed an idea/patch which requires that?  Or are
> > > we thinking that we can do what Tom suggested for relation extension
> > > lock and also plan to change group locking for future parallel
> > > operations that might require it?
> >
> > What I'm advocating is that extension locks should continue to go
> > through lock.c. And yes, that requires some changes to group locking,
> > but I still don't see why they'd be complicated.
> >
>
> Fair position, as per initial analysis, I think if we do below three
> things, it should work out without changing to a new way of locking
> for relation extension or page type locks.
> a. As per the discussion above, ensure in code we will never try to
> acquire another heavy-weight lock after acquiring relation extension
> or page type locks (probably by having Asserts in code or maybe some
> other way).

The current patch
(v02_0001-Added-assert-to-verify-that-we-never-try-to-take-any.patch)
doesn't check that acquiring a heavy-weight lock after page type lock,
is that right? There is the path doing that: ginInsertCleanup() holds
a page lock and insert the pending list items, which might hold a
relation extension lock.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Sat, Mar 7, 2020 at 9:19 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Sat, Mar 7, 2020 at 8:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Dilip Kumar <dilipbalaut@gmail.com> writes:
> > I think instead of the flag we need to keep the counter because we can
> > acquire the same relation extension lock multiple times.
>
> Uh ... what?  How would that not be broken usage on its face?

Basically,  if we can ensure that while holding the relation extension
lock we will not wait for any other lock then we can ignore in the
deadlock detection path so that we don't detect the false deadlock due
to the group locking mechanism.  So if we are already holding the
relation extension lock and trying to acquire the same lock-in same
mode then it can never wait so this is safe.

> I continue to think that we'd be better off getting all of this
> out of the heavyweight lock manager.  There is no reason why we
> should need deadlock detection, or multiple holds of the same
> lock, or pretty much anything that LWLocks don't give you.

Right, we never need deadlock detection for this lock.  But, I think
there are quite a few cases where we have multiple holds at the same
time.  e.g, during RelationAddExtraBlocks, while holding the relation
extension lock we try to update the block in FSM and FSM might need to
add extra FSM block which will again try to acquire the same lock.

But, I think the main reason for not converting it to an LWLocks is
because Andres has a concern about inventing new lock mechanism as
discuss upthread[1]


Right, that is one point and another is that if we go via the route of converting it to LWLocks, then we also need to think of some solution for page locks that are used in ginInsertCleanup.  However, if we go with the approach being pursued [1] then the page locks will be handled in a similar way as relation extension locks.


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Sun, Mar 8, 2020 at 7:58 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
On Mon, 24 Feb 2020 at 19:08, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Feb 20, 2020 at 8:06 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2020-02-19 11:12:18 +0530, Amit Kapila wrote:
> > > I think till we know the real need for changing group locking, going
> > > in the direction of what Tom suggested to use an array of LWLocks [1]
> > > to address the problems in hand is a good idea.
> >
> > -many
> >
> > I think that building yet another locking subsystem is the entirely
> > wrong idea - especially when there's imo no convincing architectural
> > reasons to do so.
> >
>
> Hmm, AFAIU, it will be done by having an array of LWLocks which we do
> at other places as well (like BufferIO locks).  I am not sure if we
> can call it as new locking subsystem, but if we decide to continue
> using lock.c and change group locking then I think we can do that as
> well, see my comments below regarding that.
>
> >
> > > It is not very clear to me that are we thinking to give up on Tom's
> > > idea [1] and change group locking even though it is not clear or at
> > > least nobody has proposed an idea/patch which requires that?  Or are
> > > we thinking that we can do what Tom suggested for relation extension
> > > lock and also plan to change group locking for future parallel
> > > operations that might require it?
> >
> > What I'm advocating is that extension locks should continue to go
> > through lock.c. And yes, that requires some changes to group locking,
> > but I still don't see why they'd be complicated.
> >
>
> Fair position, as per initial analysis, I think if we do below three
> things, it should work out without changing to a new way of locking
> for relation extension or page type locks.
> a. As per the discussion above, ensure in code we will never try to
> acquire another heavy-weight lock after acquiring relation extension
> or page type locks (probably by having Asserts in code or maybe some
> other way).

The current patch
(v02_0001-Added-assert-to-verify-that-we-never-try-to-take-any.patch)
doesn't check that acquiring a heavy-weight lock after page type lock,
is that right?

No, it should do that.
 
There is the path doing that: ginInsertCleanup() holds
a page lock and insert the pending list items, which might hold a
relation extension lock.

Right, I could also see that, but do you see any problem with that?  I agree that Assert should cover this case, but I don't see any fundamental problem with that.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Mon, 9 Mar 2020 at 14:16, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Mar 8, 2020 at 7:58 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>>
>> On Mon, 24 Feb 2020 at 19:08, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> >
>> > On Thu, Feb 20, 2020 at 8:06 AM Andres Freund <andres@anarazel.de> wrote:
>> > >
>> > > Hi,
>> > >
>> > > On 2020-02-19 11:12:18 +0530, Amit Kapila wrote:
>> > > > I think till we know the real need for changing group locking, going
>> > > > in the direction of what Tom suggested to use an array of LWLocks [1]
>> > > > to address the problems in hand is a good idea.
>> > >
>> > > -many
>> > >
>> > > I think that building yet another locking subsystem is the entirely
>> > > wrong idea - especially when there's imo no convincing architectural
>> > > reasons to do so.
>> > >
>> >
>> > Hmm, AFAIU, it will be done by having an array of LWLocks which we do
>> > at other places as well (like BufferIO locks).  I am not sure if we
>> > can call it as new locking subsystem, but if we decide to continue
>> > using lock.c and change group locking then I think we can do that as
>> > well, see my comments below regarding that.
>> >
>> > >
>> > > > It is not very clear to me that are we thinking to give up on Tom's
>> > > > idea [1] and change group locking even though it is not clear or at
>> > > > least nobody has proposed an idea/patch which requires that?  Or are
>> > > > we thinking that we can do what Tom suggested for relation extension
>> > > > lock and also plan to change group locking for future parallel
>> > > > operations that might require it?
>> > >
>> > > What I'm advocating is that extension locks should continue to go
>> > > through lock.c. And yes, that requires some changes to group locking,
>> > > but I still don't see why they'd be complicated.
>> > >
>> >
>> > Fair position, as per initial analysis, I think if we do below three
>> > things, it should work out without changing to a new way of locking
>> > for relation extension or page type locks.
>> > a. As per the discussion above, ensure in code we will never try to
>> > acquire another heavy-weight lock after acquiring relation extension
>> > or page type locks (probably by having Asserts in code or maybe some
>> > other way).
>>
>> The current patch
>> (v02_0001-Added-assert-to-verify-that-we-never-try-to-take-any.patch)
>> doesn't check that acquiring a heavy-weight lock after page type lock,
>> is that right?
>
>
> No, it should do that.
>
>>
>> There is the path doing that: ginInsertCleanup() holds
>> a page lock and insert the pending list items, which might hold a
>> relation extension lock.
>
>
> Right, I could also see that, but do you see any problem with that?  I agree that Assert should cover this case, but
Idon't see any fundamental problem with that.
 

I think that could be a problem if we change the group locking so that
it doesn't consider page lock type.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Mon, Mar 9, 2020 at 11:38 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
On Mon, 9 Mar 2020 at 14:16, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Mar 8, 2020 at 7:58 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>> >
>> > Fair position, as per initial analysis, I think if we do below three
>> > things, it should work out without changing to a new way of locking
>> > for relation extension or page type locks.
>> > a. As per the discussion above, ensure in code we will never try to
>> > acquire another heavy-weight lock after acquiring relation extension
>> > or page type locks (probably by having Asserts in code or maybe some
>> > other way).
>>
>> The current patch
>> (v02_0001-Added-assert-to-verify-that-we-never-try-to-take-any.patch)
>> doesn't check that acquiring a heavy-weight lock after page type lock,
>> is that right?
>
>
> No, it should do that.
>
>>
>> There is the path doing that: ginInsertCleanup() holds
>> a page lock and insert the pending list items, which might hold a
>> relation extension lock.
>
>
> Right, I could also see that, but do you see any problem with that?  I agree that Assert should cover this case, but I don't see any fundamental problem with that.

I think that could be a problem if we change the group locking so that
it doesn't consider page lock type.

I might be missing something, but won't that be a problem only when if there is a case where we acquire page lock after acquiring a relation extension lock?  Can you please explain the scenario you have in mind which can create a problem?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Mon, 9 Mar 2020 at 15:50, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Mar 9, 2020 at 11:38 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>>
>> On Mon, 9 Mar 2020 at 14:16, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> >
>> > On Sun, Mar 8, 2020 at 7:58 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>> >> >
>> >> > Fair position, as per initial analysis, I think if we do below three
>> >> > things, it should work out without changing to a new way of locking
>> >> > for relation extension or page type locks.
>> >> > a. As per the discussion above, ensure in code we will never try to
>> >> > acquire another heavy-weight lock after acquiring relation extension
>> >> > or page type locks (probably by having Asserts in code or maybe some
>> >> > other way).
>> >>
>> >> The current patch
>> >> (v02_0001-Added-assert-to-verify-that-we-never-try-to-take-any.patch)
>> >> doesn't check that acquiring a heavy-weight lock after page type lock,
>> >> is that right?
>> >
>> >
>> > No, it should do that.
>> >
>> >>
>> >> There is the path doing that: ginInsertCleanup() holds
>> >> a page lock and insert the pending list items, which might hold a
>> >> relation extension lock.
>> >
>> >
>> > Right, I could also see that, but do you see any problem with that?  I agree that Assert should cover this case,
butI don't see any fundamental problem with that.
 
>>
>> I think that could be a problem if we change the group locking so that
>> it doesn't consider page lock type.
>
>
> I might be missing something, but won't that be a problem only when if there is a case where we acquire page lock
afteracquiring a relation extension lock?
 

Yes, you're right.

Well I meant that the reason why we need to make Assert should cover
page locks case is the same as the reason for extension lock type
case. If we change the group locking so that it doesn't consider
extension lock and change deadlock so that it doesn't make a wait edge
for it, we need to ensure that the same backend doesn't acquire
heavy-weight lock after holding relation extension lock. These are
already done in the current patch. Similarly, if we did the similar
change for page lock in the group locking and deadlock , we need to
ensure the same things for page lock. But ISTM it doesn't necessarily
need to support page lock for now because currently we use it only for
cleanup pending list of gin index.


Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Mon, Mar 9, 2020 at 2:09 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
On Mon, 9 Mar 2020 at 15:50, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Mar 9, 2020 at 11:38 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>>
>> On Mon, 9 Mar 2020 at 14:16, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> >
>> > On Sun, Mar 8, 2020 at 7:58 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>> >> >
>> >> > Fair position, as per initial analysis, I think if we do below three
>> >> > things, it should work out without changing to a new way of locking
>> >> > for relation extension or page type locks.
>> >> > a. As per the discussion above, ensure in code we will never try to
>> >> > acquire another heavy-weight lock after acquiring relation extension
>> >> > or page type locks (probably by having Asserts in code or maybe some
>> >> > other way).
>> >>
>> >> The current patch
>> >> (v02_0001-Added-assert-to-verify-that-we-never-try-to-take-any.patch)
>> >> doesn't check that acquiring a heavy-weight lock after page type lock,
>> >> is that right?
>> >
>> >
>> > No, it should do that.
>> >
>> >>
>> >> There is the path doing that: ginInsertCleanup() holds
>> >> a page lock and insert the pending list items, which might hold a
>> >> relation extension lock.
>> >
>> >
>> > Right, I could also see that, but do you see any problem with that?  I agree that Assert should cover this case, but I don't see any fundamental problem with that.
>>
>> I think that could be a problem if we change the group locking so that
>> it doesn't consider page lock type.
>
>
> I might be missing something, but won't that be a problem only when if there is a case where we acquire page lock after acquiring a relation extension lock?

Yes, you're right.

Well I meant that the reason why we need to make Assert should cover
page locks case is the same as the reason for extension lock type
case. If we change the group locking so that it doesn't consider
extension lock and change deadlock so that it doesn't make a wait edge
for it, we need to ensure that the same backend doesn't acquire
heavy-weight lock after holding relation extension lock. These are
already done in the current patch. Similarly, if we did the similar
change for page lock in the group locking and deadlock , we need to
ensure the same things for page lock.

Agreed.
 
But ISTM it doesn't necessarily
need to support page lock for now because currently we use it only for
cleanup pending list of gin index.


I agree, but I think it is better to have a patch for the same even if we want to review/commit that separately.  That will help us to look at how the complete solution looks.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Mon, Mar 9, 2020 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Mar 9, 2020 at 2:09 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>>
>> On Mon, 9 Mar 2020 at 15:50, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> >
>> > On Mon, Mar 9, 2020 at 11:38 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>> >>
>> >> On Mon, 9 Mar 2020 at 14:16, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> >> >
>> >> > On Sun, Mar 8, 2020 at 7:58 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote:
>> >> >> >
>> >> >> > Fair position, as per initial analysis, I think if we do below three
>> >> >> > things, it should work out without changing to a new way of locking
>> >> >> > for relation extension or page type locks.
>> >> >> > a. As per the discussion above, ensure in code we will never try to
>> >> >> > acquire another heavy-weight lock after acquiring relation extension
>> >> >> > or page type locks (probably by having Asserts in code or maybe some
>> >> >> > other way).
>> >> >>
>> >> >> The current patch
>> >> >> (v02_0001-Added-assert-to-verify-that-we-never-try-to-take-any.patch)
>> >> >> doesn't check that acquiring a heavy-weight lock after page type lock,
>> >> >> is that right?
>> >> >
>> >> >
>> >> > No, it should do that.
>> >> >
>> >> >>
>> >> >> There is the path doing that: ginInsertCleanup() holds
>> >> >> a page lock and insert the pending list items, which might hold a
>> >> >> relation extension lock.
>> >> >
>> >> >
>> >> > Right, I could also see that, but do you see any problem with that?  I agree that Assert should cover this
case,but I don't see any fundamental problem with that.
 
>> >>
>> >> I think that could be a problem if we change the group locking so that
>> >> it doesn't consider page lock type.
>> >
>> >
>> > I might be missing something, but won't that be a problem only when if there is a case where we acquire page lock
afteracquiring a relation extension lock?
 
>>
>> Yes, you're right.
>>
>> Well I meant that the reason why we need to make Assert should cover
>> page locks case is the same as the reason for extension lock type
>> case. If we change the group locking so that it doesn't consider
>> extension lock and change deadlock so that it doesn't make a wait edge
>> for it, we need to ensure that the same backend doesn't acquire
>> heavy-weight lock after holding relation extension lock. These are
>> already done in the current patch. Similarly, if we did the similar
>> change for page lock in the group locking and deadlock , we need to
>> ensure the same things for page lock.
>
>
> Agreed.
>
>>
>> But ISTM it doesn't necessarily
>> need to support page lock for now because currently we use it only for
>> cleanup pending list of gin index.
>>
>
> I agree, but I think it is better to have a patch for the same even if we want to review/commit that separately.
Thatwill help us to look at how the complete solution looks.
 

Please find the updated patch (summary of the changes)
- Instead of searching the lock hash table for assert, it maintains a counter.
- Also, handled the case where we can acquire the relation extension
lock while holding the relation extension lock on the same relation.
- Handled the error case.

In addition to that prepared a WIP patch for handling the PageLock.
First, I thought that we can use the same counter for the PageLock and
the RelationExtensionLock because in assert we just need to check
whether we are trying to acquire any other heavyweight lock while
holding any of these locks.  But, the exceptional case where we
allowed to acquire a relation extension lock while holding any of
these locks is a bit different.  Because, if we are holding a relation
extension lock then we allowed to acquire the relation extension lock
on the same relation but it can not be any other relation otherwise it
can create a cycle.  But, the same is not true with the PageLock,
i.e. while holding the PageLock you can acquire the relation extension
lock on any relation and that will be safe because the relation
extension lock guarantee that, it will never create the cycle.
However, I agree that we don't have any such cases where we want to
acquire a relation extension lock on the different relations while
holding the PageLock.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment
On Mon, Feb 24, 2020 at 3:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Feb 20, 2020 at 8:06 AM Andres Freund <andres@anarazel.de> wrote:
> > What I'm advocating is that extension locks should continue to go
> > through lock.c. And yes, that requires some changes to group locking,
> > but I still don't see why they'd be complicated.
> >
>
> Fair position, as per initial analysis, I think if we do below three
> things, it should work out without changing to a new way of locking
> for relation extension or page type locks.
> a. As per the discussion above, ensure in code we will never try to
> acquire another heavy-weight lock after acquiring relation extension
> or page type locks (probably by having Asserts in code or maybe some
> other way).

I have done an analysis of the relation extension lock (which can be
acquired via LockRelationForExtension or
ConditionalLockRelationForExtension) and found that we don't acquire
any other heavyweight lock after acquiring it. However, we do
sometimes try to acquire it again in the places where we update FSM
after extension, see points (e) and (f) described below.  The usage of
this lock can be broadly divided into six categories and each one is
explained as follows:

a. Where after taking the relation extension lock we call ReadBuffer
(or its variant) and then LockBuffer.  The LockBuffer internally calls
either LWLock to acquire or release neither of which acquire another
heavy-weight lock. It is quite obvious as well that while taking some
lightweight lock, there is no reason to acquire another heavyweight
lock on any object.  The specs/comments of ReadBufferExtended (which
gets called from variants of ReadBuffer) API says that if the blknum
requested is P_NEW, only one backend can call it at-a-time which
indicates that we don't need to acquire any heavy-weight lock inside
this API.  Otherwise, also, this API won't need a heavy-weight lock to
read the existing block into shared buffer as two different backends
are allowed to read the same block.  I have also gone through all the
functions called/used in this path to ensure that we don't use
heavy-weight locks inside it.

The usage by APIs BloomNewBuffer, GinNewBuffer, gistNewBuffer,
_bt_getbuf, and SpGistNewBuffer falls in this category.  Another API
that falls under this category is revmap_physical_extend which uses
ReadBuffer, LocakBuffer and ReleaseBuffer. The ReleaseBuffer API
unpins aka decrement the reference count for buffer and disassociates
a buffer from the resource owner.  None of that requires heavy-weight
lock. T

b. After taking relation extension lock, we call
RelationGetNumberOfBlocks which primarily calls file-level functions
to determine the size of the file. This doesn't acquire any other
heavy-weight lock after relation extension lock.

The usage by APIs ginvacuumcleanup, gistvacuumscan, btvacuumscan, and
spgvacuumscan falls in this category.

c. There is a usage in API brin_page_cleanup() where we just acquire
and release the relation extension lock to avoid reinitializing the
page. As there is no call in-between acquire and release, so there is
no chance of another heavy-weight lock acquire after having relation
extension lock.

d. In fsm_extend() and vm_extend(), after acquiring relation extension
lock, we perform various file-level operations like RelationOpenSmgr,
smgrexists, smgrcreate, smgrnblocks, smgrextend.  First, from theory,
we don't have any heavy-weight lock other than relation extension lock
which can cover such operations and then I have verified it by going
through these APIs that these don't acquire any other heavy-weight
lock.  Then these APIs also call PageSetChecksumInplace computes a
checksum of the page and sets the same in page header which is quite
straight-forward and doesn't acquire any heavy-weight lock.

In vm_extend, we additionally call CacheInvalidateSmgr to send a
shared-inval message to force other backends to close any smgr
references they may have for the relation for which we extending
visibility map which has no reason to acquire any heavy-weight lock.
I have checked the code path as well and I didn't find any
heavy-weight lock call in that.

e. In brin_getinsertbuffer, we call ReadBuffer() and LockBuffer(), the
usage of which is the same as what is mentioned in (a).  In addition
to that it calls brin_initialize_empty_new_buffer() which further
calls RecordPageWithFreeSpace which can again acquire relation
extension lock for same relation.  This usage is safe because we have
a mechanism in heavy-weight lock manager that if we already hold a
lock and a request came for the same lock and in same mode, the lock
will be granted.

f. In RelationGetBufferForTuple(), there are multiple APIs that get
called and like (e), it can try to reacquire the relation extension
lock in one of those APIs.  The main APIs it calls after acquiring
relation extension lock are described as follows:
   - GetPageWithFreeSpace: This tries to find a page in the given
relation with at least the specified amount of free space.  This
mainly checks the FSM pages and in one of the paths might call
fsm_extend which can again try to acquire the relation extension lock
on the same relation.
   - RelationAddExtraBlocks: This adds multiple pages in a relation if
there is contention around relation extension lock.  This calls
RelationExtensionLockWaiterCount which is mainly to check how many
lockers are waiting for the same lock, then call ReadBufferBI which as
explained above won't require heavy-weight locks and FSM APIs which
can acquire Relation extension lock on the same relation, but that is
safe as discussed previously.

The Page locks can be acquired via LockPage and ConditionalLockPage.
This is acquired from one place in the code during Gin index cleanup
(ginInsertCleanup). The basic idea is that it will scan the pending
list and move entries into the main index.  While moving entries to
the main page, it might need to add a new page that will require us to
take a relation extension lock.  Now, unlike relation extension lock,
after acquiring page lock, we do acquire another heavy-weight lock
(relation extension lock), but as we never acquire it in reverse
order, this is safe.

So, as per this analysis, we can add Asserts for relation extension
and page locks which will indicate that they won't participate in
deadlocks.  It would be good if someone else can also do independent
analysis and verify my findings.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



On Fri, Mar 6, 2020 at 11:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> I think instead of the flag we need to keep the counter because we can
> acquire the same relation extension lock multiple times.  So
> basically, every time we acquire the lock we can increment the counter
> and while releasing we can decrement it.   During an error path, I
> think it is fine to set it to 0 in CommitTransaction/AbortTransaction.
> But, I am not sure that we can set to 0 or decrement it in
> AbortSubTransaction because we are not sure whether we have acquired
> the lock under this subtransaction or not.

I think that CommitTransaction, AbortTransaction, and friends have
*zero* business touching this. I think the counter - or flag - should
track whether we've got a PROCLOCK entry for a relation extension
lock. We either do, or we do not, and that does not change because of
anything have to do with the transaction state. It changes because
somebody calls LockRelease() or LockReleaseAll().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Sat, Mar 7, 2020 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I continue to think that we'd be better off getting all of this
> out of the heavyweight lock manager.  There is no reason why we
> should need deadlock detection, or multiple holds of the same
> lock, or pretty much anything that LWLocks don't give you.

Well, that was my initial inclination too, but Andres didn't like it.
I don't know whether it's better to take his advice or yours.

The one facility that we need here which the heavyweight lock facility
does provide and the lightweight lock facility does not is the ability
to take locks on an effectively unlimited number of distinct objects.
That is, we can't have a separate LWLock for every relation, because
there ~2^32 relation OIDs per database, and ~2^32 database OIDs, and a
patch that tried to allocate a tranche of 2^64 LWLocks would probably
get shot down.

The patch I wrote for this tried to work around this by having an
array of LWLocks and hashing <DBOID, RELOID> pairs onto array slots.
This produces some false sharing, though, which Andres didn't like
(and I can understand his concern). We could work around that problem
with a more complex design, where the LWLocks in the array do not
themselves represent the right to extend the relation, but only
protect the list of lockers. But at that point it starts to look like
you are reinventing the whole LOCK/PROCLOCK division.

So from my point of view we've got three possible approaches here, all
imperfect:

- Hash <DB, REL> pairs onto an array of LWLocks that represent the
right to extend the relation. Problem: false sharing for the whole
time the lock is held.

- Hash <DB, REL> pairs onto an array of LWLocks that protect a list of
lockers. Problem: looks like reinventing LOCK/PROCLOCK mechanism,
which is a fair amount of complexity to be duplicating.

- Adapt the heavyweight lock manager. Problem: Code is old, complex,
grotty, and doesn't need more weird special cases.

Whatever we choose, I think we ought to try to get Page locks and
Relation Extension locks into the same system. They're conceptually
the same kind of thing: you're not locking an SQL object, you
basically want an LWLock, but you can't use an LWLock because you want
to lock an OID not a piece of shared memory, so you can't have enough
LWLocks to use them in the regular way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Tue, Mar 10, 2020 at 6:48 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sat, Mar 7, 2020 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I continue to think that we'd be better off getting all of this
> > out of the heavyweight lock manager.  There is no reason why we
> > should need deadlock detection, or multiple holds of the same
> > lock, or pretty much anything that LWLocks don't give you.
>
> Well, that was my initial inclination too, but Andres didn't like it.
> I don't know whether it's better to take his advice or yours.
>
> The one facility that we need here which the heavyweight lock facility
> does provide and the lightweight lock facility does not is the ability
> to take locks on an effectively unlimited number of distinct objects.
> That is, we can't have a separate LWLock for every relation, because
> there ~2^32 relation OIDs per database, and ~2^32 database OIDs, and a
> patch that tried to allocate a tranche of 2^64 LWLocks would probably
> get shot down.
>

I think if we have to follow any LWLock based design, then we also
need to think about a case where if it is already acquired by the
backend (say in X mode), then it should be granted if the same backend
tries to acquire it in same mode (or mode that is compatible with the
mode in which it is already acquired).  As per my analysis above [1],
we do this at multiple places for relation extension lock.

[1] - https://www.postgresql.org/message-id/CAA4eK1%2BE8Vu%3D9PYZBZvMrga0Ynz_m6jmT3G_vJv-3L1PWv9Krg%40mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



On Tue, Mar 10, 2020 at 8:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Please find the updated patch (summary of the changes)
> - Instead of searching the lock hash table for assert, it maintains a counter.
> - Also, handled the case where we can acquire the relation extension
> lock while holding the relation extension lock on the same relation.
> - Handled the error case.
>
> In addition to that prepared a WIP patch for handling the PageLock.
> First, I thought that we can use the same counter for the PageLock and
> the RelationExtensionLock because in assert we just need to check
> whether we are trying to acquire any other heavyweight lock while
> holding any of these locks.  But, the exceptional case where we
> allowed to acquire a relation extension lock while holding any of
> these locks is a bit different.  Because, if we are holding a relation
> extension lock then we allowed to acquire the relation extension lock
> on the same relation but it can not be any other relation otherwise it
> can create a cycle.  But, the same is not true with the PageLock,
> i.e. while holding the PageLock you can acquire the relation extension
> lock on any relation and that will be safe because the relation
> extension lock guarantee that, it will never create the cycle.
> However, I agree that we don't have any such cases where we want to
> acquire a relation extension lock on the different relations while
> holding the PageLock.
>

Right, today, we don't have such cases where after acquiring relation
extension or page lock for a particular relation, we need to acquire
any of those for other relation and I am not able to offhand think of
many cases where we might have such a need in the future.  The one
theoretical possibility is to include fork_num in the lock tag while
acquiring extension lock for fsm/vm, but that will also have the same
relation.  Similarly one might say it is valid to acquire extension
lock in share mode after we have acquired it exclusive mode.  I am not
sure how much futuristic we want to make these Asserts.

I feel we should cover the current possible cases (which I think will
make the asserts more strict then required) and if there is a need to
relax them in the future for any particular use case, then we will
consider those.  In general, if we consider the way Mahendra has
written a patch which is to find the entry via the local hash table to
check for an Assert condition, then it will be a bit easier to extend
the checks if required in future as that way we have more information
about the particular lock. However, it will make the check more
expensive which might be okay considering that it is only for Assert
enabled builds.

One minor comment:
/*
+ * We should not acquire any other lock if we are already holding the
+ * relation extension lock.  Only exception is that if we are trying to
+ * acquire the relation extension lock then we can hold the relation
+ * extension on the same relation.
+ */
+ Assert(!IsRelExtLockHeld() ||
+    ((locktag->locktag_type == LOCKTAG_RELATION_EXTEND) && found));

I think you don't need the second part of the check because if we have
found the lock in the local lock table, we would return before this
check.  I think it will catch the case where if we have an extension
lock on one relation, then it won't allow us to acquire it on another
relation. OTOH, it will also not allow cases where backend has
relation extension lock in Exclusive mode and it tries to acquire it
in Shared mode. So, not sure if it is a good idea.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



On Tue, Mar 10, 2020 at 6:39 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Mar 6, 2020 at 11:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > I think instead of the flag we need to keep the counter because we can
> > acquire the same relation extension lock multiple times.  So
> > basically, every time we acquire the lock we can increment the counter
> > and while releasing we can decrement it.   During an error path, I
> > think it is fine to set it to 0 in CommitTransaction/AbortTransaction.
> > But, I am not sure that we can set to 0 or decrement it in
> > AbortSubTransaction because we are not sure whether we have acquired
> > the lock under this subtransaction or not.
>
> I think that CommitTransaction, AbortTransaction, and friends have
> *zero* business touching this. I think the counter - or flag - should
> track whether we've got a PROCLOCK entry for a relation extension
> lock. We either do, or we do not, and that does not change because of
> anything have to do with the transaction state. It changes because
> somebody calls LockRelease() or LockReleaseAll().
>

Do we want to have a special check in the LockRelease() to identify
whether we are releasing relation extension lock?  If not, then how we
will identify that relation extension is released and we can reset it
during subtransaction abort due to error?  During success paths, we
know when we have released RelationExtension or Page Lock (via
UnlockRelationForExtension or UnlockPage).  During the top-level
transaction end, we know when we have released all the locks, so that
will imply that RelationExtension and or Page locks must have been
released by that time.

If we have no other choice, then I see a few downsides of adding a
special check in the LockRelease() call:

1. Instead of resetting/decrement the variable from specific APIs like
UnlockRelationForExtension or UnlockPage, we need to have it in
LockRelease. It will also look odd, if set variable in
LockRelationForExtension, but don't reset in the
UnlockRelationForExtension variant.  Now, maybe we can allow to reset
it at both places if it is a flag, but not if it is a counter
variable.

2. One can argue that adding extra instructions in a generic path
(like LockRelease) is not a good idea, especially if those are for an
Assert. I understand this won't add anything which we can measure by
standard benchmarks.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



On Wed, Mar 11, 2020 at 2:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Mar 10, 2020 at 8:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Please find the updated patch (summary of the changes)
> > - Instead of searching the lock hash table for assert, it maintains a counter.
> > - Also, handled the case where we can acquire the relation extension
> > lock while holding the relation extension lock on the same relation.
> > - Handled the error case.
> >
> > In addition to that prepared a WIP patch for handling the PageLock.
> > First, I thought that we can use the same counter for the PageLock and
> > the RelationExtensionLock because in assert we just need to check
> > whether we are trying to acquire any other heavyweight lock while
> > holding any of these locks.  But, the exceptional case where we
> > allowed to acquire a relation extension lock while holding any of
> > these locks is a bit different.  Because, if we are holding a relation
> > extension lock then we allowed to acquire the relation extension lock
> > on the same relation but it can not be any other relation otherwise it
> > can create a cycle.  But, the same is not true with the PageLock,
> > i.e. while holding the PageLock you can acquire the relation extension
> > lock on any relation and that will be safe because the relation
> > extension lock guarantee that, it will never create the cycle.
> > However, I agree that we don't have any such cases where we want to
> > acquire a relation extension lock on the different relations while
> > holding the PageLock.
> >
>
> Right, today, we don't have such cases where after acquiring relation
> extension or page lock for a particular relation, we need to acquire
> any of those for other relation and I am not able to offhand think of
> many cases where we might have such a need in the future.  The one
> theoretical possibility is to include fork_num in the lock tag while
> acquiring extension lock for fsm/vm, but that will also have the same
> relation.  Similarly one might say it is valid to acquire extension
> lock in share mode after we have acquired it exclusive mode.  I am not
> sure how much futuristic we want to make these Asserts.
>
> I feel we should cover the current possible cases (which I think will
> make the asserts more strict then required) and if there is a need to
> relax them in the future for any particular use case, then we will
> consider those.  In general, if we consider the way Mahendra has
> written a patch which is to find the entry via the local hash table to
> check for an Assert condition, then it will be a bit easier to extend
> the checks if required in future as that way we have more information
> about the particular lock. However, it will make the check more
> expensive which might be okay considering that it is only for Assert
> enabled builds.
>
> One minor comment:
> /*
> + * We should not acquire any other lock if we are already holding the
> + * relation extension lock.  Only exception is that if we are trying to
> + * acquire the relation extension lock then we can hold the relation
> + * extension on the same relation.
> + */
> + Assert(!IsRelExtLockHeld() ||
> +    ((locktag->locktag_type == LOCKTAG_RELATION_EXTEND) && found));
>
> I think you don't need the second part of the check because if we have
> found the lock in the local lock table, we would return before this
> check.

Right.

  I think it will catch the case where if we have an extension
> lock on one relation, then it won't allow us to acquire it on another
> relation.

But, those will be caught even if we remove the second part right.
Basically, if we have Assert(!IsRelExtLockHeld(), that means by this
time you should not hold any relation extension lock.  The exceptional
case where we allow relation extension on the same relation will
anyway not reach here.  I think the second part of the Assert is just
useless.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Tue, Mar 10, 2020 at 4:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Feb 24, 2020 at 3:38 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Feb 20, 2020 at 8:06 AM Andres Freund <andres@anarazel.de> wrote:
> > > What I'm advocating is that extension locks should continue to go
> > > through lock.c. And yes, that requires some changes to group locking,
> > > but I still don't see why they'd be complicated.
> > >
> >
> > Fair position, as per initial analysis, I think if we do below three
> > things, it should work out without changing to a new way of locking
> > for relation extension or page type locks.
> > a. As per the discussion above, ensure in code we will never try to
> > acquire another heavy-weight lock after acquiring relation extension
> > or page type locks (probably by having Asserts in code or maybe some
> > other way).
>
> I have done an analysis of the relation extension lock (which can be
> acquired via LockRelationForExtension or
> ConditionalLockRelationForExtension) and found that we don't acquire
> any other heavyweight lock after acquiring it. However, we do
> sometimes try to acquire it again in the places where we update FSM
> after extension, see points (e) and (f) described below.  The usage of
> this lock can be broadly divided into six categories and each one is
> explained as follows:
>
> a. Where after taking the relation extension lock we call ReadBuffer
> (or its variant) and then LockBuffer.  The LockBuffer internally calls
> either LWLock to acquire or release neither of which acquire another
> heavy-weight lock. It is quite obvious as well that while taking some
> lightweight lock, there is no reason to acquire another heavyweight
> lock on any object.  The specs/comments of ReadBufferExtended (which
> gets called from variants of ReadBuffer) API says that if the blknum
> requested is P_NEW, only one backend can call it at-a-time which
> indicates that we don't need to acquire any heavy-weight lock inside
> this API.  Otherwise, also, this API won't need a heavy-weight lock to
> read the existing block into shared buffer as two different backends
> are allowed to read the same block.  I have also gone through all the
> functions called/used in this path to ensure that we don't use
> heavy-weight locks inside it.
>
> The usage by APIs BloomNewBuffer, GinNewBuffer, gistNewBuffer,
> _bt_getbuf, and SpGistNewBuffer falls in this category.  Another API
> that falls under this category is revmap_physical_extend which uses
> ReadBuffer, LocakBuffer and ReleaseBuffer. The ReleaseBuffer API
> unpins aka decrement the reference count for buffer and disassociates
> a buffer from the resource owner.  None of that requires heavy-weight
> lock. T
>
> b. After taking relation extension lock, we call
> RelationGetNumberOfBlocks which primarily calls file-level functions
> to determine the size of the file. This doesn't acquire any other
> heavy-weight lock after relation extension lock.
>
> The usage by APIs ginvacuumcleanup, gistvacuumscan, btvacuumscan, and
> spgvacuumscan falls in this category.
>
> c. There is a usage in API brin_page_cleanup() where we just acquire
> and release the relation extension lock to avoid reinitializing the
> page. As there is no call in-between acquire and release, so there is
> no chance of another heavy-weight lock acquire after having relation
> extension lock.
>
> d. In fsm_extend() and vm_extend(), after acquiring relation extension
> lock, we perform various file-level operations like RelationOpenSmgr,
> smgrexists, smgrcreate, smgrnblocks, smgrextend.  First, from theory,
> we don't have any heavy-weight lock other than relation extension lock
> which can cover such operations and then I have verified it by going
> through these APIs that these don't acquire any other heavy-weight
> lock.  Then these APIs also call PageSetChecksumInplace computes a
> checksum of the page and sets the same in page header which is quite
> straight-forward and doesn't acquire any heavy-weight lock.
>
> In vm_extend, we additionally call CacheInvalidateSmgr to send a
> shared-inval message to force other backends to close any smgr
> references they may have for the relation for which we extending
> visibility map which has no reason to acquire any heavy-weight lock.
> I have checked the code path as well and I didn't find any
> heavy-weight lock call in that.
>
> e. In brin_getinsertbuffer, we call ReadBuffer() and LockBuffer(), the
> usage of which is the same as what is mentioned in (a).  In addition
> to that it calls brin_initialize_empty_new_buffer() which further
> calls RecordPageWithFreeSpace which can again acquire relation
> extension lock for same relation.  This usage is safe because we have
> a mechanism in heavy-weight lock manager that if we already hold a
> lock and a request came for the same lock and in same mode, the lock
> will be granted.
>
> f. In RelationGetBufferForTuple(), there are multiple APIs that get
> called and like (e), it can try to reacquire the relation extension
> lock in one of those APIs.  The main APIs it calls after acquiring
> relation extension lock are described as follows:
>    - GetPageWithFreeSpace: This tries to find a page in the given
> relation with at least the specified amount of free space.  This
> mainly checks the FSM pages and in one of the paths might call
> fsm_extend which can again try to acquire the relation extension lock
> on the same relation.
>    - RelationAddExtraBlocks: This adds multiple pages in a relation if
> there is contention around relation extension lock.  This calls
> RelationExtensionLockWaiterCount which is mainly to check how many
> lockers are waiting for the same lock, then call ReadBufferBI which as
> explained above won't require heavy-weight locks and FSM APIs which
> can acquire Relation extension lock on the same relation, but that is
> safe as discussed previously.
>
> The Page locks can be acquired via LockPage and ConditionalLockPage.
> This is acquired from one place in the code during Gin index cleanup
> (ginInsertCleanup). The basic idea is that it will scan the pending
> list and move entries into the main index.  While moving entries to
> the main page, it might need to add a new page that will require us to
> take a relation extension lock.  Now, unlike relation extension lock,
> after acquiring page lock, we do acquire another heavy-weight lock
> (relation extension lock), but as we never acquire it in reverse
> order, this is safe.
>
> So, as per this analysis, we can add Asserts for relation extension
> and page locks which will indicate that they won't participate in
> deadlocks.  It would be good if someone else can also do independent
> analysis and verify my findings.

I have also analyzed the usage for the RelationExtensioLock and the
PageLock.  And, my findings are on the same lines.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Wed, Mar 11, 2020 at 2:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Mar 10, 2020 at 8:53 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Please find the updated patch (summary of the changes)
> > - Instead of searching the lock hash table for assert, it maintains a counter.
> > - Also, handled the case where we can acquire the relation extension
> > lock while holding the relation extension lock on the same relation.
> > - Handled the error case.
> >
> > In addition to that prepared a WIP patch for handling the PageLock.
> > First, I thought that we can use the same counter for the PageLock and
> > the RelationExtensionLock because in assert we just need to check
> > whether we are trying to acquire any other heavyweight lock while
> > holding any of these locks.  But, the exceptional case where we
> > allowed to acquire a relation extension lock while holding any of
> > these locks is a bit different.  Because, if we are holding a relation
> > extension lock then we allowed to acquire the relation extension lock
> > on the same relation but it can not be any other relation otherwise it
> > can create a cycle.  But, the same is not true with the PageLock,
> > i.e. while holding the PageLock you can acquire the relation extension
> > lock on any relation and that will be safe because the relation
> > extension lock guarantee that, it will never create the cycle.
> > However, I agree that we don't have any such cases where we want to
> > acquire a relation extension lock on the different relations while
> > holding the PageLock.
> >
>
> Right, today, we don't have such cases where after acquiring relation
> extension or page lock for a particular relation, we need to acquire
> any of those for other relation and I am not able to offhand think of
> many cases where we might have such a need in the future.  The one
> theoretical possibility is to include fork_num in the lock tag while
> acquiring extension lock for fsm/vm, but that will also have the same
> relation.  Similarly one might say it is valid to acquire extension
> lock in share mode after we have acquired it exclusive mode.  I am not
> sure how much futuristic we want to make these Asserts.
>
> I feel we should cover the current possible cases (which I think will
> make the asserts more strict then required) and if there is a need to
> relax them in the future for any particular use case, then we will
> consider those.  In general, if we consider the way Mahendra has
> written a patch which is to find the entry via the local hash table to
> check for an Assert condition, then it will be a bit easier to extend
> the checks if required in future as that way we have more information
> about the particular lock. However, it will make the check more
> expensive which might be okay considering that it is only for Assert
> enabled builds.
>
> One minor comment:
> /*
> + * We should not acquire any other lock if we are already holding the
> + * relation extension lock.  Only exception is that if we are trying to
> + * acquire the relation extension lock then we can hold the relation
> + * extension on the same relation.
> + */
> + Assert(!IsRelExtLockHeld() ||
> +    ((locktag->locktag_type == LOCKTAG_RELATION_EXTEND) && found));

I have fixed this in the attached patch set.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment
On Wed, Mar 11, 2020 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Mar 10, 2020 at 6:39 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Fri, Mar 6, 2020 at 11:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > I think instead of the flag we need to keep the counter because we can
> > > acquire the same relation extension lock multiple times.  So
> > > basically, every time we acquire the lock we can increment the counter
> > > and while releasing we can decrement it.   During an error path, I
> > > think it is fine to set it to 0 in CommitTransaction/AbortTransaction.
> > > But, I am not sure that we can set to 0 or decrement it in
> > > AbortSubTransaction because we are not sure whether we have acquired
> > > the lock under this subtransaction or not.
> >
> > I think that CommitTransaction, AbortTransaction, and friends have
> > *zero* business touching this. I think the counter - or flag - should
> > track whether we've got a PROCLOCK entry for a relation extension
> > lock. We either do, or we do not, and that does not change because of
> > anything have to do with the transaction state. It changes because
> > somebody calls LockRelease() or LockReleaseAll().
> >
>
> Do we want to have a special check in the LockRelease() to identify
> whether we are releasing relation extension lock?  If not, then how we
> will identify that relation extension is released and we can reset it
> during subtransaction abort due to error?  During success paths, we
> know when we have released RelationExtension or Page Lock (via
> UnlockRelationForExtension or UnlockPage).  During the top-level
> transaction end, we know when we have released all the locks, so that
> will imply that RelationExtension and or Page locks must have been
> released by that time.
>
> If we have no other choice, then I see a few downsides of adding a
> special check in the LockRelease() call:
>
> 1. Instead of resetting/decrement the variable from specific APIs like
> UnlockRelationForExtension or UnlockPage, we need to have it in
> LockRelease. It will also look odd, if set variable in
> LockRelationForExtension, but don't reset in the
> UnlockRelationForExtension variant.  Now, maybe we can allow to reset
> it at both places if it is a flag, but not if it is a counter
> variable.
>
> 2. One can argue that adding extra instructions in a generic path
> (like LockRelease) is not a good idea, especially if those are for an
> Assert. I understand this won't add anything which we can measure by
> standard benchmarks.

I have just written a WIP patch for relation extension lock where
instead of incrementing and decrementing the counter in
LockRelationForExtension and UnlockRelationForExtension respectively.
We can just set and reset the flag in LockAcquireExtended and
LockRelease.  So this patch appears simple to me as we are not
involving the transaction APIs to set and reset the flag.  However, we
need to add an extra check as you have already mentioned.  I think we
could measure the performance and see whether it has any impact or
not?

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment
On Thu, Mar 12, 2020 at 11:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I have fixed this in the attached patch set.
>

I have modified your
v4-0003-Conflict-Extension-Page-lock-in-group-member patch.  The
modifications are (a) Change src/backend/storage/lmgr/README to
reflect new behaviour, (b) Introduce a new macro LOCK_LOCKTAG which
slightly simplifies the code, (c) moved the deadlock.c check a few
lines up and (d) changed a few comments.

It might be better if we can move the checks related to extension and
page lock in a separate API or macro.  What do you think?

I have also used an extension to test this patch.  This is the same
extension that I have used to test the group locking patch.  It will
allow backends to form a group as we do for parallel workers.  The
extension is attached to this email.

Test without patch:
Session-1
Create table t1(c1 int, c2 char(500));
Select become_lock_group_leader();

Insert into t1 values(generate_series(1,100),'aaa'); -- stop this
after acquiring relation extension lock via GDB.

Session-2
Select  become_lock_group_member();
Insert into t1 values(generate_series(101,200),'aaa');
- Debug LockAcquire and found that it doesn't generate conflict for
Relation Extension lock.

The above experiment has shown that without patch group members can
acquire relation extension lock if the group leader has that lock.
After patch the second session waits for the first session to release
the relation extension lock. I know this is not a perfect way to test,
but it is better than nothing.  I think we need to do some more
testing either using this extension or some other way for extension
and page locks.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment
On Thu, Mar 12, 2020 at 5:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Mar 12, 2020 at 11:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I have fixed this in the attached patch set.
> >
>
> I have modified your
> v4-0003-Conflict-Extension-Page-lock-in-group-member patch.  The
> modifications are (a) Change src/backend/storage/lmgr/README to
> reflect new behaviour, (b) Introduce a new macro LOCK_LOCKTAG which
> slightly simplifies the code, (c) moved the deadlock.c check a few
> lines up and (d) changed a few comments.
>
> It might be better if we can move the checks related to extension and
> page lock in a separate API or macro.  What do you think?
>
I think moving them inside a macro is a good idea. Also, I think we
should move all the Assert related code inside some debugging macro
similar to this:
#ifdef LOCK_DEBUG
....
#endif

+ /*
+ * The relation extension or page lock can never participate in actual
+ * deadlock cycle.  See Asserts in LockAcquireExtended.  So, there is
+ * no advantage in checking wait edges from it.
+ */
+ if ((LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND) ||
+ (LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
+ return false;
+
Since this is true, we can also avoid these kind of locks in
ExpandConstraints, right? It'll certainly reduce some complexity in
topological sort.

  /*
+ * The relation extension or page lock conflict even between the group
+ * members.
+ */
+ if ((LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND) ||
+ (LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
+ {
+ PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)",
+ proclock);
+ return true;
+ }
This check includes the heavyweight locks that conflict even under
same parallel group. It also has another property that they can never
participate in deadlock cycles. And, the number of locks under this
category is likely to increase in future with new parallel features.
Hence, it could be used in multiple places. Should we move the
condition inside a macro and just call it from here?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



On Thu, Mar 12, 2020 at 7:50 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>
> On Thu, Mar 12, 2020 at 5:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Mar 12, 2020 at 11:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > I have fixed this in the attached patch set.
> > >
> >
> > I have modified your
> > v4-0003-Conflict-Extension-Page-lock-in-group-member patch.  The
> > modifications are (a) Change src/backend/storage/lmgr/README to
> > reflect new behaviour, (b) Introduce a new macro LOCK_LOCKTAG which
> > slightly simplifies the code, (c) moved the deadlock.c check a few
> > lines up and (d) changed a few comments.
> >
> > It might be better if we can move the checks related to extension and
> > page lock in a separate API or macro.  What do you think?
> >
> I think moving them inside a macro is a good idea. Also, I think we
> should move all the Assert related code inside some debugging macro
> similar to this:
> #ifdef LOCK_DEBUG
> ....
> #endif
>

If we move it under some macro, then those Asserts will be only
enabled when that macro is defined.  I think we want there Asserts to
be enabled always in assert enabled build, these will be like any
other Asserts in the code.  What is the advantage of doing those under
macro?

> + /*
> + * The relation extension or page lock can never participate in actual
> + * deadlock cycle.  See Asserts in LockAcquireExtended.  So, there is
> + * no advantage in checking wait edges from it.
> + */
> + if ((LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND) ||
> + (LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
> + return false;
> +
> Since this is true, we can also avoid these kind of locks in
> ExpandConstraints, right?
>

Yes, I had also thought about it but left it to avoid sprinkling such
checks at more places than absolutely required.

> It'll certainly reduce some complexity in
> topological sort.
>

I think you mean to say TopoSort will have to look at fewer members in
the wait queue, otherwise, there is nothing from the perspective of
code which we can remove/change there. I think there will be hardly
any chance that such locks will participate here because we take those
for some work and release them (basically, they are unlike other
heavyweight locks which can be released at the end).   Having said
that, I am not against putting those checks at the place you are
suggesting, it is just that I thought that it won't be of much use.

>   /*
> + * The relation extension or page lock conflict even between the group
> + * members.
> + */
> + if ((LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND) ||
> + (LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
> + {
> + PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)",
> + proclock);
> + return true;
> + }
> This check includes the heavyweight locks that conflict even under
> same parallel group. It also has another property that they can never
> participate in deadlock cycles. And, the number of locks under this
> category is likely to increase in future with new parallel features.
> Hence, it could be used in multiple places. Should we move the
> condition inside a macro and just call it from here?
>

Right, this is what I have suggested upthread. Do you have any
suggestions for naming such a macro or function?  I could think of
something like LocksConflictAmongGroupMembers or
LocksNotParticipateInDeadlock. The first one suits more for its usage
in LockCheckConflicts and the second in the deadlock.c code. So none
of those sound perfect to me.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



On Thu, Mar 12, 2020 at 5:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Mar 12, 2020 at 11:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I have fixed this in the attached patch set.
> >
>
> I have modified your
> v4-0003-Conflict-Extension-Page-lock-in-group-member patch.  The
> modifications are (a) Change src/backend/storage/lmgr/README to
> reflect new behaviour, (b) Introduce a new macro LOCK_LOCKTAG which
> slightly simplifies the code, (c) moved the deadlock.c check a few
> lines up and (d) changed a few comments.

Changes look fine to me.

> It might be better if we can move the checks related to extension and
> page lock in a separate API or macro.  What do you think?

I feel it looks cleaner this way as well.  But, If we plan to move it
to common function/macro then we should use some common name such that
it can be used in FindLockCycleRecurseMember as well as in
LockCheckConflicts.

> I have also used an extension to test this patch.  This is the same
> extension that I have used to test the group locking patch.  It will
> allow backends to form a group as we do for parallel workers.  The
> extension is attached to this email.
>
> Test without patch:
> Session-1
> Create table t1(c1 int, c2 char(500));
> Select become_lock_group_leader();
>
> Insert into t1 values(generate_series(1,100),'aaa'); -- stop this
> after acquiring relation extension lock via GDB.
>
> Session-2
> Select  become_lock_group_member();
> Insert into t1 values(generate_series(101,200),'aaa');
> - Debug LockAcquire and found that it doesn't generate conflict for
> Relation Extension lock.
>
> The above experiment has shown that without patch group members can
> acquire relation extension lock if the group leader has that lock.
> After patch the second session waits for the first session to release
> the relation extension lock. I know this is not a perfect way to test,
> but it is better than nothing.  I think we need to do some more
> testing either using this extension or some other way for extension
> and page locks.

I have also tested the same and verified it.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Fri, Mar 13, 2020 at 8:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Mar 12, 2020 at 7:50 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> >
> > On Thu, Mar 12, 2020 at 5:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Mar 12, 2020 at 11:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > I have fixed this in the attached patch set.
> > > >
> > >
> > > I have modified your
> > > v4-0003-Conflict-Extension-Page-lock-in-group-member patch.  The
> > > modifications are (a) Change src/backend/storage/lmgr/README to
> > > reflect new behaviour, (b) Introduce a new macro LOCK_LOCKTAG which
> > > slightly simplifies the code, (c) moved the deadlock.c check a few
> > > lines up and (d) changed a few comments.
> > >
> > > It might be better if we can move the checks related to extension and
> > > page lock in a separate API or macro.  What do you think?
> > >
> > I think moving them inside a macro is a good idea. Also, I think we
> > should move all the Assert related code inside some debugging macro
> > similar to this:
> > #ifdef LOCK_DEBUG
> > ....
> > #endif
> >
>
> If we move it under some macro, then those Asserts will be only
> enabled when that macro is defined.  I think we want there Asserts to
> be enabled always in assert enabled build, these will be like any
> other Asserts in the code.  What is the advantage of doing those under
> macro?
>
> > + /*
> > + * The relation extension or page lock can never participate in actual
> > + * deadlock cycle.  See Asserts in LockAcquireExtended.  So, there is
> > + * no advantage in checking wait edges from it.
> > + */
> > + if ((LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND) ||
> > + (LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
> > + return false;
> > +
> > Since this is true, we can also avoid these kind of locks in
> > ExpandConstraints, right?
> >
>
> Yes, I had also thought about it but left it to avoid sprinkling such
> checks at more places than absolutely required.
>
> > It'll certainly reduce some complexity in
> > topological sort.
> >
>
> I think you mean to say TopoSort will have to look at fewer members in
> the wait queue, otherwise, there is nothing from the perspective of
> code which we can remove/change there. I think there will be hardly
> any chance that such locks will participate here because we take those
> for some work and release them (basically, they are unlike other
> heavyweight locks which can be released at the end).   Having said
> that, I am not against putting those checks at the place you are
> suggesting, it is just that I thought that it won't be of much use.

I am not sure I understand this part.  Because topological sort will
work on the soft edges we have created when we found the cycle,  but
for relation extension/page lock we are completely ignoring hard/soft
edge then it will never participate in topo sort as well.  Am I
missing something?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Thu, Mar 12, 2020 at 3:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Mar 11, 2020 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > If we have no other choice, then I see a few downsides of adding a
> > special check in the LockRelease() call:
> >
> > 1. Instead of resetting/decrement the variable from specific APIs like
> > UnlockRelationForExtension or UnlockPage, we need to have it in
> > LockRelease. It will also look odd, if set variable in
> > LockRelationForExtension, but don't reset in the
> > UnlockRelationForExtension variant.  Now, maybe we can allow to reset
> > it at both places if it is a flag, but not if it is a counter
> > variable.
> >
> > 2. One can argue that adding extra instructions in a generic path
> > (like LockRelease) is not a good idea, especially if those are for an
> > Assert. I understand this won't add anything which we can measure by
> > standard benchmarks.
>
> I have just written a WIP patch for relation extension lock where
> instead of incrementing and decrementing the counter in
> LockRelationForExtension and UnlockRelationForExtension respectively.
> We can just set and reset the flag in LockAcquireExtended and
> LockRelease.  So this patch appears simple to me as we are not
> involving the transaction APIs to set and reset the flag.  However, we
> need to add an extra check as you have already mentioned.  I think we
> could measure the performance and see whether it has any impact or
> not?
>

LockAcquireExtended()
{
..
+ if (locktag->locktag_type == LOCKTAG_RELATION_EXTEND)
+ IsRelationExtensionLockHeld = true;
..
}

Can we move this check inside a function (CheckAndSetLockHeld or
something like that) as we need to add a similar thing for page lock?
Also, how about moving the set and reset of these flags to
GrantLockLocal and RemoveLocalLock as that will further reduce the
number of places where we need to add such a check.  Another thing is
to see if it makes sense to have a macro like LOCALLOCK_LOCKMETHOD to
get the lock tag.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



On Fri, Mar 13, 2020 at 11:08 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Mar 12, 2020 at 3:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Wed, Mar 11, 2020 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > If we have no other choice, then I see a few downsides of adding a
> > > special check in the LockRelease() call:
> > >
> > > 1. Instead of resetting/decrement the variable from specific APIs like
> > > UnlockRelationForExtension or UnlockPage, we need to have it in
> > > LockRelease. It will also look odd, if set variable in
> > > LockRelationForExtension, but don't reset in the
> > > UnlockRelationForExtension variant.  Now, maybe we can allow to reset
> > > it at both places if it is a flag, but not if it is a counter
> > > variable.
> > >
> > > 2. One can argue that adding extra instructions in a generic path
> > > (like LockRelease) is not a good idea, especially if those are for an
> > > Assert. I understand this won't add anything which we can measure by
> > > standard benchmarks.
> >
> > I have just written a WIP patch for relation extension lock where
> > instead of incrementing and decrementing the counter in
> > LockRelationForExtension and UnlockRelationForExtension respectively.
> > We can just set and reset the flag in LockAcquireExtended and
> > LockRelease.  So this patch appears simple to me as we are not
> > involving the transaction APIs to set and reset the flag.  However, we
> > need to add an extra check as you have already mentioned.  I think we
> > could measure the performance and see whether it has any impact or
> > not?
> >
>
> LockAcquireExtended()
> {
> ..
> + if (locktag->locktag_type == LOCKTAG_RELATION_EXTEND)
> + IsRelationExtensionLockHeld = true;
> ..
> }
>
> Can we move this check inside a function (CheckAndSetLockHeld or
> something like that) as we need to add a similar thing for page lock?

ok

> Also, how about moving the set and reset of these flags to
> GrantLockLocal and RemoveLocalLock as that will further reduce the
> number of places where we need to add such a check.

Make sense to me.

 Another thing is
> to see if it makes sense to have a macro like LOCALLOCK_LOCKMETHOD to
> get the lock tag.

ok

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Fri, Mar 13, 2020 at 8:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Mar 12, 2020 at 7:50 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> > I think moving them inside a macro is a good idea. Also, I think we
> > should move all the Assert related code inside some debugging macro
> > similar to this:
> > #ifdef LOCK_DEBUG
> > ....
> > #endif
> >
> If we move it under some macro, then those Asserts will be only
> enabled when that macro is defined.  I think we want there Asserts to
> be enabled always in assert enabled build, these will be like any
> other Asserts in the code.  What is the advantage of doing those under
> macro?
>
My concern is related to performance regression. We're using two
static variables in hot-paths only for checking a few asserts. So, I'm
not sure whether we should enable the same by default, specially when
asserts are itself disabled.
-ResetRelExtLockHeldCount()
+ResetRelExtPageLockHeldCount()
 {
  RelationExtensionLockHeldCount = 0;
+ PageLockHeldCount = 0;
+}
Also, we're calling this method from frequently used functions like
Commit/AbortTransaction. So, it's better these two static variables
share the same cache line and reinitalize them with a single
instruction.

>
> >   /*
> > + * The relation extension or page lock conflict even between the group
> > + * members.
> > + */
> > + if ((LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND) ||
> > + (LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
> > + {
> > + PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)",
> > + proclock);
> > + return true;
> > + }
> > This check includes the heavyweight locks that conflict even under
> > same parallel group. It also has another property that they can never
> > participate in deadlock cycles. And, the number of locks under this
> > category is likely to increase in future with new parallel features.
> > Hence, it could be used in multiple places. Should we move the
> > condition inside a macro and just call it from here?
> >
>
> Right, this is what I have suggested upthread. Do you have any
> suggestions for naming such a macro or function?  I could think of
> something like LocksConflictAmongGroupMembers or
> LocksNotParticipateInDeadlock. The first one suits more for its usage
> in LockCheckConflicts and the second in the deadlock.c code. So none
> of those sound perfect to me.
>
Actually, I'm not able to come up with a good suggestion. I'm trying
to think of a generic name similar to strong or weak locks but with
the following properties:
a. Locks that don't participate in deadlock detection
b. Locks that conflicts in the same parallel group

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



On Fri, Mar 13, 2020 at 8:42 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > On Thu, Mar 12, 2020 at 7:50 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> >
> > > + /*
> > > + * The relation extension or page lock can never participate in actual
> > > + * deadlock cycle.  See Asserts in LockAcquireExtended.  So, there is
> > > + * no advantage in checking wait edges from it.
> > > + */
> > > + if ((LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND) ||
> > > + (LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
> > > + return false;
> > > +
> > > Since this is true, we can also avoid these kind of locks in
> > > ExpandConstraints, right?
>
> I am not sure I understand this part.  Because topological sort will
> work on the soft edges we have created when we found the cycle,  but
> for relation extension/page lock we are completely ignoring hard/soft
> edge then it will never participate in topo sort as well.  Am I
> missing something?
>
No, I think you're right. We only add constraints if we've detected a
cycle in the graph. Hence, you don't need the check here.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



On Fri, Mar 13, 2020 at 8:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Mar 12, 2020 at 5:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Mar 12, 2020 at 11:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > I have fixed this in the attached patch set.
> > >
> >
> > I have modified your
> > v4-0003-Conflict-Extension-Page-lock-in-group-member patch.  The
> > modifications are (a) Change src/backend/storage/lmgr/README to
> > reflect new behaviour, (b) Introduce a new macro LOCK_LOCKTAG which
> > slightly simplifies the code, (c) moved the deadlock.c check a few
> > lines up and (d) changed a few comments.
>
> Changes look fine to me.
>

Today, while looking at this patch again, I realized that there is a
where we sometimes allow group members to jump the wait queue.  This
is primarily to avoid creating deadlocks (see ProcSleep).  Now,
ideally, we don't need this for relation extension or page locks as
those can never lead to deadlocks.  However, the current code will
give group members more priority to acquire relation extension or page
locks if any one of the members has held those locks.  Now, if we want
we can prevent giving group members priority for these locks, but I am
not sure how important is that case.  So, I have left that as it is by
adding a few comments.  What do you think?

Additionally, I have changed/added a few more sentences in README.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment
On Fri, Mar 13, 2020 at 2:32 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>
> On Fri, Mar 13, 2020 at 8:29 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Mar 12, 2020 at 7:50 PM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> > > I think moving them inside a macro is a good idea. Also, I think we
> > > should move all the Assert related code inside some debugging macro
> > > similar to this:
> > > #ifdef LOCK_DEBUG
> > > ....
> > > #endif
> > >
> > If we move it under some macro, then those Asserts will be only
> > enabled when that macro is defined.  I think we want there Asserts to
> > be enabled always in assert enabled build, these will be like any
> > other Asserts in the code.  What is the advantage of doing those under
> > macro?
> >
> My concern is related to performance regression. We're using two
> static variables in hot-paths only for checking a few asserts. So, I'm
> not sure whether we should enable the same by default, specially when
> asserts are itself disabled.
> -ResetRelExtLockHeldCount()
> +ResetRelExtPageLockHeldCount()
>  {
>   RelationExtensionLockHeldCount = 0;
> + PageLockHeldCount = 0;
> +}
> Also, we're calling this method from frequently used functions like
> Commit/AbortTransaction. So, it's better these two static variables
> share the same cache line and reinitalize them with a single
> instruction.

In the recent version of the patch, instead of a counter, we have done
with a flag.  So I think now we can just keep a single variable and we
can just reset the bit in a single instruction.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Fri, Mar 13, 2020 at 11:16 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Mar 13, 2020 at 11:08 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Mar 12, 2020 at 3:04 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > On Wed, Mar 11, 2020 at 2:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > >
> > > > If we have no other choice, then I see a few downsides of adding a
> > > > special check in the LockRelease() call:
> > > >
> > > > 1. Instead of resetting/decrement the variable from specific APIs like
> > > > UnlockRelationForExtension or UnlockPage, we need to have it in
> > > > LockRelease. It will also look odd, if set variable in
> > > > LockRelationForExtension, but don't reset in the
> > > > UnlockRelationForExtension variant.  Now, maybe we can allow to reset
> > > > it at both places if it is a flag, but not if it is a counter
> > > > variable.
> > > >
> > > > 2. One can argue that adding extra instructions in a generic path
> > > > (like LockRelease) is not a good idea, especially if those are for an
> > > > Assert. I understand this won't add anything which we can measure by
> > > > standard benchmarks.
> > >
> > > I have just written a WIP patch for relation extension lock where
> > > instead of incrementing and decrementing the counter in
> > > LockRelationForExtension and UnlockRelationForExtension respectively.
> > > We can just set and reset the flag in LockAcquireExtended and
> > > LockRelease.  So this patch appears simple to me as we are not
> > > involving the transaction APIs to set and reset the flag.  However, we
> > > need to add an extra check as you have already mentioned.  I think we
> > > could measure the performance and see whether it has any impact or
> > > not?
> > >
> >
> > LockAcquireExtended()
> > {
> > ..
> > + if (locktag->locktag_type == LOCKTAG_RELATION_EXTEND)
> > + IsRelationExtensionLockHeld = true;
> > ..
> > }
> >
> > Can we move this check inside a function (CheckAndSetLockHeld or
> > something like that) as we need to add a similar thing for page lock?
>
> ok

Done

>
> > Also, how about moving the set and reset of these flags to
> > GrantLockLocal and RemoveLocalLock as that will further reduce the
> > number of places where we need to add such a check.
>
> Make sense to me.

Done
>

>  Another thing is
> > to see if it makes sense to have a macro like LOCALLOCK_LOCKMETHOD to
> > get the lock tag.
>
> ok

Done

Apart from that, I have also extended the solution for the page lock.
And, I have also broken down the 3rd patch in two parts for relation
extension and for the page lock.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment
On Fri, Mar 13, 2020 at 3:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Mar 13, 2020 at 8:37 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Mar 12, 2020 at 5:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Thu, Mar 12, 2020 at 11:15 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > I have fixed this in the attached patch set.
> > > >
> > >
> > > I have modified your
> > > v4-0003-Conflict-Extension-Page-lock-in-group-member patch.  The
> > > modifications are (a) Change src/backend/storage/lmgr/README to
> > > reflect new behaviour, (b) Introduce a new macro LOCK_LOCKTAG which
> > > slightly simplifies the code, (c) moved the deadlock.c check a few
> > > lines up and (d) changed a few comments.
> >
> > Changes look fine to me.
> >
>
> Today, while looking at this patch again, I realized that there is a
> where we sometimes allow group members to jump the wait queue.  This
> is primarily to avoid creating deadlocks (see ProcSleep).  Now,
> ideally, we don't need this for relation extension or page locks as
> those can never lead to deadlocks.  However, the current code will
> give group members more priority to acquire relation extension or page
> locks if any one of the members has held those locks.  Now, if we want
> we can prevent giving group members priority for these locks, but I am
> not sure how important is that case.  So, I have left that as it is by
> adding a few comments.  What do you think?
>
> Additionally, I have changed/added a few more sentences in README.

I have included all your changes in the latest patch set.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Fri, Mar 13, 2020 at 7:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Apart from that, I have also extended the solution for the page lock.
> And, I have also broken down the 3rd patch in two parts for relation
> extension and for the page lock.
>

Thanks, I have made a number of cosmetic changes and written
appropriate commit messages for all patches.  See the attached patch
series and let me know your opinion? BTW, did you get a chance to test
page locks by using the extension which I have posted above or by some
other way?  I think it is important to test page-lock related patches
now.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment
On Sat, Mar 14, 2020 at 7:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Mar 13, 2020 at 7:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Apart from that, I have also extended the solution for the page lock.
> > And, I have also broken down the 3rd patch in two parts for relation
> > extension and for the page lock.
> >
>
> Thanks, I have made a number of cosmetic changes and written
> appropriate commit messages for all patches.  See the attached patch
> series and let me know your opinion? BTW, did you get a chance to test
> page locks by using the extension which I have posted above or by some
> other way?  I think it is important to test page-lock related patches
> now.

I have reviewed the updated patches and looks fine to me.  Apart from
this I have done testing for the Page Lock using group locking
extension.

--Setup
create table gin_test_tbl(i int4[]) with (autovacuum_enabled = off);
create index gin_test_idx on gin_test_tbl using gin (i);
create table gin_test_tbl1(i int4[]) with (autovacuum_enabled = off);
create index gin_test_idx1 on gin_test_tbl1 using gin (i);

--session1:
select become_lock_group_leader();
select gin_clean_pending_list('gin_test_idx');

--session2:
select become_lock_group_member(session1_pid);
select gin_clean_pending_list('gin_test_idx1');

--session3:
select become_lock_group_leader();
select gin_clean_pending_list('gin_test_idx1');

--session4:
select become_lock_group_member(session3_pid);
select gin_clean_pending_list('gin_test_idx');

ERROR:  deadlock detected
DETAIL:  Process 61953 waits for ExclusiveLock on page 0 of relation
16399 of database 13577; blocked by process 62197.
Process 62197 waits for ExclusiveLock on page 0 of relation 16400 of
database 13577; blocked by process 61953.
HINT:  See server log for query details.


Session1 and Session3 acquire the PageLock on two different index's
meta-pages and blocked in gdb,  meanwhile, their member tries to
acquire the page lock as shown in the above example and it detects the
deadlock which is solved after applying the patch.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Sun, Mar 15, 2020 at 1:15 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Mar 14, 2020 at 7:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Mar 13, 2020 at 7:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > Apart from that, I have also extended the solution for the page lock.
> > > And, I have also broken down the 3rd patch in two parts for relation
> > > extension and for the page lock.
> > >
> >
> > Thanks, I have made a number of cosmetic changes and written
> > appropriate commit messages for all patches.  See the attached patch
> > series and let me know your opinion? BTW, did you get a chance to test
> > page locks by using the extension which I have posted above or by some
> > other way?  I think it is important to test page-lock related patches
> > now.
>
> I have reviewed the updated patches and looks fine to me.  Apart from
> this I have done testing for the Page Lock using group locking
> extension.
>
> --Setup
> create table gin_test_tbl(i int4[]) with (autovacuum_enabled = off);
> create index gin_test_idx on gin_test_tbl using gin (i);
> create table gin_test_tbl1(i int4[]) with (autovacuum_enabled = off);
> create index gin_test_idx1 on gin_test_tbl1 using gin (i);
>
> --session1:
> select become_lock_group_leader();
> select gin_clean_pending_list('gin_test_idx');
>
> --session2:
> select become_lock_group_member(session1_pid);
> select gin_clean_pending_list('gin_test_idx1');
>
> --session3:
> select become_lock_group_leader();
> select gin_clean_pending_list('gin_test_idx1');
>
> --session4:
> select become_lock_group_member(session3_pid);
> select gin_clean_pending_list('gin_test_idx');
>
> ERROR:  deadlock detected
> DETAIL:  Process 61953 waits for ExclusiveLock on page 0 of relation
> 16399 of database 13577; blocked by process 62197.
> Process 62197 waits for ExclusiveLock on page 0 of relation 16400 of
> database 13577; blocked by process 61953.
> HINT:  See server log for query details.
>
>
> Session1 and Session3 acquire the PageLock on two different index's
> meta-pages and blocked in gdb,  meanwhile, their member tries to
> acquire the page lock as shown in the above example and it detects the
> deadlock which is solved after applying the patch.

I have modified 0001 and 0002 slightly,  Basically, instead of two
function CheckAndSetLockHeld and CheckAndReSetLockHeld, I have created
a one function.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment
On Sun, Mar 15, 2020 at 1:15 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Mar 14, 2020 at 7:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Fri, Mar 13, 2020 at 7:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > Apart from that, I have also extended the solution for the page lock.
> > > And, I have also broken down the 3rd patch in two parts for relation
> > > extension and for the page lock.
> > >
> >
> > Thanks, I have made a number of cosmetic changes and written
> > appropriate commit messages for all patches.  See the attached patch
> > series and let me know your opinion? BTW, did you get a chance to test
> > page locks by using the extension which I have posted above or by some
> > other way?  I think it is important to test page-lock related patches
> > now.
>
> I have reviewed the updated patches and looks fine to me.  Apart from
> this I have done testing for the Page Lock using group locking
> extension.
>
> --Setup
> create table gin_test_tbl(i int4[]) with (autovacuum_enabled = off);
> create index gin_test_idx on gin_test_tbl using gin (i);
> create table gin_test_tbl1(i int4[]) with (autovacuum_enabled = off);
> create index gin_test_idx1 on gin_test_tbl1 using gin (i);
>
> --session1:
> select become_lock_group_leader();
> select gin_clean_pending_list('gin_test_idx');
>
> --session2:
> select become_lock_group_member(session1_pid);
> select gin_clean_pending_list('gin_test_idx1');
>
> --session3:
> select become_lock_group_leader();
> select gin_clean_pending_list('gin_test_idx1');
>
> --session4:
> select become_lock_group_member(session3_pid);
> select gin_clean_pending_list('gin_test_idx');
>
> ERROR:  deadlock detected
> DETAIL:  Process 61953 waits for ExclusiveLock on page 0 of relation
> 16399 of database 13577; blocked by process 62197.
> Process 62197 waits for ExclusiveLock on page 0 of relation 16400 of
> database 13577; blocked by process 61953.
> HINT:  See server log for query details.
>
>
> Session1 and Session3 acquire the PageLock on two different index's
> meta-pages and blocked in gdb,  meanwhile, their member tries to
> acquire the page lock as shown in the above example and it detects the
> deadlock which is solved after applying the patch.
>

So, in this test, you have first performed the actions from Session-1
and Session-3 (blocked them via GDB after acquiring page lock) and
then performed the actions from Session-2 and Session-4, right?
Though this is not a very realistic case, it proves the point that
page locks don't participate in the deadlock cycle after the patch.  I
think we can do a few more tests that test other aspects of the patch.

1. Group members wait for page locks.  If you test that the leader
acquires the page lock and then member also tries to acquire the same
lock on the same index, it wouldn't block before the patch, but after
the patch, the member should wait for the leader to release the lock.
2. Try to hit Assert in LockAcquireExtended (a) by trying to
re-acquire the page lock via the debugger, (b) try to acquire the
relation extension lock after page lock and it should be allowed
(after acquiring page lock, we take relation extension lock in
following code path:
ginInsertCleanup->ginEntryInsert->ginFindLeafPage->ginPlaceToPage->GinNewBuffer).

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



On Sun, Mar 15, 2020 at 4:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I have modified 0001 and 0002 slightly,  Basically, instead of two
> function CheckAndSetLockHeld and CheckAndReSetLockHeld, I have created
> a one function.
>

+CheckAndSetLockHeld(LOCALLOCK *locallock, bool value)

Can we rename the parameter as lock_held, acquired or something like
that so that it indicates what it intends to do and probably add a
comment for that variable atop of function?

There is some work left related to testing some parts of the patch and
I can do some more review, but it started to look good to me, so I am
planning to push this in the coming week (say by Wednesday or so)
unless there are some major comments.  There are primarily two parts
of the patch-series (a) Assert that we don't acquire a heavyweight
lock on another object after relation extension lock. (b) Allow
relation extension lock to conflict among the parallel group members.
On similar lines there are two patches for page locks.

I think we have discussed in detail about LWLock approach and it seems
that it might be tricky than we initially thought especially with some
of the latest findings where we have noticed that there are multiple
cases where we can try to re-acquire the relation extension lock and
other things which we have discussed.  Also, all of us don't agree
with that idea.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



On Sun, Mar 15, 2020 at 5:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Mar 15, 2020 at 1:15 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Sat, Mar 14, 2020 at 7:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Fri, Mar 13, 2020 at 7:02 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > Apart from that, I have also extended the solution for the page lock.
> > > > And, I have also broken down the 3rd patch in two parts for relation
> > > > extension and for the page lock.
> > > >
> > >
> > > Thanks, I have made a number of cosmetic changes and written
> > > appropriate commit messages for all patches.  See the attached patch
> > > series and let me know your opinion? BTW, did you get a chance to test
> > > page locks by using the extension which I have posted above or by some
> > > other way?  I think it is important to test page-lock related patches
> > > now.
> >
> > I have reviewed the updated patches and looks fine to me.  Apart from
> > this I have done testing for the Page Lock using group locking
> > extension.
> >
> > --Setup
> > create table gin_test_tbl(i int4[]) with (autovacuum_enabled = off);
> > create index gin_test_idx on gin_test_tbl using gin (i);
> > create table gin_test_tbl1(i int4[]) with (autovacuum_enabled = off);
> > create index gin_test_idx1 on gin_test_tbl1 using gin (i);
> >
> > --session1:
> > select become_lock_group_leader();
> > select gin_clean_pending_list('gin_test_idx');
> >
> > --session2:
> > select become_lock_group_member(session1_pid);
> > select gin_clean_pending_list('gin_test_idx1');
> >
> > --session3:
> > select become_lock_group_leader();
> > select gin_clean_pending_list('gin_test_idx1');
> >
> > --session4:
> > select become_lock_group_member(session3_pid);
> > select gin_clean_pending_list('gin_test_idx');
> >
> > ERROR:  deadlock detected
> > DETAIL:  Process 61953 waits for ExclusiveLock on page 0 of relation
> > 16399 of database 13577; blocked by process 62197.
> > Process 62197 waits for ExclusiveLock on page 0 of relation 16400 of
> > database 13577; blocked by process 61953.
> > HINT:  See server log for query details.
> >
> >
> > Session1 and Session3 acquire the PageLock on two different index's
> > meta-pages and blocked in gdb,  meanwhile, their member tries to
> > acquire the page lock as shown in the above example and it detects the
> > deadlock which is solved after applying the patch.
> >
>
> So, in this test, you have first performed the actions from Session-1
> and Session-3 (blocked them via GDB after acquiring page lock) and
> then performed the actions from Session-2 and Session-4, right?

Yes

> Though this is not a very realistic case, it proves the point that
> page locks don't participate in the deadlock cycle after the patch.  I
> think we can do a few more tests that test other aspects of the patch.
>
> 1. Group members wait for page locks.  If you test that the leader
> acquires the page lock and then member also tries to acquire the same
> lock on the same index, it wouldn't block before the patch, but after
> the patch, the member should wait for the leader to release the lock.

Okay, I will test this part.

> 2. Try to hit Assert in LockAcquireExtended (a) by trying to
> re-acquire the page lock via the debugger,

I am not sure whether it is true or not,  Because, if we are holding
the page lock and we try the same page lock then the lock will be
granted without reaching the code path.  However, I agree that this is
not intended instead this is a side effect of allowing relation
extension lock while holding the same relation extension lock.  So
basically, now the situation is that if the lock is directly granted
because we are holding the same lock then it will not go to the assert
code.  IMHO, we don't need to add extra code to make it behave
differently.  Please let me know what is your opinion on this.

(b) try to acquire the
> relation extension lock after page lock and it should be allowed
> (after acquiring page lock, we take relation extension lock in
> following code path:
> ginInsertCleanup->ginEntryInsert->ginFindLeafPage->ginPlaceToPage->GinNewBuffer).

ok

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Sun, Mar 15, 2020 at 6:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Mar 15, 2020 at 4:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I have modified 0001 and 0002 slightly,  Basically, instead of two
> > function CheckAndSetLockHeld and CheckAndReSetLockHeld, I have created
> > a one function.
> >
>
> +CheckAndSetLockHeld(LOCALLOCK *locallock, bool value)
>
> Can we rename the parameter as lock_held, acquired or something like
> that so that it indicates what it intends to do and probably add a
> comment for that variable atop of function?

Done

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment
On Sun, Mar 15, 2020 at 9:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sun, Mar 15, 2020 at 5:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > 1. Group members wait for page locks.  If you test that the leader
> > acquires the page lock and then member also tries to acquire the same
> > lock on the same index, it wouldn't block before the patch, but after
> > the patch, the member should wait for the leader to release the lock.
>
> Okay, I will test this part.
>
> > 2. Try to hit Assert in LockAcquireExtended (a) by trying to
> > re-acquire the page lock via the debugger,
>
> I am not sure whether it is true or not,  Because, if we are holding
> the page lock and we try the same page lock then the lock will be
> granted without reaching the code path.  However, I agree that this is
> not intended instead this is a side effect of allowing relation
> extension lock while holding the same relation extension lock.  So
> basically, now the situation is that if the lock is directly granted
> because we are holding the same lock then it will not go to the assert
> code.  IMHO, we don't need to add extra code to make it behave
> differently.  Please let me know what is your opinion on this.
>

I also don't think there is any reason to add code to prevent that.
Actually, what I wanted to test was to somehow hit the Assert for the
cases where it will actually hit if someone tomorrow tries to acquire
any other type of lock.  Can we mimic such a situation by hacking code
(say try to acquire some other type of heavyweight lock) or in some
way to hit the newly added Assert?

> (b) try to acquire the
> > relation extension lock after page lock and it should be allowed
> > (after acquiring page lock, we take relation extension lock in
> > following code path:
> > ginInsertCleanup->ginEntryInsert->ginFindLeafPage->ginPlaceToPage->GinNewBuffer).
>
> ok
>

Thanks.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

From
Masahiko Sawada
Date:
On Mon, 16 Mar 2020 at 00:54, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sun, Mar 15, 2020 at 6:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sun, Mar 15, 2020 at 4:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > I have modified 0001 and 0002 slightly,  Basically, instead of two
> > > function CheckAndSetLockHeld and CheckAndReSetLockHeld, I have created
> > > a one function.
> > >
> >
> > +CheckAndSetLockHeld(LOCALLOCK *locallock, bool value)
> >
> > Can we rename the parameter as lock_held, acquired or something like
> > that so that it indicates what it intends to do and probably add a
> > comment for that variable atop of function?
>
> Done
>

I've looked at the patches and ISTM these work as expected.
IsRelationExtensionLockHeld and IsPageLockHeld are used only when
assertion is enabled. So how about making CheckAndSetLockHeld work
only if USE_ASSERT_CHECKING to avoid overheads?

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On Mon, Mar 16, 2020 at 8:57 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Mon, 16 Mar 2020 at 00:54, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Sun, Mar 15, 2020 at 6:20 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Sun, Mar 15, 2020 at 4:34 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
> > > > I have modified 0001 and 0002 slightly,  Basically, instead of two
> > > > function CheckAndSetLockHeld and CheckAndReSetLockHeld, I have created
> > > > a one function.
> > > >
> > >
> > > +CheckAndSetLockHeld(LOCALLOCK *locallock, bool value)
> > >
> > > Can we rename the parameter as lock_held, acquired or something like
> > > that so that it indicates what it intends to do and probably add a
> > > comment for that variable atop of function?
> >
> > Done
> >
>
> I've looked at the patches and ISTM these work as expected.

Thanks for verifying.

> IsRelationExtensionLockHeld and IsPageLockHeld are used only when
> assertion is enabled. So how about making CheckAndSetLockHeld work
> only if USE_ASSERT_CHECKING to avoid overheads?

That makes sense to me so updated the patch.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment
On Mon, Mar 16, 2020 at 8:15 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Mar 15, 2020 at 9:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Sun, Mar 15, 2020 at 5:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > >
> > > 1. Group members wait for page locks.  If you test that the leader
> > > acquires the page lock and then member also tries to acquire the same
> > > lock on the same index, it wouldn't block before the patch, but after
> > > the patch, the member should wait for the leader to release the lock.
> >
> > Okay, I will test this part.
> >
> > > 2. Try to hit Assert in LockAcquireExtended (a) by trying to
> > > re-acquire the page lock via the debugger,
> >
> > I am not sure whether it is true or not,  Because, if we are holding
> > the page lock and we try the same page lock then the lock will be
> > granted without reaching the code path.  However, I agree that this is
> > not intended instead this is a side effect of allowing relation
> > extension lock while holding the same relation extension lock.  So
> > basically, now the situation is that if the lock is directly granted
> > because we are holding the same lock then it will not go to the assert
> > code.  IMHO, we don't need to add extra code to make it behave
> > differently.  Please let me know what is your opinion on this.
> >
>
> I also don't think there is any reason to add code to prevent that.
> Actually, what I wanted to test was to somehow hit the Assert for the
> cases where it will actually hit if someone tomorrow tries to acquire
> any other type of lock.  Can we mimic such a situation by hacking code
> (say try to acquire some other type of heavyweight lock) or in some
> way to hit the newly added Assert?

I have hacked the code by calling another heavyweight lock and the
assert is hit.

>
> > (b) try to acquire the
> > > relation extension lock after page lock and it should be allowed
> > > (after acquiring page lock, we take relation extension lock in
> > > following code path:
> > > ginInsertCleanup->ginEntryInsert->ginFindLeafPage->ginPlaceToPage->GinNewBuffer).

I have tested this part and it works as expected i.e. assert is not hit.

--test case
create table gin_test_tbl(i int4[]) with (autovacuum_enabled = off);
create index gin_test_idx on gin_test_tbl using gin (i);
insert into gin_test_tbl select array[1, 2, g] from generate_series(1, 20000) g;
select gin_clean_pending_list('gin_test_idx');

BTW, this test is already covered by the existing gin.sql file so we
don't need to add any new test.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Mon, Mar 16, 2020 at 9:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> On Mon, Mar 16, 2020 at 8:57 AM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> > IsRelationExtensionLockHeld and IsPageLockHeld are used only when
> > assertion is enabled. So how about making CheckAndSetLockHeld work
> > only if USE_ASSERT_CHECKING to avoid overheads?
>
> That makes sense to me so updated the patch.
+1

In v10-0001-Assert-that-we-don-t-acquire-a-heavyweight-lock-.patch,

+ * Indicate that the lock is released for a particular type of locks.
s/lock is/locks are

+ /* Indicate that the lock is acquired for a certain type of locks. */
s/lock is/locks are

In v10-0002-*.patch,

+ * Flag to indicate if the page lock is held by this backend.  We don't
+ * acquire any other heavyweight lock while holding the page lock except for
+ * relation extension.  However, these locks are never taken in reverse order
+ * which implies that page locks will also never participate in the deadlock
+ * cycle.
s/while holding the page lock except for relation extension/while
holding the page lock except for relation extension and page lock

+ * We don't acquire any other heavyweight lock while holding the page lock
+ * except for relation extension lock.
Same as above

Other than that, the patches look good to me. I've also done some
testing after applying the Test-group-deadlock patch provided by Amit
earlier in the thread. It works as expected.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



On Mon, Mar 16, 2020 at 11:56 AM Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
>
> On Mon, Mar 16, 2020 at 9:43 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > On Mon, Mar 16, 2020 at 8:57 AM Masahiko Sawada
> > <masahiko.sawada@2ndquadrant.com> wrote:
> > > IsRelationExtensionLockHeld and IsPageLockHeld are used only when
> > > assertion is enabled. So how about making CheckAndSetLockHeld work
> > > only if USE_ASSERT_CHECKING to avoid overheads?
> >
> > That makes sense to me so updated the patch.
> +1
>
> In v10-0001-Assert-that-we-don-t-acquire-a-heavyweight-lock-.patch,
>
> + * Indicate that the lock is released for a particular type of locks.
> s/lock is/locks are

Done

> + /* Indicate that the lock is acquired for a certain type of locks. */
> s/lock is/locks are

Done

>
> In v10-0002-*.patch,
>
> + * Flag to indicate if the page lock is held by this backend.  We don't
> + * acquire any other heavyweight lock while holding the page lock except for
> + * relation extension.  However, these locks are never taken in reverse order
> + * which implies that page locks will also never participate in the deadlock
> + * cycle.
> s/while holding the page lock except for relation extension/while
> holding the page lock except for relation extension and page lock

Done

> + * We don't acquire any other heavyweight lock while holding the page lock
> + * except for relation extension lock.
> Same as above

Done

>
> Other than that, the patches look good to me. I've also done some
> testing after applying the Test-group-deadlock patch provided by Amit
> earlier in the thread. It works as expected.

Thanks for testing.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment
On Mon, Mar 16, 2020 at 3:24 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>

+
+ /*
+ * Indicate that the lock is released for certain types of locks
+ */
+#ifdef USE_ASSERT_CHECKING
+ CheckAndSetLockHeld(locallock, false);
+#endif
 }

 /*
@@ -1618,6 +1666,11 @@ GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner)
  locallock->numLockOwners++;
  if (owner != NULL)
  ResourceOwnerRememberLock(owner, locallock);
+
+ /* Indicate that the lock is acquired for certain types of locks. */
+#ifdef USE_ASSERT_CHECKING
+ CheckAndSetLockHeld(locallock, true);
+#endif
 }

There is no need to sprinkle USE_ASSERT_CHECKING at so many places,
having inside the new function is sufficient.  I have changed that,
added few more comments and
made minor changes.  See, what you think about attached?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment
On Tue, Mar 17, 2020 at 5:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Mar 16, 2020 at 3:24 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
>
> +
> + /*
> + * Indicate that the lock is released for certain types of locks
> + */
> +#ifdef USE_ASSERT_CHECKING
> + CheckAndSetLockHeld(locallock, false);
> +#endif
>  }
>
>  /*
> @@ -1618,6 +1666,11 @@ GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner)
>   locallock->numLockOwners++;
>   if (owner != NULL)
>   ResourceOwnerRememberLock(owner, locallock);
> +
> + /* Indicate that the lock is acquired for certain types of locks. */
> +#ifdef USE_ASSERT_CHECKING
> + CheckAndSetLockHeld(locallock, true);
> +#endif
>  }
>
> There is no need to sprinkle USE_ASSERT_CHECKING at so many places,
> having inside the new function is sufficient.  I have changed that,
> added few more comments and
> made minor changes.  See, what you think about attached?

Your changes look fine to me.  I have also verified all the test and
everything works fine.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



On Tue, Mar 17, 2020 at 6:32 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Your changes look fine to me.  I have also verified all the test and
> everything works fine.
>

I have pushed the first patch.  I will push the others in coming days.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com