Thread: Patch: fix lock contention for HASHHDR.mutex

Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

11 December 2015, 14:01:44

Hello all,

Consider following stacktrace:

(gdb) bt
#0  0x00007f77c81fae87 in semop () syscall-template.S:81
#1  0x000000000063b721 in PGSemaphoreLock pg_sema.c:387
#2  0x0000000000692e1b in LWLockAcquire lwlock.c:1026
#3  0x000000000068d4c5 in LockAcquireExtended lock.c:881
#4  0x000000000068dcc1 in LockAcquire lock.c:672
#5  0x000000000068b7a8 in LockRelationOid lmgr.c:112
#6  0x0000000000501d18 in find_inheritance_children pg_inherits.c:120
#7  0x0000000000501d80 in find_all_inheritors pg_inherits.c:182
#8  0x000000000062db8d in expand_inherited_rtentry prepunion.c:1285
#9  expand_inherited_tables prepunion.c:1212
#10 0x0000000000622705 in subquery_planner planner.c:501
#11 0x0000000000622d31 in standard_planner planner.c:285
#12 0x000000000069ef0c in pg_plan_query postgres.c:809
#13 0x000000000069f004 in pg_plan_queries postgres.c:868
#14 0x00000000006a0fc2 in exec_simple_query postgres.c:1033
#15 PostgresMain postgres.c:4032
#16 0x0000000000467479 in BackendRun postmaster.c:4237
#17 BackendStartup postmaster.c:3913
#18 ServerLoop () postmaster.c:1684
#19 0x000000000064c828 in PostmasterMain postmaster.c:1292
#20 0x0000000000467f3e in main main.c:223

Turns out PostgreSQL can spend a lot of time waiting for a lock in this
particular place, especially if you are running PostgreSQL on 60-core
server. Which obviously is a pretty bad sign.

You can easily reproduce this issue on regular Core i7 server. Use
attached schema.sql file to create a database and run:

pgbench -j 8 -c 8 -f pgbench.sql -T 300 my_database 2>/dev/null &

While this example is running connect to some PostgreSQL process with
gdb and run bt/c from time to time. You will see that PostgreSQL waits
for this particular lock quite often.

The problem is that code between LWLockAcquire (lock.c:881) and
LWLockRelease (lock.c:1020) can _sometimes_ run up to 3-5 ms. Using
old-good gettimeofday and logging method I managed to find a bottleneck:

-- proclock = SetupLockInTable [lock.c:892]
 `-- proclock = (PROCLOCK *) hash_search_with_hash_value [lock.c:1105]
   `-- currBucket = get_hash_entry(hashp) [dynahash.c:985]
     `-- SpinLockAcquire(&hctl->mutex) [dynahash.c:1187]

If my measurements are not wrong (I didn't place gettimeofday between
SpinLockAquire/SpinLockRelease, etc) we sometimes spend about 3 ms here
waiting for a spinlock, which doesn't seems right.

I managed to fix this behaviour by modifying choose_nelem_alloc
procedure in dynahash.c (see attached patch). The idea is to double
number of items we allocate when there is no more free items in hash
table. So we need twice less allocations which reduces contention.

This patch doesn't cause any performance degradation according to
pgbench, `make check` passes, etc.

Best regards,
Aleksander

Attachment

Re: Patch: fix lock contention for HASHHDR.mutex

From

Tom Lane

Date:

11 December 2015, 15:30:39

Aleksander Alekseev <a.alekseev@postgrespro.ru> writes:
> Turns out PostgreSQL can spend a lot of time waiting for a lock in this
> particular place, especially if you are running PostgreSQL on 60-core
> server. Which obviously is a pretty bad sign.
> ...
> I managed to fix this behaviour by modifying choose_nelem_alloc
> procedure in dynahash.c (see attached patch).

TBH, this is just voodoo.  I don't know why this change would have made
any impact on lock acquisition performance, and neither do you, and the
odds are good that it's pure chance that it changed anything.  One likely
theory is that you managed to shift around memory allocations so that
something aligned on a cacheline boundary when it hadn't before.  But, of
course, the next patch that changes allocations anywhere in shared memory
could change that back.  There are lots of effects like this that appear
or disappear based on seemingly unrelated code changes when you're
measuring edge-case performance.

The patch is not necessarily bad in itself.  As time goes by and machines
get bigger, it can make sense to allocate more memory at a time to reduce
memory management overhead.  But arguing for it on the basis that it fixes
lock allocation behavior with 60 cores is just horsepucky.  What you were
measuring there was steady-state hash table behavior, not the speed of the
allocate-some-more-memory code path.
        regards, tom lane

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

11 December 2015, 16:16:25

Hello, Tom

I see your point, but I would like to clarify a few things.

1. Do we consider described measurement method good enough to conclude
that sometimes PostgreSQL really spends 3 ms in a spinlock (like a RTT
between two Internet hosts in the same city)? If not, what method
should be used to approve or disapprove this?

2. If we agree that PostgreSQL does sometimes spend 3 ms in a spinlock
do we consider this a problem?

3. If we consider this a problem, what method is considered appropriate
to find a real reason of such behaviour so we could fix it?

Best regards,
Aleksander

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

11 December 2015, 17:44:04

Oops. s/approve or disapprove/confirm or deny/

On Fri, 11 Dec 2015 19:14:41 +0300
Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:

> Hello, Tom
> 
> I see your point, but I would like to clarify a few things.
> 
> 1. Do we consider described measurement method good enough to conclude
> that sometimes PostgreSQL really spends 3 ms in a spinlock (like a RTT
> between two Internet hosts in the same city)? If not, what method
> should be used to approve or disapprove this?
> 
> 2. If we agree that PostgreSQL does sometimes spend 3 ms in a spinlock
> do we consider this a problem?
> 
> 3. If we consider this a problem, what method is considered
> appropriate to find a real reason of such behaviour so we could fix
> it?
> 
> Best regards,
> Aleksander
> 
>

Re: Patch: fix lock contention for HASHHDR.mutex

From

Simon Riggs

Date:

13 December 2015, 09:48:56

On 11 December 2015 at 16:14, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:

I see your point, but I would like to clarify a few things.

1. Do we consider described measurement method good enough to conclude
that sometimes PostgreSQL really spends 3 ms in a spinlock (like a RTT
between two Internet hosts in the same city)? If not, what method
should be used to approve or disapprove this?

Timing things seems fine. You may have located an important issue.

2. If we agree that PostgreSQL does sometimes spend 3 ms in a spinlock
do we consider this a problem?

Probably, but then Tom didn't question 1 or 2. What he questioned was your fix.

3. If we consider this a problem, what method is considered appropriate
to find a real reason of such behaviour so we could fix it?

The problem you identify is in only one place, yet your fix changes many parts of the code. Why is that the best fix out of the many possible ones? Why would such a change be acceptable?

I personally don't know the answers to those questions. It would be wonderful if somebody else would structure our lives such that all we had to do was find simple answers, but that isn't the way life is. You ask for a chain of logical thought, but it is for you to create one, somehow: patches are default-reject, not committer-explain-why-reject.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

15 December 2015, 12:13:09

Hello, Tom.

I was exploring this issue further and discovered something strange.

"PROCLOCK hash" and "LOCK hash" are hash tables in shared memory. All
memory for these tables is in fact pre-allocated. But for some reason
these two tables are created (lock.c:394) with init_size =/= max_size.
It causes small overhead on calling memory allocator after hash table
creation and additional locking/unlocking.

I checked all other hash tables created via ShmemInitHash. All of these
tables have init_size == max_size. I suggest to create "PROCLOCK hash"
and "LOCK hash" with init_size == max_size too (see attached patch).
Granted this change doesn't cause any noticeable performance
improvements according to pgbench. Nevertheless it looks like a very
right thing to do which at least doesn't make anything worse.

If this patch seems to be OK next logical step I believe would be to
remove init_size parameter in ShmemInitHash procedure since in practice
it always equals max_size.

On Fri, 11 Dec 2015 10:30:30 -0500
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Aleksander Alekseev <a.alekseev@postgrespro.ru> writes:
> > Turns out PostgreSQL can spend a lot of time waiting for a lock in
> > this particular place, especially if you are running PostgreSQL on
> > 60-core server. Which obviously is a pretty bad sign.
> > ...
> > I managed to fix this behaviour by modifying choose_nelem_alloc
> > procedure in dynahash.c (see attached patch).
>
> TBH, this is just voodoo.  I don't know why this change would have
> made any impact on lock acquisition performance, and neither do you,
> and the odds are good that it's pure chance that it changed
> anything.  One likely theory is that you managed to shift around
> memory allocations so that something aligned on a cacheline boundary
> when it hadn't before.  But, of course, the next patch that changes
> allocations anywhere in shared memory could change that back.  There
> are lots of effects like this that appear or disappear based on
> seemingly unrelated code changes when you're measuring edge-case
> performance.
>
> The patch is not necessarily bad in itself.  As time goes by and
> machines get bigger, it can make sense to allocate more memory at a
> time to reduce memory management overhead.  But arguing for it on the
> basis that it fixes lock allocation behavior with 60 cores is just
> horsepucky.  What you were measuring there was steady-state hash
> table behavior, not the speed of the allocate-some-more-memory code
> path.
>
>             regards, tom lane
>
>

Attachment

lock-contention-v2.patch

Re: Patch: fix lock contention for HASHHDR.mutex

From

Andres Freund

Date:

15 December 2015, 12:25:21

On 2015-12-11 17:00:01 +0300, Aleksander Alekseev wrote:
> The problem is that code between LWLockAcquire (lock.c:881) and
> LWLockRelease (lock.c:1020) can _sometimes_ run up to 3-5 ms. Using
> old-good gettimeofday and logging method I managed to find a bottleneck:
> 
> -- proclock = SetupLockInTable [lock.c:892]
>  `-- proclock = (PROCLOCK *) hash_search_with_hash_value [lock.c:1105]
>    `-- currBucket = get_hash_entry(hashp) [dynahash.c:985]
>      `-- SpinLockAcquire(&hctl->mutex) [dynahash.c:1187]
> 
> If my measurements are not wrong (I didn't place gettimeofday between
> SpinLockAquire/SpinLockRelease, etc) we sometimes spend about 3 ms here
> waiting for a spinlock, which doesn't seems right.

Well, it's quite possible that your process was scheduled out, while
waiting for that spinlock. Which'd make 3ms pretty normal.

I'd consider using a LWLock instead of a spinlock here. I've seen this
contended in a bunch of situations, and the queued behaviour, combined
with directed wakeups on the OS level, ought to improve the worst case
behaviour measurably.

Greetings,

Andres Freund

Re: Patch: fix lock contention for HASHHDR.mutex

From

Robert Haas

Date:

17 December 2015, 14:48:02

On Tue, Dec 15, 2015 at 7:25 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-12-11 17:00:01 +0300, Aleksander Alekseev wrote:
>> The problem is that code between LWLockAcquire (lock.c:881) and
>> LWLockRelease (lock.c:1020) can _sometimes_ run up to 3-5 ms. Using
>> old-good gettimeofday and logging method I managed to find a bottleneck:
>>
>> -- proclock = SetupLockInTable [lock.c:892]
>>  `-- proclock = (PROCLOCK *) hash_search_with_hash_value [lock.c:1105]
>>    `-- currBucket = get_hash_entry(hashp) [dynahash.c:985]
>>      `-- SpinLockAcquire(&hctl->mutex) [dynahash.c:1187]
>>
>> If my measurements are not wrong (I didn't place gettimeofday between
>> SpinLockAquire/SpinLockRelease, etc) we sometimes spend about 3 ms here
>> waiting for a spinlock, which doesn't seems right.
>
> Well, it's quite possible that your process was scheduled out, while
> waiting for that spinlock. Which'd make 3ms pretty normal.
>
> I'd consider using a LWLock instead of a spinlock here. I've seen this
> contended in a bunch of situations, and the queued behaviour, combined
> with directed wakeups on the OS level, ought to improve the worst case
> behaviour measurably.

Amit had the idea a while back of trying to replace the HASHHDR mutex
with something based on atomic ops.  It seems hard to avoid the
attendant A-B-A problems but maybe there's a way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: fix lock contention for HASHHDR.mutex

From

Andres Freund

Date:

17 December 2015, 15:14:26

On 2015-12-17 09:47:57 -0500, Robert Haas wrote:
> On Tue, Dec 15, 2015 at 7:25 AM, Andres Freund <andres@anarazel.de> wrote:
> > I'd consider using a LWLock instead of a spinlock here. I've seen this
> > contended in a bunch of situations, and the queued behaviour, combined
> > with directed wakeups on the OS level, ought to improve the worst case
> > behaviour measurably.
> 
> Amit had the idea a while back of trying to replace the HASHHDR mutex
> with something based on atomic ops.  It seems hard to avoid the
> attendant A-B-A problems but maybe there's a way.

It'd really like to see it being replaced by a queuing lock
(i.e. lwlock) before we go there. And then maybe partition the freelist,
and make nentries an atomic.  Just doing those might already be good
enough and should be a lot easier.

Andres

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

17 December 2015, 16:05:31

> It'd really like to see it being replaced by a queuing lock
> (i.e. lwlock) before we go there. And then maybe partition the
> freelist, and make nentries an atomic.

I believe I just implemented something like this (see attachment). The
idea is to partition PROCLOCK hash table manually into NUM_LOCK_
PARTITIONS smaller and non-partitioned hash tables. Since these tables
are non-partitioned spinlock is not used and there is no lock
contention.

On 60-core server we gain 3.5-4 more TPS according to benchmark
described above. As I understand there is no performance degradation in
other cases (different CPU, traditional pgbench, etc).

If this patch seems to be OK I believe we could consider applying the
same change not only to PROCLOCK hash table.

Attachment

shard-proclock-hash-table.patch

Re: Patch: fix lock contention for HASHHDR.mutex

From

Robert Haas

Date:

17 December 2015, 18:11:12

On Thu, Dec 17, 2015 at 11:03 AM, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:
>> It'd really like to see it being replaced by a queuing lock
>> (i.e. lwlock) before we go there. And then maybe partition the
>> freelist, and make nentries an atomic.
>
> I believe I just implemented something like this (see attachment). The
> idea is to partition PROCLOCK hash table manually into NUM_LOCK_
> PARTITIONS smaller and non-partitioned hash tables. Since these tables
> are non-partitioned spinlock is not used and there is no lock
> contention.

Oh, that's an interesting idea.  I guess the problem is that if the
freelist is unshared, then users might get an error that the lock
table is full when some other partition still has elements remaining.

> On 60-core server we gain 3.5-4 more TPS according to benchmark
> described above. As I understand there is no performance degradation in
> other cases (different CPU, traditional pgbench, etc).

3.5-4 more TPS, or 3.5 times more TPS?  Can you share the actual numbers?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

18 December 2015, 05:11:57

> Oh, that's an interesting idea.  I guess the problem is that if the
> freelist is unshared, then users might get an error that the lock
> table is full when some other partition still has elements remaining.

True, but I don't believe it should be a real problem assuming we have
a strong hash function. If user get such an error it means that
database is under heavy load and there is not much more free elements
in other tables either. This particular user didn't get lucky, some
other will. Anyway administrator should do something about it -
fight DDoS attack, tune database parameters, etc.

> 3.5-4 more TPS, or 3.5 times more TPS?  Can you share the actual
> numbers?

Oops, 3.5-4 _times_ more TPS, i.e. 2206 TPS vs 546 TPS.

Re: Patch: fix lock contention for HASHHDR.mutex

From

Amit Kapila

Date:

18 December 2015, 08:32:54

On Thu, Dec 17, 2015 at 8:44 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2015-12-17 09:47:57 -0500, Robert Haas wrote:
> > On Tue, Dec 15, 2015 at 7:25 AM, Andres Freund <andres@anarazel.de> wrote:
> > > I'd consider using a LWLock instead of a spinlock here. I've seen this
> > > contended in a bunch of situations, and the queued behaviour, combined
> > > with directed wakeups on the OS level, ought to improve the worst case
> > > behaviour measurably.
> >
> > Amit had the idea a while back of trying to replace the HASHHDR mutex
> > with something based on atomic ops. It seems hard to avoid the
> > attendant A-B-A problems but maybe there's a way.
>
> It'd really like to see it being replaced by a queuing lock
> (i.e. lwlock) before we go there. And then maybe partition the freelist,
> and make nentries an atomic. Just doing those might already be good
> enough and should be a lot easier.
>

makes sense to me, but I think we should as well try the Group leader idea

used for ProcArrayLock optimisation as during those tests, I found that

it gives better results as compare to partitioning.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

18 December 2015, 08:42:46

> Oops, 3.5-4 _times_ more TPS, i.e. 2206 TPS vs 546 TPS.

In fact these numbers are for similar but a little bit different
benchmark (same schema without checks on child tables). Here are exact
numbers for a benchmark described above.



Before:

$ pgbench -j 64 -c 64 -f pgbench.sql -T 30 my_database
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 64
number of threads: 64
duration: 30 s
number of transactions actually processed: 20433
latency average: 93.966 ms
tps = 679.698439 (including connections establishing)
tps = 680.353897 (excluding connections establishing)

$ pgbench -j 64 -c 64 -f pgbench.sql -T 30 my_database
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 64
number of threads: 64
duration: 30 s
number of transactions actually processed: 19111
latency average: 100.466 ms
tps = 635.763523 (including connections establishing)
tps = 636.112682 (excluding connections establishing)

$ pgbench -j 64 -c 64 -f pgbench.sql -T 30 my_database
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 64
number of threads: 64
duration: 30 s
number of transactions actually processed: 19218
latency average: 99.906 ms
tps = 639.506848 (including connections establishing)
tps = 639.838757 (excluding connections establishing)


After:

$ pgbench -j 64 -c 64 -f pgbench.sql -T 30 my_database
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 64
number of threads: 64
duration: 30 s
number of transactions actually processed: 95900
latency average: 20.021 ms
tps = 3194.142762 (including connections establishing)
tps = 3196.091843 (excluding connections establishing)

$ pgbench -j 64 -c 64 -f pgbench.sql -T 30 my_database
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 64
number of threads: 64
duration: 30 s
number of transactions actually processed: 96837
latency average: 19.827 ms
tps = 3225.822355 (including connections establishing)
tps = 3227.762847 (excluding connections establishing)

$ pgbench -j 64 -c 64 -f pgbench.sql -T 30 my_database
transaction type: Custom query
scaling factor: 1
query mode: simple
number of clients: 64
number of threads: 64
duration: 30 s
number of transactions actually processed: 96143
latency average: 19.970 ms
tps = 3202.637126 (including connections establishing)
tps = 3204.070466 (excluding connections establishing)


Ratio:

$ python

min(3194.0, 3225.0, 3202.0) / max(679.0, 635.0, 639.0)
4.703976435935199

Re: Patch: fix lock contention for HASHHDR.mutex

From

Amit Kapila

Date:

18 December 2015, 08:49:33

On Thu, Dec 17, 2015 at 9:33 PM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:
>
> > It'd really like to see it being replaced by a queuing lock
> > (i.e. lwlock) before we go there. And then maybe partition the
> > freelist, and make nentries an atomic.
>
> I believe I just implemented something like this (see attachment). The
> idea is to partition PROCLOCK hash table manually into NUM_LOCK_
> PARTITIONS smaller and non-partitioned hash tables. Since these tables
> are non-partitioned spinlock is not used and there is no lock
> contention.
>

This idea can improve the situation with ProcLock hash table, but I

think IIUC what Andres is suggesting would reduce the contention

around dynahash freelist and can be helpful in many more situations

including BufMapping locks.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

18 December 2015, 10:06:47

> This idea can improve the situation with ProcLock hash table, but I
> think IIUC what Andres is suggesting would reduce the contention
> around dynahash freelist and can be helpful in many more situations
> including BufMapping locks.

I agree. But as I understand PostgreSQL community doesn't generally
аpprоvе big changes that affects whole system. Especially if original
problem was only in one particular place. Therefore for now I suggest
only a small change. Naturally if it will be accepted there is no
reason not to apply same changes for BufMapping or even dynahash itself
with corresponding PROCLOCK hash refactoring.

BTW could you (or anyone?) please help me find this thread regarding
BufMapping or perhaps provide a benchmark? I would like to reproduce
this issue but I can't find anything relevant in a mailing list. Also
it seems to be a good idea to compare alternative approaches that were
mentioned (atomics ops, group leader). Are there any discussions,
benchmarks or patches regarding this topic?

Frankly I have serious doubts regarding atomics ops since they will more
likely create the same contention that a spinlock does. But perhaps
there is a patch that works not the way I think it could work.

Re: Patch: fix lock contention for HASHHDR.mutex

From

Andres Freund

Date:

18 December 2015, 10:37:11

On 2015-12-18 11:40:58 +0300, Aleksander Alekseev wrote:
> $ pgbench -j 64 -c 64 -f pgbench.sql -T 30 my_database

What's in pgbench.sql?

Greetings,

Andres Freund

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

18 December 2015, 10:59:09

> What's in pgbench.sql?

It's from first message of this thread:

http://www.postgresql.org/message-id/20151211170001.78ded9d7@fujitsu

Re: Patch: fix lock contention for HASHHDR.mutex

From

Amit Kapila

Date:

18 December 2015, 11:26:02

On Fri, Dec 18, 2015 at 2:50 PM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:
>
> > This idea can improve the situation with ProcLock hash table, but I
> > think IIUC what Andres is suggesting would reduce the contention
> > around dynahash freelist and can be helpful in many more situations
> > including BufMapping locks.
>
> I agree. But as I understand PostgreSQL community doesn't generally
> approve big changes that affects whole system. Especially if original
> problem was only in one particular place. Therefore for now I suggest
> only a small change. Naturally if it will be accepted there is no
> reason not to apply same changes for BufMapping or even dynahash itself
> with corresponding PROCLOCK hash refactoring.
>

> BTW could you (or anyone?) please help me find this thread regarding
> BufMapping or perhaps provide a benchmark?

You can find that in below thread:

http://www.postgresql.org/message-id/CAA4eK1+U+GQDc2sio4adRk+ux6obFYRPxkY=CH5BkNaBToo84A@mail.gmail.com

This even contains the original idea and initial patch for replacing

spinlocks with atomic ops. I have mentioned about the A-B-A problem

few mails down in that thread and given a link to paper suggesting how

that can be solved. It needs more work, but doable.

> I would like to reproduce
> this issue but I can't find anything relevant in a mailing list. Also
> it seems to be a good idea to compare alternative approaches that were
> mentioned (atomics ops, group leader). Are there any discussions,
> benchmarks or patches regarding this topic?
>

You can find the discussion and patch related to Group leader approach

in the thread:

http://www.postgresql.org/message-id/CAA4eK1JbX4FzPHigNt0JSaz30a85BPJV+ewhk+wg_o-T6xufEA@mail.gmail.com

This patch is already committed.

> Frankly I have serious doubts regarding atomics ops since they will more
> likely create the same contention that a spinlock does. But perhaps
> there is a patch that works not the way I think it could work.
>

I think it is difficult to say without implementing it. If we want to evaluate

multiple approaches then what we can do here is I can help with writing a

patch using LWLocks and you can once evaluate the atomic ops approach

and we already have your current patch, then we can see what works out

best.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Patch: fix lock contention for HASHHDR.mutex

From

Teodor Sigaev

Date:

18 December 2015, 13:46:52

> Oh, that's an interesting idea.  I guess the problem is that if the
> freelist is unshared, then users might get an error that the lock
> table is full when some other partition still has elements remaining.

Could we split one freelist in hash to NUM_LOCK_PARTITIONS freelists?
Each partition will have its own freelist and if freelist is empty then 
partition should search an entry in freelists of other partitions. To prevent 
concurrent access it's needed to add one LWLock to hash, each partition should 
lock LWlock in share mode to work with its own freelist and exclusive to work 
with other freelists.

Actually, I'd like to improve all partitioned hashes instead of improve only one 
case.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

18 December 2015, 14:21:21

> Could we split one freelist in hash to NUM_LOCK_PARTITIONS freelists?
> Each partition will have its own freelist and if freelist is empty
> then partition should search an entry in freelists of other
> partitions. To prevent concurrent access it's needed to add one
> LWLock to hash, each partition should lock LWlock in share mode to
> work with its own freelist and exclusive to work with other freelists.
> 
> Actually, I'd like to improve all partitioned hashes instead of
> improve only one case.

It seems to be a most promising direction of research for now. I will
send a patch and benchmark results soon.

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

18 December 2015, 15:11:57

> This idea can improve the situation with ProcLock hash table, but I
> think IIUC what Andres is suggesting would reduce the contention
> around dynahash freelist and can be helpful in many more situations
> including BufMapping locks.

I agree. But as I understand PostgreSQL community doesn't generally
approve big changes that affects whole system. Especially if original
problem was only in one particular place. Therefore for now I suggest
only a small change. Naturally if it will be accepted there is no
reason not to apply same changes for BufMapping or even dynahash itself
with corresponding PROCLOCK hash refactoring.

BTW could you (or anyone?) please help me find this thread regarding
BufMapping or perhaps provide a benchmark? I would like to reproduce
this issue but I can't find anything relevant in a mailing list. Also
it seems to be a good idea to compare alternative approaches that were
mentioned (atomics ops, group leader). Are there any discussions,
benchmarks or patches regarding this topic?

Frankly I have serious doubts regarding atomics ops since they will more
likely create the same contention that a spinlock does. But perhaps
there is a patch that works not the way I think it could work.

Re: Patch: fix lock contention for HASHHDR.mutex

From

Robert Haas

Date:

18 December 2015, 15:57:28

On Fri, Dec 18, 2015 at 8:46 AM, Teodor Sigaev <teodor@sigaev.ru> wrote:
>> Oh, that's an interesting idea.  I guess the problem is that if the
>> freelist is unshared, then users might get an error that the lock
>> table is full when some other partition still has elements remaining.
>
> Could we split one freelist in hash to NUM_LOCK_PARTITIONS freelists?
> Each partition will have its own freelist and if freelist is empty then
> partition should search an entry in freelists of other partitions. To
> prevent concurrent access it's needed to add one LWLock to hash, each
> partition should lock LWlock in share mode to work with its own freelist and
> exclusive to work with other freelists.
>
> Actually, I'd like to improve all partitioned hashes instead of improve only
> one case.

Yeah.  I'm not sure that should be an LWLock rather than a spinlock,
but we can benchmark it both ways.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

22 December 2015, 15:41:48

> > Actually, I'd like to improve all partitioned hashes instead of
> > improve only one case.
>
> Yeah.  I'm not sure that should be an LWLock rather than a spinlock,
> but we can benchmark it both ways.

I would like to share some preliminary results. I tested four
implementations:

- no locks and no element stealing from other partitions;
- single LWLock per partitioned table;
- single spinlock per partitioned table;
- NUM_LOCK_PARTITIONS spinlocks per partitioned table;

Interestingly "Shared Buffer Lookup Table" (see buf_table.c) has 128
partitions. The constant NUM_BUFFER_PARTITIONS was increased from 16 to
128 in commit 3acc10c9:


http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3acc10c997f916f6a741d0b4876126b7b08e3892;hp=952872698d9443fdf9b808a1376017f00c91065a

Obviously after splitting a freelist into NUM_LOCK_PARTITIONS
partitions (and assuming that all necessary locking/unlocking is done
on calling side) tables can't have more than NUM_LOCK_PARTITIONS
partitions because it would cause race conditions. For this reason I
had to define NUM_BUFFER_PARTITIONS as NUM_LOCK_PARTITIONS and compare
behaviour of PostgreSQL depending on different values of
NUM_LOCK_PARTITIONS.

So here are results:

Core i7, pgbench -j 8 -c 8 -T 30 pgbench
(3 tests, TPS excluding connections establishing)

NUM_LOCK_  |  master  | no locks |  lwlock  | spinlock | spinlock
PARTITIONS | (99ccb2) |          |          |          |  array
-----------|----------|----------|----------|----------|----------
           |  295.4   |  297.4   |  299.4   |  285.6   |  302.7
(1 << 4)   |  286.1   |  300.5   |  283.4   |  300.9   |  300.4
           |  300.0   |  300.0   |  302.1   |  300.7   |  300.3
-----------|----------|----------|----------|----------|----------
           |          |  296.7   |  299.9   |  298.8   |  298.3
(1 << 5)   |   ----   |  301.9   |  302.2   |  305.7   |  306.3
           |          |  287.7   |  301.0   |  303.0   |  304.5
-----------|----------|----------|----------|----------|----------
           |          |  296.4   |  300.5   |  302.9   |  304.6
(1 << 6)   |   ----   |  301.7   |  305.6   |  306.4   |  302.3
           |          |  299.6   |  304.5   |  306.6   |  300.4
-----------|----------|----------|----------|----------|----------
           |          |  295.9   |  298.7   |  295.3   |  305.0
(1 << 7)   |   ----   |  299.5   |  300.5   |  299.0   |  310.2
           |          |  287.8   |  285.9   |  300.2   |  302.2

Core i7, pgbench -j 8 -c 8 -f big_table.sql -T 30 my_database
(3 test, TPS excluding connections establishing)

NUM_LOCK_  |  master  | no locks |  lwlock  | spinlock | spinlock
PARTITIONS | (99ccb2) |          |          |          |  array
-----------|----------|----------|----------|----------|----------
           |  505.1   |  521.3   |  511.1   |  524.4   |  501.6
(1 << 4)   |  452.4   |  467.4   |  509.2   |  472.3   |  453.7
           |  435.2   |  462.4   |  445.8   |  467.9   |  467.0
-----------|----------|----------|----------|----------|----------
           |          |  514.8   |  476.3   |  507.9   |  510.6
(1 << 5)   |   ----   |  457.5   |  491.2   |  464.6   |  431.7
           |          |  442.2   |  457.0   |  495.5   |  448.2
-----------|----------|----------|----------|----------|----------
           |          |  516.4   |  502.5   |  468.0   |  521.3
(1 << 6)   |   ----   |  463.6   |  438.7   |  488.8   |  455.4
           |          |  434.2   |  468.1   |  484.7   |  433.5
-----------|----------|----------|----------|----------|----------
           |          |  513.6   |  459.4   |  519.6   |  510.3
(1 << 7)   |   ----   |  470.1   |  454.6   |  445.5   |  415.9
           |          |  459.4   |  489.7   |  457.1   |  452.8

60-core server, pgbench -j 64 -c 64 -T 30 pgbench
(3 tests, TPS excluding connections establishing)

NUM_LOCK_  |  master  | no locks |  lwlock  | spinlock | spinlock
PARTITIONS | (99ccb2) |          |          |          |  array
-----------|----------|----------|----------|----------|----------
           |  3156.2  |  3157.9  |  3542.0  |  3444.3  |  3472.4
(1 << 4)   |  3268.5  |  3444.7  |  3485.7  |  3486.0  |  3500.5
           |  3251.2  |  3482.3  |  3398.7  |  3587.1  |  3557.7
-----------|----------|----------|----------|----------|----------
           |          |  3352.7  |  3556.0  |  3543.3  |  3526.8
(1 << 5)   |   ----   |  3465.0  |  3475.2  |  3486.9  |  3528.4
           |          |  3410.0  |  3482.0  |  3493.7  |  3444.9
-----------|----------|----------|----------|----------|----------
           |          |  3437.8  |  3413.1  |  3445.8  |  3481.6
(1 << 6)   |   ----   |  3470.1  |  3478.4  |  3538.5  |  3579.9
           |          |  3450.8  |  3431.1  |  3509.0  |  3512.5
-----------|----------|----------|----------|----------|----------
           |          |  3425.4  |  3534.6  |  3414.7  |  3517.1
(1 << 7)   |   ----   |  3436.5  |  3430.0  |  3428.0  |  3536.4
           |          |  3455.6  |  3479.7  |  3573.4  |  3543.0

60-core server, pgbench -j 64 -c 64 -f big_table.sql -T 30 my_database
(3 tests, TPS excluding connections establishing)

NUM_LOCK_  |  master  | no locks |  lwlock  | spinlock | spinlock
PARTITIONS | (99ccb2) |          |          |          |  array
-----------|----------|----------|----------|----------|----------
           |  661.1   |  4639.6  |  1435.2  |  445.9   |  1589.6
(1 << 4)   |  642.9   |  4566.7  |  1410.3  |  457.1   |  1601.7
           |  643.9   |  4621.8  |  1404.8  |  489.0   |  1592.6
-----------|----------|----------|----------|----------|----------
           |          |  4721.9  |  1543.1  |  499.1   |  1596.9
(1 << 5)   |   ----   |  4506.8  |  1513.0  |  528.3   |  1594.7
           |          |  4744.7  |  1540.3  |  524.0   |  1593.0
-----------|----------|----------|----------|----------|----------
           |          |  4649.1  |  1564.5  |  475.9   |  1580.1
(1 << 6)   |   ----   |  4671.0  |  1560.5  |  485.6   |  1589.1
           |          |  4751.0  |  1557.4  |  505.1   |  1580.3
-----------|----------|----------|----------|----------|----------
           |          |  4657.7  |  1551.8  |  534.7   |  1585.1
(1 << 7)   |   ----   |  4616.8  |  1546.8  |  495.8   |  1623.4
           |          |  4779.2  |  1538.5  |  537.4   |  1588.5

All four implementations (W.I.P. quality --- dirty code, no comments,
etc) are attached to this message. Schema of my_database and
big_table.sql file are attached to the first message of this thread.

A large spread of TPS on Core i7 is due to the fact that its actually
my laptop with other applications running beside PostgreSQL. Still we
see that all solutions are equally good on this CPU and there is no
performance degradation.

Now regarding 60-core server:

- One spinlock per hash table doesn't scale. I personally was expecting
  this;
- LWLock's and array of spinlocks do scale on NUMA up to a certain
  point;
- Best results are shown by "no locks";

I believe that "no locks" implementation is the best one since it is at
least 3 times faster on NUMA then any other implementation. Also it is
simpler and doesn't have stealing-from-other-freelists logic that
executes rarely and therefore is a likely source of bugs. Regarding ~16
elements of freelists which in some corner cases could but wouldn't be
used --- as I mentioned before I believe its not such a big problem.
Also its a small price to pay for 3 times more TPS.

Regarding NUM_LOCK_PARTITIONS (and NUM_BUFFER_PARTITIONS) I have some
doubts. For sure Robert had a good reason for committing 3acc10c9.
Unfortunately I'm not familiar with a story behind this commit. What do
you think?

Attachment

Re: Patch: fix lock contention for HASHHDR.mutex

From

Robert Haas

Date:

22 December 2015, 18:32:43

On Tue, Dec 22, 2015 at 10:39 AM, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:
> Obviously after splitting a freelist into NUM_LOCK_PARTITIONS
> partitions (and assuming that all necessary locking/unlocking is done
> on calling side) tables can't have more than NUM_LOCK_PARTITIONS
> partitions because it would cause race conditions. For this reason I
> had to define NUM_BUFFER_PARTITIONS as NUM_LOCK_PARTITIONS and compare
> behaviour of PostgreSQL depending on different values of
> NUM_LOCK_PARTITIONS.

This is not at all obvious.  There is no obvious reason why the number
of ways that the freelist is partitioned needs to have any relation at
all to the number of ways that the hash table itself is partitioned.
Probably, each process should pick a freelist affinity based on
MyBackendId % number_of_partitions or something like that.  If it
doesn't find one there, it can search the others, but always starting
with the one for which it has an affinity.  That way, the number of
processes contending on a single spinlock will be limited to
max_connections/number_of_partitions except when the table is very
nearly full.  That may happen a lot, though, for the buffer mapping
table, so we might need to over-allocate to compensate.  We want it to
be the case that a backend will almost always find a free entry on its
own freelist when it needs one, and only occasionally need to visit
any other freelist.

> Now regarding 60-core server:
>
> - One spinlock per hash table doesn't scale. I personally was expecting
>   this;
> - LWLock's and array of spinlocks do scale on NUMA up to a certain
>   point;
> - Best results are shown by "no locks";
>
> I believe that "no locks" implementation is the best one since it is at
> least 3 times faster on NUMA then any other implementation. Also it is
> simpler and doesn't have stealing-from-other-freelists logic that
> executes rarely and therefore is a likely source of bugs. Regarding ~16
> elements of freelists which in some corner cases could but wouldn't be
> used --- as I mentioned before I believe its not such a big problem.
> Also its a small price to pay for 3 times more TPS.

I doubt it.  Having the system become fundamentally less reliable is a
big cost.  If the system tries to find a lock table entry and fails,
the user's query is going to error out.  They are not at that point
going to say, well, it's good that my system runs fast even though I
can't run pg_dump.  They are going to say, running pg_dump used to
work and now it fails.  The consequences of failing to make an entry
in the buffer mapping table are at least as bad.

And there's a second point, too, which is that you haven't proven that
accepting the costs of your proposed model is even necessary.  You've
tried a few experiments with other approaches, not even all of the
ones that were proposed in the earlier thread, and you don't seem to
have done any investigation of why they didn't perform as well, or if
you did, it's not discussed in this email.  So maybe those problems
can be fixed, after all.

> Regarding NUM_LOCK_PARTITIONS (and NUM_BUFFER_PARTITIONS) I have some
> doubts. For sure Robert had a good reason for committing 3acc10c9.
> Unfortunately I'm not familiar with a story behind this commit. What do
> you think?

See the thread "Scaling shared buffer eviction".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

28 December 2015, 09:03:17

Here is another preliminary result I would like to share. As always you
will find corresponding patch in attachment. It has work in progress
quality.

The idea was suggested by colleague of mine Aleksander Lebedev.

freeList is partitioned like in "no lock" patch. When there is no
enough free items in a freeList we borrow AN items from a global list.
When freeList become too large we return AN items back to global list.
This global list is also partitioned into PN partitions. Each partition
is protected by a spinlock.

This way we have less lock contention than in "lwlock" or "spinlock
array" versions since we borrow multiple free elements, not one a time.
Also in worst case only AN*(NUM_LOCK_PARTITIONS-1) free items are not
used instead of (Total/NUM_LOCK_PARTITIONS)*(NUM_LOCK_PARTITIONS-1).

On 60-core server amount of TPS depends on AN and PN like this:

       |        |        |        |        |        |
       | AN = 1 | AN = 2 | AN = 4 | AN = 8 | AN =16 | AN =32
-------|--------|--------|--------|--------|--------|--------
       |  733.0 | 1120.6 | 1605.5 | 1842.5 | 1545.5 | 1237.0
PN = 1 |  740.3 | 1127.0 | 1634.2 | 1800.8 | 1573.5 | 1245.1
       |  742.9 | 1102.1 | 1647.2 | 1853.6 | 1533.4 | 1251.9
-------|--------|--------|--------|--------|--------|--------
       | 1052.0 | 1438.1 | 1755.6 | 1981.0 | 2022.0 | 1816.8
PN = 2 | 1044.8 | 1453.1 | 1784.0 | 1958.3 | 2033.2 | 1819.2
       | 1028.7 | 1419.8 | 1809.2 | 1981.2 | 2028.2 | 1790.2
-------|--------|--------|--------|--------|--------|--------
       | 1182.0 | 1521.5 | 1813.2 | 1932.6 | 2035.2 | 1948.4
PN = 4 | 1212.4 | 1535.4 | 1816.8 | 1927.0 | 2018.7 | 2014.6
       | 1189.4 | 1528.9 | 1816.9 | 1942.6 | 2011.9 | 2018.3
-------|--------|--------|--------|--------|--------|--------
       | 1148.1 | 1522.2 | 1795.4 | 1926.6 | 2031.7 | 2015.6
PN = 8 | 1175.6 | 1529.4 | 1807.6 | 1913.5 | 2007.3 | 2062.0
       | 1169.9 | 1528.0 | 1796.3 | 1926.0 | 2011.1 | 2042.8
-------|--------|--------|--------|--------|--------|--------
       | 1117.7 | 1491.0 | 1803.9 | 1925.3 | 2029.4 | 2056.2
PN =16 | 1132.8 | 1481.0 | 1809.6 | 1968.1 | 2033.8 | 2068.5
       | 1131.4 | 1481.8 | 1819.4 | 1946.2 | 2071.1 | 2073.8

AN = GLOBAL_FREE_LIST_ALLOC_NUMBER
PN = GLOBAL_FREE_LIST_PARTITIONS_NUM

There is no performance degradation on Core i7. By increasing PN or AN
any further we don't gain any more TPS.

As you may see this version is about 30% faster than "lwlock" or
"spinlock array" and 3.1 times faster than master. Still it's about 2.5
slower than "no locks" version which I find frustrating.

Next I will try to speedup this version by modifying global_free_list_*
procedures. Current implementations are not most efficient ones. Also
I'm planning to explore approaches which involve lock free algorithms.

I would like to know your opinion about such approach. For instance
could we spare AN*(NUM_LOCK_PARTITIONS-1) items in a worst case or we
can't by same reason we can't do it for (Total / NUM_LOCK_PARTITIONS) *
(NUM_LOCK_PARTITIONS-1)?

Attachment

sharded-global-free-list.patch

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

29 December 2015, 15:50:52

Good news, everyone!

I discovered that NUM_LOCK_OF_PARTITIONS is a bottleneck for a last
patch. Long story short - NUM_LOCK_OF_PARTITIONS = (1 << 7) gives best
performance:

PN = 16, AN =  8, NUM_LOCK_PARTITIONS = (1 << 7): 4782.9
PN = 16, AN =  4, NUM_LOCK_PARTITIONS = (1 << 7): 4089.9
PN = 16, AN =  2, NUM_LOCK_PARTITIONS = (1 << 7): 2822.5

PN =  8, AN =  8, NUM_LOCK_PARTITIONS = (1 << 7): 4391.7
PN =  8, AN =  4, NUM_LOCK_PARTITIONS = (1 << 7): 3665.0
PN =  8, AN =  2, NUM_LOCK_PARTITIONS = (1 << 7): 2205.7

PN =  4, AN =  8, NUM_LOCK_PARTITIONS = (1 << 7): 4031.9
PN =  4, AN =  4, NUM_LOCK_PARTITIONS = (1 << 7): 3378.2
PN =  4, AN =  2, NUM_LOCK_PARTITIONS = (1 << 7): 2142.4

Also I tried to optimize global_free_list_* procedures as I planned.
Optimized version is asymptotically faster but works about 5% slower in
practice because of some additional operations we need to perform.
Patch is attached to this message in case you would like to examine a
code.

Here is a funny thing - benchmark results I shared 22.12.2015 are wrong
because I forgot to run `make clean` after changing lwlock.h (autotools
doesn't rebuild project properly after changing .h files). Here are
correct results:

60-core server,
pgbench -j 64 -c 64 -f pgbench.sql -T 50 -P 1 my_database

NUM_LOCK_  |  master  | no locks |  lwlock  | spinlock | spinlock
PARTITIONS | (99ccb2) |          |          |          |  array
-----------|----------|----------|----------|----------|----------
 1 << 4    |  660.4   |  2296.3  |  1441.8  |  454.4   |  1597.8
-----------|----------|----------|----------|----------|----------
 1 << 5    |  536.6   |  3006.5  |  1632.3  |  381.8   |  1896.8
-----------|----------|----------|----------|----------|----------
 1 << 6    |  469.9   |  4083.5  |  1693.4  |  427.7   |  2271.5
-----------|----------|----------|----------|----------|----------
 1 << 7    |  430.8   |  4653.0  |  2016.3  |  361.1   |  3277.1

As you may see "spinlock array" version works quite well. It is almost
5 times faster than current dynahash implementation and only 30% slower
than "no locks" version. It also guarantees that we will use all
available memory.

I experimented with various improvements of "spinlock array" as well.
E.g. I tried to borrow elements not one a time, but it didn't cause any
performance improvements.

In case you believe that 5 times faster is not good enough we can also
use sharded-global-free-list.patch with appropriate AN and PN values.
Still this patch doesn't guarantee that all available memory will be
used ("no lock" patch doesn't guarantee it either).

Considering that benchmark above is not for all possible cases, but for
a very specific one, and that "spinlock array" patch has much better
guarantees then "no lock" patch, and that lock-free algorithms are
pretty complicated and error-prone I suggest we call "spinlock array"
solution good enough, at least until someone complain again that
dynahash is a bottleneck. Naturally first I clean a code, write more
comments, then re-check once again that there is no performance
degradation according to usual pgbench, etc.

Attachment

sharded-global-free-list-v2-SLOW.patch

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

30 December 2015, 13:03:36

Here is a clean version of the patch. Step 1 is an optimization. Step 2
refactors this:

 HTAB *
 ShmemInitHash(const char *name, /* table string name for shmem index */
-             long init_size,   /* initial table size */
+             long init_size,   /* initial table size XXX ignored,
 refactor! */


As I understand there is no performance degradation:

Before
======

Core i7, pgbench -j 8 -c 8 -T 50 pgbench:

```
number of transactions actually processed: 14315
latency average: 27.945 ms
latency stddev: 44.920 ms
tps = 286.097043 (including connections establishing)
tps = 286.127642 (excluding connections establishing)
```

60-core server, pgbench -j 64 -c 64 -T 50 pgbench:

```
number of transactions actually processed: 176127
latency average: 18.162 ms
latency stddev: 23.861 ms
tps = 3521.069672 (including connections establishing)
tps = 3522.166706 (excluding connections establishing)
```


After
=====

Core i7, pgbench -j 8 -c 8 -T 50 pgbench:

```
number of transactions actually processed: 14212
latency average: 27.984 ms
latency stddev: 43.250 ms
tps = 284.038588 (including connections establishing)
tps = 285.112772 (excluding connections establishing)
```

60-core server, pgbench -j 64 -c 64 -T 50 pgbench:

```
number of transactions actually processed: 172135
latency average: 17.767 ms
latency stddev: 22.692 ms
tps = 3590.410302 (including connections establishing)
tps = 3594.607165 (excluding connections establishing)
```

Attachment

Re: Patch: fix lock contention for HASHHDR.mutex

From

Alvaro Herrera

Date:

30 December 2015, 14:37:29

Aleksander Alekseev wrote:

> Here is a funny thing - benchmark results I shared 22.12.2015 are wrong
> because I forgot to run `make clean` after changing lwlock.h (autotools
> doesn't rebuild project properly after changing .h files).

Running configure with --enable-depend should avoid this problem.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Patch: fix lock contention for HASHHDR.mutex

From

Andres Freund

Date:

30 December 2015, 14:44:14

On 2015-12-30 11:37:19 -0300, Alvaro Herrera wrote:
> Aleksander Alekseev wrote:
> 
> > Here is a funny thing - benchmark results I shared 22.12.2015 are wrong
> > because I forgot to run `make clean` after changing lwlock.h (autotools
> > doesn't rebuild project properly after changing .h files).
> 
> Running configure with --enable-depend should avoid this problem.

I still maintain that --enable-depend should be on by default. We're
absurdly optimizing towards saving a handful of cycles in scenarios
which are usually bottlenecked by other things (build boxes spend more
times on tests and such), rather than optimizing for developer time. I
don't know how many people failed setting --enable-depend by now, but it
definitely goes into several hundres of wasted ours territory.

Re: Patch: fix lock contention for HASHHDR.mutex

From

Alvaro Herrera

Date:

30 December 2015, 14:49:24

Andres Freund wrote:
> On 2015-12-30 11:37:19 -0300, Alvaro Herrera wrote:
> > Aleksander Alekseev wrote:
> > 
> > > Here is a funny thing - benchmark results I shared 22.12.2015 are wrong
> > > because I forgot to run `make clean` after changing lwlock.h (autotools
> > > doesn't rebuild project properly after changing .h files).
> > 
> > Running configure with --enable-depend should avoid this problem.
> 
> I still maintain that --enable-depend should be on by default.

+1  Let's do it.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

30 December 2015, 14:58:34

Agree. --enable-depend turned on by default could save me a lot of
time. Now I'm aware regarding --enable-depend and `make clean`. Still
other people will definitely do the same mistake over and over again.

On Wed, 30 Dec 2015 11:49:13 -0300
Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> Andres Freund wrote:
> > On 2015-12-30 11:37:19 -0300, Alvaro Herrera wrote:
> > > Aleksander Alekseev wrote:
> > > 
> > > > Here is a funny thing - benchmark results I shared 22.12.2015
> > > > are wrong because I forgot to run `make clean` after changing
> > > > lwlock.h (autotools doesn't rebuild project properly after
> > > > changing .h files).
> > > 
> > > Running configure with --enable-depend should avoid this problem.
> > 
> > I still maintain that --enable-depend should be on by default.
> 
> +1  Let's do it.
>

Re: Patch: fix lock contention for HASHHDR.mutex

From

Oleg Bartunov

Date:

30 December 2015, 15:10:53

On Wed, Dec 30, 2015 at 5:44 PM, Andres Freund <andres@anarazel.de> wrote:

On 2015-12-30 11:37:19 -0300, Alvaro Herrera wrote:
> Aleksander Alekseev wrote:
>
> > Here is a funny thing - benchmark results I shared 22.12.2015 are wrong
> > because I forgot to run `make clean` after changing lwlock.h (autotools
> > doesn't rebuild project properly after changing .h files).
>
> Running configure with --enable-depend should avoid this problem.

I still maintain that --enable-depend should be on by default. We're

absurdly optimizing towards saving a handful of cycles in scenarios
which are usually bottlenecked by other things (build boxes spend more
times on tests and such), rather than optimizing for developer time. I
don't know how many people failed setting --enable-depend by now, but it
definitely goes into several hundres of wasted ours territory.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--enable-depend by default (was Re: Patch: fix lock contention for HASHHDR.mutex)

From

Tom Lane

Date:

30 December 2015, 15:49:40

[ A change as significant as this should not be debated in a thread with
a title that suggests it's of interest to only one or two people ]

> On Wed, Dec 30, 2015 at 5:44 PM, Andres Freund <andres@anarazel.de> wrote:
>> I still maintain that --enable-depend should be on by default. We're
>> absurdly optimizing towards saving a handful of cycles in scenarios
>> which are usually bottlenecked by other things (build boxes spend more
>> times on tests and such)

Nonsense.  The buildfarm will simply get slower, and at least for my
animals it's too damn slow already.  We could fix that by having the
buildfarm script automatically add --disable-depend; but if we're going
to change this, we should roll out that script update BEFORE we change
it, not after.
        regards, tom lane

Re: --enable-depend by default (was Re: Patch: fix lock contention for HASHHDR.mutex)

From

Greg Stark

Date:

30 December 2015, 16:08:54

<p dir="ltr">Alternately the buildfarm could be changed to retain earlier builds and use dependencies to reduce build
times.This would have the benefit of detecting dependency failures. At the expense of a lot more noise, at least until
weOrrin out whatever dependency failures are there now.<p dir="ltr">Or we could add extra rules so that if you don't
configurewith dependencies we check for unclean builds and error out. This would make at least accidental unclean
buildsmuch led likely.<div class="gmail_quote">On 30 Dec 2015 16:49, "Tom Lane" <<a
href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>>wrote:<br type="attribution" /><blockquote class="gmail_quote"
style="margin:00 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">[ A change as significant as this should not be
debatedin a thread with<br /> a title that suggests it's of interest to only one or two people ]<br /><br /> > On
Wed,Dec 30, 2015 at 5:44 PM, Andres Freund <<a href="mailto:andres@anarazel.de">andres@anarazel.de</a>> wrote:<br
/>>> I still maintain that --enable-depend should be on by default. We're<br /> >> absurdly optimizing
towardssaving a handful of cycles in scenarios<br /> >> which are usually bottlenecked by other things (build
boxesspend more<br /> >> times on tests and such)<br /><br /> Nonsense.  The buildfarm will simply get slower,
andat least for my<br /> animals it's too damn slow already.  We could fix that by having the<br /> buildfarm script
automaticallyadd --disable-depend; but if we're going<br /> to change this, we should roll out that script update
BEFOREwe change<br /> it, not after.<br /><br />                         regards, tom lane<br /><br /><br /> --<br />
Sentvia pgsql-hackers mailing list (<a href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br
/>To make changes to your subscription:<br /><a href="http://www.postgresql.org/mailpref/pgsql-hackers"
rel="noreferrer"target="_blank">http://www.postgresql.org/mailpref/pgsql-hackers</a><br /></blockquote></div>

Re: --enable-depend by default (was Re: Patch: fix lock contention for HASHHDR.mutex)

From

Tom Lane

Date:

30 December 2015, 16:13:41

Greg Stark <stark@mit.edu> writes:
> Alternately the buildfarm could be changed to retain earlier builds and use
> dependencies to reduce build times. This would have the benefit of
> detecting dependency failures. At the expense of a lot more noise, at least
> until we Orrin out whatever dependency failures are there now.

The buildfarm already made its choice in this area, and that's ccache.
Layering --enable-depend on top of builds that are using ccache is
simply a waste.

(Admittedly, ccache is gcc-specific, but TTBOTMK so is --enable-depend.)
        regards, tom lane

Re: --enable-depend by default (was Re: Patch: fix lock contention for HASHHDR.mutex)

From

Andres Freund

Date:

30 December 2015, 16:14:01

On 2015-12-30 10:49:27 -0500, Tom Lane wrote:
> > On Wed, Dec 30, 2015 at 5:44 PM, Andres Freund <andres@anarazel.de> wrote:
> >> I still maintain that --enable-depend should be on by default. We're
> >> absurdly optimizing towards saving a handful of cycles in scenarios
> >> which are usually bottlenecked by other things (build boxes spend more
> >> times on tests and such)
>
> Nonsense.  The buildfarm will simply get slower, and at least for my
> animals it's too damn slow already.

I think you're overstating the performance impact of
--enable-depend. It's just an additional output file for the compiler,
given the way we're doing it (-MMD -MP -MF $outfile). Let's make a quick test.

I've built both trees once before, to prime ccache. The test is:
for i in $(seq 1 5); do git clean -dfxq && ./configure --(dis|en)able-depend --quiet && time make -j4 -s;done

(coffee break)

no ccache, --disable-depend:    0m55.810s, 0m58.361s, 0m58.517s, 0m58.674s, 0m58.466s
ccache, --disable-depend:       0m5.248s,  0m5.279s,  0m5.244s,  0m5.771s,  0m5.296s
no ccache, --enable-depend:     0m56.443s, 0m58.507s, 0m58.587s, 0m58.866s, 0m58.429s
ccache, --enable-depend:        0m5.538s,  0m5.518s,  0m5.572s,  0m5.555s,  0m5.528s

Yes, this is on a much faster machine (my laptop) than what some
buildfarm animal are running. But given it's really just some
additional, *small*, files that are written (compiler) and read (make),
I don't see why the impact would be significantly different on slower
machines.

> We could fix that by having the
> buildfarm script automatically add --disable-depend; but if we're going
> to change this, we should roll out that script update BEFORE we change
> it, not after.

Sounds reasonable, despite the above. This isn't an urgent change, just
something to make newer developers waste less time.

Re: --enable-depend by default (was Re: Patch: fix lock contention for HASHHDR.mutex)

From

Andres Freund

Date:

30 December 2015, 16:51:25

On 2015-12-30 11:13:31 -0500, Tom Lane wrote:
> The buildfarm already made its choice in this area, and that's ccache.
> Layering --enable-depend on top of builds that are using ccache is
> simply a waste.
> 
> (Admittedly, ccache is gcc-specific, but TTBOTMK so is --enable-depend.)

For posterities sake: clang supports the dependency stuff by default,
it's harder but doable to get clang running with ccache.

Re: Patch: fix lock contention for HASHHDR.mutex

From

Fabrízio de Royes Mello

Date:

02 January 2016, 14:02:13

On Wed, Dec 30, 2015 at 1:10 PM, Oleg Bartunov <obartunov@gmail.com> wrote:
>
>
>
> On Wed, Dec 30, 2015 at 5:44 PM, Andres Freund <andres@anarazel.de> wrote:
>>
>> On 2015-12-30 11:37:19 -0300, Alvaro Herrera wrote:
>> > Aleksander Alekseev wrote:
>> >
>> > > Here is a funny thing - benchmark results I shared 22.12.2015 are wrong
>> > > because I forgot to run `make clean` after changing lwlock.h (autotools
>> > > doesn't rebuild project properly after changing .h files).
>> >
>> > Running configure with --enable-depend should avoid this problem.
>>
>> I still maintain that --enable-depend should be on by default. We're
>
>
> +1
>

+1, I always use it.

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog: http://fabriziomello.github.io
>> Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello
>> Github: http://github.com/fabriziomello

Re: Patch: fix lock contention for HASHHDR.mutex

From

Alvaro Herrera

Date:

12 January 2016, 00:16:57

Andres, Robert, are you still reviewing this patch?

Aleksander Alekseev wrote:
> Here is a clean version of the patch. Step 1 is an optimization. Step 2
> refactors this:
> 
>  HTAB *
>  ShmemInitHash(const char *name, /* table string name for shmem index */
> -             long init_size,   /* initial table size */
> +             long init_size,   /* initial table size XXX ignored,
>  refactor! */

So you were saying some posts upthread that the new hash tables would
"borrow" AN elements from the freelist then put AN elements back when
too many got unused.  Is that idea embodied in this patch?  I'm nervous
about continuing a line of development that does away with the
"init_size" concept completely, but if this patch is keep the "borrow"
concept then maybe it doesn't matter.

> -/* Number of partitions of the shared buffer mapping hashtable */
> -#define NUM_BUFFER_PARTITIONS  128
> -
>  /* Number of partitions the shared lock tables are divided into */
> -#define LOG2_NUM_LOCK_PARTITIONS  4
> +#define LOG2_NUM_LOCK_PARTITIONS  7
>  #define NUM_LOCK_PARTITIONS  (1 << LOG2_NUM_LOCK_PARTITIONS)
>  
> +/* Number of partitions of the shared buffer mapping hashtable */
> +#define NUM_BUFFER_PARTITIONS NUM_LOCK_PARTITIONS
> +
>  /* Number of partitions the shared predicate lock tables are divided into */
>  #define LOG2_NUM_PREDICATELOCK_PARTITIONS  4
>  #define NUM_PREDICATELOCK_PARTITIONS  (1 << LOG2_NUM_PREDICATELOCK_PARTITIONS)

I find this odd.  Why is it a good idea to define the bufMapping
partitions like this?  What's the explanation for the performance
numbers you saw?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

12 January 2016, 08:22:12

Hello, Álvaro

> So you were saying some posts upthread that the new hash tables would
> "borrow" AN elements from the freelist then put AN elements back when
> too many got unused.  Is that idea embodied in this patch?

Right. It didn't satisfy "use all memory" requirement anyway. It's
better to sacrifice a few TPS (according only to my benchmark which
doesn't represent all possible use cases) than accidentally break
someones system after upgrading to the next version of PostgreSQL. See
example with pg_dump given by Robert Haas above.

>> +/* Number of partitions of the shared buffer mapping hashtable */
>> +#define NUM_BUFFER_PARTITIONS NUM_LOCK_PARTITIONS
>> +
>>  /* Number of partitions the shared predicate lock tables are divided
>> into */ #define LOG2_NUM_PREDICATELOCK_PARTITIONS  4
>>  #define NUM_PREDICATELOCK_PARTITIONS  (1 <<
>> LOG2_NUM_PREDICATELOCK_PARTITIONS)
>
> I find this odd.  Why is it a good idea to define the bufMapping
> partitions like this? What's the explanation for the performance
> numbers you saw?

My mistake. I thought that last table was self-explanatory.

We see that a single spinlock for accessing a freeList (current
implementation) is obviously a bottleneck. Replacing it with say an
array of spinlocks reduces lock contention and therefore gives more TPS
(see first row). Also we can increase concurrency level even further by
increasing number of lock partitions (see columns "no locks", "lwlock"
and "spinlock array"). Previously it couldn't help us (see "master"
column) because of a bottleneck.

If you have any other questions regarding this patch please don't
hesitate to ask.

Best regards,
Aleksander

Re: Patch: fix lock contention for HASHHDR.mutex

From

Andres Freund

Date:

12 January 2016, 11:45:07

On 2016-01-11 21:16:42 -0300, Alvaro Herrera wrote:
> Andres, Robert, are you still reviewing this patch?

I don't think I've signed to - or actually did - reviewed this path.

Greetings,

Andres Freund

Re: Patch: fix lock contention for HASHHDR.mutex

From

Alvaro Herrera

Date:

12 January 2016, 19:40:20

Andres Freund wrote:
> On 2016-01-11 21:16:42 -0300, Alvaro Herrera wrote:
> > Andres, Robert, are you still reviewing this patch?
> 
> I don't think I've signed to - or actually did - reviewed this path.

Well, maybe not as such, but you replied to the thread several times,
which is why I asked.  No problem, you don't *have* to review all
patches.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Patch: fix lock contention for HASHHDR.mutex

From

Dilip Kumar

Date:

21 January 2016, 11:19:19

On Tue, Jan 12, 2016 at 1:50 PM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:

increasing number of lock partitions (see columns "no locks", "lwlock"
and "spinlock array"). Previously it couldn't help us (see "master"
column) because of a bottleneck.

If you have any other questions regarding this patch please don't
hesitate to ask.

I have done some performance bench marking for this patch.(dynahash-optimization-v10-step1.patch)

Machine Detail:
cpu   : POWER8
cores: 24 (192 with HT)

Non Default Parameter:
------------------------
Shared Buffer= 30GB
max_wal_size= 10GB
max_connections=500

Test1:
schema.sql and pgbench.sql are same files which Aleksander has attached in first mail of the thread.

psql -d postgres -f schema.sql
pgbench -c$ -j$ -f pgbench.sql postgres

client    base    patch
1    145    145
2    262    258
4    453    472
8    749    732
16    1114    1128
32    1727    1789
64    2729    2793
128    3534    3520

With this test i did not see any improvement, though i think it should improve the performance, do you have any suggestion to see the results same as yours ?

I have also dump stacks using some script and I have observed many stacks dumps as you mentioned in initial thread. And after patch, I found that number of lock waiting stacks are reduced.

Stack Dump:
-------------------
#0 0x00007f25842899a7 in semop () from /lib64/libc.so.6
#1 0x00000000007116d0 in PGSemaphoreLock (sema=0x7f257cb170d8) at pg_sema.c:387
#2 0x000000000078955f in LWLockAcquire (lock=0x7f247698a980, mode=LW_EXCLUSIVE) at lwlock.c:1028
#3 0x00000000007804c4 in LockAcquireExtended
#4 0x000000000077fe91 in LockAcquire
#5 0x000000000077e862 in LockRelationOid
#6 0x000000000053bc7b in find_inheritance_children
#7 0x000000000053bd4f in find_all_inheritors
#8 0x00000000006fc0a2 in expand_inherited_rtentry
#9 0x00000000006fbf91 in expand_inherited_tables

I have tried to analyze using perf also, I can see that amount of time taken in hash_search_with_hash_value is reduced from 3.86%(master) to 1.72%(patch).

I have plan to do further investigation, in different scenarios of dynahash.

--

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Patch: fix lock contention for HASHHDR.mutex

From

Anastasia Lubennikova

Date:

21 January 2016, 14:03:47

30.12.2015 16:01, Aleksander Alekseev:
> Here is a clean version of the patch. Step 1 is an optimization. Step 2
> refactors this:
>
>   HTAB *
>   ShmemInitHash(const char *name, /* table string name for shmem index */
> -             long init_size,   /* initial table size */
> +             long init_size,   /* initial table size XXX ignored,
>   refactor! */

Hi, I found that the patch is still not reviewed, so I've looked it over.
I haven't dived into this subject before, so my comments will be more 
general than relating to your investigation.
Sorry if some things seem like nitpicks, I just suppose that clear patch 
would be more convenient for reviewers.

First of all, why not merge both patches into one? They aren't too big 
anyway.
Then, this thread became too tangled. I think it's worth to write a new 
message with the patch, the test script, some results and brief overview 
of how does it really works. It will make following review much easier.

+       /* number of entries in hash table */
+       long            nentries[NUM_LOCK_PARTITIONS];
+       /* linked list of free elements */
+       HASHELEMENT *freeList[NUM_LOCK_PARTITIONS];

I think comments should be changed, to be more informative here.

+               if (IS_PARTITIONED(hashp->hctl))
+                       partitions_number = NUM_LOCK_PARTITIONS;
+               else
+                       partitions_number = 1;
+
+               nelem_alloc = ((int) nelem) / partitions_number;
+               if (nelem_alloc == 0)
+                       nelem_alloc = 1;

Add a comment here too.

-/* Number of partitions of the shared buffer mapping hashtable */
-#define NUM_BUFFER_PARTITIONS  128
- /* Number of partitions the shared lock tables are divided into */
-#define LOG2_NUM_LOCK_PARTITIONS  4
+#define LOG2_NUM_LOCK_PARTITIONS  7 #define NUM_LOCK_PARTITIONS  (1 << LOG2_NUM_LOCK_PARTITIONS)

+/* Number of partitions of the shared buffer mapping hashtable */
+#define NUM_BUFFER_PARTITIONS NUM_LOCK_PARTITIONS
+

I'm sure, it was discussed above. but these definitions do not look 
obvious at all.
Maybe you should explain this magic number 7 in the comment above?

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

22 January 2016, 10:16:48

This patch affects header files. By any chance didn't you forget to run
`make clean` after applying it? As we discussed above, when you
change .h files autotools doesn't rebuild dependent .c files:

http://www.postgresql.org/message-id/CAF4Au4y4MTapP2u3uGtBQd_z0CHEsJWfNAvRgK4umfcWA4ELxQ@mail.gmail.com

On Thu, 21 Jan 2016 16:49:12 +0530
Dilip Kumar <dilipbalaut@gmail.com> wrote:

> On Tue, Jan 12, 2016 at 1:50 PM, Aleksander Alekseev <
> a.alekseev@postgrespro.ru> wrote:
> 
> >
> > increasing number of lock partitions (see columns "no locks",
> > "lwlock" and "spinlock array"). Previously it couldn't help us (see
> > "master" column) because of a bottleneck.
> >
> > If you have any other questions regarding this patch please don't
> > hesitate to ask.
> >
> 
> I have done some performance bench marking for this
> patch.(dynahash-optimization-v10-step1.patch)
> 
> Machine Detail:
> cpu   : POWER8
> cores: 24 (192 with HT)
> 
> Non Default Parameter:
> ------------------------
> Shared Buffer= 30GB
> max_wal_size= 10GB
> max_connections=500
> 
> Test1:
> *schema.sql* and *pgbench.sql* are same files which Aleksander has
> attached in first mail of the thread.
> 
> psql -d postgres -f schema.sql
> pgbench -c$ -j$ -f pgbench.sql  postgres
> 
> client    base    patch
> 1    145    145
> 2    262    258
> 4    453    472
> 8    749    732
> 16    1114    1128
> 32    1727    1789
> 64    2729    2793
> 128    3534    3520
> 
> With this test i did not see any improvement, though i think it should
> improve the performance, do you have any suggestion to see the
> results same as yours ?
> 
> I have also dump stacks using some script and I have observed many
> stacks dumps as you mentioned in initial thread. And after patch, I
> found that number of lock waiting stacks are reduced.
> 
> Stack Dump:
> -------------------
> #0  0x00007f25842899a7 in semop () from /lib64/libc.so.6
> #1  0x00000000007116d0 in PGSemaphoreLock (sema=0x7f257cb170d8) at
> pg_sema.c:387
> #2  0x000000000078955f in LWLockAcquire (lock=0x7f247698a980,
> mode=LW_EXCLUSIVE) at lwlock.c:1028
> #3  0x00000000007804c4 in LockAcquireExtended
> #4  0x000000000077fe91 in LockAcquire
> #5  0x000000000077e862 in LockRelationOid
> #6  0x000000000053bc7b in find_inheritance_children
> #7  0x000000000053bd4f in find_all_inheritors
> #8  0x00000000006fc0a2 in expand_inherited_rtentry
> #9  0x00000000006fbf91 in expand_inherited_tables
> 
> I have tried to analyze using perf also, I can see that amount of time
> taken in hash_search_with_hash_value is reduced from 3.86%(master) to
> 1.72%(patch).
> 
> I have plan to do further investigation, in different scenarios of
> dynahash.
>

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

22 January 2016, 10:50:53

Hi,

> First of all, why not merge both patches into one? They aren't too
> big anyway.

Agree.

> I think comments should be changed, to be more informative here.

> Add a comment here too.

> Maybe you should explain this magic number 7 in the comment above?

Done.

> Then, this thread became too tangled. I think it's worth to write a
> new message with the patch, the test script, some results and brief
> overview of how does it really works. It will make following review
> much easier.

Sure.

HASHHDR represents a hash table. It could be usual or partitioned.
Partitioned table is stored in a shared memory and accessed by multiple
processes simultaneously. To prevent data corruption hash table is
partitioned and each process has to acquire a lock for a corresponding
partition before accessing data in it. Number of partition is determine
by lower bits of key's hash value. Most tricky part is --- dynahash
knows nothing about these locks, all locking is done on calling side.

Since shared memory is pre-allocated and can't grow to allocate memory
in a hash table we use freeList. Also we use nentries field to count
current number of elements in a hash table. Since hash table is used by
multiple processes these fields are protected by a spinlock. Which as
it turned out could cause lock contention and create a bottleneck.

After trying a few approaches I discovered that using one spinlock per
partition works quite well. Here are last benchmark results:

http://www.postgresql.org/message-id/20151229184851.1bb7d1bd@fujitsu

Note that "no locks" solution cant be used because it doesn't guarantee
that all available memory will be used in some corner cases.

You can find a few more details and a test script in the first message
of this thread. If you have any other questions regarding this patch
please don't hesitate to ask.

Best regards,
Aleksander

Attachment

dynahash-optimization-v11.patch

Re: Patch: fix lock contention for HASHHDR.mutex

From

Dilip Kumar

Date:

25 January 2016, 05:09:14

<div dir="ltr"><div class="gmail_extra"><br /><div class="gmail_quote">On Fri, Jan 22, 2016 at 3:44 PM, Aleksander
Alekseev<span dir="ltr"><<a href="mailto:a.alekseev@postgrespro.ru"
target="_blank">a.alekseev@postgrespro.ru</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px#ccc solid;padding-left:1ex"><div class="a3s" id=":4ve" style="overflow:hidden">This patch affects
headerfiles. By any chance didn't you forget to run<br /> `make clean` after applying it? As we discussed above, when
you<br/> change .h files autotools doesn't rebuild dependent .c files:</div></blockquote></div><br /></div><div
class="gmail_extra">Yes,actually i always compile using "make clean;make -j20; make install"<br /></div><div
class="gmail_extra">Ifyou want i will run it again may be today or tomorrow and post the result.<br /><br /><br />--
<br/><div class="gmail_signature"><div dir="ltr"><span style="color:rgb(80,0,80);font-size:12.8px">Regards,</span><br
style="color:rgb(80,0,80);font-size:12.8px"/><span style="color:rgb(80,0,80);font-size:12.8px">Dilip Kumar</span><br
style="color:rgb(80,0,80);font-size:12.8px"/><span style="color:rgb(80,0,80);font-size:12.8px">EnterpriseDB: </span><a
href="http://www.enterprisedb.com/"style="color:rgb(17,85,204);font-size:12.8px"
target="_blank">http://www.enterprisedb.com</a><br/></div></div></div></div>

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

27 January 2016, 11:48:05

> > This patch affects header files. By any chance didn't you forget to
> > run `make clean` after applying it? As we discussed above, when you
> > change .h files autotools doesn't rebuild dependent .c files:
> >
> 
> Yes, actually i always compile using "make clean;make -j20; make
> install" If you want i will run it again may be today or tomorrow and
> post the result.
> 
> 

Most likely HASHHDR.mutex is not only bottleneck in your case so my
patch doesn't help much. Unfortunately I don't have access to any
POWER8 server so I can't investigate this issue. I suggest to use a
gettimeofday trick I described in a first message of this thread. Its
time consuming but it gives a clear understanding which code is keeping
a lock.

Re: Patch: fix lock contention for HASHHDR.mutex

From

Anastasia Lubennikova

Date:

27 January 2016, 13:19:55

<br /><br /><div class="moz-cite-prefix">22.01.2016 13:48, Aleksander Alekseev:<br /></div><blockquote
cite="mid:20160122134837.788ecf83@fujitsu"type="cite"><blockquote type="cite"><pre wrap="">Then, this thread became too
tangled.I think it's worth to write a
 
new message with the patch, the test script, some results and brief
overview of how does it really works. It will make following review
much easier.
</pre></blockquote><pre wrap="">
Sure.

HASHHDR represents a hash table. It could be usual or partitioned.
Partitioned table is stored in a shared memory and accessed by multiple
processes simultaneously. To prevent data corruption hash table is
partitioned and each process has to acquire a lock for a corresponding
partition before accessing data in it. Number of partition is determine
by lower bits of key's hash value. Most tricky part is --- dynahash
knows nothing about these locks, all locking is done on calling side.

Since shared memory is pre-allocated and can't grow to allocate memory
in a hash table we use freeList. Also we use nentries field to count
current number of elements in a hash table. Since hash table is used by
multiple processes these fields are protected by a spinlock. Which as
it turned out could cause lock contention and create a bottleneck.

After trying a few approaches I discovered that using one spinlock per
partition works quite well. Here are last benchmark results:

<a class="moz-txt-link-freetext"
href="http://www.postgresql.org/message-id/20151229184851.1bb7d1bd@fujitsu">http://www.postgresql.org/message-id/20151229184851.1bb7d1bd@fujitsu</a>

Note that "no locks" solution cant be used because it doesn't guarantee
that all available memory will be used in some corner cases.

You can find a few more details and a test script in the first message
of this thread. If you have any other questions regarding this patch
please don't hesitate to ask.
</pre></blockquote><div class="moz-text-flowed" lang="x-unicode" style="font-family: -moz-fixed;     font-size:
12px;">InitLocks<br />     /* <br />      * Compute init/max size to request for lock hashtables.  Note these <br />
    * calculations must agree with LockShmemSize! <br />      */ <br /><br /> This comment certainly requires some
changes.<br /> BTW, could you explain why init_table_size was two times less than max_table_size? <br /> Although, I
don'tsee any problems with your changes. <br /><br /><br /> -    hctl->nentries = 0; <br /> -    hctl->freeList =
NULL;<br /><br /> Why did you delete these two lines? I wonder if you should rewrite them instead? <br /><br /> + *
Thisparticular number of partitions significantly reduces lock contention <br /> + * in partitioned hash tables, almost
ifpartitioned tables didn't use any <br /> + * locking at all <br /><br /> As far as I understood, this number was
obtainedexperimentally? Maybe you should note that in the comment. <br /><br /><br /> And the last, but not least.<br
/><br/> +        nelem_alloc = ((int) nelem) / partitions_number; <br /> +        if (nelem_alloc == 0) <br />
+           nelem_alloc = 1; <br /> + <br /> +        for (i = 0; i < partitions_number; i++) <br /> +            if
(!element_alloc(hashp,nelem_alloc, i)) <br /> +                ereport(ERROR, <br /> +                       
(errcode(ERRCODE_OUT_OF_MEMORY),<br /> +                         errmsg("out of memory"))); <br /><br /><br /> It looks
likeyou forgot to round the result of the division.<br /> For example, if you have nelem=25 and partitions_number=6.<br
/>25 / 6 = 4. And then you allocate 24 nelems, while 1 is lost. <br /><br /> What about productivity, I did a test on
mynotebook. Patched version shows small performance improvement.<br /><br /> pgbench -j 1 -c 1 -f pgbench.sql -T 300
postgres<br/> base patched<br /> 127tps  145tps<br /> pgbench -j 8 -c 8 -f pgbench.sql -T 300 postgres<br /> base
patched<br/> 247tps  270tps<br /><br /> But I haven't any big machine to perform tests, where the patch is supposed to
makesignificant changes.<br /> So I would rely on your and the other reviewers results.<br /><br /> Except mentioned
notes,I suppose the patch is good enough to pass it to a reviewer.<br /></div><pre class="moz-signature" cols="72">-- 
 
Anastasia Lubennikova
Postgres Professional: <a class="moz-txt-link-freetext"
href="http://www.postgrespro.com">http://www.postgrespro.com</a>
The Russian Postgres Company</pre>

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

27 January 2016, 15:15:14

> This comment certainly requires some changes.

Fixed.

> BTW, could you explain why init_table_size was two times less than
> max_table_size?

I have no clue. My best guess is that it was a reasonable thing to do in
the past. Then somebody changed a code and now there is little reason
to use init_table_size for partitioned tables.

> Why did you delete these two lines? I wonder if you should rewrite
> them instead?

```
    MemSet(hctl, 0, sizeof(HASHHDR));

-   hctl->nentries = 0;
-   hctl->freeList = NULL;
```

These fields were initialized with zero values twice. It makes little
sense to me.

> As far as I understood, this number was obtained experimentally?
> Maybe you should note that in the comment.

These numbers are very platform specific and will be outdated very
soon. I recall that my code was criticized for including exact numbers
not a long time ago. So I suggest to keep this part as is.

> For example, if you have nelem=25 and partitions_number=6.
> 25 / 6 = 4. And then you allocate 24 nelems, while 1 is lost.

Agree. Fixed.

> Except mentioned notes, I suppose the patch is good enough

I guess I will mark this patch as "Ready for Committer" then.

Attachment

dynahash-optimization-v12.patch

Re: Patch: fix lock contention for HASHHDR.mutex

From

Dilip Kumar

Date:

27 January 2016, 15:27:07

<div dir="ltr"><div class="gmail_extra"><br /><div class="gmail_quote">On Wed, Jan 27, 2016 at 5:15 PM, Aleksander
Alekseev<span dir="ltr"><<a href="mailto:a.alekseev@postgrespro.ru"
target="_blank">a.alekseev@postgrespro.ru</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin:0px
0px0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="" id=":2o6"
style="overflow:hidden">Mostlikely HASHHDR.mutex is not only bottleneck in your case so my<br /> patch doesn't help
much.Unfortunately I don't have access to any<br /> POWER8 server so I can't investigate this issue. I suggest to use
a<br/> gettimeofday trick I described in a first message of this thread. Its<br /> time consuming but it gives a clear
understandingwhich code is keeping<br /> a lock.</div></blockquote></div><br />I have also tested the pgbench Readonly
testwhen data don't fit into shared buffer,<br />Because in this case HASHHDR.mutex access will be quite frequent.<br
/>Andin this case i do see very good improvement in POWER8 server.<br /><br />Test Result:<br />------------<br />Scale
Factor:   300    <br />Shared Buffer:    512MB    <br />pgbench -c$ -j$ -S -M Prepared postgres<br />        <br
/>Client   base    patch<br />64       222173    318318<br />128    195805    262290<br clear="all" /><br />-- <br
/><divclass="gmail_signature"><div dir="ltr"><span style="color:rgb(80,0,80);font-size:12.8px">Regards,</span><br
style="color:rgb(80,0,80);font-size:12.8px"/><span style="color:rgb(80,0,80);font-size:12.8px">Dilip Kumar</span><br
style="color:rgb(80,0,80);font-size:12.8px"/><span style="color:rgb(80,0,80);font-size:12.8px">EnterpriseDB: </span><a
href="http://www.enterprisedb.com/"style="color:rgb(17,85,204);font-size:12.8px"
target="_blank">http://www.enterprisedb.com</a><br/></div></div></div></div>

Re: Patch: fix lock contention for HASHHDR.mutex

From

Robert Haas

Date:

07 February 2016, 05:10:39

On Thu, Jan 21, 2016 at 9:03 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:
> First of all, why not merge both patches into one? They aren't too big
> anyway.

So looking over this patch, it institutes a new coding rule that all
shared hash tables must have the same number of partitions, and that
number is hard-coded at 128.  The number of freelist partitions is
also hardcoded at 128.  I find this design surprising.  First, I would
not have expected that we'd need 128 freelist partitions.  We get by
just fine with only 16 lock manager partitions, at least AFAIK, and
there's no reason some future hashtable might not have little enough
contention that 16 partitions works just fine.  Even for the buffer
manager, which does have 128 partitions, I wouldn't assume that we
need 128 partitions for the free list just because we need 128
partitions for the locks themselves.  It's possible we do, but I'd
expect that existing buffer mapping locking might dissipate some of
that contention - e.g. if somebody's doing a buffer lookup on a
particular partition, they have a shared lock on that partition, so
nobody has an exclusive lock to be able to hit the freelist.  So: do
we have clear evidence that we need 128 partitions here, or might,
say, 16 work just fine?

Second, I don't see any reason why the number of free list partitions
needs to be a constant, or why it needs to be equal to the number of
hash bucket partitions.  That just seems like a completely unnecessary
constraint.  You can take the hash value, or whatever else you use to
pick the partition, modulo any number you want.  I don't really want
to increase the number of lock manager partition locks to 128 just for
the convenience of this patch, and even less do I want to impose the
requirement that every future shared hash table must use an equal
number of partitions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

08 February 2016, 10:42:08

Hello, Robert.

> So: do we have clear evidence that we need 128 partitions here, or
> might, say, 16 work just fine?

Yes, this number of partitions was chosen based on this benchmark (see
"spinlock array" column):

http://www.postgresql.org/message-id/20151229184851.1bb7d1bd@fujitsu

In fact we can choose 16, 32, 64 or 128 partitions - any number works
better than current implementation with a single spinlock. But
according to this benchmark 128 partitions give twice as much TPS then
16 partitions, almost as if we didn't have any locks at all (see "no
locks" column).

Still I'm a bit concerned that 128 partitions require (128-16)*
( sizeof(slock_t) + sizeof(void*) + sizeof(long) ) bytes of additional
memory per hash table which is:

(gdb) p (128-16)*(sizeof(slock_t) + sizeof(long) + sizeof(void*))
$1 = 1904

... bytes of shared memory on amd64.

I'm not sure if this is considered OK. If it is I suggest to leave
number 128. If it's not, perhaps we could agree on say 64 partitions as
a compromise. Described benchmark doesn't represent any possible
workload anyway. I personally believe that we don't have so many small
hash tables that 1.8 Kb of additional shared memory per hash table would
be an issue.

> I don't really want to increase the number of lock manager partition
> locks to 128 just for the convenience of this patch, and even less do
> I want to impose the requirement that every future shared hash table
> must use an equal number of partitions.

May I ask why? Does it create a real problem?

> I don't see any reason why the number of free list partitions needs
> to be a constant, or why it needs to be equal to the number of hash
> bucket partitions.  That just seems like a completely unnecessary
> constraint. You can take the hash value, or whatever else you use to
> pick the partition, modulo any number you want. 

True, number of spinlocks / freeLists could be a variable. I believe
it's an old-good simplicity vs flexibility question.

Lets say we allow user (i.e. calling side) to choose spinlocks and free
lists number. So now we should use freeList[hash % items_number] formula
which means division instead of binary shift. Do we really need such an
expensive operation just to calculate index of a free list? Probably
not. Let limit possible items_number values so it could be only power
of two. Now there is only four possible values user would more likely to
choose (16, 32, 64, 128), which is not as flexible anymore.

What about memory? If we allocate mutex, nentries and freeList
dynamically depending on items_number value, HASHHDR should store
pointers to allocated memory which means additional indirection for
almost every access to mutex, nentries and freeList fields. Probably a
bad idea. Otherwise we still waste shared memory, probably even more
memory since we don't want maximum items_number value to be too low.

Etc.

I believe such a change will cause only additional code complexity
(code is already far from trivial in dynahash.c) so next time it will
be much harder to change anything, probably some bugs we will miss and
will not solve any real problem. Why just not to choose a constant which
seems to be good enough instead?

Re: Patch: fix lock contention for HASHHDR.mutex

From

Robert Haas

Date:

08 February 2016, 14:37:20

On Mon, Feb 8, 2016 at 5:39 AM, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:
>> So: do we have clear evidence that we need 128 partitions here, or
>> might, say, 16 work just fine?
>
> Yes, this number of partitions was chosen based on this benchmark (see
> "spinlock array" column):
>
> http://www.postgresql.org/message-id/20151229184851.1bb7d1bd@fujitsu

It's very hard to understand what that email is trying to say.  It
uses terms like PN and AN without defining them.  There's certainly
not enough detail for somebody else to replicate what you did.

> In fact we can choose 16, 32, 64 or 128 partitions - any number works
> better than current implementation with a single spinlock. But
> according to this benchmark 128 partitions give twice as much TPS then
> 16 partitions, almost as if we didn't have any locks at all (see "no
> locks" column).

OK.

> Still I'm a bit concerned that 128 partitions require (128-16)*
> ( sizeof(slock_t) + sizeof(void*) + sizeof(long) ) bytes of additional
> memory per hash table which is:
>
> (gdb) p (128-16)*(sizeof(slock_t) + sizeof(long) + sizeof(void*))
> $1 = 1904
>
> ... bytes of shared memory on amd64.
>
> I'm not sure if this is considered OK. If it is I suggest to leave
> number 128. If it's not, perhaps we could agree on say 64 partitions as
> a compromise. Described benchmark doesn't represent any possible
> workload anyway. I personally believe that we don't have so many small
> hash tables that 1.8 Kb of additional shared memory per hash table would
> be an issue.

I don't particularly care about using slightly more shared memory per
hash table.

>> I don't really want to increase the number of lock manager partition
>> locks to 128 just for the convenience of this patch, and even less do
>> I want to impose the requirement that every future shared hash table
>> must use an equal number of partitions.
>
> May I ask why? Does it create a real problem?

It's hard to be 100% sure about the future, but suppose that in
another 10 years we discover that the buffer mapping hash table now
needs to have 1024 partitions, but the lock manager partition hash
table still only needs 16.  Do we really want to drag the lock manager
and any future shared hash tables up to 1024 just because the buffer
manager needs it?  That sounds like a bad idea from here.  For
example, one advantage of having fewer partitions is that it takes
less time to lock them all.  That matters to - for example - the cost
of running the deadlock detector.  Is the cost enough to matter in
that particular case with the code as it stands?  I don't know.  Could
having more partitions ever be worse than having fewer?  I'm pretty
sure that the answer is yes.

>> I don't see any reason why the number of free list partitions needs
>> to be a constant, or why it needs to be equal to the number of hash
>> bucket partitions.  That just seems like a completely unnecessary
>> constraint. You can take the hash value, or whatever else you use to
>> pick the partition, modulo any number you want.
>
> True, number of spinlocks / freeLists could be a variable. I believe
> it's an old-good simplicity vs flexibility question.
>
> Lets say we allow user (i.e. calling side) to choose spinlocks and free
> lists number. So now we should use freeList[hash % items_number] formula
> which means division instead of binary shift. Do we really need such an
> expensive operation just to calculate index of a free list? Probably
> not. Let limit possible items_number values so it could be only power
> of two. Now there is only four possible values user would more likely to
> choose (16, 32, 64, 128), which is not as flexible anymore.

Sure, I don't see anything wrong with requiring the number of
partitions to be a power of two.  On the other hand, I also suspect
that on modern hardware the cost of a shift versus a modulo operator
is almost irrelevant.  What's really going to make this fast or slow
is the degree to which it has, or does not have, cache line
contention.

> What about memory? If we allocate mutex, nentries and freeList
> dynamically depending on items_number value, HASHHDR should store
> pointers to allocated memory which means additional indirection for
> almost every access to mutex, nentries and freeList fields. Probably a
> bad idea. Otherwise we still waste shared memory, probably even more
> memory since we don't want maximum items_number value to be too low.

These seem like problems that are admit of more than one good
solution, and which you are certainly capable of solving.  For
example, you could always include space for 128 freelists but not tie
the number of freelists to the number of partition locks, or you could
always include space for 128 freelists and have the system only use
them all if there are 128 or more partitions.  If there are fewer than
128 partitions, it uses only as many freelists as there are
partitions.  Neither of these things sounds very hard to implement.

Basically, the burden for you to impose a new coding rule on everybody
who uses shared hash tables in the future is very high.  You would
need to show that that really isn't going to be an issue for anyone in
any reasonable situation.  I don't think you can show that, because I
don't think it's true.  I don't even like the idea of raising the
number of partitions for the lock manager for the convenience of your
code.  If that turns out to have negative performance consequences in
some situation, we'd either have to live with or revert your entire
patch.  Those kinds of trade-offs aren't good for you as a patch
author, or for the product.  I'm not going to commit this patch if it
imposes new constraints on choosing the number of partitions in a
dynamic hash table, and I'm going to oppose anyone else committing it
either.  If this problem couldn't be solved without imposing such
constraints, we might decide (after long soul-searching) that it was
worth it.  But it seems pretty clear that they aren't necessary in
this case, so I'm against them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: fix lock contention for HASHHDR.mutex

From

Alvaro Herrera

Date:

08 February 2016, 19:29:08

I've closed this as returned-with-feedback.  Please resubmit once you
have found answers to the questions posed; from the submitted benchmark
numbers this looks very exciting, but it needs a bit more work.  You
don't necessarily have to agree with everything Robert says, but you
need to have well reasoned explanations for the points where you
disagree.

Thanks,

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

10 February 2016, 08:26:29

Hello, Robert

> Basically, the burden for you to impose a new coding rule on everybody
> who uses shared hash tables in the future is very high.

I fixed an issue you described. Number of spinlocks doesn't depend of
NUM_LOCK_PARTITIONS anymore and could be specified for each hash table
on a calling side.

I did a benchmark described in a first message of this thread again.
Currently I don't have access to the same 60-core server so I used more
common 12-core server (24 with HT). According to this benchmark TPS
increment depends on NUM_LOCK_PARTITIONS and default number of
spinlocks this way:

pgbench -f pgbench.sql -T 150 -P 1 -c 40 -j 12

 DMN | NLP = 16 | NLP = 32 | NLP = 64 | NLP = 128
-----|----------|----------|----------|----------
   8 |  +15.1%  |  +28.2%  |  +34.1%  |  +33.7%
  16 |  +16.6%  |  +30.9%  |  +37.0%  |  +40.8%
  32 |  +15.1%  |  +33.9%  |  +39.5%  |  +41.9%
  64 |  +15.0%  |  +31.9%  |  +40.1%  |  +42.9%
 128 |   +7.7%  |  +24.7%  |  +29.6%  |  +31.6%

* NLP = NUM_LOCK_PARTITIONS
* DMN = DEFAULT_MUTEXES_NUM

I realize this benchmark doesn't represent any possible workload so for
attached patch I choose NUM_LOCK_PARTITIONS = DEFAULT_MUTEXES_NUM = 32.
It seems to be a reasonable compromise of a speedup according to
"synthetic and meaningless in practice" benchmark and number of used
locks which could mean quite a lot in practice. Still this values could
be easily changed in any moment.

Here are before/after benchmark results for this concrete patch.


BEFORE

pgbench (default):

tps = 1295.798531 (including connections establishing)
tps = 1295.858295 (excluding connections establishing)

pgbench -f pgbench.sql:

tps = 1020.072172 (including connections establishing)
tps = 1020.116888 (excluding connections establishing)


AFTER

pgbench (default):

tps = 1299.369807 (including connections establishing)
tps = 1299.429764 (excluding connections establishing)

pgbench -f pgbench.sql:

tps = 1365.749333 (including connections establishing)
tps = 1365.814384 (excluding connections establishing)


So as I understand this patch solves a lock contention problem and
doesn't make anything else worse.

Attachment

dynahash-optimization-v13.patch

Re: Patch: fix lock contention for HASHHDR.mutex

From

Robert Haas

Date:

10 February 2016, 16:34:32

On Wed, Feb 10, 2016 at 3:24 AM, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:
>> Basically, the burden for you to impose a new coding rule on everybody
>> who uses shared hash tables in the future is very high.
>
> I fixed an issue you described. Number of spinlocks doesn't depend of
> NUM_LOCK_PARTITIONS anymore and could be specified for each hash table
> on a calling side.

Thanks, this looks way better.  Some more comments:

- I don't think there's any good reason for this patch to change the
calling convention for ShmemInitHash().  Maybe that's a good idea and
maybe it isn't, but it's a separate issue from what this patch is
doing, and if we're going to do it at all, it should be discussed
separately.  Let's leave it out of this patch.

- I would not provide an option to change the number of freelist
mutexes.  Let's dump DEFAULT_MUTEXES_NUM and MAX_MUTEXES_NUM and have
FREELIST_MUTEXES_NUM.  The value of 32 which you selected is fine with
me.

- The change to LOG2_NUM_LOCK_PARTITIONS in lwlock.h looks like an
independent change.  Again, leave it out of this patch and submit it
separately with its own benchmarking data if you want to argue for it.

I think if you make these changes this patch will be quite a bit
smaller and in pretty good shape.

Thanks for working on this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

11 February 2016, 14:13:31

Hello, Robert

> Thanks, this looks way better.  Some more comments:
>
> - I don't think there's any good reason for this patch to change the
> calling convention for ShmemInitHash().  Maybe that's a good idea and
> maybe it isn't, but it's a separate issue from what this patch is
> doing, and if we're going to do it at all, it should be discussed
> separately.  Let's leave it out of this patch.
>
> - I would not provide an option to change the number of freelist
> mutexes.  Let's dump DEFAULT_MUTEXES_NUM and MAX_MUTEXES_NUM and have
> FREELIST_MUTEXES_NUM.  The value of 32 which you selected is fine with
> me.
>
> - The change to LOG2_NUM_LOCK_PARTITIONS in lwlock.h looks like an
> independent change.  Again, leave it out of this patch and submit it
> separately with its own benchmarking data if you want to argue for it.
>
> I think if you make these changes this patch will be quite a bit
> smaller and in pretty good shape.
>
> Thanks for working on this.
>

Here is a corrected patch. I decided to keep a small change in
InitLocks procedure (lock.c) since ShmemInitHash description clearly
states "For a table whose maximum size is certain, [init_size] should
be equal to max_size". Also I added two items to my TODO list to send
more patches as soon (and if) this patch will be applied.

Thanks for your contribution to this patch.

Attachment

dynahash-optimization-v14.patch

Re: Patch: fix lock contention for HASHHDR.mutex

From

Robert Haas

Date:

11 February 2016, 14:53:40

On Thu, Feb 11, 2016 at 9:11 AM, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:
>> Thanks, this looks way better.  Some more comments:
>>
>> - I don't think there's any good reason for this patch to change the
>> calling convention for ShmemInitHash().  Maybe that's a good idea and
>> maybe it isn't, but it's a separate issue from what this patch is
>> doing, and if we're going to do it at all, it should be discussed
>> separately.  Let's leave it out of this patch.
>>
>> - I would not provide an option to change the number of freelist
>> mutexes.  Let's dump DEFAULT_MUTEXES_NUM and MAX_MUTEXES_NUM and have
>> FREELIST_MUTEXES_NUM.  The value of 32 which you selected is fine with
>> me.
>>
>> - The change to LOG2_NUM_LOCK_PARTITIONS in lwlock.h looks like an
>> independent change.  Again, leave it out of this patch and submit it
>> separately with its own benchmarking data if you want to argue for it.
>>
>> I think if you make these changes this patch will be quite a bit
>> smaller and in pretty good shape.
>>
>> Thanks for working on this.
>
> Here is a corrected patch. I decided to keep a small change in
> InitLocks procedure (lock.c) since ShmemInitHash description clearly
> states "For a table whose maximum size is certain, [init_size] should
> be equal to max_size". Also I added two items to my TODO list to send
> more patches as soon (and if) this patch will be applied.
>
> Thanks for your contribution to this patch.

The fact that InitLocks() doesn't do this has been discussed before
and there's no consensus on changing it.  It is, at any rate, a
separate issue.  I'll go through the rest of this patch again now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: fix lock contention for HASHHDR.mutex

From

Robert Haas

Date:

11 February 2016, 20:26:42

On Thu, Feb 11, 2016 at 9:53 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> The fact that InitLocks() doesn't do this has been discussed before
> and there's no consensus on changing it.  It is, at any rate, a
> separate issue.  I'll go through the rest of this patch again now.

I did a little bit of cleanup on this patch and the results are
attached.  I also did some benchmarking of this version.  I tested
this on two systems, in each case using five-minute, read-only pgbench
runs at scale factor 3000, varying the client count and taking the
median of three runs.  First, I tested it on hydra, a 2-socket,
16-processor, 64-thread POWER box.  This is a community resource
hosted at OSUOSL.  Second, I tested it on cthulhu, an 8-socket,
64-processor, 128-thread x86_64 box.  This is an EnterpriseDB
resource.  Here are the results:

hydra, master vs. patched
1 client: 8042.872109 vs. 7839.587491 (-2.5%)
64 clients: 213311.852043 vs. 214002.314071 (+0.3%)
96 clients: 219551.356907 vs. 221908.397489 (+1.1%)
128 clients: 210279.022760 vs. 217974.079171 (+3.7%)

cthulhu, master vs. patched
1 client: 3407.705820 vs. 3645.129360 (+7.0%)
64 clients: 88667.681890 vs. 82636.914343 (-6.8%)
96 clients: 79303.750250 vs. 105442.993869 (+33.0%)
128 clients: 74684.510668 vs. 120984.938371 (+62.0%)

Obviously, the high-client count results on cthulhu are stellar.  I'm
*assuming* that the regressions are just random variation.  I am
however wondering if it to set the freelist affinity based on
something other than the hash value, like say the PID, so that the
same process doesn't keep switching to a different freelist for every
buffer eviction.  It also strikes me that it's probably quite likely
that slock_t mutex[NUM_FREELISTS] is a poor way to lay out this data
in memory.  For example, on a system where slock_t is just one byte,
most likely all of those mutexes are going to be in the same cache
line, which means you're going to get a LOT of false sharing.  It
seems like it would be sensible to define a new struct that contains
an slock_t, a long, and a HASHELEMENT *, and then make an array of
those structs.  That wouldn't completely eliminate false sharing, but
it would reduce it quite a bit.  My guess is that if you did that, you
could reduce the number of freelists to 8 or less and get pretty much
the same performance benefit that you're getting right now with 32.
And that, too, seems likely to be good for single-client performance.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

hashhdr-rmh.patch

Re: Patch: fix lock contention for HASHHDR.mutex

From

Robert Haas

Date:

12 February 2016, 14:30:07

On Thu, Feb 11, 2016 at 3:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Feb 11, 2016 at 9:53 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> The fact that InitLocks() doesn't do this has been discussed before
>> and there's no consensus on changing it.  It is, at any rate, a
>> separate issue.  I'll go through the rest of this patch again now.
>
> I did a little bit of cleanup on this patch and the results are
> attached.  I also did some benchmarking of this version.  I tested
> this on two systems, in each case using five-minute, read-only pgbench
> runs at scale factor 3000, varying the client count and taking the
> median of three runs.  First, I tested it on hydra, a 2-socket,
> 16-processor, 64-thread POWER box.  This is a community resource
> hosted at OSUOSL.  Second, I tested it on cthulhu, an 8-socket,
> 64-processor, 128-thread x86_64 box.  This is an EnterpriseDB
> resource.  Here are the results:
>
> hydra, master vs. patched
> 1 client: 8042.872109 vs. 7839.587491 (-2.5%)
> 64 clients: 213311.852043 vs. 214002.314071 (+0.3%)
> 96 clients: 219551.356907 vs. 221908.397489 (+1.1%)
> 128 clients: 210279.022760 vs. 217974.079171 (+3.7%)
>
> cthulhu, master vs. patched
> 1 client: 3407.705820 vs. 3645.129360 (+7.0%)
> 64 clients: 88667.681890 vs. 82636.914343 (-6.8%)
> 96 clients: 79303.750250 vs. 105442.993869 (+33.0%)
> 128 clients: 74684.510668 vs. 120984.938371 (+62.0%)
>
> Obviously, the high-client count results on cthulhu are stellar.  I'm
> *assuming* that the regressions are just random variation.  I am
> however wondering if it to set the freelist affinity based on
> something other than the hash value, like say the PID, so that the
> same process doesn't keep switching to a different freelist for every
> buffer eviction.  It also strikes me that it's probably quite likely
> that slock_t mutex[NUM_FREELISTS] is a poor way to lay out this data
> in memory.  For example, on a system where slock_t is just one byte,
> most likely all of those mutexes are going to be in the same cache
> line, which means you're going to get a LOT of false sharing.  It
> seems like it would be sensible to define a new struct that contains
> an slock_t, a long, and a HASHELEMENT *, and then make an array of
> those structs.  That wouldn't completely eliminate false sharing, but
> it would reduce it quite a bit.  My guess is that if you did that, you
> could reduce the number of freelists to 8 or less and get pretty much
> the same performance benefit that you're getting right now with 32.
> And that, too, seems likely to be good for single-client performance.

One other thing: I don't see what prevents the nentries count for a
particular freelist from going negative, which would subsequently
cause an Assertion failure.  I wonder if we should restructure the
code so that we keep track of the total number of elements allocated
and the number on each freelist, so that the count of elements =
elements allocation - sum(elements on each freelist).  This would
require a little more code churn but it seems that the result would be
more clearly correct.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

12 February 2016, 15:37:15

Hello, Robert

> It also strikes me that it's probably quite likely that slock_t
> mutex[NUM_FREELISTS] is a poor way to lay out this data in memory.
> For example, on a system where slock_t is just one byte, most likely
> all of those mutexes are going to be in the same cache line, which
> means you're going to get a LOT of false sharing.  It seems like it
> would be sensible to define a new struct that contains an slock_t, a
> long, and a HASHELEMENT *, and then make an array of those structs.
> That wouldn't completely eliminate false sharing, but it would reduce
> it quite a bit.  My guess is that if you did that, you could reduce
> the number of freelists to 8 or less and get pretty much the same
> performance benefit that you're getting right now with 32. And that,
> too, seems likely to be good for single-client performance.

I experimented for a while trying to fit every spinlock in a separate
cache line. Indeed we can gain some speedup this way. Here are
benchmark results on 12-core server for NUM_LOCK_PARTITIONS = 32 (in
this case difference is more notable):

| FREELISTS | SIZE =  32 | SIZE =  64 | SIZE = 128 | SIZE = 256 |
|-----------|------------|------------|------------|------------|
|     4     |   +25.4%   |   +28.7%   |   +28.4%   |   +28.3%   |
|     8     |   +28.2%   |   +29.4%   |   +31.7%   |   +31.4%   |
|    16     |   +29.9%   |   +32.6%   |   +31.6%   |   +30.8%   |
|    32     |   +33.7%   |   +34.2%   |   +33.6%   |   +32.6%   |

Here SIZE is short for FREELIST_BUFFER_SIZE (part of a hack I used to
align FREELIST structure, see attached patch). Cache line on this CPU
is 64 bytes according to /sys/devices/system/cpu/cpu0/cache/index0/*

I would like to remind that without these changes we had the following
picture (please note "NLP = 32" column):

pgbench -f pgbench.sql -T 150 -P 1 -c 40 -j 12

 DMN | NLP = 16 | NLP = 32 | NLP = 64 | NLP = 128
-----|----------|----------|----------|----------
   8 |  +15.1%  |  +28.2%  |  +34.1%  |  +33.7%
  16 |  +16.6%  |  +30.9%  |  +37.0%  |  +40.8%
  32 |  +15.1%  |  +33.9%  |  +39.5%  |  +41.9%
  64 |  +15.0%  |  +31.9%  |  +40.1%  |  +42.9%
 128 |   +7.7%  |  +24.7%  |  +29.6%  |  +31.6%

* NLP = NUM_LOCK_PARTITIONS
* DMN = DEFAULT_MUTEXES_NUM

So its +34.2% vs +33.9% and the number of freeLists remains the same.

> I am however wondering if it to set the freelist affinity based on
> something other than the hash value, like say the PID, so that the
> same process doesn't keep switching to a different freelist for every
> buffer eviction.

Also I tried to use PID to determine freeList number:

```
#include <sys/types.h>
#include <unistd.h>

...

#define FREELIST_IDX(hctl, hashcode) \
  (IS_PARTITIONED(hctl) ? ((uint32)getpid()) % NUM_FREELISTS : 0)

...

// now nentries could be negative in this case
// Assert(FREELIST(hctl, freelist_idx).nentries > 0);

```

Unfortunately this approach gives +33.9% TPS in best case. Also there
is a possibility that nentries will overflow. So far I don't have a
clear idea how to solve this issue effectively.

I'm not sure if further improvements worth an effort.

Attachment

dynahash-cacheline-experiment.patch

Re: Patch: fix lock contention for HASHHDR.mutex

From

Amit Kapila

Date:

01 March 2016, 13:09:23

On Fri, Feb 12, 2016 at 9:04 PM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:

Hello, Robert

> It also strikes me that it's probably quite likely that slock_t
> mutex[NUM_FREELISTS] is a poor way to lay out this data in memory.
> For example, on a system where slock_t is just one byte, most likely
> all of those mutexes are going to be in the same cache line, which
> means you're going to get a LOT of false sharing. It seems like it
> would be sensible to define a new struct that contains an slock_t, a
> long, and a HASHELEMENT *, and then make an array of those structs.
> That wouldn't completely eliminate false sharing, but it would reduce
> it quite a bit. My guess is that if you did that, you could reduce
> the number of freelists to 8 or less and get pretty much the same
> performance benefit that you're getting right now with 32. And that,
> too, seems likely to be good for single-client performance.

I experimented for a while trying to fit every spinlock in a separate
cache line. Indeed we can gain some speedup this way. Here are
benchmark results on 12-core server for NUM_LOCK_PARTITIONS = 32 (in
this case difference is more notable):

| FREELISTS | SIZE = 32 | SIZE = 64 | SIZE = 128 | SIZE = 256 |
|-----------|------------|------------|------------|------------|
| 4 | +25.4% | +28.7% | +28.4% | +28.3% |
| 8 | +28.2% | +29.4% | +31.7% | +31.4% |
| 16 | +29.9% | +32.6% | +31.6% | +30.8% |
| 32 | +33.7% | +34.2% | +33.6% | +32.6% |

Here SIZE is short for FREELIST_BUFFER_SIZE (part of a hack I used to
align FREELIST structure, see attached patch).

I am not sure, if this is exactly what has been suggested by Robert, so it is not straightforward to see if his suggestion can allow us to use NUM_FREELISTS as 8 rather than 32. I think instead of trying to use FREELISTBUFF, why not do it as Robert has suggested and try with different values of NUM_FREELISTS?

> I am however wondering if it to set the freelist affinity based on
> something other than the hash value, like say the PID, so that the
> same process doesn't keep switching to a different freelist for every
> buffer eviction.

Also I tried to use PID to determine freeList number:

```
#include <sys/types.h>
#include <unistd.h>

...

#define FREELIST_IDX(hctl, hashcode) \
(IS_PARTITIONED(hctl) ? ((uint32)getpid()) % NUM_FREELISTS : 0)

...

// now nentries could be negative in this case
// Assert(FREELIST(hctl, freelist_idx).nentries > 0);

In which case, do you think entries can go negative? I think the nentries is incremented and decremented in the way as without patch, so I am not getting what can make it go negative.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

01 March 2016, 14:46:06

Hello, Amit

> I am not sure, if this is exactly what has been suggested by Robert,
> so it is not straightforward to see if his suggestion can allow us to
> use NUM_FREELISTS as 8 rather than 32.  I think instead of trying to
> use FREELISTBUFF, why not do it as Robert has suggested and try with
> different values of NUM_FREELISTS?

Well, what I did is in fact a bit more general solution then Robert
suggested. At first I just joined freeList and mutex arrays into one
array and tried different NUM_FREELISTS values (see FREELISTS column).
That was Robert's idea. Unfortunately it didn't give any considerable
speedup comparing with previous approach.

Then I tried to manually control sizeof every array item (see
different SIZE values). By making it larger we can put every array item
into a separate cache line. This approach helped a bit in terms of TPS
(before: +33.9%, after: +34.2%) but only if NUM_FREELISTS remains the
same (32).

So answering your question - it turned out that we _can't_ reduce
NUM_FREELISTS this way. 

Also I don't believe that +0.3% TPS according to synthetic benchmark
that doesn't represent any possible workload worth it considering
additional code complexity.

> In which case, do you think entries can go negative?  I think the
> nentries is incremented and decremented in the way as without patch,
> so I am not getting what can make it go negative.

In this context we are discussing a quick and dirty fix "what would
happen if FREELIST_IDX would be implemented as getpid() % NUM_FREE_LIST
instead of hash % NUM_FREELIST". The code is:

int freelist_idx = FREELIST_IDX(hctl, hashvalue); 

// ...

switch (action)
{

// ...
 case HASH_REMOVE:   if (currBucket != NULL)   {     if (IS_PARTITIONED(hctl))       SpinLockAcquire(&(FREELIST(hctl,
freelist_idx).mutex));
     // Assert(FREELIST(hctl, freelist_idx).nentries > 0);      FREELIST(hctl, freelist_idx).nentries--;
     /* remove record from hash bucket's chain. */     *prevBucketPtr = currBucket->link;

// ...

No matter what freelist was used when element was added to a hash table.
We always try to return free memory to the same freelist number getpid()
% FREELIST_ITEMS and decrease number of elements in a hash table using
corresponding nentries field. For this reason nentries could be
negative. Still sum of all nentries remains valid.




I realize that this thread is quite long already and that its been a
while since previous commitfest ended. So I would like to provide a
short summery of this thread from my perspective:

1. Last version of a path is here:

http://www.postgresql.org/message-id/CA+Tgmobtf9nH566_jjs=jrTyMq5HdQdaRF5j7o+AbdOWQHE_Ow@mail.gmail.com

Its reviewed, tested, well-commented and apparently nobody has strong
objections against it.

2. Robert suggested a few possible improvements. Experiments showed
that in best case resulting patches are as good as path from item 1,
but code become considerably more complex.

3. Number of further changes (changing calling convention for
ShmemInitHash(), changing LOG2_NUM_LOCK_PARTITIONS constant) will be
discussed separately when and if path from this thread will be applied.

> 
> 
> 
> >
> > > I am however wondering if it to set the freelist affinity based on
> > > something other than the hash value, like say the PID, so that the
> > > same process doesn't keep switching to a different freelist for
> > > every buffer eviction.
> >
> > Also I tried to use PID to determine freeList number:
> >
> > ```
> > #include <sys/types.h>
> > #include <unistd.h>
> >
> > ...
> >
> > #define FREELIST_IDX(hctl, hashcode) \
> >   (IS_PARTITIONED(hctl) ? ((uint32)getpid()) % NUM_FREELISTS : 0)
> >
> >
> ...
> >
> > // now nentries could be negative in this case
> > // Assert(FREELIST(hctl, freelist_idx).nentries > 0);
> >
> >
> In which case, do you think entries can go negative?  I think the
> nentries is incremented and decremented in the way as without patch,
> so I am not getting what can make it go negative.
> 
> 
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com

Re: Patch: fix lock contention for HASHHDR.mutex

From

Amit Kapila

Date:

02 March 2016, 04:47:04

On Tue, Mar 1, 2016 at 8:13 PM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:
>
> Hello, Amit
>
> > I am not sure, if this is exactly what has been suggested by Robert,
> > so it is not straightforward to see if his suggestion can allow us to
> > use NUM_FREELISTS as 8 rather than 32. I think instead of trying to
> > use FREELISTBUFF, why not do it as Robert has suggested and try with
> > different values of NUM_FREELISTS?
>
> Well, what I did is in fact a bit more general solution then Robert
> suggested. At first I just joined freeList and mutex arrays into one
> array and tried different NUM_FREELISTS values (see FREELISTS column).
> That was Robert's idea. Unfortunately it didn't give any considerable
> speedup comparing with previous approach.
>

Okay, now I can understand the data better and I think your tests indicate that it is better to keep NUM_FREELISTS as 32.

>
> Then I tried to manually control sizeof every array item (see
> different SIZE values). By making it larger we can put every array item
> into a separate cache line. This approach helped a bit in terms of TPS
> (before: +33.9%, after: +34.2%) but only if NUM_FREELISTS remains the
> same (32).
>
> So answering your question - it turned out that we _can't_ reduce
> NUM_FREELISTS this way.
>
> Also I don't believe that +0.3% TPS according to synthetic benchmark
> that doesn't represent any possible workload worth it considering
> additional code complexity.
>

I think data structure HASHHDR is looking simpler, however the code is looking more complex, especially with the addition of FREELISTBUFF for cacheline purpose. I think you might want to try with exactly what has been suggested and see if the code looks better that way, but if you are not convinced, then lets leave it to committer unless somebody else wants to try that suggestion.

> > In which case, do you think entries can go negative? I think the
> > nentries is incremented and decremented in the way as without patch,
> > so I am not getting what can make it go negative.
>
> In this context we are discussing a quick and dirty fix "what would
> happen if FREELIST_IDX would be implemented as getpid() % NUM_FREE_LIST
> instead of hash % NUM_FREELIST". The code is:
>
> int freelist_idx = FREELIST_IDX(hctl, hashvalue);
>
> // ...
>
> switch (action)
> {
>
> // ...
>
> case HASH_REMOVE:
> if (currBucket != NULL)
> {
> if (IS_PARTITIONED(hctl))
> SpinLockAcquire(&(FREELIST(hctl, freelist_idx).mutex));
>
> // Assert(FREELIST(hctl, freelist_idx).nentries > 0);
> FREELIST(hctl, freelist_idx).nentries--;
>
> /* remove record from hash bucket's chain. */
> *prevBucketPtr = currBucket->link;
>
> // ...
>
> No matter what freelist was used when element was added to a hash table.
> We always try to return free memory to the same freelist number getpid()
> % FREELIST_ITEMS and decrease number of elements in a hash table using
> corresponding nentries field.

Won't it always use the same freelist to remove and add the entry from freelist as for both cases it will calculate the freelist_idx in same way?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

03 March 2016, 08:23:28

> Won't it always use the same freelist to remove and add the entry from
> freelist as for both cases it will calculate the freelist_idx in same
> way?

No. If "our" freelist is empty when we try to remove an item from it we
borrow item from another freelist. Then this borrowed item will be
returned to "our" freelist instead of original. Without some sort of
additional logic there is no way to figure out what freelist was
original.

Generally speaking negative nentries value is not something that
couldn't be solved. But I would like to remind that in this context we
are discussing a quick and dirty solution created for benchmark
purposes in a first place.

> You are not convinced, then lets leave it to committer unless
> somebody else wants to try that suggestion.

Agree. Frankly I'm tired of rewriting this patch over and over and over
again. So I would like to avoid rewriting it once again unless there is
a clear indication that this way we would gain something. Benchmarks
shows that this is not a case thus it's only a matter of taste and
intuition. We know from experience that both are bad advisers in
optimization matter.

I suggest we merge this patch already:

http://www.postgresql.org/message-id/CA+Tgmobtf9nH566_jjs=jrTyMq5HdQdaRF5j7o+AbdOWQHE_Ow@mail.gmail.com

... and get done with it. Clearly this patch is good enough ---
reviewed, tested, discussed, has proven performance extremum. When and
if dynahash will become a bottleneck again nothing will stop us from
optimizing it one more time for hardware we will have by then. Who knows
what this hardware will be like in 5-10 years? Don't we have better
things to do now than arguing about imaginary benchmarks and taste?

Lets see what is committer's opinion on this.

Re: Patch: fix lock contention for HASHHDR.mutex

From

Amit Kapila

Date:

04 March 2016, 05:59:55

On Thu, Mar 3, 2016 at 1:50 PM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:

> Won't it always use the same freelist to remove and add the entry from
> freelist as for both cases it will calculate the freelist_idx in same
> way?

No. If "our" freelist is empty when we try to remove an item from it we
borrow item from another freelist.

I think the way patch works is if indicated freelist is empty, then it tries to allocate new elements in that list and if it can't allocate, then it tries to borrow from other freelist and in both cases the element to be removed from freelist is considered to be the element of indicated freelist (basically it always increment nentries[freelist_idx]).

Then this borrowed item will be
returned to "our" freelist instead of original. Without some sort of
additional logic there is no way to figure out what freelist was
original.

As the patch always operates on nentries[freelist_idx], so it seems to me that in both cases (remove, add), the element is considered to be belonging to same freelist. Have you faced this problem of negative entries in any of your test, if so then share the test, so that I can also understand the scenario?

Generally speaking negative nentries value is not something that
couldn't be solved. But I would like to remind that in this context we
are discussing a quick and dirty solution created for benchmark
purposes in a first place.

I thought negative entries could be a problem for the patch as indicated by Robert[1], but may be I am missing something.

> You are not convinced, then lets leave it to committer unless
> somebody else wants to try that suggestion.

Agree. Frankly I'm tired of rewriting this patch over and over and over
again. So I would like to avoid rewriting it once again unless there is
a clear indication that this way we would gain something. Benchmarks
shows that this is not a case thus it's only a matter of taste and
intuition.

I can understand your feeling here and agree with you that it is a matter of taste. However, many a times if we change the patch according to committer's taste, the chances of patch getting accepted early is better, but I am not sure if that applies here, so feel free to go in the way you think is better.

[1] - http://www.postgresql.org/message-id/CA+TgmoacVsdcY=QN0do7NOK7W2-Ssqz3kR2Y84bAvifCkQd6Aw@mail.gmail.com

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Patch: fix lock contention for HASHHDR.mutex

From

Robert Haas

Date:

18 March 2016, 20:11:05

On Tue, Mar 1, 2016 at 9:43 AM, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:
>> I am not sure, if this is exactly what has been suggested by Robert,
>> so it is not straightforward to see if his suggestion can allow us to
>> use NUM_FREELISTS as 8 rather than 32.  I think instead of trying to
>> use FREELISTBUFF, why not do it as Robert has suggested and try with
>> different values of NUM_FREELISTS?
>
> Well, what I did is in fact a bit more general solution then Robert
> suggested. At first I just joined freeList and mutex arrays into one
> array and tried different NUM_FREELISTS values (see FREELISTS column).
> That was Robert's idea. Unfortunately it didn't give any considerable
> speedup comparing with previous approach.
>
> Then I tried to manually control sizeof every array item (see
> different SIZE values). By making it larger we can put every array item
> into a separate cache line. This approach helped a bit in terms of TPS
> (before: +33.9%, after: +34.2%) but only if NUM_FREELISTS remains the
> same (32).
>
> So answering your question - it turned out that we _can't_ reduce
> NUM_FREELISTS this way.

That's perplexing.  I would have expected that with all of the mutexes
packed in back-to-back like this, we would end up with a considerable
amount of false sharing.  I don't know why it ever helps to have an
array of bytes all in the same cache line of which each one is a
heavily-trafficked mutex.   Anybody else have a theory?

One other thing that concerns me mildly is the way we're handling
nentries.  It should be true, with this patch, that the individual
nentries sum to the right value modulo 2^32.  But I don't think
there's any guarantee that the values are positive any more, and in
theory after running long enough one of them could manage to overflow
or underflow.  So at a very minimum I think we need to remove the
Assert() the value is not negative.  But really, I wonder if we
shouldn't rework things a little more than that.

One idea is to jigger things so that we maintain a count of the total
number of entries that doesn't change except when we allocate, and
then for each freelist partition we maintain the number of entries in
that freelist partition.  So then the size of the hash table, instead
of being sum(nentries) is totalsize - sum(nfree).

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[ P.S. Sorry it's taken me a while to circle back around to this. ]

Re: Patch: fix lock contention for HASHHDR.mutex

From

Amit Kapila

Date:

19 March 2016, 04:28:49

On Sat, Mar 19, 2016 at 1:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Mar 1, 2016 at 9:43 AM, Aleksander Alekseev
> <a.alekseev@postgrespro.ru> wrote:
> >
> > So answering your question - it turned out that we _can't_ reduce
> > NUM_FREELISTS this way.
>
> That's perplexing. I would have expected that with all of the mutexes
> packed in back-to-back like this, we would end up with a considerable
> amount of false sharing. I don't know why it ever helps to have an
> array of bytes all in the same cache line of which each one is a
> heavily-trafficked mutex. Anybody else have a theory?
>
> One other thing that concerns me mildly is the way we're handling
> nentries. It should be true, with this patch, that the individual
> nentries sum to the right value modulo 2^32. But I don't think
> there's any guarantee that the values are positive any more, and in
> theory after running long enough one of them could manage to overflow
> or underflow.

Won't in theory, without patch as well nentries can overflow after running for very long time? I think with patch it is more prone to overflow because we start borrowing from other free lists as well.

So at a very minimum I think we need to remove the
> Assert() the value is not negative. But really, I wonder if we
> shouldn't rework things a little more than that.
>
> One idea is to jigger things so that we maintain a count of the total
> number of entries that doesn't change except when we allocate, and
> then for each freelist partition we maintain the number of entries in
> that freelist partition. So then the size of the hash table, instead
> of being sum(nentries) is totalsize - sum(nfree).
>

To me, your idea sounds much better than current code in terms of understanding the free list concept as well. So, +1 for changing the code in this way.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Patch: fix lock contention for HASHHDR.mutex

From

Robert Haas

Date:

19 March 2016, 13:32:28

On Sat, Mar 19, 2016 at 12:28 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Won't in theory, without patch as well nentries can overflow after running
> for very long time?  I think with patch it is more prone to overflow because
> we start borrowing from other free lists as well.

Uh, I don't think so.  Without the patch, there is just one entries
counter and it goes up and down.  How would it ever overflow?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: fix lock contention for HASHHDR.mutex

From

Amit Kapila

Date:

20 March 2016, 07:02:05

On Sat, Mar 19, 2016 at 7:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sat, Mar 19, 2016 at 12:28 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Won't in theory, without patch as well nentries can overflow after running
> > for very long time? I think with patch it is more prone to overflow because
> > we start borrowing from other free lists as well.
>
> Uh, I don't think so. Without the patch, there is just one entries
> counter and it goes up and down. How would it ever overflow?
>

I thought it can overflow because we haven't kept any upper limit on incrementing it unless the memory finishes (ofcourse that is just a theoretical assumption, as the decrements will keep the number in control), so are you thinking about the risk of overflow with patch because we have to use sum of all the nentries from all the arrays for total or is there any thing else which makes you think that changing nentries into arrays of nentries can make it prone to overflow?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Patch: fix lock contention for HASHHDR.mutex

From

Robert Haas

Date:

20 March 2016, 23:42:09

On Sun, Mar 20, 2016 at 3:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Sat, Mar 19, 2016 at 7:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sat, Mar 19, 2016 at 12:28 AM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> > Won't in theory, without patch as well nentries can overflow after
>> > running
>> > for very long time?  I think with patch it is more prone to overflow
>> > because
>> > we start borrowing from other free lists as well.
>>
>> Uh, I don't think so.  Without the patch, there is just one entries
>> counter and it goes up and down.  How would it ever overflow?
>
> I thought it can overflow because we haven't kept any upper limit on
> incrementing it unless the memory finishes (ofcourse that is just a
> theoretical assumption, as the decrements will keep the number in control),
> so are you thinking about the risk of overflow with patch because we have to
> use sum of all the nentries from all the arrays for total or is there any
> thing else which makes you think that changing nentries into arrays of
> nentries can make it prone to overflow?

Well, I mean, perhaps nentries could overflow if you had more than
2^32 elements, but I'm not even positive we support that.  If you
assume a fixed table with a million entries, the nentries value can
vary only between 0 and a million.  But now split that into a bunch of
separate counters.   The increment when you allocate an entry and the
decrement when you put one back don't have to hit the same bucket, so
I'm not sure there's anything that prevents the counter for one bucket
from getting arbitrarily large and the counter for another bucket
getting arbitrarily small while still summing to a value between 0 and
a million.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: fix lock contention for HASHHDR.mutex

From

Amit Kapila

Date:

21 March 2016, 03:29:58

On Mon, Mar 21, 2016 at 5:12 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sun, Mar 20, 2016 at 3:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Sat, Mar 19, 2016 at 7:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> On Sat, Mar 19, 2016 at 12:28 AM, Amit Kapila <amit.kapila16@gmail.com>
> >> wrote:
> >> > Won't in theory, without patch as well nentries can overflow after
> >> > running
> >> > for very long time? I think with patch it is more prone to overflow
> >> > because
> >> > we start borrowing from other free lists as well.
> >>
> >> Uh, I don't think so. Without the patch, there is just one entries
> >> counter and it goes up and down. How would it ever overflow?
> >
> > I thought it can overflow because we haven't kept any upper limit on
> > incrementing it unless the memory finishes (ofcourse that is just a
> > theoretical assumption, as the decrements will keep the number in control),
> > so are you thinking about the risk of overflow with patch because we have to
> > use sum of all the nentries from all the arrays for total or is there any
> > thing else which makes you think that changing nentries into arrays of
> > nentries can make it prone to overflow?
>
> Well, I mean, perhaps nentries could overflow if you had more than
> 2^32 elements, but I'm not even positive we support that. If you
> assume a fixed table with a million entries, the nentries value can
> vary only between 0 and a million. But now split that into a bunch of
> separate counters. The increment when you allocate an entry and the
> decrement when you put one back don't have to hit the same bucket,

This is the point where I think I am missing something about patch. As far as I can understand, it uses the same freelist index (freelist_idx) for allocating and putting back the entry, so I think the chance of increment in one list and decrement in another is there when the value of freelist_idx is calculated differently for the same input, is it so, or there is something else in patch which I am missing?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

21 March 2016, 08:48:47

> This is the point where I think I am missing something about patch.
> As far as I can understand, it uses the same freelist index
> (freelist_idx) for allocating and putting back the entry, so I think
> the chance of increment in one list and decrement in another is there
> when the value of freelist_idx is calculated differently for the same
> input, is it so, or there is something else in patch which I am
> missing?

You are right, nentries _can't_ be negative unless we are using getpid()
for calculating freelist_idx, since same index of nentries[] is used
when we add (increment) and remove (decrement) element from/to hash
table. The fact that we also borrow elements from other freelists if
there is no more elements in our freelist doesn't change anything.

> One idea is to jigger things so that we maintain a count of the total
> number of entries that doesn't change except when we allocate, and
> then for each freelist partition we maintain the number of entries in
> that freelist partition.  So then the size of the hash table, instead
> of being sum(nentries) is totalsize - sum(nfree).

This is an interesting idea. Still I strongly disagree that is should
be implemented in this concrete patch and discussed in this concrete
thread. Obviously such type of change deserves a separate research and
discussing since it has nothing to do with performance and since we
agreed that "change LOG2_NUM_LOCK_PARTITIONS value" and "change the
calling convention for ShmemInitHash()" patches should be implemented
separately. I added it to my TODO list.

Once again I suggest we merge this patch already:

http://www.postgresql.org/message-id/CA+Tgmobtf9nH566_jjs=jrTyMq5HdQdaRF5j7o+AbdOWQHE_Ow@mail.gmail.com

I have a strong feeling that we are just wasting our time here.

-- 
Best regards,
Aleksander Alekseev
http://eax.me/

Re: Patch: fix lock contention for HASHHDR.mutex

From

Robert Haas

Date:

23 March 2016, 00:58:09

On Mon, Mar 21, 2016 at 4:48 AM, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:
>> This is the point where I think I am missing something about patch.
>> As far as I can understand, it uses the same freelist index
>> (freelist_idx) for allocating and putting back the entry, so I think
>> the chance of increment in one list and decrement in another is there
>> when the value of freelist_idx is calculated differently for the same
>> input, is it so, or there is something else in patch which I am
>> missing?
>
> You are right, nentries _can't_ be negative unless we are using getpid()
> for calculating freelist_idx, since same index of nentries[] is used
> when we add (increment) and remove (decrement) element from/to hash
> table. The fact that we also borrow elements from other freelists if
> there is no more elements in our freelist doesn't change anything.

Ah.  OK, I missed that.

> Once again I suggest we merge this patch already:
>
> http://www.postgresql.org/message-id/CA+Tgmobtf9nH566_jjs=jrTyMq5HdQdaRF5j7o+AbdOWQHE_Ow@mail.gmail.com
>
> I have a strong feeling that we are just wasting our time here.

That is possible.  However, I would like it if you would give me the
benefit of the doubt and assume that, if I seem to be more cautious
than you would be were you a committer, there might possibly be some
good reasons for that.  The fact is that, despite being more cautious
than some people think I should be, I still manage to introduce quite
a number of bugs via the patches I commit - see the thread 'Missing
rows with index scan when collation is not "C"' on pgsql-bugs for just
the very latest example of that.  Nobody thinks that will happen with
*their* patch, of course, but it does all the same.

I'd still like an answer to the question of why this helps so much
when there must be huge amounts of false sharing between the different
mutexes.  Maybe it doesn't matter, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

23 March 2016, 09:50:21

> > I have a strong feeling that we are just wasting our time here.
>
> That is possible.  However, I would like it if you would give me the
> benefit of the doubt and assume that, if I seem to be more cautious
> than you would be were you a committer, there might possibly be some
> good reasons for that.  The fact is that, despite being more cautious
> than some people think I should be, I still manage to introduce quite
> a number of bugs via the patches I commit - see the thread 'Missing
> rows with index scan when collation is not "C"' on pgsql-bugs for just
> the very latest example of that.  Nobody thinks that will happen with
> *their* patch, of course, but it does all the same.

Oh, it explains a lot! You see, I couldn't figure out whats happening.
Patch was discussed and reviewed a long time ago, everyone seems to be
reasonably happy with it, etc. But for some reason it's ignored for
weeks and no one tells why. Now it makes sense.

I should probably mention that this patch was merged to PostgresPro
build a few months ago. Several our clients are already using this code
in production environment (guess who discovered this issue and wanted
it to be fixed). There were no complains so far.

> I'd still like an answer to the question of why this helps so much
> when there must be huge amounts of false sharing between the different
> mutexes.  Maybe it doesn't matter, though.

Well, the reason is that there is no more bottleneck here. Code is
executed like 1% of the time here and 99% of the time somewhere else.
There is a false sharing but it's not as expensive as 99% of other
code. Thus optimizing 1% of the code any further doesn't give noticeable
performance improvement.

I still believe that optimizing 1% blindly without considering possible
side effects this optimization can bring (other data alignment, some
additional machine instructions - just to name a few) and having no
way to measure these side effects is a bad idea. But I also admit a
possibility that I can somehow be wrong on this. So I rewrote this
patch one again :'( the way you suggested (without that alignment
related hack I tried, it's just too ugly). I also attached original
hashhdr-rmh.patch just to have both patches in one message so it would
be easier to find both patches in this thread.

If there are any other questions or doubts regarding these patches
please don't hesitate to ask.

--
Best regards,
Aleksander Alekseev
http://eax.me/

Attachment

Re: Patch: fix lock contention for HASHHDR.mutex

From

Robert Haas

Date:

23 March 2016, 15:02:16

On Wed, Mar 23, 2016 at 5:49 AM, Aleksander Alekseev
<a.alekseev@postgrespro.ru> wrote:
> I still believe that optimizing 1% blindly without considering possible
> side effects this optimization can bring (other data alignment, some
> additional machine instructions - just to name a few) and having no
> way to measure these side effects is a bad idea. But I also admit a
> possibility that I can somehow be wrong on this. So I rewrote this
> patch one again :'( the way you suggested (without that alignment
> related hack I tried, it's just too ugly). I also attached original
> hashhdr-rmh.patch just to have both patches in one message so it would
> be easier to find both patches in this thread.

Thanks!  I can't think of anything else to worry about with regards to
that version, so I have committed it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Patch: fix lock contention for HASHHDR.mutex

From

Aleksander Alekseev

Date:

23 March 2016, 16:00:44

> Thanks!  I can't think of anything else to worry about with regards to
> that version, so I have committed it.
> 

Thanks, Robert. And thanks everyone for contributing to this patch.


-- 
Best regards,
Aleksander Alekseev
http://eax.me/