Thread: Patch: fix lock contention for HASHHDR.mutex
Hello all, Consider following stacktrace: (gdb) bt #0 0x00007f77c81fae87 in semop () syscall-template.S:81 #1 0x000000000063b721 in PGSemaphoreLock pg_sema.c:387 #2 0x0000000000692e1b in LWLockAcquire lwlock.c:1026 #3 0x000000000068d4c5 in LockAcquireExtended lock.c:881 #4 0x000000000068dcc1 in LockAcquire lock.c:672 #5 0x000000000068b7a8 in LockRelationOid lmgr.c:112 #6 0x0000000000501d18 in find_inheritance_children pg_inherits.c:120 #7 0x0000000000501d80 in find_all_inheritors pg_inherits.c:182 #8 0x000000000062db8d in expand_inherited_rtentry prepunion.c:1285 #9 expand_inherited_tables prepunion.c:1212 #10 0x0000000000622705 in subquery_planner planner.c:501 #11 0x0000000000622d31 in standard_planner planner.c:285 #12 0x000000000069ef0c in pg_plan_query postgres.c:809 #13 0x000000000069f004 in pg_plan_queries postgres.c:868 #14 0x00000000006a0fc2 in exec_simple_query postgres.c:1033 #15 PostgresMain postgres.c:4032 #16 0x0000000000467479 in BackendRun postmaster.c:4237 #17 BackendStartup postmaster.c:3913 #18 ServerLoop () postmaster.c:1684 #19 0x000000000064c828 in PostmasterMain postmaster.c:1292 #20 0x0000000000467f3e in main main.c:223 Turns out PostgreSQL can spend a lot of time waiting for a lock in this particular place, especially if you are running PostgreSQL on 60-core server. Which obviously is a pretty bad sign. You can easily reproduce this issue on regular Core i7 server. Use attached schema.sql file to create a database and run: pgbench -j 8 -c 8 -f pgbench.sql -T 300 my_database 2>/dev/null & While this example is running connect to some PostgreSQL process with gdb and run bt/c from time to time. You will see that PostgreSQL waits for this particular lock quite often. The problem is that code between LWLockAcquire (lock.c:881) and LWLockRelease (lock.c:1020) can _sometimes_ run up to 3-5 ms. Using old-good gettimeofday and logging method I managed to find a bottleneck: -- proclock = SetupLockInTable [lock.c:892] `-- proclock = (PROCLOCK *) hash_search_with_hash_value [lock.c:1105] `-- currBucket = get_hash_entry(hashp) [dynahash.c:985] `-- SpinLockAcquire(&hctl->mutex) [dynahash.c:1187] If my measurements are not wrong (I didn't place gettimeofday between SpinLockAquire/SpinLockRelease, etc) we sometimes spend about 3 ms here waiting for a spinlock, which doesn't seems right. I managed to fix this behaviour by modifying choose_nelem_alloc procedure in dynahash.c (see attached patch). The idea is to double number of items we allocate when there is no more free items in hash table. So we need twice less allocations which reduces contention. This patch doesn't cause any performance degradation according to pgbench, `make check` passes, etc. Best regards, Aleksander
Attachment
Aleksander Alekseev <a.alekseev@postgrespro.ru> writes: > Turns out PostgreSQL can spend a lot of time waiting for a lock in this > particular place, especially if you are running PostgreSQL on 60-core > server. Which obviously is a pretty bad sign. > ... > I managed to fix this behaviour by modifying choose_nelem_alloc > procedure in dynahash.c (see attached patch). TBH, this is just voodoo. I don't know why this change would have made any impact on lock acquisition performance, and neither do you, and the odds are good that it's pure chance that it changed anything. One likely theory is that you managed to shift around memory allocations so that something aligned on a cacheline boundary when it hadn't before. But, of course, the next patch that changes allocations anywhere in shared memory could change that back. There are lots of effects like this that appear or disappear based on seemingly unrelated code changes when you're measuring edge-case performance. The patch is not necessarily bad in itself. As time goes by and machines get bigger, it can make sense to allocate more memory at a time to reduce memory management overhead. But arguing for it on the basis that it fixes lock allocation behavior with 60 cores is just horsepucky. What you were measuring there was steady-state hash table behavior, not the speed of the allocate-some-more-memory code path. regards, tom lane
Hello, Tom I see your point, but I would like to clarify a few things. 1. Do we consider described measurement method good enough to conclude that sometimes PostgreSQL really spends 3 ms in a spinlock (like a RTT between two Internet hosts in the same city)? If not, what method should be used to approve or disapprove this? 2. If we agree that PostgreSQL does sometimes spend 3 ms in a spinlock do we consider this a problem? 3. If we consider this a problem, what method is considered appropriate to find a real reason of such behaviour so we could fix it? Best regards, Aleksander
Oops. s/approve or disapprove/confirm or deny/ On Fri, 11 Dec 2015 19:14:41 +0300 Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote: > Hello, Tom > > I see your point, but I would like to clarify a few things. > > 1. Do we consider described measurement method good enough to conclude > that sometimes PostgreSQL really spends 3 ms in a spinlock (like a RTT > between two Internet hosts in the same city)? If not, what method > should be used to approve or disapprove this? > > 2. If we agree that PostgreSQL does sometimes spend 3 ms in a spinlock > do we consider this a problem? > > 3. If we consider this a problem, what method is considered > appropriate to find a real reason of such behaviour so we could fix > it? > > Best regards, > Aleksander > >
On 11 December 2015 at 16:14, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:
--
I see your point, but I would like to clarify a few things.
1. Do we consider described measurement method good enough to conclude
that sometimes PostgreSQL really spends 3 ms in a spinlock (like a RTT
between two Internet hosts in the same city)? If not, what method
should be used to approve or disapprove this?
Timing things seems fine. You may have located an important issue.
2. If we agree that PostgreSQL does sometimes spend 3 ms in a spinlock
do we consider this a problem?
Probably, but then Tom didn't question 1 or 2. What he questioned was your fix.
3. If we consider this a problem, what method is considered appropriate
to find a real reason of such behaviour so we could fix it?
The problem you identify is in only one place, yet your fix changes many parts of the code. Why is that the best fix out of the many possible ones? Why would such a change be acceptable?
I personally don't know the answers to those questions. It would be wonderful if somebody else would structure our lives such that all we had to do was find simple answers, but that isn't the way life is. You ask for a chain of logical thought, but it is for you to create one, somehow: patches are default-reject, not committer-explain-why-reject.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello, Tom. I was exploring this issue further and discovered something strange. "PROCLOCK hash" and "LOCK hash" are hash tables in shared memory. All memory for these tables is in fact pre-allocated. But for some reason these two tables are created (lock.c:394) with init_size =/= max_size. It causes small overhead on calling memory allocator after hash table creation and additional locking/unlocking. I checked all other hash tables created via ShmemInitHash. All of these tables have init_size == max_size. I suggest to create "PROCLOCK hash" and "LOCK hash" with init_size == max_size too (see attached patch). Granted this change doesn't cause any noticeable performance improvements according to pgbench. Nevertheless it looks like a very right thing to do which at least doesn't make anything worse. If this patch seems to be OK next logical step I believe would be to remove init_size parameter in ShmemInitHash procedure since in practice it always equals max_size. On Fri, 11 Dec 2015 10:30:30 -0500 Tom Lane <tgl@sss.pgh.pa.us> wrote: > Aleksander Alekseev <a.alekseev@postgrespro.ru> writes: > > Turns out PostgreSQL can spend a lot of time waiting for a lock in > > this particular place, especially if you are running PostgreSQL on > > 60-core server. Which obviously is a pretty bad sign. > > ... > > I managed to fix this behaviour by modifying choose_nelem_alloc > > procedure in dynahash.c (see attached patch). > > TBH, this is just voodoo. I don't know why this change would have > made any impact on lock acquisition performance, and neither do you, > and the odds are good that it's pure chance that it changed > anything. One likely theory is that you managed to shift around > memory allocations so that something aligned on a cacheline boundary > when it hadn't before. But, of course, the next patch that changes > allocations anywhere in shared memory could change that back. There > are lots of effects like this that appear or disappear based on > seemingly unrelated code changes when you're measuring edge-case > performance. > > The patch is not necessarily bad in itself. As time goes by and > machines get bigger, it can make sense to allocate more memory at a > time to reduce memory management overhead. But arguing for it on the > basis that it fixes lock allocation behavior with 60 cores is just > horsepucky. What you were measuring there was steady-state hash > table behavior, not the speed of the allocate-some-more-memory code > path. > > regards, tom lane > >
Attachment
On 2015-12-11 17:00:01 +0300, Aleksander Alekseev wrote: > The problem is that code between LWLockAcquire (lock.c:881) and > LWLockRelease (lock.c:1020) can _sometimes_ run up to 3-5 ms. Using > old-good gettimeofday and logging method I managed to find a bottleneck: > > -- proclock = SetupLockInTable [lock.c:892] > `-- proclock = (PROCLOCK *) hash_search_with_hash_value [lock.c:1105] > `-- currBucket = get_hash_entry(hashp) [dynahash.c:985] > `-- SpinLockAcquire(&hctl->mutex) [dynahash.c:1187] > > If my measurements are not wrong (I didn't place gettimeofday between > SpinLockAquire/SpinLockRelease, etc) we sometimes spend about 3 ms here > waiting for a spinlock, which doesn't seems right. Well, it's quite possible that your process was scheduled out, while waiting for that spinlock. Which'd make 3ms pretty normal. I'd consider using a LWLock instead of a spinlock here. I've seen this contended in a bunch of situations, and the queued behaviour, combined with directed wakeups on the OS level, ought to improve the worst case behaviour measurably. Greetings, Andres Freund
On Tue, Dec 15, 2015 at 7:25 AM, Andres Freund <andres@anarazel.de> wrote: > On 2015-12-11 17:00:01 +0300, Aleksander Alekseev wrote: >> The problem is that code between LWLockAcquire (lock.c:881) and >> LWLockRelease (lock.c:1020) can _sometimes_ run up to 3-5 ms. Using >> old-good gettimeofday and logging method I managed to find a bottleneck: >> >> -- proclock = SetupLockInTable [lock.c:892] >> `-- proclock = (PROCLOCK *) hash_search_with_hash_value [lock.c:1105] >> `-- currBucket = get_hash_entry(hashp) [dynahash.c:985] >> `-- SpinLockAcquire(&hctl->mutex) [dynahash.c:1187] >> >> If my measurements are not wrong (I didn't place gettimeofday between >> SpinLockAquire/SpinLockRelease, etc) we sometimes spend about 3 ms here >> waiting for a spinlock, which doesn't seems right. > > Well, it's quite possible that your process was scheduled out, while > waiting for that spinlock. Which'd make 3ms pretty normal. > > I'd consider using a LWLock instead of a spinlock here. I've seen this > contended in a bunch of situations, and the queued behaviour, combined > with directed wakeups on the OS level, ought to improve the worst case > behaviour measurably. Amit had the idea a while back of trying to replace the HASHHDR mutex with something based on atomic ops. It seems hard to avoid the attendant A-B-A problems but maybe there's a way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-12-17 09:47:57 -0500, Robert Haas wrote: > On Tue, Dec 15, 2015 at 7:25 AM, Andres Freund <andres@anarazel.de> wrote: > > I'd consider using a LWLock instead of a spinlock here. I've seen this > > contended in a bunch of situations, and the queued behaviour, combined > > with directed wakeups on the OS level, ought to improve the worst case > > behaviour measurably. > > Amit had the idea a while back of trying to replace the HASHHDR mutex > with something based on atomic ops. It seems hard to avoid the > attendant A-B-A problems but maybe there's a way. It'd really like to see it being replaced by a queuing lock (i.e. lwlock) before we go there. And then maybe partition the freelist, and make nentries an atomic. Just doing those might already be good enough and should be a lot easier. Andres
> It'd really like to see it being replaced by a queuing lock > (i.e. lwlock) before we go there. And then maybe partition the > freelist, and make nentries an atomic. I believe I just implemented something like this (see attachment). The idea is to partition PROCLOCK hash table manually into NUM_LOCK_ PARTITIONS smaller and non-partitioned hash tables. Since these tables are non-partitioned spinlock is not used and there is no lock contention. On 60-core server we gain 3.5-4 more TPS according to benchmark described above. As I understand there is no performance degradation in other cases (different CPU, traditional pgbench, etc). If this patch seems to be OK I believe we could consider applying the same change not only to PROCLOCK hash table.
Attachment
On Thu, Dec 17, 2015 at 11:03 AM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote: >> It'd really like to see it being replaced by a queuing lock >> (i.e. lwlock) before we go there. And then maybe partition the >> freelist, and make nentries an atomic. > > I believe I just implemented something like this (see attachment). The > idea is to partition PROCLOCK hash table manually into NUM_LOCK_ > PARTITIONS smaller and non-partitioned hash tables. Since these tables > are non-partitioned spinlock is not used and there is no lock > contention. Oh, that's an interesting idea. I guess the problem is that if the freelist is unshared, then users might get an error that the lock table is full when some other partition still has elements remaining. > On 60-core server we gain 3.5-4 more TPS according to benchmark > described above. As I understand there is no performance degradation in > other cases (different CPU, traditional pgbench, etc). 3.5-4 more TPS, or 3.5 times more TPS? Can you share the actual numbers? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> Oh, that's an interesting idea. I guess the problem is that if the > freelist is unshared, then users might get an error that the lock > table is full when some other partition still has elements remaining. True, but I don't believe it should be a real problem assuming we have a strong hash function. If user get such an error it means that database is under heavy load and there is not much more free elements in other tables either. This particular user didn't get lucky, some other will. Anyway administrator should do something about it - fight DDoS attack, tune database parameters, etc. > 3.5-4 more TPS, or 3.5 times more TPS? Can you share the actual > numbers? Oops, 3.5-4 _times_ more TPS, i.e. 2206 TPS vs 546 TPS.
On Thu, Dec 17, 2015 at 8:44 PM, Andres Freund <andres@anarazel.de> wrote:
>
> On 2015-12-17 09:47:57 -0500, Robert Haas wrote:
> > On Tue, Dec 15, 2015 at 7:25 AM, Andres Freund <andres@anarazel.de> wrote:
> > > I'd consider using a LWLock instead of a spinlock here. I've seen this
> > > contended in a bunch of situations, and the queued behaviour, combined
> > > with directed wakeups on the OS level, ought to improve the worst case
> > > behaviour measurably.
> >
> > Amit had the idea a while back of trying to replace the HASHHDR mutex
> > with something based on atomic ops. It seems hard to avoid the
> > attendant A-B-A problems but maybe there's a way.
>
> It'd really like to see it being replaced by a queuing lock
> (i.e. lwlock) before we go there. And then maybe partition the freelist,
> and make nentries an atomic. Just doing those might already be good
> enough and should be a lot easier.
>
>
> On 2015-12-17 09:47:57 -0500, Robert Haas wrote:
> > On Tue, Dec 15, 2015 at 7:25 AM, Andres Freund <andres@anarazel.de> wrote:
> > > I'd consider using a LWLock instead of a spinlock here. I've seen this
> > > contended in a bunch of situations, and the queued behaviour, combined
> > > with directed wakeups on the OS level, ought to improve the worst case
> > > behaviour measurably.
> >
> > Amit had the idea a while back of trying to replace the HASHHDR mutex
> > with something based on atomic ops. It seems hard to avoid the
> > attendant A-B-A problems but maybe there's a way.
>
> It'd really like to see it being replaced by a queuing lock
> (i.e. lwlock) before we go there. And then maybe partition the freelist,
> and make nentries an atomic. Just doing those might already be good
> enough and should be a lot easier.
>
makes sense to me, but I think we should as well try the Group leader idea
used for ProcArrayLock optimisation as during those tests, I found that
it gives better results as compare to partitioning.
> Oops, 3.5-4 _times_ more TPS, i.e. 2206 TPS vs 546 TPS. In fact these numbers are for similar but a little bit different benchmark (same schema without checks on child tables). Here are exact numbers for a benchmark described above. Before: $ pgbench -j 64 -c 64 -f pgbench.sql -T 30 my_database transaction type: Custom query scaling factor: 1 query mode: simple number of clients: 64 number of threads: 64 duration: 30 s number of transactions actually processed: 20433 latency average: 93.966 ms tps = 679.698439 (including connections establishing) tps = 680.353897 (excluding connections establishing) $ pgbench -j 64 -c 64 -f pgbench.sql -T 30 my_database transaction type: Custom query scaling factor: 1 query mode: simple number of clients: 64 number of threads: 64 duration: 30 s number of transactions actually processed: 19111 latency average: 100.466 ms tps = 635.763523 (including connections establishing) tps = 636.112682 (excluding connections establishing) $ pgbench -j 64 -c 64 -f pgbench.sql -T 30 my_database transaction type: Custom query scaling factor: 1 query mode: simple number of clients: 64 number of threads: 64 duration: 30 s number of transactions actually processed: 19218 latency average: 99.906 ms tps = 639.506848 (including connections establishing) tps = 639.838757 (excluding connections establishing) After: $ pgbench -j 64 -c 64 -f pgbench.sql -T 30 my_database transaction type: Custom query scaling factor: 1 query mode: simple number of clients: 64 number of threads: 64 duration: 30 s number of transactions actually processed: 95900 latency average: 20.021 ms tps = 3194.142762 (including connections establishing) tps = 3196.091843 (excluding connections establishing) $ pgbench -j 64 -c 64 -f pgbench.sql -T 30 my_database transaction type: Custom query scaling factor: 1 query mode: simple number of clients: 64 number of threads: 64 duration: 30 s number of transactions actually processed: 96837 latency average: 19.827 ms tps = 3225.822355 (including connections establishing) tps = 3227.762847 (excluding connections establishing) $ pgbench -j 64 -c 64 -f pgbench.sql -T 30 my_database transaction type: Custom query scaling factor: 1 query mode: simple number of clients: 64 number of threads: 64 duration: 30 s number of transactions actually processed: 96143 latency average: 19.970 ms tps = 3202.637126 (including connections establishing) tps = 3204.070466 (excluding connections establishing) Ratio: $ python min(3194.0, 3225.0, 3202.0) / max(679.0, 635.0, 639.0) 4.703976435935199
On Thu, Dec 17, 2015 at 9:33 PM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:
>
> > It'd really like to see it being replaced by a queuing lock
> > (i.e. lwlock) before we go there. And then maybe partition the
> > freelist, and make nentries an atomic.
>
> I believe I just implemented something like this (see attachment). The
> idea is to partition PROCLOCK hash table manually into NUM_LOCK_
> PARTITIONS smaller and non-partitioned hash tables. Since these tables
> are non-partitioned spinlock is not used and there is no lock
> contention.
>
>
> > It'd really like to see it being replaced by a queuing lock
> > (i.e. lwlock) before we go there. And then maybe partition the
> > freelist, and make nentries an atomic.
>
> I believe I just implemented something like this (see attachment). The
> idea is to partition PROCLOCK hash table manually into NUM_LOCK_
> PARTITIONS smaller and non-partitioned hash tables. Since these tables
> are non-partitioned spinlock is not used and there is no lock
> contention.
>
This idea can improve the situation with ProcLock hash table, but I
think IIUC what Andres is suggesting would reduce the contention
around dynahash freelist and can be helpful in many more situations
including BufMapping locks.
> This idea can improve the situation with ProcLock hash table, but I > think IIUC what Andres is suggesting would reduce the contention > around dynahash freelist and can be helpful in many more situations > including BufMapping locks. I agree. But as I understand PostgreSQL community doesn't generally аpprоvе big changes that affects whole system. Especially if original problem was only in one particular place. Therefore for now I suggest only a small change. Naturally if it will be accepted there is no reason not to apply same changes for BufMapping or even dynahash itself with corresponding PROCLOCK hash refactoring. BTW could you (or anyone?) please help me find this thread regarding BufMapping or perhaps provide a benchmark? I would like to reproduce this issue but I can't find anything relevant in a mailing list. Also it seems to be a good idea to compare alternative approaches that were mentioned (atomics ops, group leader). Are there any discussions, benchmarks or patches regarding this topic? Frankly I have serious doubts regarding atomics ops since they will more likely create the same contention that a spinlock does. But perhaps there is a patch that works not the way I think it could work.
On 2015-12-18 11:40:58 +0300, Aleksander Alekseev wrote: > $ pgbench -j 64 -c 64 -f pgbench.sql -T 30 my_database What's in pgbench.sql? Greetings, Andres Freund
> What's in pgbench.sql? It's from first message of this thread: http://www.postgresql.org/message-id/20151211170001.78ded9d7@fujitsu
On Fri, Dec 18, 2015 at 2:50 PM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:
>
> > This idea can improve the situation with ProcLock hash table, but I
> > think IIUC what Andres is suggesting would reduce the contention
> > around dynahash freelist and can be helpful in many more situations
> > including BufMapping locks.
>
> I agree. But as I understand PostgreSQL community doesn't generally
> approve big changes that affects whole system. Especially if original
> problem was only in one particular place. Therefore for now I suggest
> only a small change. Naturally if it will be accepted there is no
> reason not to apply same changes for BufMapping or even dynahash itself
> with corresponding PROCLOCK hash refactoring.
>
> Frankly I have serious doubts regarding atomics ops since they will more
> likely create the same contention that a spinlock does. But perhaps
> there is a patch that works not the way I think it could work.
>
>
> > This idea can improve the situation with ProcLock hash table, but I
> > think IIUC what Andres is suggesting would reduce the contention
> > around dynahash freelist and can be helpful in many more situations
> > including BufMapping locks.
>
> I agree. But as I understand PostgreSQL community doesn't generally
> approve big changes that affects whole system. Especially if original
> problem was only in one particular place. Therefore for now I suggest
> only a small change. Naturally if it will be accepted there is no
> reason not to apply same changes for BufMapping or even dynahash itself
> with corresponding PROCLOCK hash refactoring.
>
> BTW could you (or anyone?) please help me find this thread regarding
> BufMapping or perhaps provide a benchmark?
> BufMapping or perhaps provide a benchmark?
>
You can find that in below thread:
This even contains the original idea and initial patch for replacing
spinlocks with atomic ops. I have mentioned about the A-B-A problem
few mails down in that thread and given a link to paper suggesting how
that can be solved. It needs more work, but doable.
> I would like to reproduce
> this issue but I can't find anything relevant in a mailing list. Also
> it seems to be a good idea to compare alternative approaches that were
> mentioned (atomics ops, group leader). Are there any discussions,
> benchmarks or patches regarding this topic?
>
> this issue but I can't find anything relevant in a mailing list. Also
> it seems to be a good idea to compare alternative approaches that were
> mentioned (atomics ops, group leader). Are there any discussions,
> benchmarks or patches regarding this topic?
>
You can find the discussion and patch related to Group leader approach
in the thread:
This patch is already committed.
> Frankly I have serious doubts regarding atomics ops since they will more
> likely create the same contention that a spinlock does. But perhaps
> there is a patch that works not the way I think it could work.
>
I think it is difficult to say without implementing it. If we want to evaluate
multiple approaches then what we can do here is I can help with writing a
patch using LWLocks and you can once evaluate the atomic ops approach
and we already have your current patch, then we can see what works out
best.
> Oh, that's an interesting idea. I guess the problem is that if the > freelist is unshared, then users might get an error that the lock > table is full when some other partition still has elements remaining. Could we split one freelist in hash to NUM_LOCK_PARTITIONS freelists? Each partition will have its own freelist and if freelist is empty then partition should search an entry in freelists of other partitions. To prevent concurrent access it's needed to add one LWLock to hash, each partition should lock LWlock in share mode to work with its own freelist and exclusive to work with other freelists. Actually, I'd like to improve all partitioned hashes instead of improve only one case. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> Could we split one freelist in hash to NUM_LOCK_PARTITIONS freelists? > Each partition will have its own freelist and if freelist is empty > then partition should search an entry in freelists of other > partitions. To prevent concurrent access it's needed to add one > LWLock to hash, each partition should lock LWlock in share mode to > work with its own freelist and exclusive to work with other freelists. > > Actually, I'd like to improve all partitioned hashes instead of > improve only one case. It seems to be a most promising direction of research for now. I will send a patch and benchmark results soon.
> This idea can improve the situation with ProcLock hash table, but I > think IIUC what Andres is suggesting would reduce the contention > around dynahash freelist and can be helpful in many more situations > including BufMapping locks. I agree. But as I understand PostgreSQL community doesn't generally approve big changes that affects whole system. Especially if original problem was only in one particular place. Therefore for now I suggest only a small change. Naturally if it will be accepted there is no reason not to apply same changes for BufMapping or even dynahash itself with corresponding PROCLOCK hash refactoring. BTW could you (or anyone?) please help me find this thread regarding BufMapping or perhaps provide a benchmark? I would like to reproduce this issue but I can't find anything relevant in a mailing list. Also it seems to be a good idea to compare alternative approaches that were mentioned (atomics ops, group leader). Are there any discussions, benchmarks or patches regarding this topic? Frankly I have serious doubts regarding atomics ops since they will more likely create the same contention that a spinlock does. But perhaps there is a patch that works not the way I think it could work.
On Fri, Dec 18, 2015 at 8:46 AM, Teodor Sigaev <teodor@sigaev.ru> wrote: >> Oh, that's an interesting idea. I guess the problem is that if the >> freelist is unshared, then users might get an error that the lock >> table is full when some other partition still has elements remaining. > > Could we split one freelist in hash to NUM_LOCK_PARTITIONS freelists? > Each partition will have its own freelist and if freelist is empty then > partition should search an entry in freelists of other partitions. To > prevent concurrent access it's needed to add one LWLock to hash, each > partition should lock LWlock in share mode to work with its own freelist and > exclusive to work with other freelists. > > Actually, I'd like to improve all partitioned hashes instead of improve only > one case. Yeah. I'm not sure that should be an LWLock rather than a spinlock, but we can benchmark it both ways. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> > Actually, I'd like to improve all partitioned hashes instead of > > improve only one case. > > Yeah. I'm not sure that should be an LWLock rather than a spinlock, > but we can benchmark it both ways. I would like to share some preliminary results. I tested four implementations: - no locks and no element stealing from other partitions; - single LWLock per partitioned table; - single spinlock per partitioned table; - NUM_LOCK_PARTITIONS spinlocks per partitioned table; Interestingly "Shared Buffer Lookup Table" (see buf_table.c) has 128 partitions. The constant NUM_BUFFER_PARTITIONS was increased from 16 to 128 in commit 3acc10c9: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3acc10c997f916f6a741d0b4876126b7b08e3892;hp=952872698d9443fdf9b808a1376017f00c91065a Obviously after splitting a freelist into NUM_LOCK_PARTITIONS partitions (and assuming that all necessary locking/unlocking is done on calling side) tables can't have more than NUM_LOCK_PARTITIONS partitions because it would cause race conditions. For this reason I had to define NUM_BUFFER_PARTITIONS as NUM_LOCK_PARTITIONS and compare behaviour of PostgreSQL depending on different values of NUM_LOCK_PARTITIONS. So here are results: Core i7, pgbench -j 8 -c 8 -T 30 pgbench (3 tests, TPS excluding connections establishing) NUM_LOCK_ | master | no locks | lwlock | spinlock | spinlock PARTITIONS | (99ccb2) | | | | array -----------|----------|----------|----------|----------|---------- | 295.4 | 297.4 | 299.4 | 285.6 | 302.7 (1 << 4) | 286.1 | 300.5 | 283.4 | 300.9 | 300.4 | 300.0 | 300.0 | 302.1 | 300.7 | 300.3 -----------|----------|----------|----------|----------|---------- | | 296.7 | 299.9 | 298.8 | 298.3 (1 << 5) | ---- | 301.9 | 302.2 | 305.7 | 306.3 | | 287.7 | 301.0 | 303.0 | 304.5 -----------|----------|----------|----------|----------|---------- | | 296.4 | 300.5 | 302.9 | 304.6 (1 << 6) | ---- | 301.7 | 305.6 | 306.4 | 302.3 | | 299.6 | 304.5 | 306.6 | 300.4 -----------|----------|----------|----------|----------|---------- | | 295.9 | 298.7 | 295.3 | 305.0 (1 << 7) | ---- | 299.5 | 300.5 | 299.0 | 310.2 | | 287.8 | 285.9 | 300.2 | 302.2 Core i7, pgbench -j 8 -c 8 -f big_table.sql -T 30 my_database (3 test, TPS excluding connections establishing) NUM_LOCK_ | master | no locks | lwlock | spinlock | spinlock PARTITIONS | (99ccb2) | | | | array -----------|----------|----------|----------|----------|---------- | 505.1 | 521.3 | 511.1 | 524.4 | 501.6 (1 << 4) | 452.4 | 467.4 | 509.2 | 472.3 | 453.7 | 435.2 | 462.4 | 445.8 | 467.9 | 467.0 -----------|----------|----------|----------|----------|---------- | | 514.8 | 476.3 | 507.9 | 510.6 (1 << 5) | ---- | 457.5 | 491.2 | 464.6 | 431.7 | | 442.2 | 457.0 | 495.5 | 448.2 -----------|----------|----------|----------|----------|---------- | | 516.4 | 502.5 | 468.0 | 521.3 (1 << 6) | ---- | 463.6 | 438.7 | 488.8 | 455.4 | | 434.2 | 468.1 | 484.7 | 433.5 -----------|----------|----------|----------|----------|---------- | | 513.6 | 459.4 | 519.6 | 510.3 (1 << 7) | ---- | 470.1 | 454.6 | 445.5 | 415.9 | | 459.4 | 489.7 | 457.1 | 452.8 60-core server, pgbench -j 64 -c 64 -T 30 pgbench (3 tests, TPS excluding connections establishing) NUM_LOCK_ | master | no locks | lwlock | spinlock | spinlock PARTITIONS | (99ccb2) | | | | array -----------|----------|----------|----------|----------|---------- | 3156.2 | 3157.9 | 3542.0 | 3444.3 | 3472.4 (1 << 4) | 3268.5 | 3444.7 | 3485.7 | 3486.0 | 3500.5 | 3251.2 | 3482.3 | 3398.7 | 3587.1 | 3557.7 -----------|----------|----------|----------|----------|---------- | | 3352.7 | 3556.0 | 3543.3 | 3526.8 (1 << 5) | ---- | 3465.0 | 3475.2 | 3486.9 | 3528.4 | | 3410.0 | 3482.0 | 3493.7 | 3444.9 -----------|----------|----------|----------|----------|---------- | | 3437.8 | 3413.1 | 3445.8 | 3481.6 (1 << 6) | ---- | 3470.1 | 3478.4 | 3538.5 | 3579.9 | | 3450.8 | 3431.1 | 3509.0 | 3512.5 -----------|----------|----------|----------|----------|---------- | | 3425.4 | 3534.6 | 3414.7 | 3517.1 (1 << 7) | ---- | 3436.5 | 3430.0 | 3428.0 | 3536.4 | | 3455.6 | 3479.7 | 3573.4 | 3543.0 60-core server, pgbench -j 64 -c 64 -f big_table.sql -T 30 my_database (3 tests, TPS excluding connections establishing) NUM_LOCK_ | master | no locks | lwlock | spinlock | spinlock PARTITIONS | (99ccb2) | | | | array -----------|----------|----------|----------|----------|---------- | 661.1 | 4639.6 | 1435.2 | 445.9 | 1589.6 (1 << 4) | 642.9 | 4566.7 | 1410.3 | 457.1 | 1601.7 | 643.9 | 4621.8 | 1404.8 | 489.0 | 1592.6 -----------|----------|----------|----------|----------|---------- | | 4721.9 | 1543.1 | 499.1 | 1596.9 (1 << 5) | ---- | 4506.8 | 1513.0 | 528.3 | 1594.7 | | 4744.7 | 1540.3 | 524.0 | 1593.0 -----------|----------|----------|----------|----------|---------- | | 4649.1 | 1564.5 | 475.9 | 1580.1 (1 << 6) | ---- | 4671.0 | 1560.5 | 485.6 | 1589.1 | | 4751.0 | 1557.4 | 505.1 | 1580.3 -----------|----------|----------|----------|----------|---------- | | 4657.7 | 1551.8 | 534.7 | 1585.1 (1 << 7) | ---- | 4616.8 | 1546.8 | 495.8 | 1623.4 | | 4779.2 | 1538.5 | 537.4 | 1588.5 All four implementations (W.I.P. quality --- dirty code, no comments, etc) are attached to this message. Schema of my_database and big_table.sql file are attached to the first message of this thread. A large spread of TPS on Core i7 is due to the fact that its actually my laptop with other applications running beside PostgreSQL. Still we see that all solutions are equally good on this CPU and there is no performance degradation. Now regarding 60-core server: - One spinlock per hash table doesn't scale. I personally was expecting this; - LWLock's and array of spinlocks do scale on NUMA up to a certain point; - Best results are shown by "no locks"; I believe that "no locks" implementation is the best one since it is at least 3 times faster on NUMA then any other implementation. Also it is simpler and doesn't have stealing-from-other-freelists logic that executes rarely and therefore is a likely source of bugs. Regarding ~16 elements of freelists which in some corner cases could but wouldn't be used --- as I mentioned before I believe its not such a big problem. Also its a small price to pay for 3 times more TPS. Regarding NUM_LOCK_PARTITIONS (and NUM_BUFFER_PARTITIONS) I have some doubts. For sure Robert had a good reason for committing 3acc10c9. Unfortunately I'm not familiar with a story behind this commit. What do you think?
Attachment
On Tue, Dec 22, 2015 at 10:39 AM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote: > Obviously after splitting a freelist into NUM_LOCK_PARTITIONS > partitions (and assuming that all necessary locking/unlocking is done > on calling side) tables can't have more than NUM_LOCK_PARTITIONS > partitions because it would cause race conditions. For this reason I > had to define NUM_BUFFER_PARTITIONS as NUM_LOCK_PARTITIONS and compare > behaviour of PostgreSQL depending on different values of > NUM_LOCK_PARTITIONS. This is not at all obvious. There is no obvious reason why the number of ways that the freelist is partitioned needs to have any relation at all to the number of ways that the hash table itself is partitioned. Probably, each process should pick a freelist affinity based on MyBackendId % number_of_partitions or something like that. If it doesn't find one there, it can search the others, but always starting with the one for which it has an affinity. That way, the number of processes contending on a single spinlock will be limited to max_connections/number_of_partitions except when the table is very nearly full. That may happen a lot, though, for the buffer mapping table, so we might need to over-allocate to compensate. We want it to be the case that a backend will almost always find a free entry on its own freelist when it needs one, and only occasionally need to visit any other freelist. > Now regarding 60-core server: > > - One spinlock per hash table doesn't scale. I personally was expecting > this; > - LWLock's and array of spinlocks do scale on NUMA up to a certain > point; > - Best results are shown by "no locks"; > > I believe that "no locks" implementation is the best one since it is at > least 3 times faster on NUMA then any other implementation. Also it is > simpler and doesn't have stealing-from-other-freelists logic that > executes rarely and therefore is a likely source of bugs. Regarding ~16 > elements of freelists which in some corner cases could but wouldn't be > used --- as I mentioned before I believe its not such a big problem. > Also its a small price to pay for 3 times more TPS. I doubt it. Having the system become fundamentally less reliable is a big cost. If the system tries to find a lock table entry and fails, the user's query is going to error out. They are not at that point going to say, well, it's good that my system runs fast even though I can't run pg_dump. They are going to say, running pg_dump used to work and now it fails. The consequences of failing to make an entry in the buffer mapping table are at least as bad. And there's a second point, too, which is that you haven't proven that accepting the costs of your proposed model is even necessary. You've tried a few experiments with other approaches, not even all of the ones that were proposed in the earlier thread, and you don't seem to have done any investigation of why they didn't perform as well, or if you did, it's not discussed in this email. So maybe those problems can be fixed, after all. > Regarding NUM_LOCK_PARTITIONS (and NUM_BUFFER_PARTITIONS) I have some > doubts. For sure Robert had a good reason for committing 3acc10c9. > Unfortunately I'm not familiar with a story behind this commit. What do > you think? See the thread "Scaling shared buffer eviction". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Here is another preliminary result I would like to share. As always you will find corresponding patch in attachment. It has work in progress quality. The idea was suggested by colleague of mine Aleksander Lebedev. freeList is partitioned like in "no lock" patch. When there is no enough free items in a freeList we borrow AN items from a global list. When freeList become too large we return AN items back to global list. This global list is also partitioned into PN partitions. Each partition is protected by a spinlock. This way we have less lock contention than in "lwlock" or "spinlock array" versions since we borrow multiple free elements, not one a time. Also in worst case only AN*(NUM_LOCK_PARTITIONS-1) free items are not used instead of (Total/NUM_LOCK_PARTITIONS)*(NUM_LOCK_PARTITIONS-1). On 60-core server amount of TPS depends on AN and PN like this: | | | | | | | AN = 1 | AN = 2 | AN = 4 | AN = 8 | AN =16 | AN =32 -------|--------|--------|--------|--------|--------|-------- | 733.0 | 1120.6 | 1605.5 | 1842.5 | 1545.5 | 1237.0 PN = 1 | 740.3 | 1127.0 | 1634.2 | 1800.8 | 1573.5 | 1245.1 | 742.9 | 1102.1 | 1647.2 | 1853.6 | 1533.4 | 1251.9 -------|--------|--------|--------|--------|--------|-------- | 1052.0 | 1438.1 | 1755.6 | 1981.0 | 2022.0 | 1816.8 PN = 2 | 1044.8 | 1453.1 | 1784.0 | 1958.3 | 2033.2 | 1819.2 | 1028.7 | 1419.8 | 1809.2 | 1981.2 | 2028.2 | 1790.2 -------|--------|--------|--------|--------|--------|-------- | 1182.0 | 1521.5 | 1813.2 | 1932.6 | 2035.2 | 1948.4 PN = 4 | 1212.4 | 1535.4 | 1816.8 | 1927.0 | 2018.7 | 2014.6 | 1189.4 | 1528.9 | 1816.9 | 1942.6 | 2011.9 | 2018.3 -------|--------|--------|--------|--------|--------|-------- | 1148.1 | 1522.2 | 1795.4 | 1926.6 | 2031.7 | 2015.6 PN = 8 | 1175.6 | 1529.4 | 1807.6 | 1913.5 | 2007.3 | 2062.0 | 1169.9 | 1528.0 | 1796.3 | 1926.0 | 2011.1 | 2042.8 -------|--------|--------|--------|--------|--------|-------- | 1117.7 | 1491.0 | 1803.9 | 1925.3 | 2029.4 | 2056.2 PN =16 | 1132.8 | 1481.0 | 1809.6 | 1968.1 | 2033.8 | 2068.5 | 1131.4 | 1481.8 | 1819.4 | 1946.2 | 2071.1 | 2073.8 AN = GLOBAL_FREE_LIST_ALLOC_NUMBER PN = GLOBAL_FREE_LIST_PARTITIONS_NUM There is no performance degradation on Core i7. By increasing PN or AN any further we don't gain any more TPS. As you may see this version is about 30% faster than "lwlock" or "spinlock array" and 3.1 times faster than master. Still it's about 2.5 slower than "no locks" version which I find frustrating. Next I will try to speedup this version by modifying global_free_list_* procedures. Current implementations are not most efficient ones. Also I'm planning to explore approaches which involve lock free algorithms. I would like to know your opinion about such approach. For instance could we spare AN*(NUM_LOCK_PARTITIONS-1) items in a worst case or we can't by same reason we can't do it for (Total / NUM_LOCK_PARTITIONS) * (NUM_LOCK_PARTITIONS-1)?
Attachment
Good news, everyone! I discovered that NUM_LOCK_OF_PARTITIONS is a bottleneck for a last patch. Long story short - NUM_LOCK_OF_PARTITIONS = (1 << 7) gives best performance: PN = 16, AN = 8, NUM_LOCK_PARTITIONS = (1 << 7): 4782.9 PN = 16, AN = 4, NUM_LOCK_PARTITIONS = (1 << 7): 4089.9 PN = 16, AN = 2, NUM_LOCK_PARTITIONS = (1 << 7): 2822.5 PN = 8, AN = 8, NUM_LOCK_PARTITIONS = (1 << 7): 4391.7 PN = 8, AN = 4, NUM_LOCK_PARTITIONS = (1 << 7): 3665.0 PN = 8, AN = 2, NUM_LOCK_PARTITIONS = (1 << 7): 2205.7 PN = 4, AN = 8, NUM_LOCK_PARTITIONS = (1 << 7): 4031.9 PN = 4, AN = 4, NUM_LOCK_PARTITIONS = (1 << 7): 3378.2 PN = 4, AN = 2, NUM_LOCK_PARTITIONS = (1 << 7): 2142.4 Also I tried to optimize global_free_list_* procedures as I planned. Optimized version is asymptotically faster but works about 5% slower in practice because of some additional operations we need to perform. Patch is attached to this message in case you would like to examine a code. Here is a funny thing - benchmark results I shared 22.12.2015 are wrong because I forgot to run `make clean` after changing lwlock.h (autotools doesn't rebuild project properly after changing .h files). Here are correct results: 60-core server, pgbench -j 64 -c 64 -f pgbench.sql -T 50 -P 1 my_database NUM_LOCK_ | master | no locks | lwlock | spinlock | spinlock PARTITIONS | (99ccb2) | | | | array -----------|----------|----------|----------|----------|---------- 1 << 4 | 660.4 | 2296.3 | 1441.8 | 454.4 | 1597.8 -----------|----------|----------|----------|----------|---------- 1 << 5 | 536.6 | 3006.5 | 1632.3 | 381.8 | 1896.8 -----------|----------|----------|----------|----------|---------- 1 << 6 | 469.9 | 4083.5 | 1693.4 | 427.7 | 2271.5 -----------|----------|----------|----------|----------|---------- 1 << 7 | 430.8 | 4653.0 | 2016.3 | 361.1 | 3277.1 As you may see "spinlock array" version works quite well. It is almost 5 times faster than current dynahash implementation and only 30% slower than "no locks" version. It also guarantees that we will use all available memory. I experimented with various improvements of "spinlock array" as well. E.g. I tried to borrow elements not one a time, but it didn't cause any performance improvements. In case you believe that 5 times faster is not good enough we can also use sharded-global-free-list.patch with appropriate AN and PN values. Still this patch doesn't guarantee that all available memory will be used ("no lock" patch doesn't guarantee it either). Considering that benchmark above is not for all possible cases, but for a very specific one, and that "spinlock array" patch has much better guarantees then "no lock" patch, and that lock-free algorithms are pretty complicated and error-prone I suggest we call "spinlock array" solution good enough, at least until someone complain again that dynahash is a bottleneck. Naturally first I clean a code, write more comments, then re-check once again that there is no performance degradation according to usual pgbench, etc.
Attachment
Here is a clean version of the patch. Step 1 is an optimization. Step 2 refactors this: HTAB * ShmemInitHash(const char *name, /* table string name for shmem index */ - long init_size, /* initial table size */ + long init_size, /* initial table size XXX ignored, refactor! */ As I understand there is no performance degradation: Before ====== Core i7, pgbench -j 8 -c 8 -T 50 pgbench: ``` number of transactions actually processed: 14315 latency average: 27.945 ms latency stddev: 44.920 ms tps = 286.097043 (including connections establishing) tps = 286.127642 (excluding connections establishing) ``` 60-core server, pgbench -j 64 -c 64 -T 50 pgbench: ``` number of transactions actually processed: 176127 latency average: 18.162 ms latency stddev: 23.861 ms tps = 3521.069672 (including connections establishing) tps = 3522.166706 (excluding connections establishing) ``` After ===== Core i7, pgbench -j 8 -c 8 -T 50 pgbench: ``` number of transactions actually processed: 14212 latency average: 27.984 ms latency stddev: 43.250 ms tps = 284.038588 (including connections establishing) tps = 285.112772 (excluding connections establishing) ``` 60-core server, pgbench -j 64 -c 64 -T 50 pgbench: ``` number of transactions actually processed: 172135 latency average: 17.767 ms latency stddev: 22.692 ms tps = 3590.410302 (including connections establishing) tps = 3594.607165 (excluding connections establishing) ```
Attachment
Aleksander Alekseev wrote: > Here is a funny thing - benchmark results I shared 22.12.2015 are wrong > because I forgot to run `make clean` after changing lwlock.h (autotools > doesn't rebuild project properly after changing .h files). Running configure with --enable-depend should avoid this problem. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-12-30 11:37:19 -0300, Alvaro Herrera wrote: > Aleksander Alekseev wrote: > > > Here is a funny thing - benchmark results I shared 22.12.2015 are wrong > > because I forgot to run `make clean` after changing lwlock.h (autotools > > doesn't rebuild project properly after changing .h files). > > Running configure with --enable-depend should avoid this problem. I still maintain that --enable-depend should be on by default. We're absurdly optimizing towards saving a handful of cycles in scenarios which are usually bottlenecked by other things (build boxes spend more times on tests and such), rather than optimizing for developer time. I don't know how many people failed setting --enable-depend by now, but it definitely goes into several hundres of wasted ours territory.
Andres Freund wrote: > On 2015-12-30 11:37:19 -0300, Alvaro Herrera wrote: > > Aleksander Alekseev wrote: > > > > > Here is a funny thing - benchmark results I shared 22.12.2015 are wrong > > > because I forgot to run `make clean` after changing lwlock.h (autotools > > > doesn't rebuild project properly after changing .h files). > > > > Running configure with --enable-depend should avoid this problem. > > I still maintain that --enable-depend should be on by default. +1 Let's do it. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Agree. --enable-depend turned on by default could save me a lot of time. Now I'm aware regarding --enable-depend and `make clean`. Still other people will definitely do the same mistake over and over again. On Wed, 30 Dec 2015 11:49:13 -0300 Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Andres Freund wrote: > > On 2015-12-30 11:37:19 -0300, Alvaro Herrera wrote: > > > Aleksander Alekseev wrote: > > > > > > > Here is a funny thing - benchmark results I shared 22.12.2015 > > > > are wrong because I forgot to run `make clean` after changing > > > > lwlock.h (autotools doesn't rebuild project properly after > > > > changing .h files). > > > > > > Running configure with --enable-depend should avoid this problem. > > > > I still maintain that --enable-depend should be on by default. > > +1 Let's do it. >
On Wed, Dec 30, 2015 at 5:44 PM, Andres Freund <andres@anarazel.de> wrote:
On 2015-12-30 11:37:19 -0300, Alvaro Herrera wrote:
> Aleksander Alekseev wrote:
>
> > Here is a funny thing - benchmark results I shared 22.12.2015 are wrong
> > because I forgot to run `make clean` after changing lwlock.h (autotools
> > doesn't rebuild project properly after changing .h files).
>
> Running configure with --enable-depend should avoid this problem.
I still maintain that --enable-depend should be on by default. We're
+1
absurdly optimizing towards saving a handful of cycles in scenarios
which are usually bottlenecked by other things (build boxes spend more
times on tests and such), rather than optimizing for developer time. I
don't know how many people failed setting --enable-depend by now, but it
definitely goes into several hundres of wasted ours territory.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--enable-depend by default (was Re: Patch: fix lock contention for HASHHDR.mutex)
From
Tom Lane
Date:
[ A change as significant as this should not be debated in a thread with a title that suggests it's of interest to only one or two people ] > On Wed, Dec 30, 2015 at 5:44 PM, Andres Freund <andres@anarazel.de> wrote: >> I still maintain that --enable-depend should be on by default. We're >> absurdly optimizing towards saving a handful of cycles in scenarios >> which are usually bottlenecked by other things (build boxes spend more >> times on tests and such) Nonsense. The buildfarm will simply get slower, and at least for my animals it's too damn slow already. We could fix that by having the buildfarm script automatically add --disable-depend; but if we're going to change this, we should roll out that script update BEFORE we change it, not after. regards, tom lane
Re: --enable-depend by default (was Re: Patch: fix lock contention for HASHHDR.mutex)
From
Greg Stark
Date:
<p dir="ltr">Alternately the buildfarm could be changed to retain earlier builds and use dependencies to reduce build times.This would have the benefit of detecting dependency failures. At the expense of a lot more noise, at least until weOrrin out whatever dependency failures are there now.<p dir="ltr">Or we could add extra rules so that if you don't configurewith dependencies we check for unclean builds and error out. This would make at least accidental unclean buildsmuch led likely.<div class="gmail_quote">On 30 Dec 2015 16:49, "Tom Lane" <<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>>wrote:<br type="attribution" /><blockquote class="gmail_quote" style="margin:00 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">[ A change as significant as this should not be debatedin a thread with<br /> a title that suggests it's of interest to only one or two people ]<br /><br /> > On Wed,Dec 30, 2015 at 5:44 PM, Andres Freund <<a href="mailto:andres@anarazel.de">andres@anarazel.de</a>> wrote:<br />>> I still maintain that --enable-depend should be on by default. We're<br /> >> absurdly optimizing towardssaving a handful of cycles in scenarios<br /> >> which are usually bottlenecked by other things (build boxesspend more<br /> >> times on tests and such)<br /><br /> Nonsense. The buildfarm will simply get slower, andat least for my<br /> animals it's too damn slow already. We could fix that by having the<br /> buildfarm script automaticallyadd --disable-depend; but if we're going<br /> to change this, we should roll out that script update BEFOREwe change<br /> it, not after.<br /><br /> regards, tom lane<br /><br /><br /> --<br /> Sentvia pgsql-hackers mailing list (<a href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a>)<br />To make changes to your subscription:<br /><a href="http://www.postgresql.org/mailpref/pgsql-hackers" rel="noreferrer"target="_blank">http://www.postgresql.org/mailpref/pgsql-hackers</a><br /></blockquote></div>
Re: --enable-depend by default (was Re: Patch: fix lock contention for HASHHDR.mutex)
From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes: > Alternately the buildfarm could be changed to retain earlier builds and use > dependencies to reduce build times. This would have the benefit of > detecting dependency failures. At the expense of a lot more noise, at least > until we Orrin out whatever dependency failures are there now. The buildfarm already made its choice in this area, and that's ccache. Layering --enable-depend on top of builds that are using ccache is simply a waste. (Admittedly, ccache is gcc-specific, but TTBOTMK so is --enable-depend.) regards, tom lane
Re: --enable-depend by default (was Re: Patch: fix lock contention for HASHHDR.mutex)
From
Andres Freund
Date:
On 2015-12-30 10:49:27 -0500, Tom Lane wrote: > > On Wed, Dec 30, 2015 at 5:44 PM, Andres Freund <andres@anarazel.de> wrote: > >> I still maintain that --enable-depend should be on by default. We're > >> absurdly optimizing towards saving a handful of cycles in scenarios > >> which are usually bottlenecked by other things (build boxes spend more > >> times on tests and such) > > Nonsense. The buildfarm will simply get slower, and at least for my > animals it's too damn slow already. I think you're overstating the performance impact of --enable-depend. It's just an additional output file for the compiler, given the way we're doing it (-MMD -MP -MF $outfile). Let's make a quick test. I've built both trees once before, to prime ccache. The test is: for i in $(seq 1 5); do git clean -dfxq && ./configure --(dis|en)able-depend --quiet && time make -j4 -s;done (coffee break) no ccache, --disable-depend: 0m55.810s, 0m58.361s, 0m58.517s, 0m58.674s, 0m58.466s ccache, --disable-depend: 0m5.248s, 0m5.279s, 0m5.244s, 0m5.771s, 0m5.296s no ccache, --enable-depend: 0m56.443s, 0m58.507s, 0m58.587s, 0m58.866s, 0m58.429s ccache, --enable-depend: 0m5.538s, 0m5.518s, 0m5.572s, 0m5.555s, 0m5.528s Yes, this is on a much faster machine (my laptop) than what some buildfarm animal are running. But given it's really just some additional, *small*, files that are written (compiler) and read (make), I don't see why the impact would be significantly different on slower machines. > We could fix that by having the > buildfarm script automatically add --disable-depend; but if we're going > to change this, we should roll out that script update BEFORE we change > it, not after. Sounds reasonable, despite the above. This isn't an urgent change, just something to make newer developers waste less time.
Re: --enable-depend by default (was Re: Patch: fix lock contention for HASHHDR.mutex)
From
Andres Freund
Date:
On 2015-12-30 11:13:31 -0500, Tom Lane wrote: > The buildfarm already made its choice in this area, and that's ccache. > Layering --enable-depend on top of builds that are using ccache is > simply a waste. > > (Admittedly, ccache is gcc-specific, but TTBOTMK so is --enable-depend.) For posterities sake: clang supports the dependency stuff by default, it's harder but doable to get clang running with ccache.
On Wed, Dec 30, 2015 at 1:10 PM, Oleg Bartunov <obartunov@gmail.com> wrote:
>
>
>
> On Wed, Dec 30, 2015 at 5:44 PM, Andres Freund <andres@anarazel.de> wrote:
>>
>> On 2015-12-30 11:37:19 -0300, Alvaro Herrera wrote:
>> > Aleksander Alekseev wrote:
>> >
>> > > Here is a funny thing - benchmark results I shared 22.12.2015 are wrong
>> > > because I forgot to run `make clean` after changing lwlock.h (autotools
>> > > doesn't rebuild project properly after changing .h files).
>> >
>> > Running configure with --enable-depend should avoid this problem.
>>
>> I still maintain that --enable-depend should be on by default. We're
>
>
> +1
>
--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog: http://fabriziomello.github.io
>> Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello
>> Github: http://github.com/fabriziomello
Andres, Robert, are you still reviewing this patch? Aleksander Alekseev wrote: > Here is a clean version of the patch. Step 1 is an optimization. Step 2 > refactors this: > > HTAB * > ShmemInitHash(const char *name, /* table string name for shmem index */ > - long init_size, /* initial table size */ > + long init_size, /* initial table size XXX ignored, > refactor! */ So you were saying some posts upthread that the new hash tables would "borrow" AN elements from the freelist then put AN elements back when too many got unused. Is that idea embodied in this patch? I'm nervous about continuing a line of development that does away with the "init_size" concept completely, but if this patch is keep the "borrow" concept then maybe it doesn't matter. > -/* Number of partitions of the shared buffer mapping hashtable */ > -#define NUM_BUFFER_PARTITIONS 128 > - > /* Number of partitions the shared lock tables are divided into */ > -#define LOG2_NUM_LOCK_PARTITIONS 4 > +#define LOG2_NUM_LOCK_PARTITIONS 7 > #define NUM_LOCK_PARTITIONS (1 << LOG2_NUM_LOCK_PARTITIONS) > > +/* Number of partitions of the shared buffer mapping hashtable */ > +#define NUM_BUFFER_PARTITIONS NUM_LOCK_PARTITIONS > + > /* Number of partitions the shared predicate lock tables are divided into */ > #define LOG2_NUM_PREDICATELOCK_PARTITIONS 4 > #define NUM_PREDICATELOCK_PARTITIONS (1 << LOG2_NUM_PREDICATELOCK_PARTITIONS) I find this odd. Why is it a good idea to define the bufMapping partitions like this? What's the explanation for the performance numbers you saw? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello, Álvaro > So you were saying some posts upthread that the new hash tables would > "borrow" AN elements from the freelist then put AN elements back when > too many got unused. Is that idea embodied in this patch? Right. It didn't satisfy "use all memory" requirement anyway. It's better to sacrifice a few TPS (according only to my benchmark which doesn't represent all possible use cases) than accidentally break someones system after upgrading to the next version of PostgreSQL. See example with pg_dump given by Robert Haas above. >> +/* Number of partitions of the shared buffer mapping hashtable */ >> +#define NUM_BUFFER_PARTITIONS NUM_LOCK_PARTITIONS >> + >> /* Number of partitions the shared predicate lock tables are divided >> into */ #define LOG2_NUM_PREDICATELOCK_PARTITIONS 4 >> #define NUM_PREDICATELOCK_PARTITIONS (1 << >> LOG2_NUM_PREDICATELOCK_PARTITIONS) > > I find this odd. Why is it a good idea to define the bufMapping > partitions like this? What's the explanation for the performance > numbers you saw? My mistake. I thought that last table was self-explanatory. We see that a single spinlock for accessing a freeList (current implementation) is obviously a bottleneck. Replacing it with say an array of spinlocks reduces lock contention and therefore gives more TPS (see first row). Also we can increase concurrency level even further by increasing number of lock partitions (see columns "no locks", "lwlock" and "spinlock array"). Previously it couldn't help us (see "master" column) because of a bottleneck. If you have any other questions regarding this patch please don't hesitate to ask. Best regards, Aleksander
On 2016-01-11 21:16:42 -0300, Alvaro Herrera wrote: > Andres, Robert, are you still reviewing this patch? I don't think I've signed to - or actually did - reviewed this path. Greetings, Andres Freund
Andres Freund wrote: > On 2016-01-11 21:16:42 -0300, Alvaro Herrera wrote: > > Andres, Robert, are you still reviewing this patch? > > I don't think I've signed to - or actually did - reviewed this path. Well, maybe not as such, but you replied to the thread several times, which is why I asked. No problem, you don't *have* to review all patches. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jan 12, 2016 at 1:50 PM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:
I have done some performance bench marking for this patch.(dynahash-optimization-v10-step1.patch)
Machine Detail:
cpu : POWER8
cores: 24 (192 with HT)
Non Default Parameter:
------------------------
Shared Buffer= 30GB
max_wal_size= 10GB
max_connections=500
Test1:
schema.sql and pgbench.sql are same files which Aleksander has attached in first mail of the thread.
psql -d postgres -f schema.sql
pgbench -c$ -j$ -f pgbench.sql postgres
client base patch
1 145 145
2 262 258
4 453 472
8 749 732
16 1114 1128
32 1727 1789
64 2729 2793
128 3534 3520
With this test i did not see any improvement, though i think it should improve the performance, do you have any suggestion to see the results same as yours ?
I have also dump stacks using some script and I have observed many stacks dumps as you mentioned in initial thread. And after patch, I found that number of lock waiting stacks are reduced.
Stack Dump:
-------------------
#0 0x00007f25842899a7 in semop () from /lib64/libc.so.6
#1 0x00000000007116d0 in PGSemaphoreLock (sema=0x7f257cb170d8) at pg_sema.c:387
#2 0x000000000078955f in LWLockAcquire (lock=0x7f247698a980, mode=LW_EXCLUSIVE) at lwlock.c:1028
#3 0x00000000007804c4 in LockAcquireExtended
#4 0x000000000077fe91 in LockAcquire
#5 0x000000000077e862 in LockRelationOid
#6 0x000000000053bc7b in find_inheritance_children
#7 0x000000000053bd4f in find_all_inheritors
#8 0x00000000006fc0a2 in expand_inherited_rtentry
#9 0x00000000006fbf91 in expand_inherited_tables
I have tried to analyze using perf also, I can see that amount of time taken in hash_search_with_hash_value is reduced from 3.86%(master) to 1.72%(patch).
I have plan to do further investigation, in different scenarios of dynahash.
--
increasing number of lock partitions (see columns "no locks", "lwlock"
and "spinlock array"). Previously it couldn't help us (see "master"
column) because of a bottleneck.
If you have any other questions regarding this patch please don't
hesitate to ask.
I have done some performance bench marking for this patch.(dynahash-optimization-v10-step1.patch)
Machine Detail:
cpu : POWER8
cores: 24 (192 with HT)
Non Default Parameter:
------------------------
Shared Buffer= 30GB
max_wal_size= 10GB
max_connections=500
Test1:
schema.sql and pgbench.sql are same files which Aleksander has attached in first mail of the thread.
psql -d postgres -f schema.sql
pgbench -c$ -j$ -f pgbench.sql postgres
client base patch
1 145 145
2 262 258
4 453 472
8 749 732
16 1114 1128
32 1727 1789
64 2729 2793
128 3534 3520
With this test i did not see any improvement, though i think it should improve the performance, do you have any suggestion to see the results same as yours ?
I have also dump stacks using some script and I have observed many stacks dumps as you mentioned in initial thread. And after patch, I found that number of lock waiting stacks are reduced.
Stack Dump:
-------------------
#0 0x00007f25842899a7 in semop () from /lib64/libc.so.6
#1 0x00000000007116d0 in PGSemaphoreLock (sema=0x7f257cb170d8) at pg_sema.c:387
#2 0x000000000078955f in LWLockAcquire (lock=0x7f247698a980, mode=LW_EXCLUSIVE) at lwlock.c:1028
#3 0x00000000007804c4 in LockAcquireExtended
#4 0x000000000077fe91 in LockAcquire
#5 0x000000000077e862 in LockRelationOid
#6 0x000000000053bc7b in find_inheritance_children
#7 0x000000000053bd4f in find_all_inheritors
#8 0x00000000006fc0a2 in expand_inherited_rtentry
#9 0x00000000006fbf91 in expand_inherited_tables
I have tried to analyze using perf also, I can see that amount of time taken in hash_search_with_hash_value is reduced from 3.86%(master) to 1.72%(patch).
I have plan to do further investigation, in different scenarios of dynahash.
--
30.12.2015 16:01, Aleksander Alekseev: > Here is a clean version of the patch. Step 1 is an optimization. Step 2 > refactors this: > > HTAB * > ShmemInitHash(const char *name, /* table string name for shmem index */ > - long init_size, /* initial table size */ > + long init_size, /* initial table size XXX ignored, > refactor! */ Hi, I found that the patch is still not reviewed, so I've looked it over. I haven't dived into this subject before, so my comments will be more general than relating to your investigation. Sorry if some things seem like nitpicks, I just suppose that clear patch would be more convenient for reviewers. First of all, why not merge both patches into one? They aren't too big anyway. Then, this thread became too tangled. I think it's worth to write a new message with the patch, the test script, some results and brief overview of how does it really works. It will make following review much easier. + /* number of entries in hash table */ + long nentries[NUM_LOCK_PARTITIONS]; + /* linked list of free elements */ + HASHELEMENT *freeList[NUM_LOCK_PARTITIONS]; I think comments should be changed, to be more informative here. + if (IS_PARTITIONED(hashp->hctl)) + partitions_number = NUM_LOCK_PARTITIONS; + else + partitions_number = 1; + + nelem_alloc = ((int) nelem) / partitions_number; + if (nelem_alloc == 0) + nelem_alloc = 1; Add a comment here too. -/* Number of partitions of the shared buffer mapping hashtable */ -#define NUM_BUFFER_PARTITIONS 128 - /* Number of partitions the shared lock tables are divided into */ -#define LOG2_NUM_LOCK_PARTITIONS 4 +#define LOG2_NUM_LOCK_PARTITIONS 7 #define NUM_LOCK_PARTITIONS (1 << LOG2_NUM_LOCK_PARTITIONS) +/* Number of partitions of the shared buffer mapping hashtable */ +#define NUM_BUFFER_PARTITIONS NUM_LOCK_PARTITIONS + I'm sure, it was discussed above. but these definitions do not look obvious at all. Maybe you should explain this magic number 7 in the comment above? -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
This patch affects header files. By any chance didn't you forget to run `make clean` after applying it? As we discussed above, when you change .h files autotools doesn't rebuild dependent .c files: http://www.postgresql.org/message-id/CAF4Au4y4MTapP2u3uGtBQd_z0CHEsJWfNAvRgK4umfcWA4ELxQ@mail.gmail.com On Thu, 21 Jan 2016 16:49:12 +0530 Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Tue, Jan 12, 2016 at 1:50 PM, Aleksander Alekseev < > a.alekseev@postgrespro.ru> wrote: > > > > > increasing number of lock partitions (see columns "no locks", > > "lwlock" and "spinlock array"). Previously it couldn't help us (see > > "master" column) because of a bottleneck. > > > > If you have any other questions regarding this patch please don't > > hesitate to ask. > > > > I have done some performance bench marking for this > patch.(dynahash-optimization-v10-step1.patch) > > Machine Detail: > cpu : POWER8 > cores: 24 (192 with HT) > > Non Default Parameter: > ------------------------ > Shared Buffer= 30GB > max_wal_size= 10GB > max_connections=500 > > Test1: > *schema.sql* and *pgbench.sql* are same files which Aleksander has > attached in first mail of the thread. > > psql -d postgres -f schema.sql > pgbench -c$ -j$ -f pgbench.sql postgres > > client base patch > 1 145 145 > 2 262 258 > 4 453 472 > 8 749 732 > 16 1114 1128 > 32 1727 1789 > 64 2729 2793 > 128 3534 3520 > > With this test i did not see any improvement, though i think it should > improve the performance, do you have any suggestion to see the > results same as yours ? > > I have also dump stacks using some script and I have observed many > stacks dumps as you mentioned in initial thread. And after patch, I > found that number of lock waiting stacks are reduced. > > Stack Dump: > ------------------- > #0 0x00007f25842899a7 in semop () from /lib64/libc.so.6 > #1 0x00000000007116d0 in PGSemaphoreLock (sema=0x7f257cb170d8) at > pg_sema.c:387 > #2 0x000000000078955f in LWLockAcquire (lock=0x7f247698a980, > mode=LW_EXCLUSIVE) at lwlock.c:1028 > #3 0x00000000007804c4 in LockAcquireExtended > #4 0x000000000077fe91 in LockAcquire > #5 0x000000000077e862 in LockRelationOid > #6 0x000000000053bc7b in find_inheritance_children > #7 0x000000000053bd4f in find_all_inheritors > #8 0x00000000006fc0a2 in expand_inherited_rtentry > #9 0x00000000006fbf91 in expand_inherited_tables > > I have tried to analyze using perf also, I can see that amount of time > taken in hash_search_with_hash_value is reduced from 3.86%(master) to > 1.72%(patch). > > I have plan to do further investigation, in different scenarios of > dynahash. >
Hi, > First of all, why not merge both patches into one? They aren't too > big anyway. Agree. > I think comments should be changed, to be more informative here. > Add a comment here too. > Maybe you should explain this magic number 7 in the comment above? Done. > Then, this thread became too tangled. I think it's worth to write a > new message with the patch, the test script, some results and brief > overview of how does it really works. It will make following review > much easier. Sure. HASHHDR represents a hash table. It could be usual or partitioned. Partitioned table is stored in a shared memory and accessed by multiple processes simultaneously. To prevent data corruption hash table is partitioned and each process has to acquire a lock for a corresponding partition before accessing data in it. Number of partition is determine by lower bits of key's hash value. Most tricky part is --- dynahash knows nothing about these locks, all locking is done on calling side. Since shared memory is pre-allocated and can't grow to allocate memory in a hash table we use freeList. Also we use nentries field to count current number of elements in a hash table. Since hash table is used by multiple processes these fields are protected by a spinlock. Which as it turned out could cause lock contention and create a bottleneck. After trying a few approaches I discovered that using one spinlock per partition works quite well. Here are last benchmark results: http://www.postgresql.org/message-id/20151229184851.1bb7d1bd@fujitsu Note that "no locks" solution cant be used because it doesn't guarantee that all available memory will be used in some corner cases. You can find a few more details and a test script in the first message of this thread. If you have any other questions regarding this patch please don't hesitate to ask. Best regards, Aleksander
Attachment
<div dir="ltr"><div class="gmail_extra"><br /><div class="gmail_quote">On Fri, Jan 22, 2016 at 3:44 PM, Aleksander Alekseev<span dir="ltr"><<a href="mailto:a.alekseev@postgrespro.ru" target="_blank">a.alekseev@postgrespro.ru</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px#ccc solid;padding-left:1ex"><div class="a3s" id=":4ve" style="overflow:hidden">This patch affects headerfiles. By any chance didn't you forget to run<br /> `make clean` after applying it? As we discussed above, when you<br/> change .h files autotools doesn't rebuild dependent .c files:</div></blockquote></div><br /></div><div class="gmail_extra">Yes,actually i always compile using "make clean;make -j20; make install"<br /></div><div class="gmail_extra">Ifyou want i will run it again may be today or tomorrow and post the result.<br /><br /><br />-- <br/><div class="gmail_signature"><div dir="ltr"><span style="color:rgb(80,0,80);font-size:12.8px">Regards,</span><br style="color:rgb(80,0,80);font-size:12.8px"/><span style="color:rgb(80,0,80);font-size:12.8px">Dilip Kumar</span><br style="color:rgb(80,0,80);font-size:12.8px"/><span style="color:rgb(80,0,80);font-size:12.8px">EnterpriseDB: </span><a href="http://www.enterprisedb.com/"style="color:rgb(17,85,204);font-size:12.8px" target="_blank">http://www.enterprisedb.com</a><br/></div></div></div></div>
> > This patch affects header files. By any chance didn't you forget to > > run `make clean` after applying it? As we discussed above, when you > > change .h files autotools doesn't rebuild dependent .c files: > > > > Yes, actually i always compile using "make clean;make -j20; make > install" If you want i will run it again may be today or tomorrow and > post the result. > > Most likely HASHHDR.mutex is not only bottleneck in your case so my patch doesn't help much. Unfortunately I don't have access to any POWER8 server so I can't investigate this issue. I suggest to use a gettimeofday trick I described in a first message of this thread. Its time consuming but it gives a clear understanding which code is keeping a lock.
<br /><br /><div class="moz-cite-prefix">22.01.2016 13:48, Aleksander Alekseev:<br /></div><blockquote cite="mid:20160122134837.788ecf83@fujitsu"type="cite"><blockquote type="cite"><pre wrap="">Then, this thread became too tangled.I think it's worth to write a new message with the patch, the test script, some results and brief overview of how does it really works. It will make following review much easier. </pre></blockquote><pre wrap=""> Sure. HASHHDR represents a hash table. It could be usual or partitioned. Partitioned table is stored in a shared memory and accessed by multiple processes simultaneously. To prevent data corruption hash table is partitioned and each process has to acquire a lock for a corresponding partition before accessing data in it. Number of partition is determine by lower bits of key's hash value. Most tricky part is --- dynahash knows nothing about these locks, all locking is done on calling side. Since shared memory is pre-allocated and can't grow to allocate memory in a hash table we use freeList. Also we use nentries field to count current number of elements in a hash table. Since hash table is used by multiple processes these fields are protected by a spinlock. Which as it turned out could cause lock contention and create a bottleneck. After trying a few approaches I discovered that using one spinlock per partition works quite well. Here are last benchmark results: <a class="moz-txt-link-freetext" href="http://www.postgresql.org/message-id/20151229184851.1bb7d1bd@fujitsu">http://www.postgresql.org/message-id/20151229184851.1bb7d1bd@fujitsu</a> Note that "no locks" solution cant be used because it doesn't guarantee that all available memory will be used in some corner cases. You can find a few more details and a test script in the first message of this thread. If you have any other questions regarding this patch please don't hesitate to ask. </pre></blockquote><div class="moz-text-flowed" lang="x-unicode" style="font-family: -moz-fixed; font-size: 12px;">InitLocks<br /> /* <br /> * Compute init/max size to request for lock hashtables. Note these <br /> * calculations must agree with LockShmemSize! <br /> */ <br /><br /> This comment certainly requires some changes.<br /> BTW, could you explain why init_table_size was two times less than max_table_size? <br /> Although, I don'tsee any problems with your changes. <br /><br /><br /> - hctl->nentries = 0; <br /> - hctl->freeList = NULL;<br /><br /> Why did you delete these two lines? I wonder if you should rewrite them instead? <br /><br /> + * Thisparticular number of partitions significantly reduces lock contention <br /> + * in partitioned hash tables, almost ifpartitioned tables didn't use any <br /> + * locking at all <br /><br /> As far as I understood, this number was obtainedexperimentally? Maybe you should note that in the comment. <br /><br /><br /> And the last, but not least.<br /><br/> + nelem_alloc = ((int) nelem) / partitions_number; <br /> + if (nelem_alloc == 0) <br /> + nelem_alloc = 1; <br /> + <br /> + for (i = 0; i < partitions_number; i++) <br /> + if (!element_alloc(hashp,nelem_alloc, i)) <br /> + ereport(ERROR, <br /> + (errcode(ERRCODE_OUT_OF_MEMORY),<br /> + errmsg("out of memory"))); <br /><br /><br /> It looks likeyou forgot to round the result of the division.<br /> For example, if you have nelem=25 and partitions_number=6.<br />25 / 6 = 4. And then you allocate 24 nelems, while 1 is lost. <br /><br /> What about productivity, I did a test on mynotebook. Patched version shows small performance improvement.<br /><br /> pgbench -j 1 -c 1 -f pgbench.sql -T 300 postgres<br/> base patched<br /> 127tps 145tps<br /> pgbench -j 8 -c 8 -f pgbench.sql -T 300 postgres<br /> base patched<br/> 247tps 270tps<br /><br /> But I haven't any big machine to perform tests, where the patch is supposed to makesignificant changes.<br /> So I would rely on your and the other reviewers results.<br /><br /> Except mentioned notes,I suppose the patch is good enough to pass it to a reviewer.<br /></div><pre class="moz-signature" cols="72">-- Anastasia Lubennikova Postgres Professional: <a class="moz-txt-link-freetext" href="http://www.postgrespro.com">http://www.postgrespro.com</a> The Russian Postgres Company</pre>
> This comment certainly requires some changes. Fixed. > BTW, could you explain why init_table_size was two times less than > max_table_size? I have no clue. My best guess is that it was a reasonable thing to do in the past. Then somebody changed a code and now there is little reason to use init_table_size for partitioned tables. > Why did you delete these two lines? I wonder if you should rewrite > them instead? ``` MemSet(hctl, 0, sizeof(HASHHDR)); - hctl->nentries = 0; - hctl->freeList = NULL; ``` These fields were initialized with zero values twice. It makes little sense to me. > As far as I understood, this number was obtained experimentally? > Maybe you should note that in the comment. These numbers are very platform specific and will be outdated very soon. I recall that my code was criticized for including exact numbers not a long time ago. So I suggest to keep this part as is. > For example, if you have nelem=25 and partitions_number=6. > 25 / 6 = 4. And then you allocate 24 nelems, while 1 is lost. Agree. Fixed. > Except mentioned notes, I suppose the patch is good enough I guess I will mark this patch as "Ready for Committer" then.
Attachment
<div dir="ltr"><div class="gmail_extra"><br /><div class="gmail_quote">On Wed, Jan 27, 2016 at 5:15 PM, Aleksander Alekseev<span dir="ltr"><<a href="mailto:a.alekseev@postgrespro.ru" target="_blank">a.alekseev@postgrespro.ru</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin:0px 0px0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="" id=":2o6" style="overflow:hidden">Mostlikely HASHHDR.mutex is not only bottleneck in your case so my<br /> patch doesn't help much.Unfortunately I don't have access to any<br /> POWER8 server so I can't investigate this issue. I suggest to use a<br/> gettimeofday trick I described in a first message of this thread. Its<br /> time consuming but it gives a clear understandingwhich code is keeping<br /> a lock.</div></blockquote></div><br />I have also tested the pgbench Readonly testwhen data don't fit into shared buffer,<br />Because in this case HASHHDR.mutex access will be quite frequent.<br />Andin this case i do see very good improvement in POWER8 server.<br /><br />Test Result:<br />------------<br />Scale Factor: 300 <br />Shared Buffer: 512MB <br />pgbench -c$ -j$ -S -M Prepared postgres<br /> <br />Client base patch<br />64 222173 318318<br />128 195805 262290<br clear="all" /><br />-- <br /><divclass="gmail_signature"><div dir="ltr"><span style="color:rgb(80,0,80);font-size:12.8px">Regards,</span><br style="color:rgb(80,0,80);font-size:12.8px"/><span style="color:rgb(80,0,80);font-size:12.8px">Dilip Kumar</span><br style="color:rgb(80,0,80);font-size:12.8px"/><span style="color:rgb(80,0,80);font-size:12.8px">EnterpriseDB: </span><a href="http://www.enterprisedb.com/"style="color:rgb(17,85,204);font-size:12.8px" target="_blank">http://www.enterprisedb.com</a><br/></div></div></div></div>
On Thu, Jan 21, 2016 at 9:03 AM, Anastasia Lubennikova <a.lubennikova@postgrespro.ru> wrote: > First of all, why not merge both patches into one? They aren't too big > anyway. So looking over this patch, it institutes a new coding rule that all shared hash tables must have the same number of partitions, and that number is hard-coded at 128. The number of freelist partitions is also hardcoded at 128. I find this design surprising. First, I would not have expected that we'd need 128 freelist partitions. We get by just fine with only 16 lock manager partitions, at least AFAIK, and there's no reason some future hashtable might not have little enough contention that 16 partitions works just fine. Even for the buffer manager, which does have 128 partitions, I wouldn't assume that we need 128 partitions for the free list just because we need 128 partitions for the locks themselves. It's possible we do, but I'd expect that existing buffer mapping locking might dissipate some of that contention - e.g. if somebody's doing a buffer lookup on a particular partition, they have a shared lock on that partition, so nobody has an exclusive lock to be able to hit the freelist. So: do we have clear evidence that we need 128 partitions here, or might, say, 16 work just fine? Second, I don't see any reason why the number of free list partitions needs to be a constant, or why it needs to be equal to the number of hash bucket partitions. That just seems like a completely unnecessary constraint. You can take the hash value, or whatever else you use to pick the partition, modulo any number you want. I don't really want to increase the number of lock manager partition locks to 128 just for the convenience of this patch, and even less do I want to impose the requirement that every future shared hash table must use an equal number of partitions. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello, Robert. > So: do we have clear evidence that we need 128 partitions here, or > might, say, 16 work just fine? Yes, this number of partitions was chosen based on this benchmark (see "spinlock array" column): http://www.postgresql.org/message-id/20151229184851.1bb7d1bd@fujitsu In fact we can choose 16, 32, 64 or 128 partitions - any number works better than current implementation with a single spinlock. But according to this benchmark 128 partitions give twice as much TPS then 16 partitions, almost as if we didn't have any locks at all (see "no locks" column). Still I'm a bit concerned that 128 partitions require (128-16)* ( sizeof(slock_t) + sizeof(void*) + sizeof(long) ) bytes of additional memory per hash table which is: (gdb) p (128-16)*(sizeof(slock_t) + sizeof(long) + sizeof(void*)) $1 = 1904 ... bytes of shared memory on amd64. I'm not sure if this is considered OK. If it is I suggest to leave number 128. If it's not, perhaps we could agree on say 64 partitions as a compromise. Described benchmark doesn't represent any possible workload anyway. I personally believe that we don't have so many small hash tables that 1.8 Kb of additional shared memory per hash table would be an issue. > I don't really want to increase the number of lock manager partition > locks to 128 just for the convenience of this patch, and even less do > I want to impose the requirement that every future shared hash table > must use an equal number of partitions. May I ask why? Does it create a real problem? > I don't see any reason why the number of free list partitions needs > to be a constant, or why it needs to be equal to the number of hash > bucket partitions. That just seems like a completely unnecessary > constraint. You can take the hash value, or whatever else you use to > pick the partition, modulo any number you want. True, number of spinlocks / freeLists could be a variable. I believe it's an old-good simplicity vs flexibility question. Lets say we allow user (i.e. calling side) to choose spinlocks and free lists number. So now we should use freeList[hash % items_number] formula which means division instead of binary shift. Do we really need such an expensive operation just to calculate index of a free list? Probably not. Let limit possible items_number values so it could be only power of two. Now there is only four possible values user would more likely to choose (16, 32, 64, 128), which is not as flexible anymore. What about memory? If we allocate mutex, nentries and freeList dynamically depending on items_number value, HASHHDR should store pointers to allocated memory which means additional indirection for almost every access to mutex, nentries and freeList fields. Probably a bad idea. Otherwise we still waste shared memory, probably even more memory since we don't want maximum items_number value to be too low. Etc. I believe such a change will cause only additional code complexity (code is already far from trivial in dynahash.c) so next time it will be much harder to change anything, probably some bugs we will miss and will not solve any real problem. Why just not to choose a constant which seems to be good enough instead?
On Mon, Feb 8, 2016 at 5:39 AM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote: >> So: do we have clear evidence that we need 128 partitions here, or >> might, say, 16 work just fine? > > Yes, this number of partitions was chosen based on this benchmark (see > "spinlock array" column): > > http://www.postgresql.org/message-id/20151229184851.1bb7d1bd@fujitsu It's very hard to understand what that email is trying to say. It uses terms like PN and AN without defining them. There's certainly not enough detail for somebody else to replicate what you did. > In fact we can choose 16, 32, 64 or 128 partitions - any number works > better than current implementation with a single spinlock. But > according to this benchmark 128 partitions give twice as much TPS then > 16 partitions, almost as if we didn't have any locks at all (see "no > locks" column). OK. > Still I'm a bit concerned that 128 partitions require (128-16)* > ( sizeof(slock_t) + sizeof(void*) + sizeof(long) ) bytes of additional > memory per hash table which is: > > (gdb) p (128-16)*(sizeof(slock_t) + sizeof(long) + sizeof(void*)) > $1 = 1904 > > ... bytes of shared memory on amd64. > > I'm not sure if this is considered OK. If it is I suggest to leave > number 128. If it's not, perhaps we could agree on say 64 partitions as > a compromise. Described benchmark doesn't represent any possible > workload anyway. I personally believe that we don't have so many small > hash tables that 1.8 Kb of additional shared memory per hash table would > be an issue. I don't particularly care about using slightly more shared memory per hash table. >> I don't really want to increase the number of lock manager partition >> locks to 128 just for the convenience of this patch, and even less do >> I want to impose the requirement that every future shared hash table >> must use an equal number of partitions. > > May I ask why? Does it create a real problem? It's hard to be 100% sure about the future, but suppose that in another 10 years we discover that the buffer mapping hash table now needs to have 1024 partitions, but the lock manager partition hash table still only needs 16. Do we really want to drag the lock manager and any future shared hash tables up to 1024 just because the buffer manager needs it? That sounds like a bad idea from here. For example, one advantage of having fewer partitions is that it takes less time to lock them all. That matters to - for example - the cost of running the deadlock detector. Is the cost enough to matter in that particular case with the code as it stands? I don't know. Could having more partitions ever be worse than having fewer? I'm pretty sure that the answer is yes. >> I don't see any reason why the number of free list partitions needs >> to be a constant, or why it needs to be equal to the number of hash >> bucket partitions. That just seems like a completely unnecessary >> constraint. You can take the hash value, or whatever else you use to >> pick the partition, modulo any number you want. > > True, number of spinlocks / freeLists could be a variable. I believe > it's an old-good simplicity vs flexibility question. > > Lets say we allow user (i.e. calling side) to choose spinlocks and free > lists number. So now we should use freeList[hash % items_number] formula > which means division instead of binary shift. Do we really need such an > expensive operation just to calculate index of a free list? Probably > not. Let limit possible items_number values so it could be only power > of two. Now there is only four possible values user would more likely to > choose (16, 32, 64, 128), which is not as flexible anymore. Sure, I don't see anything wrong with requiring the number of partitions to be a power of two. On the other hand, I also suspect that on modern hardware the cost of a shift versus a modulo operator is almost irrelevant. What's really going to make this fast or slow is the degree to which it has, or does not have, cache line contention. > What about memory? If we allocate mutex, nentries and freeList > dynamically depending on items_number value, HASHHDR should store > pointers to allocated memory which means additional indirection for > almost every access to mutex, nentries and freeList fields. Probably a > bad idea. Otherwise we still waste shared memory, probably even more > memory since we don't want maximum items_number value to be too low. These seem like problems that are admit of more than one good solution, and which you are certainly capable of solving. For example, you could always include space for 128 freelists but not tie the number of freelists to the number of partition locks, or you could always include space for 128 freelists and have the system only use them all if there are 128 or more partitions. If there are fewer than 128 partitions, it uses only as many freelists as there are partitions. Neither of these things sounds very hard to implement. Basically, the burden for you to impose a new coding rule on everybody who uses shared hash tables in the future is very high. You would need to show that that really isn't going to be an issue for anyone in any reasonable situation. I don't think you can show that, because I don't think it's true. I don't even like the idea of raising the number of partitions for the lock manager for the convenience of your code. If that turns out to have negative performance consequences in some situation, we'd either have to live with or revert your entire patch. Those kinds of trade-offs aren't good for you as a patch author, or for the product. I'm not going to commit this patch if it imposes new constraints on choosing the number of partitions in a dynamic hash table, and I'm going to oppose anyone else committing it either. If this problem couldn't be solved without imposing such constraints, we might decide (after long soul-searching) that it was worth it. But it seems pretty clear that they aren't necessary in this case, so I'm against them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I've closed this as returned-with-feedback. Please resubmit once you have found answers to the questions posed; from the submitted benchmark numbers this looks very exciting, but it needs a bit more work. You don't necessarily have to agree with everything Robert says, but you need to have well reasoned explanations for the points where you disagree. Thanks, -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello, Robert > Basically, the burden for you to impose a new coding rule on everybody > who uses shared hash tables in the future is very high. I fixed an issue you described. Number of spinlocks doesn't depend of NUM_LOCK_PARTITIONS anymore and could be specified for each hash table on a calling side. I did a benchmark described in a first message of this thread again. Currently I don't have access to the same 60-core server so I used more common 12-core server (24 with HT). According to this benchmark TPS increment depends on NUM_LOCK_PARTITIONS and default number of spinlocks this way: pgbench -f pgbench.sql -T 150 -P 1 -c 40 -j 12 DMN | NLP = 16 | NLP = 32 | NLP = 64 | NLP = 128 -----|----------|----------|----------|---------- 8 | +15.1% | +28.2% | +34.1% | +33.7% 16 | +16.6% | +30.9% | +37.0% | +40.8% 32 | +15.1% | +33.9% | +39.5% | +41.9% 64 | +15.0% | +31.9% | +40.1% | +42.9% 128 | +7.7% | +24.7% | +29.6% | +31.6% * NLP = NUM_LOCK_PARTITIONS * DMN = DEFAULT_MUTEXES_NUM I realize this benchmark doesn't represent any possible workload so for attached patch I choose NUM_LOCK_PARTITIONS = DEFAULT_MUTEXES_NUM = 32. It seems to be a reasonable compromise of a speedup according to "synthetic and meaningless in practice" benchmark and number of used locks which could mean quite a lot in practice. Still this values could be easily changed in any moment. Here are before/after benchmark results for this concrete patch. BEFORE pgbench (default): tps = 1295.798531 (including connections establishing) tps = 1295.858295 (excluding connections establishing) pgbench -f pgbench.sql: tps = 1020.072172 (including connections establishing) tps = 1020.116888 (excluding connections establishing) AFTER pgbench (default): tps = 1299.369807 (including connections establishing) tps = 1299.429764 (excluding connections establishing) pgbench -f pgbench.sql: tps = 1365.749333 (including connections establishing) tps = 1365.814384 (excluding connections establishing) So as I understand this patch solves a lock contention problem and doesn't make anything else worse.
Attachment
On Wed, Feb 10, 2016 at 3:24 AM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote: >> Basically, the burden for you to impose a new coding rule on everybody >> who uses shared hash tables in the future is very high. > > I fixed an issue you described. Number of spinlocks doesn't depend of > NUM_LOCK_PARTITIONS anymore and could be specified for each hash table > on a calling side. Thanks, this looks way better. Some more comments: - I don't think there's any good reason for this patch to change the calling convention for ShmemInitHash(). Maybe that's a good idea and maybe it isn't, but it's a separate issue from what this patch is doing, and if we're going to do it at all, it should be discussed separately. Let's leave it out of this patch. - I would not provide an option to change the number of freelist mutexes. Let's dump DEFAULT_MUTEXES_NUM and MAX_MUTEXES_NUM and have FREELIST_MUTEXES_NUM. The value of 32 which you selected is fine with me. - The change to LOG2_NUM_LOCK_PARTITIONS in lwlock.h looks like an independent change. Again, leave it out of this patch and submit it separately with its own benchmarking data if you want to argue for it. I think if you make these changes this patch will be quite a bit smaller and in pretty good shape. Thanks for working on this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello, Robert > Thanks, this looks way better. Some more comments: > > - I don't think there's any good reason for this patch to change the > calling convention for ShmemInitHash(). Maybe that's a good idea and > maybe it isn't, but it's a separate issue from what this patch is > doing, and if we're going to do it at all, it should be discussed > separately. Let's leave it out of this patch. > > - I would not provide an option to change the number of freelist > mutexes. Let's dump DEFAULT_MUTEXES_NUM and MAX_MUTEXES_NUM and have > FREELIST_MUTEXES_NUM. The value of 32 which you selected is fine with > me. > > - The change to LOG2_NUM_LOCK_PARTITIONS in lwlock.h looks like an > independent change. Again, leave it out of this patch and submit it > separately with its own benchmarking data if you want to argue for it. > > I think if you make these changes this patch will be quite a bit > smaller and in pretty good shape. > > Thanks for working on this. > Here is a corrected patch. I decided to keep a small change in InitLocks procedure (lock.c) since ShmemInitHash description clearly states "For a table whose maximum size is certain, [init_size] should be equal to max_size". Also I added two items to my TODO list to send more patches as soon (and if) this patch will be applied. Thanks for your contribution to this patch.
Attachment
On Thu, Feb 11, 2016 at 9:11 AM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote: >> Thanks, this looks way better. Some more comments: >> >> - I don't think there's any good reason for this patch to change the >> calling convention for ShmemInitHash(). Maybe that's a good idea and >> maybe it isn't, but it's a separate issue from what this patch is >> doing, and if we're going to do it at all, it should be discussed >> separately. Let's leave it out of this patch. >> >> - I would not provide an option to change the number of freelist >> mutexes. Let's dump DEFAULT_MUTEXES_NUM and MAX_MUTEXES_NUM and have >> FREELIST_MUTEXES_NUM. The value of 32 which you selected is fine with >> me. >> >> - The change to LOG2_NUM_LOCK_PARTITIONS in lwlock.h looks like an >> independent change. Again, leave it out of this patch and submit it >> separately with its own benchmarking data if you want to argue for it. >> >> I think if you make these changes this patch will be quite a bit >> smaller and in pretty good shape. >> >> Thanks for working on this. > > Here is a corrected patch. I decided to keep a small change in > InitLocks procedure (lock.c) since ShmemInitHash description clearly > states "For a table whose maximum size is certain, [init_size] should > be equal to max_size". Also I added two items to my TODO list to send > more patches as soon (and if) this patch will be applied. > > Thanks for your contribution to this patch. The fact that InitLocks() doesn't do this has been discussed before and there's no consensus on changing it. It is, at any rate, a separate issue. I'll go through the rest of this patch again now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Feb 11, 2016 at 9:53 AM, Robert Haas <robertmhaas@gmail.com> wrote: > The fact that InitLocks() doesn't do this has been discussed before > and there's no consensus on changing it. It is, at any rate, a > separate issue. I'll go through the rest of this patch again now. I did a little bit of cleanup on this patch and the results are attached. I also did some benchmarking of this version. I tested this on two systems, in each case using five-minute, read-only pgbench runs at scale factor 3000, varying the client count and taking the median of three runs. First, I tested it on hydra, a 2-socket, 16-processor, 64-thread POWER box. This is a community resource hosted at OSUOSL. Second, I tested it on cthulhu, an 8-socket, 64-processor, 128-thread x86_64 box. This is an EnterpriseDB resource. Here are the results: hydra, master vs. patched 1 client: 8042.872109 vs. 7839.587491 (-2.5%) 64 clients: 213311.852043 vs. 214002.314071 (+0.3%) 96 clients: 219551.356907 vs. 221908.397489 (+1.1%) 128 clients: 210279.022760 vs. 217974.079171 (+3.7%) cthulhu, master vs. patched 1 client: 3407.705820 vs. 3645.129360 (+7.0%) 64 clients: 88667.681890 vs. 82636.914343 (-6.8%) 96 clients: 79303.750250 vs. 105442.993869 (+33.0%) 128 clients: 74684.510668 vs. 120984.938371 (+62.0%) Obviously, the high-client count results on cthulhu are stellar. I'm *assuming* that the regressions are just random variation. I am however wondering if it to set the freelist affinity based on something other than the hash value, like say the PID, so that the same process doesn't keep switching to a different freelist for every buffer eviction. It also strikes me that it's probably quite likely that slock_t mutex[NUM_FREELISTS] is a poor way to lay out this data in memory. For example, on a system where slock_t is just one byte, most likely all of those mutexes are going to be in the same cache line, which means you're going to get a LOT of false sharing. It seems like it would be sensible to define a new struct that contains an slock_t, a long, and a HASHELEMENT *, and then make an array of those structs. That wouldn't completely eliminate false sharing, but it would reduce it quite a bit. My guess is that if you did that, you could reduce the number of freelists to 8 or less and get pretty much the same performance benefit that you're getting right now with 32. And that, too, seems likely to be good for single-client performance. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Thu, Feb 11, 2016 at 3:26 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Feb 11, 2016 at 9:53 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> The fact that InitLocks() doesn't do this has been discussed before >> and there's no consensus on changing it. It is, at any rate, a >> separate issue. I'll go through the rest of this patch again now. > > I did a little bit of cleanup on this patch and the results are > attached. I also did some benchmarking of this version. I tested > this on two systems, in each case using five-minute, read-only pgbench > runs at scale factor 3000, varying the client count and taking the > median of three runs. First, I tested it on hydra, a 2-socket, > 16-processor, 64-thread POWER box. This is a community resource > hosted at OSUOSL. Second, I tested it on cthulhu, an 8-socket, > 64-processor, 128-thread x86_64 box. This is an EnterpriseDB > resource. Here are the results: > > hydra, master vs. patched > 1 client: 8042.872109 vs. 7839.587491 (-2.5%) > 64 clients: 213311.852043 vs. 214002.314071 (+0.3%) > 96 clients: 219551.356907 vs. 221908.397489 (+1.1%) > 128 clients: 210279.022760 vs. 217974.079171 (+3.7%) > > cthulhu, master vs. patched > 1 client: 3407.705820 vs. 3645.129360 (+7.0%) > 64 clients: 88667.681890 vs. 82636.914343 (-6.8%) > 96 clients: 79303.750250 vs. 105442.993869 (+33.0%) > 128 clients: 74684.510668 vs. 120984.938371 (+62.0%) > > Obviously, the high-client count results on cthulhu are stellar. I'm > *assuming* that the regressions are just random variation. I am > however wondering if it to set the freelist affinity based on > something other than the hash value, like say the PID, so that the > same process doesn't keep switching to a different freelist for every > buffer eviction. It also strikes me that it's probably quite likely > that slock_t mutex[NUM_FREELISTS] is a poor way to lay out this data > in memory. For example, on a system where slock_t is just one byte, > most likely all of those mutexes are going to be in the same cache > line, which means you're going to get a LOT of false sharing. It > seems like it would be sensible to define a new struct that contains > an slock_t, a long, and a HASHELEMENT *, and then make an array of > those structs. That wouldn't completely eliminate false sharing, but > it would reduce it quite a bit. My guess is that if you did that, you > could reduce the number of freelists to 8 or less and get pretty much > the same performance benefit that you're getting right now with 32. > And that, too, seems likely to be good for single-client performance. One other thing: I don't see what prevents the nentries count for a particular freelist from going negative, which would subsequently cause an Assertion failure. I wonder if we should restructure the code so that we keep track of the total number of elements allocated and the number on each freelist, so that the count of elements = elements allocation - sum(elements on each freelist). This would require a little more code churn but it seems that the result would be more clearly correct. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello, Robert > It also strikes me that it's probably quite likely that slock_t > mutex[NUM_FREELISTS] is a poor way to lay out this data in memory. > For example, on a system where slock_t is just one byte, most likely > all of those mutexes are going to be in the same cache line, which > means you're going to get a LOT of false sharing. It seems like it > would be sensible to define a new struct that contains an slock_t, a > long, and a HASHELEMENT *, and then make an array of those structs. > That wouldn't completely eliminate false sharing, but it would reduce > it quite a bit. My guess is that if you did that, you could reduce > the number of freelists to 8 or less and get pretty much the same > performance benefit that you're getting right now with 32. And that, > too, seems likely to be good for single-client performance. I experimented for a while trying to fit every spinlock in a separate cache line. Indeed we can gain some speedup this way. Here are benchmark results on 12-core server for NUM_LOCK_PARTITIONS = 32 (in this case difference is more notable): | FREELISTS | SIZE = 32 | SIZE = 64 | SIZE = 128 | SIZE = 256 | |-----------|------------|------------|------------|------------| | 4 | +25.4% | +28.7% | +28.4% | +28.3% | | 8 | +28.2% | +29.4% | +31.7% | +31.4% | | 16 | +29.9% | +32.6% | +31.6% | +30.8% | | 32 | +33.7% | +34.2% | +33.6% | +32.6% | Here SIZE is short for FREELIST_BUFFER_SIZE (part of a hack I used to align FREELIST structure, see attached patch). Cache line on this CPU is 64 bytes according to /sys/devices/system/cpu/cpu0/cache/index0/* I would like to remind that without these changes we had the following picture (please note "NLP = 32" column): pgbench -f pgbench.sql -T 150 -P 1 -c 40 -j 12 DMN | NLP = 16 | NLP = 32 | NLP = 64 | NLP = 128 -----|----------|----------|----------|---------- 8 | +15.1% | +28.2% | +34.1% | +33.7% 16 | +16.6% | +30.9% | +37.0% | +40.8% 32 | +15.1% | +33.9% | +39.5% | +41.9% 64 | +15.0% | +31.9% | +40.1% | +42.9% 128 | +7.7% | +24.7% | +29.6% | +31.6% * NLP = NUM_LOCK_PARTITIONS * DMN = DEFAULT_MUTEXES_NUM So its +34.2% vs +33.9% and the number of freeLists remains the same. > I am however wondering if it to set the freelist affinity based on > something other than the hash value, like say the PID, so that the > same process doesn't keep switching to a different freelist for every > buffer eviction. Also I tried to use PID to determine freeList number: ``` #include <sys/types.h> #include <unistd.h> ... #define FREELIST_IDX(hctl, hashcode) \ (IS_PARTITIONED(hctl) ? ((uint32)getpid()) % NUM_FREELISTS : 0) ... // now nentries could be negative in this case // Assert(FREELIST(hctl, freelist_idx).nentries > 0); ``` Unfortunately this approach gives +33.9% TPS in best case. Also there is a possibility that nentries will overflow. So far I don't have a clear idea how to solve this issue effectively. I'm not sure if further improvements worth an effort.
Attachment
On Fri, Feb 12, 2016 at 9:04 PM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:
In which case, do you think entries can go negative? I think the nentries is incremented and decremented in the way as without patch, so I am not getting what can make it go negative.
Hello, Robert
> It also strikes me that it's probably quite likely that slock_t
> mutex[NUM_FREELISTS] is a poor way to lay out this data in memory.
> For example, on a system where slock_t is just one byte, most likely
> all of those mutexes are going to be in the same cache line, which
> means you're going to get a LOT of false sharing. It seems like it
> would be sensible to define a new struct that contains an slock_t, a
> long, and a HASHELEMENT *, and then make an array of those structs.
> That wouldn't completely eliminate false sharing, but it would reduce
> it quite a bit. My guess is that if you did that, you could reduce
> the number of freelists to 8 or less and get pretty much the same
> performance benefit that you're getting right now with 32. And that,
> too, seems likely to be good for single-client performance.
I experimented for a while trying to fit every spinlock in a separate
cache line. Indeed we can gain some speedup this way. Here are
benchmark results on 12-core server for NUM_LOCK_PARTITIONS = 32 (in
this case difference is more notable):
| FREELISTS | SIZE = 32 | SIZE = 64 | SIZE = 128 | SIZE = 256 |
|-----------|------------|------------|------------|------------|
| 4 | +25.4% | +28.7% | +28.4% | +28.3% |
| 8 | +28.2% | +29.4% | +31.7% | +31.4% |
| 16 | +29.9% | +32.6% | +31.6% | +30.8% |
| 32 | +33.7% | +34.2% | +33.6% | +32.6% |
Here SIZE is short for FREELIST_BUFFER_SIZE (part of a hack I used to
align FREELIST structure, see attached patch).
I am not sure, if this is exactly what has been suggested by Robert, so it is not straightforward to see if his suggestion can allow us to use NUM_FREELISTS as 8 rather than 32. I think instead of trying to use FREELISTBUFF, why not do it as Robert has suggested and try with different values of NUM_FREELISTS?
> I am however wondering if it to set the freelist affinity based on
> something other than the hash value, like say the PID, so that the
> same process doesn't keep switching to a different freelist for every
> buffer eviction.
Also I tried to use PID to determine freeList number:
```
#include <sys/types.h>
#include <unistd.h>
...
#define FREELIST_IDX(hctl, hashcode) \
(IS_PARTITIONED(hctl) ? ((uint32)getpid()) % NUM_FREELISTS : 0)
...
// now nentries could be negative in this case
// Assert(FREELIST(hctl, freelist_idx).nentries > 0);
Hello, Amit > I am not sure, if this is exactly what has been suggested by Robert, > so it is not straightforward to see if his suggestion can allow us to > use NUM_FREELISTS as 8 rather than 32. I think instead of trying to > use FREELISTBUFF, why not do it as Robert has suggested and try with > different values of NUM_FREELISTS? Well, what I did is in fact a bit more general solution then Robert suggested. At first I just joined freeList and mutex arrays into one array and tried different NUM_FREELISTS values (see FREELISTS column). That was Robert's idea. Unfortunately it didn't give any considerable speedup comparing with previous approach. Then I tried to manually control sizeof every array item (see different SIZE values). By making it larger we can put every array item into a separate cache line. This approach helped a bit in terms of TPS (before: +33.9%, after: +34.2%) but only if NUM_FREELISTS remains the same (32). So answering your question - it turned out that we _can't_ reduce NUM_FREELISTS this way. Also I don't believe that +0.3% TPS according to synthetic benchmark that doesn't represent any possible workload worth it considering additional code complexity. > In which case, do you think entries can go negative? I think the > nentries is incremented and decremented in the way as without patch, > so I am not getting what can make it go negative. In this context we are discussing a quick and dirty fix "what would happen if FREELIST_IDX would be implemented as getpid() % NUM_FREE_LIST instead of hash % NUM_FREELIST". The code is: int freelist_idx = FREELIST_IDX(hctl, hashvalue); // ... switch (action) { // ... case HASH_REMOVE: if (currBucket != NULL) { if (IS_PARTITIONED(hctl)) SpinLockAcquire(&(FREELIST(hctl, freelist_idx).mutex)); // Assert(FREELIST(hctl, freelist_idx).nentries > 0); FREELIST(hctl, freelist_idx).nentries--; /* remove record from hash bucket's chain. */ *prevBucketPtr = currBucket->link; // ... No matter what freelist was used when element was added to a hash table. We always try to return free memory to the same freelist number getpid() % FREELIST_ITEMS and decrease number of elements in a hash table using corresponding nentries field. For this reason nentries could be negative. Still sum of all nentries remains valid. I realize that this thread is quite long already and that its been a while since previous commitfest ended. So I would like to provide a short summery of this thread from my perspective: 1. Last version of a path is here: http://www.postgresql.org/message-id/CA+Tgmobtf9nH566_jjs=jrTyMq5HdQdaRF5j7o+AbdOWQHE_Ow@mail.gmail.com Its reviewed, tested, well-commented and apparently nobody has strong objections against it. 2. Robert suggested a few possible improvements. Experiments showed that in best case resulting patches are as good as path from item 1, but code become considerably more complex. 3. Number of further changes (changing calling convention for ShmemInitHash(), changing LOG2_NUM_LOCK_PARTITIONS constant) will be discussed separately when and if path from this thread will be applied. > > > > > > > > I am however wondering if it to set the freelist affinity based on > > > something other than the hash value, like say the PID, so that the > > > same process doesn't keep switching to a different freelist for > > > every buffer eviction. > > > > Also I tried to use PID to determine freeList number: > > > > ``` > > #include <sys/types.h> > > #include <unistd.h> > > > > ... > > > > #define FREELIST_IDX(hctl, hashcode) \ > > (IS_PARTITIONED(hctl) ? ((uint32)getpid()) % NUM_FREELISTS : 0) > > > > > ... > > > > // now nentries could be negative in this case > > // Assert(FREELIST(hctl, freelist_idx).nentries > 0); > > > > > In which case, do you think entries can go negative? I think the > nentries is incremented and decremented in the way as without patch, > so I am not getting what can make it go negative. > > > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com
On Tue, Mar 1, 2016 at 8:13 PM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:
>
> Hello, Amit
>
> > I am not sure, if this is exactly what has been suggested by Robert,
> > so it is not straightforward to see if his suggestion can allow us to
> > use NUM_FREELISTS as 8 rather than 32. I think instead of trying to
> > use FREELISTBUFF, why not do it as Robert has suggested and try with
> > different values of NUM_FREELISTS?
>
> Well, what I did is in fact a bit more general solution then Robert
> suggested. At first I just joined freeList and mutex arrays into one
> array and tried different NUM_FREELISTS values (see FREELISTS column).
> That was Robert's idea. Unfortunately it didn't give any considerable
> speedup comparing with previous approach.
>
Okay, now I can understand the data better and I think your tests indicate that it is better to keep NUM_FREELISTS as 32.
>
> Then I tried to manually control sizeof every array item (see
> different SIZE values). By making it larger we can put every array item
> into a separate cache line. This approach helped a bit in terms of TPS
> (before: +33.9%, after: +34.2%) but only if NUM_FREELISTS remains the
> same (32).
>
> So answering your question - it turned out that we _can't_ reduce
> NUM_FREELISTS this way.
>
> Also I don't believe that +0.3% TPS according to synthetic benchmark
> that doesn't represent any possible workload worth it considering
> additional code complexity.
>
> > In which case, do you think entries can go negative? I think the
> > nentries is incremented and decremented in the way as without patch,
> > so I am not getting what can make it go negative.
>
> In this context we are discussing a quick and dirty fix "what would
> happen if FREELIST_IDX would be implemented as getpid() % NUM_FREE_LIST
> instead of hash % NUM_FREELIST". The code is:
>
> int freelist_idx = FREELIST_IDX(hctl, hashvalue);
>
> // ...
>
> switch (action)
> {
>
> // ...
>
> case HASH_REMOVE:
> if (currBucket != NULL)
> {
> if (IS_PARTITIONED(hctl))
> SpinLockAcquire(&(FREELIST(hctl, freelist_idx).mutex));
>
> // Assert(FREELIST(hctl, freelist_idx).nentries > 0);
> FREELIST(hctl, freelist_idx).nentries--;
>
> /* remove record from hash bucket's chain. */
> *prevBucketPtr = currBucket->link;
>
> // ...
>
> No matter what freelist was used when element was added to a hash table.
> We always try to return free memory to the same freelist number getpid()
> % FREELIST_ITEMS and decrease number of elements in a hash table using
> corresponding nentries field.
>
> Hello, Amit
>
> > I am not sure, if this is exactly what has been suggested by Robert,
> > so it is not straightforward to see if his suggestion can allow us to
> > use NUM_FREELISTS as 8 rather than 32. I think instead of trying to
> > use FREELISTBUFF, why not do it as Robert has suggested and try with
> > different values of NUM_FREELISTS?
>
> Well, what I did is in fact a bit more general solution then Robert
> suggested. At first I just joined freeList and mutex arrays into one
> array and tried different NUM_FREELISTS values (see FREELISTS column).
> That was Robert's idea. Unfortunately it didn't give any considerable
> speedup comparing with previous approach.
>
Okay, now I can understand the data better and I think your tests indicate that it is better to keep NUM_FREELISTS as 32.
>
> Then I tried to manually control sizeof every array item (see
> different SIZE values). By making it larger we can put every array item
> into a separate cache line. This approach helped a bit in terms of TPS
> (before: +33.9%, after: +34.2%) but only if NUM_FREELISTS remains the
> same (32).
>
> So answering your question - it turned out that we _can't_ reduce
> NUM_FREELISTS this way.
>
> Also I don't believe that +0.3% TPS according to synthetic benchmark
> that doesn't represent any possible workload worth it considering
> additional code complexity.
>
I think data structure HASHHDR is looking simpler, however the code is looking more complex, especially with the addition of FREELISTBUFF for cacheline purpose. I think you might want to try with exactly what has been suggested and see if the code looks better that way, but if you are not convinced, then lets leave it to committer unless somebody else wants to try that suggestion.
> > In which case, do you think entries can go negative? I think the
> > nentries is incremented and decremented in the way as without patch,
> > so I am not getting what can make it go negative.
>
> In this context we are discussing a quick and dirty fix "what would
> happen if FREELIST_IDX would be implemented as getpid() % NUM_FREE_LIST
> instead of hash % NUM_FREELIST". The code is:
>
> int freelist_idx = FREELIST_IDX(hctl, hashvalue);
>
> // ...
>
> switch (action)
> {
>
> // ...
>
> case HASH_REMOVE:
> if (currBucket != NULL)
> {
> if (IS_PARTITIONED(hctl))
> SpinLockAcquire(&(FREELIST(hctl, freelist_idx).mutex));
>
> // Assert(FREELIST(hctl, freelist_idx).nentries > 0);
> FREELIST(hctl, freelist_idx).nentries--;
>
> /* remove record from hash bucket's chain. */
> *prevBucketPtr = currBucket->link;
>
> // ...
>
> No matter what freelist was used when element was added to a hash table.
> We always try to return free memory to the same freelist number getpid()
> % FREELIST_ITEMS and decrease number of elements in a hash table using
> corresponding nentries field.
>
Won't it always use the same freelist to remove and add the entry from freelist as for both cases it will calculate the freelist_idx in same way?
> Won't it always use the same freelist to remove and add the entry from > freelist as for both cases it will calculate the freelist_idx in same > way? No. If "our" freelist is empty when we try to remove an item from it we borrow item from another freelist. Then this borrowed item will be returned to "our" freelist instead of original. Without some sort of additional logic there is no way to figure out what freelist was original. Generally speaking negative nentries value is not something that couldn't be solved. But I would like to remind that in this context we are discussing a quick and dirty solution created for benchmark purposes in a first place. > You are not convinced, then lets leave it to committer unless > somebody else wants to try that suggestion. Agree. Frankly I'm tired of rewriting this patch over and over and over again. So I would like to avoid rewriting it once again unless there is a clear indication that this way we would gain something. Benchmarks shows that this is not a case thus it's only a matter of taste and intuition. We know from experience that both are bad advisers in optimization matter. I suggest we merge this patch already: http://www.postgresql.org/message-id/CA+Tgmobtf9nH566_jjs=jrTyMq5HdQdaRF5j7o+AbdOWQHE_Ow@mail.gmail.com ... and get done with it. Clearly this patch is good enough --- reviewed, tested, discussed, has proven performance extremum. When and if dynahash will become a bottleneck again nothing will stop us from optimizing it one more time for hardware we will have by then. Who knows what this hardware will be like in 5-10 years? Don't we have better things to do now than arguing about imaginary benchmarks and taste? Lets see what is committer's opinion on this.
On Thu, Mar 3, 2016 at 1:50 PM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote:
> Won't it always use the same freelist to remove and add the entry from
> freelist as for both cases it will calculate the freelist_idx in same
> way?
No. If "our" freelist is empty when we try to remove an item from it we
borrow item from another freelist.
I think the way patch works is if indicated freelist is empty, then it tries to allocate new elements in that list and if it can't allocate, then it tries to borrow from other freelist and in both cases the element to be removed from freelist is considered to be the element of indicated freelist (basically it always increment nentries[freelist_idx]).
Then this borrowed item will be
returned to "our" freelist instead of original. Without some sort of
additional logic there is no way to figure out what freelist was
original.
As the patch always operates on nentries[freelist_idx], so it seems to me that in both cases (remove, add), the element is considered to be belonging to same freelist. Have you faced this problem of negative entries in any of your test, if so then share the test, so that I can also understand the scenario?
Generally speaking negative nentries value is not something that
couldn't be solved. But I would like to remind that in this context we
are discussing a quick and dirty solution created for benchmark
purposes in a first place.
I thought negative entries could be a problem for the patch as indicated by Robert[1], but may be I am missing something.
> You are not convinced, then lets leave it to committer unless
> somebody else wants to try that suggestion.
Agree. Frankly I'm tired of rewriting this patch over and over and over
again. So I would like to avoid rewriting it once again unless there is
a clear indication that this way we would gain something. Benchmarks
shows that this is not a case thus it's only a matter of taste and
intuition.
I can understand your feeling here and agree with you that it is a matter of taste. However, many a times if we change the patch according to committer's taste, the chances of patch getting accepted early is better, but I am not sure if that applies here, so feel free to go in the way you think is better.
On Tue, Mar 1, 2016 at 9:43 AM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote: >> I am not sure, if this is exactly what has been suggested by Robert, >> so it is not straightforward to see if his suggestion can allow us to >> use NUM_FREELISTS as 8 rather than 32. I think instead of trying to >> use FREELISTBUFF, why not do it as Robert has suggested and try with >> different values of NUM_FREELISTS? > > Well, what I did is in fact a bit more general solution then Robert > suggested. At first I just joined freeList and mutex arrays into one > array and tried different NUM_FREELISTS values (see FREELISTS column). > That was Robert's idea. Unfortunately it didn't give any considerable > speedup comparing with previous approach. > > Then I tried to manually control sizeof every array item (see > different SIZE values). By making it larger we can put every array item > into a separate cache line. This approach helped a bit in terms of TPS > (before: +33.9%, after: +34.2%) but only if NUM_FREELISTS remains the > same (32). > > So answering your question - it turned out that we _can't_ reduce > NUM_FREELISTS this way. That's perplexing. I would have expected that with all of the mutexes packed in back-to-back like this, we would end up with a considerable amount of false sharing. I don't know why it ever helps to have an array of bytes all in the same cache line of which each one is a heavily-trafficked mutex. Anybody else have a theory? One other thing that concerns me mildly is the way we're handling nentries. It should be true, with this patch, that the individual nentries sum to the right value modulo 2^32. But I don't think there's any guarantee that the values are positive any more, and in theory after running long enough one of them could manage to overflow or underflow. So at a very minimum I think we need to remove the Assert() the value is not negative. But really, I wonder if we shouldn't rework things a little more than that. One idea is to jigger things so that we maintain a count of the total number of entries that doesn't change except when we allocate, and then for each freelist partition we maintain the number of entries in that freelist partition. So then the size of the hash table, instead of being sum(nentries) is totalsize - sum(nfree). Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [ P.S. Sorry it's taken me a while to circle back around to this. ]
On Sat, Mar 19, 2016 at 1:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Mar 1, 2016 at 9:43 AM, Aleksander Alekseev
> <a.alekseev@postgrespro.ru> wrote:
> >
> > So answering your question - it turned out that we _can't_ reduce
> > NUM_FREELISTS this way.
>
> That's perplexing. I would have expected that with all of the mutexes
> packed in back-to-back like this, we would end up with a considerable
> amount of false sharing. I don't know why it ever helps to have an
> array of bytes all in the same cache line of which each one is a
> heavily-trafficked mutex. Anybody else have a theory?
>
> One other thing that concerns me mildly is the way we're handling
> nentries. It should be true, with this patch, that the individual
> nentries sum to the right value modulo 2^32. But I don't think
> there's any guarantee that the values are positive any more, and in
> theory after running long enough one of them could manage to overflow
> or underflow.
>
> On Tue, Mar 1, 2016 at 9:43 AM, Aleksander Alekseev
> <a.alekseev@postgrespro.ru> wrote:
> >
> > So answering your question - it turned out that we _can't_ reduce
> > NUM_FREELISTS this way.
>
> That's perplexing. I would have expected that with all of the mutexes
> packed in back-to-back like this, we would end up with a considerable
> amount of false sharing. I don't know why it ever helps to have an
> array of bytes all in the same cache line of which each one is a
> heavily-trafficked mutex. Anybody else have a theory?
>
> One other thing that concerns me mildly is the way we're handling
> nentries. It should be true, with this patch, that the individual
> nentries sum to the right value modulo 2^32. But I don't think
> there's any guarantee that the values are positive any more, and in
> theory after running long enough one of them could manage to overflow
> or underflow.
>
Won't in theory, without patch as well nentries can overflow after running for very long time? I think with patch it is more prone to overflow because we start borrowing from other free lists as well.
So at a very minimum I think we need to remove the
> Assert() the value is not negative. But really, I wonder if we
> shouldn't rework things a little more than that.
>
> One idea is to jigger things so that we maintain a count of the total
> number of entries that doesn't change except when we allocate, and
> then for each freelist partition we maintain the number of entries in
> that freelist partition. So then the size of the hash table, instead
> of being sum(nentries) is totalsize - sum(nfree).
>
> Assert() the value is not negative. But really, I wonder if we
> shouldn't rework things a little more than that.
>
> One idea is to jigger things so that we maintain a count of the total
> number of entries that doesn't change except when we allocate, and
> then for each freelist partition we maintain the number of entries in
> that freelist partition. So then the size of the hash table, instead
> of being sum(nentries) is totalsize - sum(nfree).
>
To me, your idea sounds much better than current code in terms of understanding the free list concept as well. So, +1 for changing the code in this way.
On Sat, Mar 19, 2016 at 12:28 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Won't in theory, without patch as well nentries can overflow after running > for very long time? I think with patch it is more prone to overflow because > we start borrowing from other free lists as well. Uh, I don't think so. Without the patch, there is just one entries counter and it goes up and down. How would it ever overflow? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Mar 19, 2016 at 7:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sat, Mar 19, 2016 at 12:28 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Won't in theory, without patch as well nentries can overflow after running
> > for very long time? I think with patch it is more prone to overflow because
> > we start borrowing from other free lists as well.
>
> Uh, I don't think so. Without the patch, there is just one entries
> counter and it goes up and down. How would it ever overflow?
>
>
> On Sat, Mar 19, 2016 at 12:28 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Won't in theory, without patch as well nentries can overflow after running
> > for very long time? I think with patch it is more prone to overflow because
> > we start borrowing from other free lists as well.
>
> Uh, I don't think so. Without the patch, there is just one entries
> counter and it goes up and down. How would it ever overflow?
>
I thought it can overflow because we haven't kept any upper limit on incrementing it unless the memory finishes (ofcourse that is just a theoretical assumption, as the decrements will keep the number in control), so are you thinking about the risk of overflow with patch because we have to use sum of all the nentries from all the arrays for total or is there any thing else which makes you think that changing nentries into arrays of nentries can make it prone to overflow?
On Sun, Mar 20, 2016 at 3:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Sat, Mar 19, 2016 at 7:02 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Sat, Mar 19, 2016 at 12:28 AM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> > Won't in theory, without patch as well nentries can overflow after >> > running >> > for very long time? I think with patch it is more prone to overflow >> > because >> > we start borrowing from other free lists as well. >> >> Uh, I don't think so. Without the patch, there is just one entries >> counter and it goes up and down. How would it ever overflow? > > I thought it can overflow because we haven't kept any upper limit on > incrementing it unless the memory finishes (ofcourse that is just a > theoretical assumption, as the decrements will keep the number in control), > so are you thinking about the risk of overflow with patch because we have to > use sum of all the nentries from all the arrays for total or is there any > thing else which makes you think that changing nentries into arrays of > nentries can make it prone to overflow? Well, I mean, perhaps nentries could overflow if you had more than 2^32 elements, but I'm not even positive we support that. If you assume a fixed table with a million entries, the nentries value can vary only between 0 and a million. But now split that into a bunch of separate counters. The increment when you allocate an entry and the decrement when you put one back don't have to hit the same bucket, so I'm not sure there's anything that prevents the counter for one bucket from getting arbitrarily large and the counter for another bucket getting arbitrarily small while still summing to a value between 0 and a million. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Mar 21, 2016 at 5:12 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sun, Mar 20, 2016 at 3:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Sat, Mar 19, 2016 at 7:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> On Sat, Mar 19, 2016 at 12:28 AM, Amit Kapila <amit.kapila16@gmail.com>
> >> wrote:
> >> > Won't in theory, without patch as well nentries can overflow after
> >> > running
> >> > for very long time? I think with patch it is more prone to overflow
> >> > because
> >> > we start borrowing from other free lists as well.
> >>
> >> Uh, I don't think so. Without the patch, there is just one entries
> >> counter and it goes up and down. How would it ever overflow?
> >
> > I thought it can overflow because we haven't kept any upper limit on
> > incrementing it unless the memory finishes (ofcourse that is just a
> > theoretical assumption, as the decrements will keep the number in control),
> > so are you thinking about the risk of overflow with patch because we have to
> > use sum of all the nentries from all the arrays for total or is there any
> > thing else which makes you think that changing nentries into arrays of
> > nentries can make it prone to overflow?
>
> Well, I mean, perhaps nentries could overflow if you had more than
> 2^32 elements, but I'm not even positive we support that. If you
> assume a fixed table with a million entries, the nentries value can
> vary only between 0 and a million. But now split that into a bunch of
> separate counters. The increment when you allocate an entry and the
> decrement when you put one back don't have to hit the same bucket,
>
> On Sun, Mar 20, 2016 at 3:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Sat, Mar 19, 2016 at 7:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> On Sat, Mar 19, 2016 at 12:28 AM, Amit Kapila <amit.kapila16@gmail.com>
> >> wrote:
> >> > Won't in theory, without patch as well nentries can overflow after
> >> > running
> >> > for very long time? I think with patch it is more prone to overflow
> >> > because
> >> > we start borrowing from other free lists as well.
> >>
> >> Uh, I don't think so. Without the patch, there is just one entries
> >> counter and it goes up and down. How would it ever overflow?
> >
> > I thought it can overflow because we haven't kept any upper limit on
> > incrementing it unless the memory finishes (ofcourse that is just a
> > theoretical assumption, as the decrements will keep the number in control),
> > so are you thinking about the risk of overflow with patch because we have to
> > use sum of all the nentries from all the arrays for total or is there any
> > thing else which makes you think that changing nentries into arrays of
> > nentries can make it prone to overflow?
>
> Well, I mean, perhaps nentries could overflow if you had more than
> 2^32 elements, but I'm not even positive we support that. If you
> assume a fixed table with a million entries, the nentries value can
> vary only between 0 and a million. But now split that into a bunch of
> separate counters. The increment when you allocate an entry and the
> decrement when you put one back don't have to hit the same bucket,
>
This is the point where I think I am missing something about patch. As far as I can understand, it uses the same freelist index (freelist_idx) for allocating and putting back the entry, so I think the chance of increment in one list and decrement in another is there when the value of freelist_idx is calculated differently for the same input, is it so, or there is something else in patch which I am missing?
> This is the point where I think I am missing something about patch. > As far as I can understand, it uses the same freelist index > (freelist_idx) for allocating and putting back the entry, so I think > the chance of increment in one list and decrement in another is there > when the value of freelist_idx is calculated differently for the same > input, is it so, or there is something else in patch which I am > missing? You are right, nentries _can't_ be negative unless we are using getpid() for calculating freelist_idx, since same index of nentries[] is used when we add (increment) and remove (decrement) element from/to hash table. The fact that we also borrow elements from other freelists if there is no more elements in our freelist doesn't change anything. > One idea is to jigger things so that we maintain a count of the total > number of entries that doesn't change except when we allocate, and > then for each freelist partition we maintain the number of entries in > that freelist partition. So then the size of the hash table, instead > of being sum(nentries) is totalsize - sum(nfree). This is an interesting idea. Still I strongly disagree that is should be implemented in this concrete patch and discussed in this concrete thread. Obviously such type of change deserves a separate research and discussing since it has nothing to do with performance and since we agreed that "change LOG2_NUM_LOCK_PARTITIONS value" and "change the calling convention for ShmemInitHash()" patches should be implemented separately. I added it to my TODO list. Once again I suggest we merge this patch already: http://www.postgresql.org/message-id/CA+Tgmobtf9nH566_jjs=jrTyMq5HdQdaRF5j7o+AbdOWQHE_Ow@mail.gmail.com I have a strong feeling that we are just wasting our time here. -- Best regards, Aleksander Alekseev http://eax.me/
On Mon, Mar 21, 2016 at 4:48 AM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote: >> This is the point where I think I am missing something about patch. >> As far as I can understand, it uses the same freelist index >> (freelist_idx) for allocating and putting back the entry, so I think >> the chance of increment in one list and decrement in another is there >> when the value of freelist_idx is calculated differently for the same >> input, is it so, or there is something else in patch which I am >> missing? > > You are right, nentries _can't_ be negative unless we are using getpid() > for calculating freelist_idx, since same index of nentries[] is used > when we add (increment) and remove (decrement) element from/to hash > table. The fact that we also borrow elements from other freelists if > there is no more elements in our freelist doesn't change anything. Ah. OK, I missed that. > Once again I suggest we merge this patch already: > > http://www.postgresql.org/message-id/CA+Tgmobtf9nH566_jjs=jrTyMq5HdQdaRF5j7o+AbdOWQHE_Ow@mail.gmail.com > > I have a strong feeling that we are just wasting our time here. That is possible. However, I would like it if you would give me the benefit of the doubt and assume that, if I seem to be more cautious than you would be were you a committer, there might possibly be some good reasons for that. The fact is that, despite being more cautious than some people think I should be, I still manage to introduce quite a number of bugs via the patches I commit - see the thread 'Missing rows with index scan when collation is not "C"' on pgsql-bugs for just the very latest example of that. Nobody thinks that will happen with *their* patch, of course, but it does all the same. I'd still like an answer to the question of why this helps so much when there must be huge amounts of false sharing between the different mutexes. Maybe it doesn't matter, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> > I have a strong feeling that we are just wasting our time here. > > That is possible. However, I would like it if you would give me the > benefit of the doubt and assume that, if I seem to be more cautious > than you would be were you a committer, there might possibly be some > good reasons for that. The fact is that, despite being more cautious > than some people think I should be, I still manage to introduce quite > a number of bugs via the patches I commit - see the thread 'Missing > rows with index scan when collation is not "C"' on pgsql-bugs for just > the very latest example of that. Nobody thinks that will happen with > *their* patch, of course, but it does all the same. Oh, it explains a lot! You see, I couldn't figure out whats happening. Patch was discussed and reviewed a long time ago, everyone seems to be reasonably happy with it, etc. But for some reason it's ignored for weeks and no one tells why. Now it makes sense. I should probably mention that this patch was merged to PostgresPro build a few months ago. Several our clients are already using this code in production environment (guess who discovered this issue and wanted it to be fixed). There were no complains so far. > I'd still like an answer to the question of why this helps so much > when there must be huge amounts of false sharing between the different > mutexes. Maybe it doesn't matter, though. Well, the reason is that there is no more bottleneck here. Code is executed like 1% of the time here and 99% of the time somewhere else. There is a false sharing but it's not as expensive as 99% of other code. Thus optimizing 1% of the code any further doesn't give noticeable performance improvement. I still believe that optimizing 1% blindly without considering possible side effects this optimization can bring (other data alignment, some additional machine instructions - just to name a few) and having no way to measure these side effects is a bad idea. But I also admit a possibility that I can somehow be wrong on this. So I rewrote this patch one again :'( the way you suggested (without that alignment related hack I tried, it's just too ugly). I also attached original hashhdr-rmh.patch just to have both patches in one message so it would be easier to find both patches in this thread. If there are any other questions or doubts regarding these patches please don't hesitate to ask. -- Best regards, Aleksander Alekseev http://eax.me/
Attachment
On Wed, Mar 23, 2016 at 5:49 AM, Aleksander Alekseev <a.alekseev@postgrespro.ru> wrote: > I still believe that optimizing 1% blindly without considering possible > side effects this optimization can bring (other data alignment, some > additional machine instructions - just to name a few) and having no > way to measure these side effects is a bad idea. But I also admit a > possibility that I can somehow be wrong on this. So I rewrote this > patch one again :'( the way you suggested (without that alignment > related hack I tried, it's just too ugly). I also attached original > hashhdr-rmh.patch just to have both patches in one message so it would > be easier to find both patches in this thread. Thanks! I can't think of anything else to worry about with regards to that version, so I have committed it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> Thanks! I can't think of anything else to worry about with regards to > that version, so I have committed it. > Thanks, Robert. And thanks everyone for contributing to this patch. -- Best regards, Aleksander Alekseev http://eax.me/