Thread: [HACKERS] HACKERS[PROPOSAL] split ProcArrayLock into multiple parts
Hi,
I have been experimenting with splitting the ProcArrayLock into parts. That is, to Acquire the ProcArrayLock in shared mode, it is only necessary to acquire one of the parts in shared mode; to acquire the lock in exclusive mode, all of the parts must be acquired in exclusive mode. For those interested, I have attached a design description of the change.
This approach has been quite successful on large systems with the hammerdb benchmark.With a prototype based on 10 master source and running on power8 (model 8335-GCA with 2sockets, 20 core)
hammerdb improved by 16%; On intel (Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz, 2 socket, 44 core) with 9.6 base and prototype hammerdb improved by 4%. (attached is a set of spreadsheets for power8.
The down side is that on smaller configurations (single socket) where there is less "lock thrashing" in the storage subsystem and there are multiple Lwlocks to take for an exclusive acquire, there is a decided downturn in performance. On hammerdb, the prototype was 6% worse than the base on a single socket power configuration.
If there is interest in this approach, I will submit a patch.
Jim Van Fleet
I have been experimenting with splitting the ProcArrayLock into parts. That is, to Acquire the ProcArrayLock in shared mode, it is only necessary to acquire one of the parts in shared mode; to acquire the lock in exclusive mode, all of the parts must be acquired in exclusive mode. For those interested, I have attached a design description of the change.
This approach has been quite successful on large systems with the hammerdb benchmark.With a prototype based on 10 master source and running on power8 (model 8335-GCA with 2sockets, 20 core)
hammerdb improved by 16%; On intel (Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz, 2 socket, 44 core) with 9.6 base and prototype hammerdb improved by 4%. (attached is a set of spreadsheets for power8.
The down side is that on smaller configurations (single socket) where there is less "lock thrashing" in the storage subsystem and there are multiple Lwlocks to take for an exclusive acquire, there is a decided downturn in performance. On hammerdb, the prototype was 6% worse than the base on a single socket power configuration.
If there is interest in this approach, I will submit a patch.
Jim Van Fleet
Attachment
Hi, Jim.
How do you ensure of transaction order?
Example:
- you lock shard A and gather info. You find transaction T1 in-progress.
- Then you unlock shard A.
- T1 completes. T2, that depends on T1, also completes. But T2 was on shard B.
- you lock shard B, and gather info from.
- You didn't saw T2 as in progress, so you will lookup into clog then and will find it as commited.
Now you see T2 as commited, but T1 as in-progress - clear violation of transaction order.
Probably you've already solved this issue. If so it would be great to learn the solution.
5 июня 2017 г. 10:30 PM пользователь Jim Van Fleet <vanfleet@us.ibm.com> написал:
Hi,
I have been experimenting with splitting the ProcArrayLock into parts. That is, to Acquire the ProcArrayLock in shared mode, it is only necessary to acquire one of the parts in shared mode; to acquire the lock in exclusive mode, all of the parts must be acquired in exclusive mode. For those interested, I have attached a design description of the change.
This approach has been quite successful on large systems with the hammerdb benchmark.With a prototype based on 10 master source and running on power8 (model 8335-GCA with 2sockets, 20 core)
hammerdb improved by 16%; On intel (Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz, 2 socket, 44 core) with 9.6 base and prototype hammerdb improved by 4%. (attached is a set of spreadsheets for power8.
The down side is that on smaller configurations (single socket) where there is less "lock thrashing" in the storage subsystem and there are multiple Lwlocks to take for an exclusive acquire, there is a decided downturn in performance. On hammerdb, the prototype was 6% worse than the base on a single socket power configuration.
If there is interest in this approach, I will submit a patch.
Jim Van Fleet
Excuse me, Jim.
I was tired and misunderstand proposal: I thought of ProcArray sharding, but proposal is about ProcArrayLock sharding.
BTW, I just posted improvement to LWLock:
https://www.postgresql.org/message-id/2968c0be065baab8865c4c95de3f435c%40postgrespro.ru
Would you mind to test against that and together with that?
5 июня 2017 г. 11:11 PM пользователь Sokolov Yura <y.sokolov@postgrespro.ru> написал:
Hi, Jim.How do you ensure of transaction order?Example:- you lock shard A and gather info. You find transaction T1 in-progress.- Then you unlock shard A.- T1 completes. T2, that depends on T1, also completes. But T2 was on shard B.- you lock shard B, and gather info from.- You didn't saw T2 as in progress, so you will lookup into clog then and will find it as commited.Now you see T2 as commited, but T1 as in-progress - clear violation of transaction order.Probably you've already solved this issue. If so it would be great to learn the solution.5 июня 2017 г. 10:30 PM пользователь Jim Van Fleet <vanfleet@us.ibm.com> написал:Hi,
I have been experimenting with splitting the ProcArrayLock into parts. That is, to Acquire the ProcArrayLock in shared mode, it is only necessary to acquire one of the parts in shared mode; to acquire the lock in exclusive mode, all of the parts must be acquired in exclusive mode. For those interested, I have attached a design description of the change.
This approach has been quite successful on large systems with the hammerdb benchmark.With a prototype based on 10 master source and running on power8 (model 8335-GCA with 2sockets, 20 core)
hammerdb improved by 16%; On intel (Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz, 2 socket, 44 core) with 9.6 base and prototype hammerdb improved by 4%. (attached is a set of spreadsheets for power8.
The down side is that on smaller configurations (single socket) where there is less "lock thrashing" in the storage subsystem and there are multiple Lwlocks to take for an exclusive acquire, there is a decided downturn in performance. On hammerdb, the prototype was 6% worse than the base on a single socket power configuration.
If there is interest in this approach, I will submit a patch.
Jim Van Fleet
NP, Sokolov --
pgsql-hackers-owner@postgresql.org wrote on 06/05/2017 03:26:46 PM:
> From: Sokolov Yura <y.sokolov@postgrespro.ru>
> To: Jim Van Fleet <vanfleet@us.ibm.com>
> Cc: pgsql-hackers@postgresql.org
> Date: 06/05/2017 03:28 PM
> Subject: Re: [HACKERS] HACKERS[PROPOSAL] split ProcArrayLock into
> multiple parts
> Sent by: pgsql-hackers-owner@postgresql.org
>
> Excuse me, Jim.
>
> I was tired and misunderstand proposal: I thought of ProcArray
> sharding, but proposal is about ProcArrayLock sharding.
>
> BTW, I just posted improvement to LWLock:
>
> https://www.postgresql.org/message-id/
> 2968c0be065baab8865c4c95de3f435c%40postgrespro.ru
>
> Would you mind to test against that and together with that?
I will give them a try ..
Jim
pgsql-hackers-owner@postgresql.org wrote on 06/05/2017 03:26:46 PM:
> From: Sokolov Yura <y.sokolov@postgrespro.ru>
> To: Jim Van Fleet <vanfleet@us.ibm.com>
> Cc: pgsql-hackers@postgresql.org
> Date: 06/05/2017 03:28 PM
> Subject: Re: [HACKERS] HACKERS[PROPOSAL] split ProcArrayLock into
> multiple parts
> Sent by: pgsql-hackers-owner@postgresql.org
>
> Excuse me, Jim.
>
> I was tired and misunderstand proposal: I thought of ProcArray
> sharding, but proposal is about ProcArrayLock sharding.
>
> BTW, I just posted improvement to LWLock:
>
> https://www.postgresql.org/message-id/
> 2968c0be065baab8865c4c95de3f435c%40postgrespro.ru
>
> Would you mind to test against that and together with that?
I will give them a try ..
Jim
Hi Sokolov --
I tried your patch. I only had time for doing a few points on power8. pgbench rw on two sockets is awesome! Keeps getting more throughput as threads are added -- in contrast to base and my prototype. I did not run single socket pgbench.
Hammerdb, 1 socket was in the same ballpark as the base, but slightly lower. 2 socket was also in the same ballpark as the base, again slightly lower. I did not do a series of points (just one at the previous "sweet spot"), so the "final" results may be better, The ProcArrayLock multiple parts was lower except in two socket case. The performance data I collected for your patch on hammerdb showed the same sort of issues as the base.
I don't see much point in combining the two because of the ProcArrayLock down side -- that is, single socket. poor performance. Unless we could come up with some heuristic to use one part on light loads and two on heavy (and still stay correct), then I don't see it ... With the combination, what I think we would see is awesome pgbench rw, awesome hammerdb 2 socket performance, and degraded single socket hammerdb.
Jim
From: Sokolov Yura <y.sokolov@postgrespro.ru>
To: Jim Van Fleet <vanfleet@us.ibm.com>
Cc: pgsql-hackers@postgresql.org
Date: 06/05/2017 03:28 PM
Subject: Re: [HACKERS] HACKERS[PROPOSAL] split ProcArrayLock into multiple parts
Sent by: pgsql-hackers-owner@postgresql.org
Excuse me, Jim.
I was tired and misunderstand proposal: I thought of ProcArray sharding, but proposal is about ProcArrayLock sharding.
BTW, I just posted improvement to LWLock:
https://www.postgresql.org/message-id/2968c0be065baab8865c4c95de3f435c%40postgrespro.ru
Would you mind to test against that and together with that?
5 июня 2017 г. 11:11 PM пользователь Sokolov Yura <y.sokolov@postgrespro.ru> написал:
Hi, Jim.
How do you ensure of transaction order?
Example:
- you lock shard A and gather info. You find transaction T1 in-progress.
- Then you unlock shard A.
- T1 completes. T2, that depends on T1, also completes. But T2 was on shard B.
- you lock shard B, and gather info from.
- You didn't saw T2 as in progress, so you will lookup into clog then and will find it as commited.
Now you see T2 as commited, but T1 as in-progress - clear violation of transaction order.
Probably you've already solved this issue. If so it would be great to learn the solution.
5 июня 2017 г. 10:30 PM пользователь Jim Van Fleet <vanfleet@us.ibm.com> написал:
Hi,
I have been experimenting with splitting the ProcArrayLock into parts. That is, to Acquire the ProcArrayLock in shared mode, it is only necessary to acquire one of the parts in shared mode; to acquire the lock in exclusive mode, all of the parts must be acquired in exclusive mode. For those interested, I have attached a design description of the change.
This approach has been quite successful on large systems with the hammerdb benchmark.With a prototype based on 10 master source and running on power8 (model 8335-GCA with 2sockets, 20 core)
hammerdb improved by 16%; On intel (Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz, 2 socket, 44 core) with 9.6 base and prototype hammerdb improved by 4%. (attached is a set of spreadsheets for power8.
The down side is that on smaller configurations (single socket) where there is less "lock thrashing" in the storage subsystem and there are multiple Lwlocks to take for an exclusive acquire, there is a decided downturn in performance. On hammerdb, the prototype was 6% worse than the base on a single socket power configuration.
If there is interest in this approach, I will submit a patch.
Jim Van Fleet
I tried your patch. I only had time for doing a few points on power8. pgbench rw on two sockets is awesome! Keeps getting more throughput as threads are added -- in contrast to base and my prototype. I did not run single socket pgbench.
Hammerdb, 1 socket was in the same ballpark as the base, but slightly lower. 2 socket was also in the same ballpark as the base, again slightly lower. I did not do a series of points (just one at the previous "sweet spot"), so the "final" results may be better, The ProcArrayLock multiple parts was lower except in two socket case. The performance data I collected for your patch on hammerdb showed the same sort of issues as the base.
I don't see much point in combining the two because of the ProcArrayLock down side -- that is, single socket. poor performance. Unless we could come up with some heuristic to use one part on light loads and two on heavy (and still stay correct), then I don't see it ... With the combination, what I think we would see is awesome pgbench rw, awesome hammerdb 2 socket performance, and degraded single socket hammerdb.
Jim
From: Sokolov Yura <y.sokolov@postgrespro.ru>
To: Jim Van Fleet <vanfleet@us.ibm.com>
Cc: pgsql-hackers@postgresql.org
Date: 06/05/2017 03:28 PM
Subject: Re: [HACKERS] HACKERS[PROPOSAL] split ProcArrayLock into multiple parts
Sent by: pgsql-hackers-owner@postgresql.org
Excuse me, Jim.
I was tired and misunderstand proposal: I thought of ProcArray sharding, but proposal is about ProcArrayLock sharding.
BTW, I just posted improvement to LWLock:
https://www.postgresql.org/message-id/2968c0be065baab8865c4c95de3f435c%40postgrespro.ru
Would you mind to test against that and together with that?
5 июня 2017 г. 11:11 PM пользователь Sokolov Yura <y.sokolov@postgrespro.ru> написал:
Hi, Jim.
How do you ensure of transaction order?
Example:
- you lock shard A and gather info. You find transaction T1 in-progress.
- Then you unlock shard A.
- T1 completes. T2, that depends on T1, also completes. But T2 was on shard B.
- you lock shard B, and gather info from.
- You didn't saw T2 as in progress, so you will lookup into clog then and will find it as commited.
Now you see T2 as commited, but T1 as in-progress - clear violation of transaction order.
Probably you've already solved this issue. If so it would be great to learn the solution.
5 июня 2017 г. 10:30 PM пользователь Jim Van Fleet <vanfleet@us.ibm.com> написал:
Hi,
I have been experimenting with splitting the ProcArrayLock into parts. That is, to Acquire the ProcArrayLock in shared mode, it is only necessary to acquire one of the parts in shared mode; to acquire the lock in exclusive mode, all of the parts must be acquired in exclusive mode. For those interested, I have attached a design description of the change.
This approach has been quite successful on large systems with the hammerdb benchmark.With a prototype based on 10 master source and running on power8 (model 8335-GCA with 2sockets, 20 core)
hammerdb improved by 16%; On intel (Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz, 2 socket, 44 core) with 9.6 base and prototype hammerdb improved by 4%. (attached is a set of spreadsheets for power8.
The down side is that on smaller configurations (single socket) where there is less "lock thrashing" in the storage subsystem and there are multiple Lwlocks to take for an exclusive acquire, there is a decided downturn in performance. On hammerdb, the prototype was 6% worse than the base on a single socket power configuration.
If there is interest in this approach, I will submit a patch.
Jim Van Fleet
On Tue, Jun 6, 2017 at 1:00 AM, Jim Van Fleet <vanfleet@us.ibm.com> wrote: > Hi, > > I have been experimenting with splitting the ProcArrayLock into parts. > That is, to Acquire the ProcArrayLock in shared mode, it is only necessary > to acquire one of the parts in shared mode; to acquire the lock in exclusive > mode, all of the parts must be acquired in exclusive mode. For those > interested, I have attached a design description of the change. > > This approach has been quite successful on large systems with the hammerdb > benchmark.With a prototype based on 10 master source and running on power8 > (model 8335-GCA with 2sockets, 20 core) > hammerdb improved by 16%; On intel (Intel(R) Xeon(R) CPU E5-2699 v4 @ > 2.20GHz, 2 socket, 44 core) with 9.6 base and prototype hammerdb improved by > 4%. (attached is a set of spreadsheets for power8. > > The down side is that on smaller configurations (single socket) where there > is less "lock thrashing" in the storage subsystem and there are multiple > Lwlocks to take for an exclusive acquire, there is a decided downturn in > performance. On hammerdb, the prototype was 6% worse than the base on a > single socket power configuration. > I think any patch having 6% regression on one machine configuration and 16% improvement on another machine configuration is not a net win. However, if there is a way to address the regression, then it will look much attractive. > If there is interest in this approach, I will submit a patch. > The basic idea is clear from your description, but it will be better if you share the patch as well. It will not only help people to review and provide you feedback but also allow them to test and see if they can reproduce the numbers you have mentioned in the mail. There is some related work which was previously proposed in this area ("Cache the snapshot") [1] and it claims to reduce contention around ProcArrayLock. I am not sure if that patch still applies, however, if you find it relevant and you are interested in evaluating the same, then we can request the author to post a rebased version if it doesn't apply. [1] - https://www.postgresql.org/message-id/CAD__OuiwEi5sHe2wwQCK36Ac9QMhvJuqG3CfPN%2BOFCMb7rdruQ%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Amit Kapila <amit.kapila16@gmail.com> wrote on 06/07/2017 07:34:06 AM:
...
> > The down side is that on smaller configurations (single socket) where there
> > is less "lock thrashing" in the storage subsystem and there are multiple
> > Lwlocks to take for an exclusive acquire, there is a decided downturn in
> > performance. On hammerdb, the prototype was 6% worse than the base on a
> > single socket power configuration.
> >
>
> I think any patch having 6% regression on one machine configuration
> and 16% improvement on another machine configuration is not a net win.
> However, if there is a way to address the regression, then it will
> look much attractive.
I have to agree.
>
> > If there is interest in this approach, I will submit a patch.
> >
>
> The basic idea is clear from your description, but it will be better
> if you share the patch as well. It will not only help people to
> review and provide you feedback but also allow them to test and see if
> they can reproduce the numbers you have mentioned in the mail.
OK -- would love the feedback and any suggestions on how to mitigate the low end problems.
>
> There is some related work which was previously proposed in this area
> ("Cache the snapshot") [1] and it claims to reduce contention around
> ProcArrayLock. I am not sure if that patch still applies, however, if
> you find it relevant and you are interested in evaluating the same,
> then we can request the author to post a rebased version if it doesn't
> apply.
Sokolov Yura has a patch which, to me, looks good for pgbench rw performance. Does not do so well with hammerdb (about the same as base) on single socket and two socket.
>
> [1] - https://www.postgresql.org/message-id/
> CAD__OuiwEi5sHe2wwQCK36Ac9QMhvJuqG3CfPN%2BOFCMb7rdruQ%40mail.gmail.com
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>
On Wed, Jun 7, 2017 at 12:29 PM, Jim Van Fleet <vanfleet@us.ibm.com> wrote: >> The basic idea is clear from your description, but it will be better >> if you share the patch as well. It will not only help people to >> review and provide you feedback but also allow them to test and see if >> they can reproduce the numbers you have mentioned in the mail. > > OK -- would love the feedback and any suggestions on how to mitigate the low > end problems. Did you intend to attach a patch? > Sokolov Yura has a patch which, to me, looks good for pgbench rw > performance. Does not do so well with hammerdb (about the same as base) on > single socket and two socket. Any idea why? I think we will have to understand *why* certain things help in some situations and not others, not just *that* they do, in order to come up with a good solution to this problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote on 06/07/2017 12:12:02 PM:
> > OK -- would love the feedback and any suggestions on how to mitigate the low
> > end problems.
>
> Did you intend to attach a patch?
Yes I do -- tomorrow or Thursday -- needs a little cleaning up ...
> > Sokolov Yura has a patch which, to me, looks good for pgbench rw
> > performance. Does not do so well with hammerdb (about the same as base) on
> > single socket and two socket.
>
> Any idea why? I think we will have to understand *why* certain things
> help in some situations and not others, not just *that* they do, in
> order to come up with a good solution to this problem.
Looking at the data now -- LWLockAquire philosophy is different -- at first glance I would have guessed "about the same" as the base, but I can not yet explain why we have super pgbench rw performance and "the same" hammerdb performance.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
Good day Robert, Jim, and everyone. On 2017-06-08 00:06, Jim Van Fleet wrote: > Robert Haas <robertmhaas@gmail.com> wrote on 06/07/2017 12:12:02 PM: > >> > OK -- would love the feedback and any suggestions on how to > mitigate the low >> > end problems. >> >> Did you intend to attach a patch? > Yes I do -- tomorrow or Thursday -- needs a little cleaning up ... > >> > Sokolov Yura has a patch which, to me, looks good for pgbench rw >> > performance. Does not do so well with hammerdb (about the same as > base) on >> > single socket and two socket. >> >> Any idea why? I think we will have to understand *why* certain > things >> help in some situations and not others, not just *that* they do, in >> order to come up with a good solution to this problem. My patch improves acquiring contended/blocking LWLock on NUMA cause: a. patched procedure generates much lesser writes, especially because taking WaitListLock is unified with acquiring thelock itself. Access to modified memory is very expensive on NUMA, so less writes leads to less wasted time. b. it spins several time on lock->state in attempts to acquire lock before starting attempts to queue self to wait list.It is really the cause of some speedup. Without spinning patch just removes degradation on contention. I don't knowwhy spinning doesn't improves single socket performance though :-) Probably still because all algorithmic overhead (system calls, sleeping and awakening process) is not too expensive until NUMA is involved. > Looking at the data now -- LWLockAquire philosophy is different -- at > first glance I would have guessed "about the same" as the base, but I > can not yet explain why we have super pgbench rw performance and "the > same" hammerdb performance My patch improves only blocking contention, ie when a lot of EXCLUSIVE locks are involved. pgbench rw generates a lot of write traffic, so there is a lot of contention and waiting on WALInsertLocks (in XLogInsertRecord, and waiting in XLogFlush), WalWriteLock (in XLogFlush), CLogControlLock (in TransactionIdSetTreeStatus). The case when SHARED lock is much more common than EXCLUSIVE is not affected by patch, because SHARED is acquired then on the fast path in both original and patched version. So, looks like hammerdb doesn't produce much EXCLUSIVE contention on LWLocks, so it is not improved with the patch. Splitting ProcArrayLock helps with acquiring SHARED lock on NUMA in absence of EXCLUSIVE lock because of the same reason why my patch improves acquiring of blocking lock: less writes to same memory. Since every process writes to some one part of ProcArrayLock, there is a lot less writes to each part of ProcArrayLock, so acquiring SHARED lock pays lesser for accessing to modified memory on NUMA. Probably I'm mistaken somewhere. > >> >> -- >> Robert Haas >> EnterpriseDB: http://www.enterprisedb.com >> The Enterprise PostgreSQL Company >> -- Sokolov Yura aka funny_falcon Postgres Professional: https://postgrespro.ru The Russian Postgres Company
pgsql-hackers-owner@postgresql.org wrote on 06/07/2017 04:06:57 PM:
...
> >
> > Did you intend to attach a patch?
> Yes I do -- tomorrow or Thursday -- needs a little cleaning up ...
meant Friday
>
> > > Sokolov Yura has a patch which, to me, looks good for pgbench rw
> > > performance. Does not do so well with hammerdb (about the same
> as base) on
> > > single socket and two socket.
> >
> > Any idea why? I think we will have to understand *why* certain things
> > help in some situations and not others, not just *that* they do, in
> > order to come up with a good solution to this problem.
> Looking at the data now -- LWLockAquire philosophy is different --
> at first glance I would have guessed "about the same" as the base,
> but I can not yet explain why we have super pgbench rw performance
> and "the same" hammerdb performance.
(data taken from perf cycles when I invoked the performance data gathering script, generally in the middle of the run)
In hammerdb two socket, the ProcArrayLock is the bottle neck in LWLockAcquire (called from GetSnapshotData about 75% of the calls to LWLockAquire). With Sokolov's patch, LWLockAcquire (with LWLockAttemptLock included) is a little over 9%; pgbench, on the other hand, has LWLockAquire at 1.3% with GetSnapshotData calling only 11% of the calls to LWLockAcquire.
What I think that means is that there is no ProcArrayLock bottleneck in pgbench. GetSnapshotData runs the entire proc chain of PGXACT's so is held a rather long time. Guessing that the other locks are held a much shorter time; Sukolov's patch handles the other locks better because of spinning. We see much more time in LWLockAcquire with hammerdb because of the spinning -- with the ProcArrayLock, spinning does not help much because of the longer hold time.
The spin count is relatively high (100/2), so I made it much smaller (20/2) in the hopes that the spin would still handle the shorter hold time locks but not be a bother with long hold times.
Running pgbench with 96 users, the thruput was slightly less at 70K tsp vs 75K tps (vs base of 40K tps at 96 threads and peak of 58K at 64 threads); hammerdb two socket was slightly better (about 3%) than the peak base.
What all this tells me is that LWLockAcquire would (probably) benefit from some spinning.
>
> >
> > --
> > Robert Haas
> > EnterpriseDB: http://www.enterprisedb.com
> > The Enterprise PostgreSQL Company
> >