I have been experimenting with splitting the ProcArrayLock into parts. That is, to Acquire the ProcArrayLock in shared mode, it is only necessary to acquire one of the parts in shared mode; to acquire the lock in exclusive mode, all of the parts must be acquired in exclusive mode. For those interested, I have attached a design description of the change.
This approach has been quite successful on large systems with the hammerdb benchmark.With a prototype based on 10 master source and running on power8 (model 8335-GCA with 2sockets, 20 core) hammerdb improved by 16%; On intel (Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz, 2 socket, 44 core) with 9.6 base and prototype hammerdb improved by 4%. (attached is a set of spreadsheets for power8.
The down side is that on smaller configurations (single socket) where there is less "lock thrashing" in the storage subsystem and there are multiple Lwlocks to take for an exclusive acquire, there is a decided downturn in performance. On hammerdb, the prototype was 6% worse than the base on a single socket power configuration.
If there is interest in this approach, I will submit a patch.