Re: Reduce ProcArrayLock contention - Mailing list pgsql-hackers

From Pavan Deolasee
Subject Re: Reduce ProcArrayLock contention
Date
Msg-id CABOikdM6oyr25AkAyVhhpC1vO7amwbS3rjdZj3tjsGS7L-n6xQ@mail.gmail.com
Whole thread Raw
In response to Reduce ProcArrayLock contention  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Reduce ProcArrayLock contention
List pgsql-hackers


On Mon, Jun 29, 2015 at 8:57 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:


pgbench setup
------------------------
scale factor - 300
Data is on magnetic disk and WAL on ssd.
pgbench -M prepared tpc-b

Head : commit 51d0fe5d
Patch -1 : group_xid_clearing_at_trans_end_rel_v1


Client Count/TPS18163264128
HEAD814609210899199262363617812
Patch-11086648311093199083122028237

The graph for the data is attached.


Numbers look impressive and definitely shows that the idea is worth pursuing. I tried patch on my laptop. Unfortunately, at least for 4 and 8 clients, I did not see any improvement. In fact, averages over 2 runs showed a slight 2-4% decline in the tps. Having said that, there is no reason to disbelieve your numbers and no much powerful machines, we might see the gains.

BTW I ran the tests with, pgbench -s 10 -c 4 -T 300 
 

Points about performance data
---------------------------------------------
1.  Gives good performance improvement at or greater than 64 clients
and give somewhat moderate improvement at lower client count.  The
reason is that because the contention around ProcArrayLock is mainly
seen at higher client count.  I have checked that at higher client-count,
it started behaving lockless (which means performance with patch is
equivivalent to if we just comment out ProcArrayLock in
ProcArrayEndTransaction()).

Well, I am not entirely sure if thats a correct way of looking at it. Sure, you would see less contention on the ProcArrayLock because the fact is that there are far fewer backends trying to acquire it. But those who don't get the lock will sleep and hence the contention is moved somewhere else, at least partially.  
 
2. There is some noise in this data (at 1 client count, I don't expect
much difference).
3. I have done similar tests on power-8 m/c and found similar gains.

As I said, I'm not seeing benefits on my laptop (Macbook Pro, Quad core, SSD). But then I ran with much lower scale factor and much lesser number of clients.
 
4. The gains are visible when the data fits in shared_buffers as for other
workloads I/O starts dominating.

Thats seems be perfectly expected.
 
5. I have seen that effect of Patch is much more visible if we keep
autovacuum = off (do manual vacuum after each run) and keep
wal_writer_delay to lower value (say 20ms). 

Do you know why that happens? Is it because the contention moves somewhere else with autovacuum on?
 
Regarding the design itself, I've an idea that may be we can create a general purpose infrastructure to use this technique. If its useful here, I'm sure there are other places where this can be applied with similar effect.

For example, how about adding an API such as LWLockDispatchWork(lock, mode, function_ptr, data_ptr)? Here the data_ptr points to somewhere in shared memory that the function_ptr can work on once lock is available. If the lock is available in the requested mode then the function_ptr is executed with the given data_ptr and the function returns. If the lock is not available then the work is dispatched to some Q (tracked on per-lock basis?) and the process goes to sleep. Whenever the lock becomes available in the requested mode, the work is executed by some other backedn and the primary process is woken up. This will most likely happen in the LWLockRelease() path when the last holder is about to give up the lock so that it becomes available in the requested "mode". Now there is lot of handwaving here and I'm not sure if the LWLock infrastructure permits us to add something like this easily. But I thought I will put forward the idea anyways. In fact, I remember trying something of this sort a long time back, but can't recollect why I gave up on the idea. May be I did not see much benefit of the entire approach of clubbing work-pieces and doing them in a single process. But then I probably did not have access to powerful machines then to correctly measure the benefits. Hence I'm not willing to give up on the idea, especially given your test results.

BTW may be the LWLockDispatchWork() makes sense only for EXCLUSIVE locks because we tend to read from shared memory and populate local structures in READ mode and that can only happen in the primary backend itself.

Regarding the patch, the compare-and-exchange function calls that you've used would work only for 64-bit machines, right? You would need to use equivalent 32-bit calls on a 32-bit machine.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: Andrew Gierth
Date:
Subject: Re: GSets: Fix bug involving GROUPING and HAVING together
Next
From: Jeevan Chalke
Date:
Subject: Re: GSets: Fix bug involving GROUPING and HAVING together