Re: BufFreelistLock - Mailing list pgsql-hackers

From Jeff Janes
Subject Re: BufFreelistLock
Date
Msg-id AANLkTimUr3KHCXXk6TmbpsAFqOj20_M3qAZsSSk6s7TL@mail.gmail.com
Whole thread Raw
In response to Re: BufFreelistLock  (Jim Nasby <jim@nasby.net>)
Responses Re: BufFreelistLock
List pgsql-hackers
On Tue, Dec 14, 2010 at 1:42 PM, Jim Nasby <jim@nasby.net> wrote:
>
> On Dec 14, 2010, at 11:08 AM, Jeff Janes wrote:
>

>> I wouldn't expect an increase in shared_buffers to make contention on
>> BufFreelistLock worse.  If the increased buffers are used to hold
>> heavily-accessed data, then you will find the pages you want in
>> shared_buffers more often, and so need to run the clock-sweep less
>> often.  That should make up for longer sweeps.  But if the increased
>> buffers are used to hold data that is just read once and thrown away,
>> then the clock sweep shouldn't need to sweep very far before finding a
>> candidate.
>
> Well, we're talking about a working set that's between 96 and 192G, but
> only 8G (or 28G) of shared buffers. So there's going to be a pretty
> large amount of buffer replacement happening. We also have
> 210 tables where the ratio of heap buffer hits to heap reads is
> over 1000, so the stuff that is in shared buffers probably keeps
> usage_count quite high. Put these two together, and we're probably
> spending a fairly significant amount of time running the clock sweep.

The thing that makes me think the bottleneck is elsewhere is that
increasing from 8G to 28G made it worse.  If buffer unpins are
happening at about the same rate, then my gut feeling is that the
clock sweep has to do about the same amount of decrementing before it
gets to a free buffer under steady state conditions.  Whether it has
to decrement 8G in buffers three and a half times each, or 28G of
buffers one time each, it would do about the same amount of work.
This is all hand waving, of course.


> Even excluding our admittedly unusual workload, there is still significant overhead in running the clock sweep vs
justgrabbing something off of the free list (assuming we had separate locks for the two operations). 

But do we actually know that?  Doing a clock sweep is only a lot of
overhead if it has to pass over many buffers in order to find a good
one, and we don't know the numbers on that.  I think you can sweep a
lot of buffers for the overhead of a single contended lock.

If the sweep and the freelist had separate locks, you still need to
lock the freelist to add to it things discovered during the sweep.


> Does anyone know what the overhead of getting a block from the filesystem cache is?

I did tests on this a few days ago.  It took on average 20
microseconds per row to select one row via primary key when everything
was in shared buffers.
When everything was in RAM but not shared buffers, it took 40
microseconds.  Of this, about 10 microseconds were the kernel calls to
seek and read from OS cache to shared_buffers, and the other 10
microseconds is some kind of PG overhead, I don't know where.  The
timings are per select, not per page, and one select usually reads two
pages, one for the index leaf and one for the table.

This was all single-client usage on 2.8GHz AMD Opteron.  Not all the
components of the timings will scale equally with additional clients
on additional CPUs of course.  I think the time spent in the kernel
calls to do the seek and read will scale better than most other parts.


> BTW, given our workload I can't see any way of running at debug2 without having a large impact on performance.

As long as you are adding #define BGW_DEBUG and recompiling, you might
as well promote all the DEBUG2 in src/backend/storage/buffer/bufmgr.c
to DEBUG1 or LOG.  I think this will only generate a couple log
message per bgwriter_delay.  That should be tolerable, especially for
testing purposes.

Cheers,

Jeff


pgsql-hackers by date:

Previous
From: Dimitri Fontaine
Date:
Subject: Re: hstores in pl/python
Next
From: Tom Lane
Date:
Subject: Re: Complier warnings on mingw gcc 4.5.0