Re: Page replacement algorithm in buffer cache - Mailing list pgsql-hackers

From Jeff Janes
Subject Re: Page replacement algorithm in buffer cache
Date
Msg-id CAMkU=1zVSyNRR_AQh4j_w6h37+qyvAz4fY8A+QP8e0dsuBg7Fw@mail.gmail.com
Whole thread Raw
In response to Re: Page replacement algorithm in buffer cache  (Ants Aasma <ants@cybertec.at>)
Responses Re: Page replacement algorithm in buffer cache  (Merlin Moncure <mmoncure@gmail.com>)
List pgsql-hackers
On Friday, March 22, 2013, Ants Aasma wrote:
On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> well if you do a non-locking test first you could at least avoid some
> cases (and, if you get the answer wrong, so what?) by jumping to the
> next buffer immediately.  if the non locking test comes good, only
> then do you do a hardware TAS.
>
> you could in fact go further and dispense with all locking in front of
> usage_count, on the premise that it's only advisory and not a real
> refcount.  so you only then lock if/when it's time to select a
> candidate buffer, and only then when you did a non locking test first.
>  this would of course require some amusing adjustments to various
> logical checks (usage_count <= 0, heh).

Moreover, if the buffer happens to miss a decrement due to a data
race, there's a good chance that the buffer is heavily used and
wouldn't need to be evicted soon anyway. (if you arrange it to be a
read-test-inc/dec-store operation then you will never go out of
bounds) However, clocksweep and usage_count maintenance is not what is
causing contention because that workload is distributed. The issue is
pinning and unpinning.

That is one of multiple issues.  Contention on the BufFreelistLock is another one.  I agree that usage_count maintenance is unlikely to become a bottleneck unless one or both of those is fixed first (and maybe not even then)

...

 
The issue with the current buffer management algorithm is that it
seems to scale badly with increasing shared_buffers.

I do not think that this is the case.  Neither of the SELECT-only contention points (pinning/unpinning of index root blocks when all data is in shared_buffers, and BufFreelistLock when all data is not in shared_buffers) are made worse by increasing shared_buffers that I have seen.  They do scale badly with number of concurrent processes, though.

The reports of write-heavy workloads not scaling well with shared_buffers do not seem to be driven by the buffer management algorithm, or at least not the freelist part of it.  They mostly seem to center on the kernel and the IO controllers.

 Cheers,

Jeff

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: pkg-config files for libpq and ecpg
Next
From: Tom Lane
Date:
Subject: Re: Hash Join cost estimates