Thread: our buffer replacement strategy is kind of lame

our buffer replacement strategy is kind of lame

From

Robert Haas

Date:

12 August 2011, 01:05:39

While I was poking around at the index-only scans patch, I couldn't
help noticing that our buffer replacement algorithm leaves something
to be desired.  Here's the query again:

select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567);

As sample_data is not indexed, it sequential scans that table and does
an index-only scan of pgbench_accounts for each aid.  When this is
done, here's what pg_buffercache has to say:

rhaas=# select usagecount, sum(1) from pg_buffercache group by 1 order by 1;
usagecount |  sum
------------+-------         1 |   136         2 |    12         3 |    72         4 |     7         5 | 13755
| 37218
 
(6 rows)

Only 96 of the 14286 buffers in sample_data are in shared_buffers,
despite the fact that we have 37,218 *completely unused* buffers lying
around.  That sucks, because it means that the sample query did a
whole lot of buffer eviction that was completely needless.  The ring
buffer is there to prevent trashing the shared buffer arena, but here
it would have been fine to trash the arena: there wasn't anything
there we would have minded losing (or, indeed, anything at all).  On
the other hand, the buffer manager has *no problem at all* trashing
the buffer arena if we're faulting in pages for an index scan rather
than a sequential scan.  If you manage to get all of sample_data into
memory (by running many copies of the above query in parallel, you can
get each one to allocate its own ring buffer, and eventually pull in
all the pages), and then run some query that probes an index which is
too large to fit in shared_buffers, it cheerfully blows the whole
sample_data table out without a second thought.  Had you sequentially
scanned a big table, of course, there would be some protection, but an
index scan can stomp all over everything with complete impunity.

The general problem here is that we are not very smart about handling
workloads with weak locality - i.e. the working set is larger than
shared buffers.  If the working set fits in shared_buffers, we will
keep it there, and it will be strongly protected against eviction.
But if the working set *doesn't* fit in shared buffers, then we fall
back on some not-very-smart heuristics to decide what to do: keep
pages involved in index scans, ditch those involved in sequential
scans.

It seems to me that we need a smarter algorithm.  It seems that NetBSD
has adopted the use of Clock-Pro, and there are some discussions of it
being used in Linux as well, though I'm not sure whether that's
actually (still?) the case.  Paper is here, although I find the
description of the algorithm to be approximately clear as mud:

http://www.cse.ohio-state.edu/~fchen/paper/papers/usenix05.pdf

One of the important notions which many of the algorithms in the
literature seem to hit on in one way or another is that you mustn't go
and decide that all of your buffers - or nearly all your buffers - are
hot.  That's really equivalent to saying that none of your buffers are
hot; and it's also pretty easy to see that such a scenario would be
disastrous for our current algorithm, which - if everything in
shared_buffers has usage-count 5 all the time - will have to clock
sweep a long way each time it needs to allocate a buffer.  In fact,
the more of your buffers want to be hot (and thus hard to evict), the
fewer of them should actually be allowed to acquire that status.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: our buffer replacement strategy is kind of lame

From

Simon Riggs

Date:

12 August 2011, 05:36:45

On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:

> On
> the other hand, the buffer manager has *no problem at all* trashing
> the buffer arena if we're faulting in pages for an index scan rather
> than a sequential scan.  If you manage to get all of sample_data into
> memory (by running many copies of the above query in parallel, you can
> get each one to allocate its own ring buffer, and eventually pull in
> all the pages), and then run some query that probes an index which is
> too large to fit in shared_buffers, it cheerfully blows the whole
> sample_data table out without a second thought.  Had you sequentially
> scanned a big table, of course, there would be some protection, but an
> index scan can stomp all over everything with complete impunity.

That's a good observation and I think we should do this

* Make an IndexScan use a ring buffer once it has used 32 blocks. The
vast majority won't do that, so we avoid overhead on the common path.

* Make an BitmapIndexScan use a ring buffer when we know that the
index is larger than 32 blocks. (Ignore upper parts of tree for that
calc).


--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: our buffer replacement strategy is kind of lame

From

Simon Riggs

Date:

12 August 2011, 06:00:00

On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:

> rhaas=# select usagecount, sum(1) from pg_buffercache group by 1 order by 1;
> usagecount |  sum
> ------------+-------
>          1 |   136
>          2 |    12
>          3 |    72
>          4 |     7
>          5 | 13755
>            | 37218
> (6 rows)
>
> Only 96 of the 14286 buffers in sample_data are in shared_buffers,
> despite the fact that we have 37,218 *completely unused* buffers lying
> around.  That sucks, because it means that the sample query did a
> whole lot of buffer eviction that was completely needless.  The ring
> buffer is there to prevent trashing the shared buffer arena, but here
> it would have been fine to trash the arena: there wasn't anything
> there we would have minded losing (or, indeed, anything at all).

You're missing an important point. The SeqScan is measurably faster
when using the ring buffer because of the effects of L2 cacheing on
the buffers.

Also, your logic is a little off, since you're doing the scan on an
otherwise quiet machine soon after startup and there are some
completely unused buffers. That isn't the typical state of the buffer
cache and so spoiling the cache is not acceptable in the general case.
The above case doesn't suck because it carefully reserved parts of
shared buffers for other users when spoiling the cache would be of no
benefit to the user, so it worked great from my perspective.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: our buffer replacement strategy is kind of lame

From

Simon Riggs

Date:

12 August 2011, 07:53:48

On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:

> The general problem here is that we are not very smart about handling
> workloads with weak locality - i.e. the working set is larger than
> shared buffers.  If the working set fits in shared_buffers, we will
> keep it there, and it will be strongly protected against eviction.
> But if the working set *doesn't* fit in shared buffers, then we fall
> back on some not-very-smart heuristics to decide what to do: keep
> pages involved in index scans, ditch those involved in sequential
> scans.
>
> It seems to me that we need a smarter algorithm.  It seems that NetBSD
> has adopted the use of Clock-Pro, and there are some discussions of it
> being used in Linux as well, though I'm not sure whether that's
> actually (still?) the case.  Paper is here, although I find the
> description of the algorithm to be approximately clear as mud:
>
> http://www.cse.ohio-state.edu/~fchen/paper/papers/usenix05.pdf
>
> One of the important notions which many of the algorithms in the
> literature seem to hit on in one way or another is that you mustn't go
> and decide that all of your buffers - or nearly all your buffers - are
> hot.  That's really equivalent to saying that none of your buffers are
> hot; and it's also pretty easy to see that such a scenario would be
> disastrous for our current algorithm, which - if everything in
> shared_buffers has usage-count 5 all the time - will have to clock
> sweep a long way each time it needs to allocate a buffer.  In fact,
> the more of your buffers want to be hot (and thus hard to evict), the
> fewer of them should actually be allowed to acquire that status.

This is something I've been actively working on. I was on the trail of
possible reasons that increasing the size of shared buffers would slow
down PostgreSQL.

The worst case behaviour of the current freelist code is that it can
take up to 5 * shared_buffers checks before identifying a victim
buffer. That occurs when we have a working set exactly matching size
of shared buffers. There are problems when large portions of shared
buffers are very popular, so that the free-ish buffers are a small
proportion of the total buffers - this case gets worse if the
distribution of buffer allocations is non-random so you get say 10
free-ish blocks together, followed by N-10 very popular ones. That
makes every 11th freelist request take ~O(shared_buffers). These
problems will show themselves as sporadic BufFreelistLock contention.
Sporadic is the problem here since it may make the contention hard to
measure in aggregate, yet with real impact on performance.

For us to benefit from increased shared_buffers we need to have an
algorithm that is O(k) in its worst case behaviour and O(1) in its
best case behaviour - the latter aspect is provided by a correctly
functioning bgwriter. At the moment, we do nothing to enforce O(k)
behaviour.

The following patch implements O(k) behaviour using a heuristic limit.
My first attempt was a fixed heuristic limit, but that was less
suitable because there are actually 2 requirements. The value of k
must be small to have a measurable impact on the smoothness of
StrategyGetBuffer(), but k must also be fairly large to ensure the
choice of victim is a relatively good one. So the way to balance those
conflicting objectives is to have a progressively increasing search
window. An just to repeat, k is not a % of shared buffers, which would
still give O(N) behaviour.

I've not been reading "the literature", given the problems we had in
2004/5 regarding patents in this area. I also think that since we rely
on the underlying filesystem for cacheing that we don't have exactly
the same problem as other systems.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

freelist_ok.v2.patch

Re: our buffer replacement strategy is kind of lame

From

Simon Riggs

Date:

12 August 2011, 08:57:52

On Fri, Aug 12, 2011 at 11:53 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

> I've not been reading "the literature", given the problems we had in
> 2004/5 regarding patents in this area. I also think that since we rely
> on the underlying filesystem for cacheing that we don't have exactly
> the same problem as other systems.

Reading the link you gave, I'm unimpressed by ClockPro. The
measurements made are all in low numbers of MB and the results show
that Clock is roughly same or better for large caches, but one might
read them multiple ways.

The zone of interest for us is above 1GB and the issues we care about
are contention, so even if we think ClockPro has a chance of improving
things we really don't have any measurements that support the cases we
care about.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: our buffer replacement strategy is kind of lame

From

Robert Haas

Date:

12 August 2011, 09:14:12

On Fri, Aug 12, 2011 at 4:33 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> You're missing an important point. The SeqScan is measurably faster
> when using the ring buffer because of the effects of L2 cacheing on
> the buffers.

I hadn't thought of that, but I think that's only true if the relation
won't fit in shared_buffers (or whatever portion of shared_buffers is
reasonably available, given the other activity on the machine).  In
this particular case, it's almost 20% faster if the relation is all in
shared_buffers; I tested it.  I think what's going on here is that
initscan() has a heuristic that tries to use a BufferAccessStrategy if
the relation is larger than 1/4 of shared_buffers.  That's probably a
pretty good heuristic in general, but in this case I have a relation
which just so happens to be 27.9% of shared_buffers but will still
fit.  As you say below, that may not be typical in real life, although
there are probably data warehousing systems where it's normal to have
only one big query running at a time.

> Also, your logic is a little off, since you're doing the scan on an
> otherwise quiet machine soon after startup and there are some
> completely unused buffers. That isn't the typical state of the buffer
> cache and so spoiling the cache is not acceptable in the general case.
> The above case doesn't suck because it carefully reserved parts of
> shared buffers for other users when spoiling the cache would be of no
> benefit to the user, so it worked great from my perspective.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: our buffer replacement strategy is kind of lame

From

Robert Haas

Date:

12 August 2011, 09:26:22

On Fri, Aug 12, 2011 at 4:36 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>> On
>> the other hand, the buffer manager has *no problem at all* trashing
>> the buffer arena if we're faulting in pages for an index scan rather
>> than a sequential scan.  If you manage to get all of sample_data into
>> memory (by running many copies of the above query in parallel, you can
>> get each one to allocate its own ring buffer, and eventually pull in
>> all the pages), and then run some query that probes an index which is
>> too large to fit in shared_buffers, it cheerfully blows the whole
>> sample_data table out without a second thought.  Had you sequentially
>> scanned a big table, of course, there would be some protection, but an
>> index scan can stomp all over everything with complete impunity.
>
> That's a good observation and I think we should do this
>
> * Make an IndexScan use a ring buffer once it has used 32 blocks. The
> vast majority won't do that, so we avoid overhead on the common path.
>
> * Make an BitmapIndexScan use a ring buffer when we know that the
> index is larger than 32 blocks. (Ignore upper parts of tree for that
> calc).

We'd need to think about what happens to the internal pages of the
btree, the leaf pages, and then the heap pages from the underlying
relation; those probably shouldn't all be treated the same.  Also, I
think the tricky part is figuring out when to apply the optimization
in the first place.  Once we decide we need a ring buffer, a very
small one (like 32 blocks) is probably the way to go.  But it will be
a loser to apply the optimization to data sets that would otherwise
have fit in shared_buffers.  This is a classic case of the LRU/MRU
problem.  You want to evict buffers in LRU fashion until the working
set gets larger than you can cache; and then you want to switch to MRU
to avoid uselessly caching pages that you'll never manage to revisit
before they're evicted.

The point of algorithms like Clock-Pro is to try to have the system
work that out on the fly, based on the actual workload, rather than
using heuristics.  I agree with you there's no convincing evidence
that Clock-Pro would be better for us; I mostly thought it was
interesting because it seems that the NetBSD and Linux guys find it
interesting, and they're managing much larger caches than the ones
we're dealing with.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: our buffer replacement strategy is kind of lame

From

Simon Riggs

Date:

12 August 2011, 09:29:01

On Fri, Aug 12, 2011 at 1:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Aug 12, 2011 at 4:33 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> You're missing an important point. The SeqScan is measurably faster
>> when using the ring buffer because of the effects of L2 cacheing on
>> the buffers.
>
> I hadn't thought of that, but I think that's only true if the relation
> won't fit in shared_buffers (or whatever portion of shared_buffers is
> reasonably available, given the other activity on the machine).  In
> this particular case, it's almost 20% faster if the relation is all in
> shared_buffers; I tested it.  I think what's going on here is that
> initscan() has a heuristic that tries to use a BufferAccessStrategy if
> the relation is larger than 1/4 of shared_buffers.  That's probably a
> pretty good heuristic in general, but in this case I have a relation
> which just so happens to be 27.9% of shared_buffers but will still
> fit.  As you say below, that may not be typical in real life, although
> there are probably data warehousing systems where it's normal to have
> only one big query running at a time.

I think there are reasonable arguments to make

* prefer_cache = off (default) | on a table level storage parameter,
=on will disable the use of BufferAccessStrategy

* make cache_spoil_threshold a parameter, with default 0.25

Considering the world of very large RAMs in which we now live, some
control of the above makes sense.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: our buffer replacement strategy is kind of lame

From

Robert Haas

Date:

12 August 2011, 09:40:30

On Fri, Aug 12, 2011 at 6:53 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> The worst case behaviour of the current freelist code is that it can
> take up to 5 * shared_buffers checks before identifying a victim
> buffer. That occurs when we have a working set exactly matching size
> of shared buffers. There are problems when large portions of shared
> buffers are very popular, so that the free-ish buffers are a small
> proportion of the total buffers - this case gets worse if the
> distribution of buffer allocations is non-random so you get say 10
> free-ish blocks together, followed by N-10 very popular ones. That
> makes every 11th freelist request take ~O(shared_buffers). These
> problems will show themselves as sporadic BufFreelistLock contention.
> Sporadic is the problem here since it may make the contention hard to
> measure in aggregate, yet with real impact on performance.

I've been thinking about this exact same set of problems.

> For us to benefit from increased shared_buffers we need to have an
> algorithm that is O(k) in its worst case behaviour and O(1) in its
> best case behaviour - the latter aspect is provided by a correctly
> functioning bgwriter. At the moment, we do nothing to enforce O(k)
> behaviour.

Check.

> The following patch implements O(k) behaviour using a heuristic limit.
> My first attempt was a fixed heuristic limit, but that was less
> suitable because there are actually 2 requirements. The value of k
> must be small to have a measurable impact on the smoothness of
> StrategyGetBuffer(), but k must also be fairly large to ensure the
> choice of victim is a relatively good one. So the way to balance those
> conflicting objectives is to have a progressively increasing search
> window. An just to repeat, k is not a % of shared buffers, which would
> still give O(N) behaviour.

This is a very interesting idea, and I think it has a lot of potential.

One additional thought I had is that right now I think it's far too
easy for buffers to get pushed all the way up to usage count 5,
because we bump the reference count every time the buffer is pinned,
which can easily happen many times per clock sweep.  I think we would
be better off having a "used" flag on each buffer.  Each pin sets the
used bit, but does nothing if it is already set.  The usage count is
only changed by the clock sweep, which does the following on each
buffer:

- If the "used" flag is set, then the flag is cleared and the usage
count increases by one to a maximum of 5.
- If the "used" flag is not set, then the usage count decreases by one.

That way, buffers can't get hot because of a fast flurry of access
followed by nothing - to get up to usage_count 5, they've actually got
to stay interesting for a period of time.  As a side effect, I believe
we could get rid of the spinlock that we currently take and release
for every buffer the clock sweep hits, because we could regard the
usage_count as protected by the BufFreelistLock rather than by the
buffer header spinlock; and the "used" flag doesn't require perfect
synchronization.  We'd only pin the buffer we actually plan to evict
(and could recheck the "used" flag before doing so, in case a memory
ordering effect made us miss the fact that it had been recently set).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: our buffer replacement strategy is kind of lame

From

Simon Riggs

Date:

12 August 2011, 09:41:21

On Fri, Aug 12, 2011 at 1:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:

>  But it will be
> a loser to apply the optimization to data sets that would otherwise
> have fit in shared_buffers.

Spoiling the cache is a bad plan, even if it makes the current query faster.

I think we should make the optimisation stronger still and use
FADV_DONT_NEED on blocks that fall from the ring buffer. Spoiling the
OS cache is a problem as well.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: our buffer replacement strategy is kind of lame

From

Robert Haas

Date:

12 August 2011, 09:42:32

On Fri, Aug 12, 2011 at 8:28 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Fri, Aug 12, 2011 at 1:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Aug 12, 2011 at 4:33 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> You're missing an important point. The SeqScan is measurably faster
>>> when using the ring buffer because of the effects of L2 cacheing on
>>> the buffers.
>>
>> I hadn't thought of that, but I think that's only true if the relation
>> won't fit in shared_buffers (or whatever portion of shared_buffers is
>> reasonably available, given the other activity on the machine).  In
>> this particular case, it's almost 20% faster if the relation is all in
>> shared_buffers; I tested it.  I think what's going on here is that
>> initscan() has a heuristic that tries to use a BufferAccessStrategy if
>> the relation is larger than 1/4 of shared_buffers.  That's probably a
>> pretty good heuristic in general, but in this case I have a relation
>> which just so happens to be 27.9% of shared_buffers but will still
>> fit.  As you say below, that may not be typical in real life, although
>> there are probably data warehousing systems where it's normal to have
>> only one big query running at a time.
>
> I think there are reasonable arguments to make
>
> * prefer_cache = off (default) | on a table level storage parameter,
> =on will disable the use of BufferAccessStrategy
>
> * make cache_spoil_threshold a parameter, with default 0.25

Yeah, something like that might make sense.  Of course, a completely
self-tuning system would be better, but I'm not sure we're going to
find one of those.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: our buffer replacement strategy is kind of lame

From

Robert Haas

Date:

12 August 2011, 09:49:21

On Fri, Aug 12, 2011 at 8:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Fri, Aug 12, 2011 at 1:26 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>  But it will be
>> a loser to apply the optimization to data sets that would otherwise
>> have fit in shared_buffers.
>
> Spoiling the cache is a bad plan, even if it makes the current query faster.
>
> I think we should make the optimisation stronger still and use
> FADV_DONT_NEED on blocks that fall from the ring buffer. Spoiling the
> OS cache is a problem as well.

That would probably be better for really big tables, but it will be
worse for those where the entire table fits in the OS cache.

Caching spoiling means you're putting things into the cache which
won't still be there the next time they're needed.  If the data in
question fits in cache (and I don't think we can regard that as
uncommon, particularly for the OS cache, which can be half a terabyte)
then you don't want the anti-spoiling logic to kick in.

One thing we could consider for large sequential scans is doing
POSIX_FADV_SEQUENTIAL before beginning the scan (and maybe one more
time if the scan wraps).  That's basically throwing the problem of
whether or not to go LRU or MRU back on the OS, but the OS may well
have a better idea whether there's enough cache available to hold the
whole than we do.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: our buffer replacement strategy is kind of lame

From

daveg

Date:

12 August 2011, 19:42:56

On Fri, Aug 12, 2011 at 01:28:49PM +0100, Simon Riggs wrote:
> I think there are reasonable arguments to make
> 
> * prefer_cache = off (default) | on a table level storage parameter,
> =on will disable the use of BufferAccessStrategy
> 
> * make cache_spoil_threshold a parameter, with default 0.25
> 
> Considering the world of very large RAMs in which we now live, some
> control of the above makes sense.

As long as we are discussion cache settings for tables, I have a client
who would like to be able to lock specific tables and indexes into cache
as they have strict response time requirements for particular queries.
At the moment they are running postgres with a tablespace on ramfs and
taking frequent backups, but this is not optimal.

-dg

-- 
David Gould       daveg@sonic.net      510 536 1443    510 282 0869
If simplicity worked, the world would be overrun with insects.

Re: our buffer replacement strategy is kind of lame

From

Greg Stark

Date:

12 August 2011, 23:52:09

On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Only 96 of the 14286 buffers in sample_data are in shared_buffers,
> despite the fact that we have 37,218 *completely unused* buffers lying
> around.  That sucks, because it means that the sample query did a
> whole lot of buffer eviction that was completely needless.  The ring
> buffer is there to prevent trashing the shared buffer arena, but here
> it would have been fine to trash the arena: there wasn't anything
> there we would have minded losing (or, indeed, anything at all).

I don't disagree with the general thrust of your point, but I just
wanted to point out one case where not using free buffers even though
they're available might make sense:

If you execute a large batch delete or update or even just set lots of
hint bits you'll dirty a lot of buffers. The ring buffer forces the
query that is actually dirtying all these buffers to also do the i/o
to write them out. Otherwise you leave them behind to slow down other
queries. This was one of the problems with the old vacuum code which
the ring buffer replaced. It left behind lots of dirtied buffers for
other queries to do i/o on.

--
greg

Re: our buffer replacement strategy is kind of lame

From

Robert Haas

Date:

13 August 2011, 16:52:24

On Fri, Aug 12, 2011 at 10:51 PM, Greg Stark <stark@mit.edu> wrote:
> On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Only 96 of the 14286 buffers in sample_data are in shared_buffers,
>> despite the fact that we have 37,218 *completely unused* buffers lying
>> around.  That sucks, because it means that the sample query did a
>> whole lot of buffer eviction that was completely needless.  The ring
>> buffer is there to prevent trashing the shared buffer arena, but here
>> it would have been fine to trash the arena: there wasn't anything
>> there we would have minded losing (or, indeed, anything at all).
>
> I don't disagree with the general thrust of your point, but I just
> wanted to point out one case where not using free buffers even though
> they're available might make sense:
>
> If you execute a large batch delete or update or even just set lots of
> hint bits you'll dirty a lot of buffers. The ring buffer forces the
> query that is actually dirtying all these buffers to also do the i/o
> to write them out. Otherwise you leave them behind to slow down other
> queries. This was one of the problems with the old vacuum code which
> the ring buffer replaced. It left behind lots of dirtied buffers for
> other queries to do i/o on.

Interesting point.

After thinking about this some more, I'm coming around to the idea
that we need to distinguish between:

1. Ensuring a sufficient supply of evictable buffers, and
2. Evicting a buffer.

The second obviously needs to be done only when needed, but the first
one should really be done as background work.  Currently, the clock
sweep serves both functions, and that's not good.  We shouldn't ever
let ourselves get to the point where there are no buffers at all with
reference count zero, so that the next guy who needs a buffer has to
spin the clock hand around until the reference counts get low enough.
Maintaining a sufficient supply of refcount-zero buffers should be
done as a background task; and possibly we ought to put them all in a
linked list so that the next guy who needs a buffer can just pop one
off.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: our buffer replacement strategy is kind of lame

From

Greg Stark

Date:

13 August 2011, 18:08:49

On Sat, Aug 13, 2011 at 8:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:

> and possibly we ought to put them all in a
> linked list so that the next guy who needs a buffer can just pop one

The whole point of the clock sweep algorithm is to approximate an LRU
without needing to maintain a linked list. The problem with a linked
list is that you need to serialize access to it so every time you
reference a buffer you need to wait on a lock for the list so you can
move that buffer around in the list.

It does kind of seem like your numbers indicate we're missing part of
the picture though. The idea with the clock sweep algorithm is that
you keep approximately 1/nth of the buffers with each of the n values.
If we're allowing nearly all the buffers to reach a reference count of
5 then you're right that we've lost any information about which
buffers have been referenced most recently.

I wonder if we're suppoesd to be doing something like advancing the
clock hand each time we increment a reference count as well?

-- 
greg

Re: our buffer replacement strategy is kind of lame

From

Robert Haas

Date:

13 August 2011, 19:15:05

On Sat, Aug 13, 2011 at 4:40 PM, Greg Stark <stark@mit.edu> wrote:
> On Sat, Aug 13, 2011 at 8:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> and possibly we ought to put them all in a
>> linked list so that the next guy who needs a buffer can just pop one
>
> The whole point of the clock sweep algorithm is to approximate an LRU
> without needing to maintain a linked list. The problem with a linked
> list is that you need to serialize access to it so every time you
> reference a buffer you need to wait on a lock for the list so you can
> move that buffer around in the list.

Right, but the same doesn't apply to what I'm talking about.  You just
put each guy into the linked list when his reference count goes down
to 0.  When you want to allocate a buffer, you pop somebody off.  If
his reference count is still 0, you forget about him and pop the next
guy, and keep going until you find a suitable victim.

The problem with just running the clock sweep every time you need a
victim buffer is that you may have to pass over a large number of
unevictable buffers before you find someone you can actually kick out.That's work that should really be getting done in
thebackground, not

at the absolute last moment when we discover, hey, we need a buffer.

> It does kind of seem like your numbers indicate we're missing part of
> the picture though. The idea with the clock sweep algorithm is that
> you keep approximately 1/nth of the buffers with each of the n values.
> If we're allowing nearly all the buffers to reach a reference count of
> 5 then you're right that we've lost any information about which
> buffers have been referenced most recently.

It doesn't seem right to have 1/nth of the buffers at each of n values
because that might not match the actual workload.  For example, you
might have lightning-hot buffers that fill 50% of your available
cache.  Trying to artificially force some of those down to usage count
4 or 3 doesn't actually get you anywhere.  In fact, AFAICS, there's no
direct reason to care about about how many buffers have usage count 1
vs. usage count 5.  What IS important for performance is that there
are enough with usage count 0.  Any other distinction is only useful
to the extent that it allows us to make a better decision about which
buffers we should push down to 0 next.

> I wonder if we're suppoesd to be doing something like advancing the
> clock hand each time we increment a reference count as well?

That precise thing seems far too expensive, but I agree that
something's missing.  Right now we decrease usage counts when the
clock hand moves (which is driven by buffer allocation), and we
increase them when buffers are pinned (which is driven by access to
already-resident buffers).  I feel like that's comparing apples and
oranges.  If some subset of shared_buffers is very hot relative to the
uncached pages, then everything's going to converge to 5 (or whatever
arbitrary maximum we care to set in lieu of 5).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: our buffer replacement strategy is kind of lame

From

Simon Riggs

Date:

14 August 2011, 07:57:38

On Sat, Aug 13, 2011 at 11:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Aug 13, 2011 at 4:40 PM, Greg Stark <stark@mit.edu> wrote:
>> On Sat, Aug 13, 2011 at 8:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> and possibly we ought to put them all in a
>>> linked list so that the next guy who needs a buffer can just pop one
>>
>> The whole point of the clock sweep algorithm is to approximate an LRU
>> without needing to maintain a linked list. The problem with a linked
>> list is that you need to serialize access to it so every time you
>> reference a buffer you need to wait on a lock for the list so you can
>> move that buffer around in the list.
>
> Right, but the same doesn't apply to what I'm talking about.  You just
> put each guy into the linked list when his reference count goes down
> to 0.  When you want to allocate a buffer, you pop somebody off.  If
> his reference count is still 0, you forget about him and pop the next
> guy, and keep going until you find a suitable victim.
>
> The problem with just running the clock sweep every time you need a
> victim buffer is that you may have to pass over a large number of
> unevictable buffers before you find someone you can actually kick out.
>  That's work that should really be getting done in the background, not
> at the absolute last moment when we discover, hey, we need a buffer.

I think Greg is right here. Those suggested changes just bring back the LRU.

If you keep a separate list then you need to serialize access to it.

usage_count == 0 and "unevictable" are dynamic states, so if the
bgwriter sees those states they can change before anyone uses the
buffer.

The only state which is unchangeable is a recently filled block, such
as from a COPY, which is why I originally suggested a
filled-block-list - though I don't think we need it now because of the
buffer cycle strategy.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: our buffer replacement strategy is kind of lame

From

Simon Riggs

Date:

14 August 2011, 08:07:17

On Sat, Aug 13, 2011 at 11:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:

> I agree that
> something's missing.

I'm quoting you completely out of context here, but yes, something is missing.

We can't credibly do one test on usage count in shared buffers and
then start talking about how buffer management is all wrong.

My own patch requires more test evidence before we can commit it,
which is why I hadn't published it before now. I'll endeavour to do
that now its on the table.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: our buffer replacement strategy is kind of lame

From

Robert Haas

Date:

14 August 2011, 09:44:16

On Sun, Aug 14, 2011 at 6:57 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Sat, Aug 13, 2011 at 11:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sat, Aug 13, 2011 at 4:40 PM, Greg Stark <stark@mit.edu> wrote:
>>> On Sat, Aug 13, 2011 at 8:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> and possibly we ought to put them all in a
>>>> linked list so that the next guy who needs a buffer can just pop one
>>>
>>> The whole point of the clock sweep algorithm is to approximate an LRU
>>> without needing to maintain a linked list. The problem with a linked
>>> list is that you need to serialize access to it so every time you
>>> reference a buffer you need to wait on a lock for the list so you can
>>> move that buffer around in the list.
>>
>> Right, but the same doesn't apply to what I'm talking about.  You just
>> put each guy into the linked list when his reference count goes down
>> to 0.  When you want to allocate a buffer, you pop somebody off.  If
>> his reference count is still 0, you forget about him and pop the next
>> guy, and keep going until you find a suitable victim.
>>
>> The problem with just running the clock sweep every time you need a
>> victim buffer is that you may have to pass over a large number of
>> unevictable buffers before you find someone you can actually kick out.
>>  That's work that should really be getting done in the background, not
>> at the absolute last moment when we discover, hey, we need a buffer.
>
> I think Greg is right here. Those suggested changes just bring back the LRU.

Since the problem with LRU is that it requires moving each buffer to
the front of the list every time it's touched, and since the idea that
I proposed does not require that, I don't know what you mean by this.

> If you keep a separate list then you need to serialize access to it.

That is true.  However, there are some reasons to think it would be
better than what we're doing right now.  Right now, we acquire the
BufFreelistLock, and then take and release some number of spinlocks.
It seems fairly easy to construct cases where that number is routinely
over 100; even on my laptop, I could construct cases where it was
routinely 50-100 range.  A linked list that we can just dequeue things
from could be protected by a single spinlock, which would hopefully be
somewhat faster.  (In the interest of credit where credit is due, this
is pretty much the point Jim Nasby has been making every time this
argument comes up, and I've been dismissing it, but now that I
understand how much work gets done in that clock sweep code, I'm
starting to think he might be right.)

However, if it turns out that that there's still too much contention
over that spinlock, we can probably fix that by maintaining N lists
instead of one, distributing buffers between them in round-robin
fashion, and having each backend choose a list based on its backend ID
modulo N.  The testing I've done so far seems to indicate that
spinlock contention doesn't really become noticeable until you have
32-40 CPUs pounding hard on the same spinlock, so even N = 4 is
probably enough to make the problem go away.  But I don't even think
we're going to have to go that far right at the outset, because
there's another major source of contention in the buffer eviction
path: the buffer mapping locks.

> usage_count == 0 and "unevictable" are dynamic states, so if the
> bgwriter sees those states they can change before anyone uses the
> buffer.

Yep.  That's already a problem.  When we pin the buffer to be evicted,
we're not yet holding the relevant buffer mapping locks.  By the time
we do, someone else can have pinned the page.  So there's already
retry logic here.  In theory, you can imagine a backend getting
starved for an arbitrarily long period of time, because it keeps
picking a victim buffer that someone else wrenches away before we
actually manage to kick it out.  I haven't yet seen any evidence that
that's a problem in practice, but if it is, this idea will make it
worse.  It seems hard to know without testing it, though.

The big problem with this idea is that it pretty much requires that
the work you mentioned in one of your other emails - separating the
background writer and checkpoint machinery into two separate processes
- to happen first.  So I'm probably going to have to code that up to
see whether this works.  If you're planning to post that patch soon
I'll wait for it.  Otherwise, I'm going to have to cobble together
something that is at least good enough for testing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: our buffer replacement strategy is kind of lame

From

Martijn van Oosterhout

Date:

14 August 2011, 10:05:50

On Sat, Aug 13, 2011 at 09:40:15PM +0100, Greg Stark wrote:
> On Sat, Aug 13, 2011 at 8:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> > and possibly we ought to put them all in a
> > linked list so that the next guy who needs a buffer can just pop one
>
> The whole point of the clock sweep algorithm is to approximate an LRU
> without needing to maintain a linked list. The problem with a linked
> list is that you need to serialize access to it so every time you
> reference a buffer you need to wait on a lock for the list so you can
> move that buffer around in the list.

Well, there are such things as lock-free linked lists. Whether they'd
help here is the question though.

http://www.cse.yorku.ca/~ruppert/papers/lfll.pdf

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.  -- Arthur Schopenhauer

Re: our buffer replacement strategy is kind of lame

From

Simon Riggs

Date:

14 August 2011, 11:36:00

On Sun, Aug 14, 2011 at 1:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:

> The big problem with this idea is that it pretty much requires that
> the work you mentioned in one of your other emails - separating the
> background writer and checkpoint machinery into two separate processes
> - to happen first.  So I'm probably going to have to code that up to
> see whether this works.  If you're planning to post that patch soon
> I'll wait for it.  Otherwise, I'm going to have to cobble together
> something that is at least good enough for testing.

No, the big problem with the idea is that regrettably it is just an
idea on your part and has no basis in external theory or measurement.
I would not object to you investigating such a path and I think you
are someone that could invent something new and original there, but it
seems much less likely to be fruitful, or at least would require
significant experimental results to demonstrate an improvement in a
wide range of use cases to the rest of the hackers.

As to you not being able to work on your idea until I've split
bgwriter/checkpoint, that's completely unnecessary, and you know it. A
single ifdef is sufficient there, if at all.

The path I was working on (as shown in the earlier patch) was to apply
some corrections to the existing algorithm to reduce its worst case
behaviour. That's something I've seen mention of people doing for
RedHat kernels.

Overall, I think a minor modification is the appropriate path. If
Linux or other OS already use ClockPro then we already benefit from
it. It seems silly to track blocks that recently left shared buffers
when they are very likely still actually in memory in the filesystem
cache.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: our buffer replacement strategy is kind of lame

From

Simon Riggs

Date:

14 August 2011, 11:43:33

On Sun, Aug 14, 2011 at 1:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Aug 14, 2011 at 6:57 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On Sat, Aug 13, 2011 at 11:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Sat, Aug 13, 2011 at 4:40 PM, Greg Stark <stark@mit.edu> wrote:
>>>> On Sat, Aug 13, 2011 at 8:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>>> and possibly we ought to put them all in a
>>>>> linked list so that the next guy who needs a buffer can just pop one
>>>>
>>>> The whole point of the clock sweep algorithm is to approximate an LRU
>>>> without needing to maintain a linked list. The problem with a linked
>>>> list is that you need to serialize access to it so every time you
>>>> reference a buffer you need to wait on a lock for the list so you can
>>>> move that buffer around in the list.
>>>
>>> Right, but the same doesn't apply to what I'm talking about.  You just
>>> put each guy into the linked list when his reference count goes down
>>> to 0.  When you want to allocate a buffer, you pop somebody off.  If
>>> his reference count is still 0, you forget about him and pop the next
>>> guy, and keep going until you find a suitable victim.
>>>
>>> The problem with just running the clock sweep every time you need a
>>> victim buffer is that you may have to pass over a large number of
>>> unevictable buffers before you find someone you can actually kick out.
>>>  That's work that should really be getting done in the background, not
>>> at the absolute last moment when we discover, hey, we need a buffer.
>>
>> I think Greg is right here. Those suggested changes just bring back the LRU.
>
> Since the problem with LRU is that it requires moving each buffer to
> the front of the list every time it's touched, and since the idea that
> I proposed does not require that, I don't know what you mean by this.
>
>> If you keep a separate list then you need to serialize access to it.
>
> That is true.  However, there are some reasons to think it would be
> better than what we're doing right now.  Right now, we acquire the
> BufFreelistLock, and then take and release some number of spinlocks.
> It seems fairly easy to construct cases where that number is routinely
> over 100; even on my laptop, I could construct cases where it was
> routinely 50-100 range.  A linked list that we can just dequeue things
> from could be protected by a single spinlock, which would hopefully be
> somewhat faster.  (In the interest of credit where credit is due, this
> is pretty much the point Jim Nasby has been making every time this
> argument comes up, and I've been dismissing it, but now that I
> understand how much work gets done in that clock sweep code, I'm
> starting to think he might be right.)
>
> However, if it turns out that that there's still too much contention
> over that spinlock, we can probably fix that by maintaining N lists
> instead of one, distributing buffers between them in round-robin
> fashion, and having each backend choose a list based on its backend ID
> modulo N.  The testing I've done so far seems to indicate that
> spinlock contention doesn't really become noticeable until you have
> 32-40 CPUs pounding hard on the same spinlock, so even N = 4 is
> probably enough to make the problem go away.  But I don't even think
> we're going to have to go that far right at the outset, because
> there's another major source of contention in the buffer eviction
> path: the buffer mapping locks.

All of the above presumes that we have a list of best-next-victims. We
don't have that list, so describing how we would optimise access to it
means nothing.

Yes, if we had it, we could dequeue them quickly - that is not the problem.

The problem is that creating the best-next-victim list AND making it
accurate is the act that causes contention.

Yes, we can maintain an inaccurate list cheaply.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: our buffer replacement strategy is kind of lame

From

Robert Haas

Date:

14 August 2011, 12:47:43

On Sun, Aug 14, 2011 at 10:35 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Sun, Aug 14, 2011 at 1:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> The big problem with this idea is that it pretty much requires that
>> the work you mentioned in one of your other emails - separating the
>> background writer and checkpoint machinery into two separate processes
>> - to happen first.  So I'm probably going to have to code that up to
>> see whether this works.  If you're planning to post that patch soon
>> I'll wait for it.  Otherwise, I'm going to have to cobble together
>> something that is at least good enough for testing.
>
> No, the big problem with the idea is that regrettably it is just an
> idea on your part and has no basis in external theory or measurement.
> I would not object to you investigating such a path and I think you
> are someone that could invent something new and original there, but it
> seems much less likely to be fruitful, or at least would require
> significant experimental results to demonstrate an improvement in a
> wide range of use cases to the rest of the hackers.

All right, well, I'll mull over whether it's worth pursuing.  Unless I
or someone else comes up with an idea I like better, I think it
probably is.

> As to you not being able to work on your idea until I've split
> bgwriter/checkpoint, that's completely unnecessary, and you know it. A
> single ifdef is sufficient there, if at all.

Hmm.  Well, it might be unnecessary, but if I knew it were
unnecessary, I wouldn't have said that I thought it was necessary.

> The path I was working on (as shown in the earlier patch) was to apply
> some corrections to the existing algorithm to reduce its worst case
> behaviour. That's something I've seen mention of people doing for
> RedHat kernels.

Yeah.  Your idea is appealing because it bounds the amount of time .
There is some chance that you might kick out a really hot buffer if
there are a long series of such buffers in a row.  Without testing, I
don't know whether that's a serious problem or not.

> Overall, I think a minor modification is the appropriate path. If
> Linux or other OS already use ClockPro then we already benefit from
> it. It seems silly to track blocks that recently left shared buffers
> when they are very likely still actually in memory in the filesystem
> cache.

You may be right.  Basically, my concern is that buffer eviction is
too slow.  On a 32-core system, it's easy to construct a workload
where the whole system bottlenecks on the rate at which buffers can be
evicted and replaced - not because the system is fundamentally
incapable of copying data around that quickly, but because everything
piles up behind BufFreelistLock, and to a lesser extent the buffer
mapping locks.  Your idea may help with that, but I doubt that it's a
complete solution.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: our buffer replacement strategy is kind of lame

From

Tom Lane

Date:

14 August 2011, 14:11:14

Simon Riggs <simon@2ndQuadrant.com> writes:
> On Sat, Aug 13, 2011 at 11:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I agree that something's missing.

> I'm quoting you completely out of context here, but yes, something is missing.

> We can't credibly do one test on usage count in shared buffers and
> then start talking about how buffer management is all wrong.

More generally: the originally presented facts suggest that there might
be value in improving the "buffer access strategy" code that keeps
particular operations from using all of shared_buffers.  It seems to me
to be a giant and unsubstantiated leap from that to the conclusion that
there's anything wrong with the clock sweep algorithm.  Moreover,
several of the proposed "fixes" amount to reversion to methods that
we already know are less good than the clock sweep, because we already
tried them years ago.  So I've been quite unimpressed with the quality
of discussion in this thread.
        regards, tom lane

Re: our buffer replacement strategy is kind of lame

From

Robert Haas

Date:

14 August 2011, 15:33:42

On Sun, Aug 14, 2011 at 1:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> On Sat, Aug 13, 2011 at 11:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> I agree that something's missing.
>
>> I'm quoting you completely out of context here, but yes, something is missing.
>
>> We can't credibly do one test on usage count in shared buffers and
>> then start talking about how buffer management is all wrong.
>
> More generally: the originally presented facts suggest that there might
> be value in improving the "buffer access strategy" code that keeps
> particular operations from using all of shared_buffers.

Possibly.  Since I've realized that we only switch to a ring buffer
when the size of the relation is more than 25% of shared buffers, I'm
less concerned about that problem.  My test case only demonstrated a
~20% performance improvement from getting rid of the ring buffer, and
that was with a relation that happened to be 27% of the size of
shared_buffers, so it was a bit unlucky.  I think it'd be interesting
to try to make this smarter in some way, but it's not bugging me as
much now that I've realized that I was unlucky to fall down that
particular well.

> It seems to me
> to be a giant and unsubstantiated leap from that to the conclusion that
> there's anything wrong with the clock sweep algorithm.  Moreover,
> several of the proposed "fixes" amount to reversion to methods that
> we already know are less good than the clock sweep, because we already
> tried them years ago.  So I've been quite unimpressed with the quality
> of discussion in this thread

Well, here's the problem I'm worried about: if 99% of shared_buffers
is filled with a very hot working set, every new page that gets
brought in will need to scan, on average, 100 buffers before finding
something to evict.  That seems slow.  Simon is proposing to bound the
really bad case where you flip through the entire ring multiple times
before you find a buffer, and that may well be worth doing.  But I
think even scanning 100 buffers every time you need to bring something
in is too slow.  What's indisputable is that a SELECT-only workload
which is larger than shared_buffers can be very easily rate-limited by
the speed at which BufFreelistLock can be taken and released.  If you
have a better idea for solving that problem, I'm all ears...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: our buffer replacement strategy is kind of lame

From

Greg Smith

Date:

14 August 2011, 21:53:12

On 08/12/2011 10:51 PM, Greg Stark wrote:
> If you execute a large batch delete or update or even just set lots of
> hint bits you'll dirty a lot of buffers. The ring buffer forces the
> query that is actually dirtying all these buffers to also do the i/o
> to write them out. Otherwise you leave them behind to slow down other
> queries. This was one of the problems with the old vacuum code which
> the ring buffer replaced. It left behind lots of dirtied buffers for
> other queries to do i/o on.
>    

I ran into the other side of this when trying to use Linux's relatively 
new dirty_background_bytes tunable to constrain the OS write cache.  
When running with the current VACUUM ring buffer code, if there isn't 
also a large OS write cache backing that, performance suffers quite a 
bit.  I've been adding test rigging to quantify this into pgbench-tools 
recently, and I fear that one of the possible outcomes could pushback 
pressure toward making the database's ring buffer bigger.  Just a 
theory--waiting on some numbers still.

Anyway, I think every idea thrown out here so far needs about an order 
of magnitude more types of benchmarking test cases before it can be 
evaluated at all.  The case I just mentioned is a good example of why.  
Every other test I ran showed a nice improvement with the kernel tuning 
I tried.  But VACUUM was massively detuned in the process.

I have an entire file folder filled with notes on way to reorganize the 
buffer cache, from my background writer work for 8.3.  In my mind 
they're all sitting stuck behind the problem of getting enough 
standardized benchmark workloads to have a performance regression 
suite.  The idea of touching any of this code without a look at a large 
number of different tests is a bit optimistic.  What I expect to happen 
here that all initially proposed changes will end up tuning for one 
workload at the expense of other, not measured ones.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: our buffer replacement strategy is kind of lame

From

"Kevin Grittner"

Date:

15 August 2011, 14:06:41

Greg Smith <greg@2ndQuadrant.com> wrote:
> Anyway, I think every idea thrown out here so far needs about an
> order of magnitude more types of benchmarking test cases before it
> can be evaluated at all.
Right.  I'm very excited about all the optimizations going in, and
can't see where the ones committed so far are very likely to cause
problems for workloads other than those tested, but here we're
heading into territory where the performance farm is very
desperately needed.
That said, I'll throw out one more idea, as something that worked
very well for me in a similar situation, and played well with
external caching which knew lots about the storage media but nothing
about the database side.  It's very similar to the clock sweep
algorithm, but uses a weighted average rather than a simple count. 
It didn't tend to leave as many buffers clustered at one end or the
other, and the gradation did seem to be useful.
I started out this post trying to describe it more generally, but it
all came out sounding too vague and hand-wavy; so I'll give a
description with more particulars which could, of course, be tweaked
without tossing the concept.  This particular description modifies
the techniques I've used to try to better fit the particulars of
PostgreSQL.  It may fall down on contention issues, but since it
benchmarked better for me than the alternatives, I thought it might
at least help spin off other useful ideas.
(1)  Each buffer would have a one byte count, incremented on access,
similar to the current count.  Of course, 255 would not wrap on
access, but be left unchanged.
(2)  We would have 256 int counters for how many buffers were
available with each access count.  These would be maintained when
the access counts changed.
(3)  While sweeping, we wouldn't decrement the access counters for
buffers we couldn't use; rather, there would be a boolean flag to
say whether to divide the access counts for non-chosen buffers by
two.
(4)  The "reduce" flag would be set when any access count goes to
255, and cleared after one complete sweep through the buffers.  (So,
obviously, we note the location when we set the flag.)
(5)  To find the next buffer to evict, we would find out what is the
lowest count in use, and sweep forward until we found one.
(6)  We would have a background process doing this sweeping and
count reduction which would try to keep a few evicted buffers in a
queue for quick pickup.  This queue would be the only source for
getting a buffer.  Getting a buffer would always be a very
lightweight operation if there's something in the queue.  The
background task would try very hard to keep the queue non-empty,
only coming to rest when the queue was "full" -- whatever that was
chosen to mean (possibly based on the sort of adaptive calculations
currently used by the background writer).
There are a lot of interesting ways this could interact with the
background writer.  One of the more intriguing to me would be for
the "sweep" process to estimate what current access count levels
will need to be used on its next sweep and queue up any at those
levels which are dirty for the background writer.  This is vaguely
conceptually similar to the "wash marker" used in some LRU lists.
-Kevin

Re: our buffer replacement strategy is kind of lame

From

Jim Nasby

Date:

15 August 2011, 20:26:51

On Aug 13, 2011, at 3:40 PM, Greg Stark wrote:
> It does kind of seem like your numbers indicate we're missing part of
> the picture though. The idea with the clock sweep algorithm is that
> you keep approximately 1/nth of the buffers with each of the n values.
> If we're allowing nearly all the buffers to reach a reference count of
> 5 then you're right that we've lost any information about which
> buffers have been referenced most recently.

One possible missing piece here is that OS clock-sweeps depend on the clock hand to both increment and decrement the
usagecount. The hardware sets a bit any time a page is accessed; as the clock sweeps in increases usage count if the
bitis set and decreases it if it's clear. I believe someone else in the thread suggested this, and I definitely think
it'sworth an experiment. Presumably this would also ease some lock contention issues. 

There is another piece that might be relevant... many (most?) OSes keep multiple lists of pages. FreeBSD for example
containsthese page lists (http://www.freebsd.org/doc/en/articles/vm-design/article.html). Full description follows, but
Ithink the biggest take-away is that there is a difference in how pages are handled once they are no longer active
basedon whither the page is dirty or not. 

Active: These pages are actively in use and are not currently under consideration for eviction. This is roughy
equivalentto all of our buffers with a usage count of 5. 

When an active page's usage count drops to it's minimum value, it will get unmapped from process space and moved to one
oftwo queues: 

Inactive: DIRTY pages that are eligible for eviction once they've been written out.

Cache: CLEAN pages that may be immediately reclaimed

Free: A small set of pages that are basically the tail of the Cache list. The OS *must* maintain some pages on this
listto support memory needed during interrupt handling. The size of this list is typically kept very small, and I'm not
sureif non-interrupt processing will pull from this list. 

It's important to note that the OS can pull a page back out of the Inactive and Cache lists back into Active very
cheaply.

I think there are two interesting points here. First: after a page has been determined to no longer be in active use it
goesinto inactive or cache based on whether it's dirty. ISTM that allows for much better scheduling of the flushing of
dirtypages. That said; I'm not sure how much that would help us due to checkpoint requirements. 

Second: AFAIK only the Active list has a clock sweep. I believe the others are LRU (the mentioned URL refers to them as
queues).I believe this works well because if a page faults it just needs to be removed from whichever queue it is in,
addedto the Active queue, and mapped back into process space. 
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net

Re: our buffer replacement strategy is kind of lame

From

Simon Riggs

Date:

02 January 2012, 13:30:22

On Sun, Aug 14, 2011 at 7:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:

> Simon is proposing to bound the
> really bad case where you flip through the entire ring multiple times
> before you find a buffer, and that may well be worth doing.  But I
> think even scanning 100 buffers every time you need to bring something
> in is too slow.  What's indisputable is that a SELECT-only workload
> which is larger than shared_buffers can be very easily rate-limited by
> the speed at which BufFreelistLock can be taken and released.  If you
> have a better idea for solving that problem, I'm all ears...

I felt we were on the right track here for a while.

Does anyone dispute that BufFreelistLock is a problem? shared buffer
replacement is *not* O(k) and it definitely needs to be.

Does anyone have a better idea for reducing BufFreelistLock
contention? Something simple that will work for 9.2?

What steps are there between here and committing the freelist_ok.v2.patch?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: our buffer replacement strategy is kind of lame

From

Tom Lane

Date:

02 January 2012, 13:42:11

Simon Riggs <simon@2ndQuadrant.com> writes:
> Does anyone have a better idea for reducing BufFreelistLock
> contention? Something simple that will work for 9.2?

Get rid of the freelist?  Once shared buffers are full, it's just about
useless anyway.  But you'd need to think about the test cases that you
pay attention to, as there might be scenarios where it remains useful.
        regards, tom lane

Re: our buffer replacement strategy is kind of lame

From

Simon Riggs

Date:

02 January 2012, 15:59:19

On Mon, Jan 2, 2012 at 5:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> Does anyone have a better idea for reducing BufFreelistLock
>> contention? Something simple that will work for 9.2?
>
> Get rid of the freelist?  Once shared buffers are full, it's just about
> useless anyway.  But you'd need to think about the test cases that you
> pay attention to, as there might be scenarios where it remains useful.

Agree freelist is mostly useless, but startup and dropping objects requires it.

When its full its just an integer > 0 test, which is cheap and its on
the same cacheline as the nextVictimBuffer anyway, so we have to fetch
it.

The clock sweep is where all the time goes, in its current form.

We have 2 problems

1. StrategySyncStart() requests exclusive lock on BufFreelistLock, so
bgwriter has to fight with backends to find out where it should clean.
As a result it spends lots of time waiting and insufficient time
cleaning.

2. When a backend can't find a free buffer, it spins for a long time
while holding the lock. This makes the buffer strategy O(N) in its
worst case, which slows everything down. Notably, while this is
happening the bgwriter sits doing nothing at all, right at the moment
when it is most needed. The Clock algorithm is an approximation of an
LRU, so is already suboptimal as a "perfect cache". Tweaking to avoid
worst case behaviour makes sense. How much to tweak? Well,...

(1) is fixed by buffreelistlock_reduction.v1.patch

(2) is fixed by freelist_ok.v2.patch

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: our buffer replacement strategy is kind of lame

From

Robert Haas

Date:

03 January 2012, 11:18:41

On Mon, Jan 2, 2012 at 2:53 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Get rid of the freelist?  Once shared buffers are full, it's just about
>> useless anyway.  But you'd need to think about the test cases that you
>> pay attention to, as there might be scenarios where it remains useful.
>
> Agree freelist is mostly useless, but startup and dropping objects requires it.

Not really.  If someone drops an object, we must invalidate all the
buffers immediately, but adding them to the free list is just an
optimization to make sure we reuse the invalidated buffers in
preference to evicting buffers that still have valid contents.  But I
suspect that in practice this is not a very important optimization,
because (1) most people probably don't drop permanent tables or
databases very often and (2) buffers that are being heavily used
should have a positive reference count, in which case the clock sweep
will pass over them and land on one of the newly-invalidated buffers
anyway.

> When its full its just an integer > 0 test, which is cheap and its on
> the same cacheline as the nextVictimBuffer anyway, so we have to fetch
> it.
>
> The clock sweep is where all the time goes, in its current form.

...but I agree with this.  In its current form, the clock sweep has to
acquire a spinlock for every buffer it touches.  That's really
expensive, and I think we need to either get rid of it altogether or
at least find some way to move it into the background.  Ideally, in
the common case, a backend that wants to allocate a buffer shouldn't
need to do more than acquire a spinlock, pop a buffer off some sort of
linked list of allocation targets, and release the spinlock.  Whatever
other computation is needed should get pushed out of foreground
processing.

> We have 2 problems
>
> 1. StrategySyncStart() requests exclusive lock on BufFreelistLock, so
> bgwriter has to fight with backends to find out where it should clean.
> As a result it spends lots of time waiting and insufficient time
> cleaning.

I have trouble accepting that this is really the problem we should be
solving.  There's only one background writer process, and that process
is waking up every 200ms by default.  Many people probably tune that
down quite a bit, but unless I'm quite mistaken, it should still be a
drop in the bucket next to what the backends are doing.

> 2. When a backend can't find a free buffer, it spins for a long time
> while holding the lock. This makes the buffer strategy O(N) in its
> worst case, which slows everything down. Notably, while this is
> happening the bgwriter sits doing nothing at all, right at the moment
> when it is most needed. The Clock algorithm is an approximation of an
> LRU, so is already suboptimal as a "perfect cache". Tweaking to avoid
> worst case behaviour makes sense. How much to tweak? Well,...

I generally agree with this analysis, but I don't think the proposed
patch is going to solve the problem.  It may have some merit as a way
of limiting the worst case behavior.  For example, if every shared
buffer has a reference count of 5, the first buffer allocation that
misses is going to have a lot of work to do before it can actually
come up with a victim.  But I don't think it's going to provide good
scaling in general.  Even if the background writer only spins through,
on average, ten or fifteen buffers before finding one to evict, that
still means we're acquiring ten or fifteen spinlocks while holding
BufFreelistLock. I don't currently have the measurements to prove
that's too expensive, but I bet it is.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: our buffer replacement strategy is kind of lame

From

Simon Riggs

Date:

03 January 2012, 11:56:40

On Tue, Jan 3, 2012 at 3:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:

>> The clock sweep is where all the time goes, in its current form.
>
> ...but I agree with this.  In its current form, the clock sweep has to
> acquire a spinlock for every buffer it touches.  That's really
> expensive, and I think we need to either get rid of it altogether or
> at least find some way to move it into the background.  Ideally, in
> the common case, a backend that wants to allocate a buffer shouldn't
> need to do more than acquire a spinlock, pop a buffer off some sort of
> linked list of allocation targets, and release the spinlock.  Whatever
> other computation is needed should get pushed out of foreground
> processing.

So you don't think a freelist is worth having, but you want a list of
allocation targets.
What is the practical difference?

>> We have 2 problems
>>
>> 1. StrategySyncStart() requests exclusive lock on BufFreelistLock, so
>> bgwriter has to fight with backends to find out where it should clean.
>> As a result it spends lots of time waiting and insufficient time
>> cleaning.
>
> I have trouble accepting that this is really the problem we should be
> solving.  There's only one background writer process, and that process
> is waking up every 200ms by default.  Many people probably tune that
> down quite a bit, but unless I'm quite mistaken, it should still be a
> drop in the bucket next to what the backends are doing.

That doesn't follow. Forcing the bgwriter to wait makes no sense. Yes,
most of the contention is caused by the other backends, but the
bgwriter experiences that contention currently when it has no need to
do so.

Presumably if you did have an allocation list maintained in the
background by the bgwriter you wouldn't expect the bgwriter to wait
behind a lock for no reason when it could be doing useful work.

>> 2. When a backend can't find a free buffer, it spins for a long time
>> while holding the lock. This makes the buffer strategy O(N) in its
>> worst case, which slows everything down. Notably, while this is
>> happening the bgwriter sits doing nothing at all, right at the moment
>> when it is most needed. The Clock algorithm is an approximation of an
>> LRU, so is already suboptimal as a "perfect cache". Tweaking to avoid
>> worst case behaviour makes sense. How much to tweak? Well,...
>
> I generally agree with this analysis, but I don't think the proposed
> patch is going to solve the problem.  It may have some merit as a way
> of limiting the worst case behavior.  For example, if every shared
> buffer has a reference count of 5, the first buffer allocation that
> misses is going to have a lot of work to do before it can actually
> come up with a victim.  But I don't think it's going to provide good
> scaling in general.  Even if the background writer only spins through,
> on average, ten or fifteen buffers before finding one to evict, that
> still means we're acquiring ten or fifteen spinlocks while holding
> BufFreelistLock. I don't currently have the measurements to prove
> that's too expensive, but I bet it is.

I think its worth reducing the cost of scanning, but that has little
to do with solving the O(N) problem. I think we need both.

I've left the way open for you to redesign freelist management in many
ways. Please take the opportunity and go for it, though we must
realise that major overhauls require significantly more testing to
prove their worth. Reducing spinlocking only sounds like a good way to
proceed for this release.

If you don't have time in 9.2, then these two small patches are worth
having. The bgwriter locking patch needs less formal evidence to show
its worth. We simply don't need to have a bgwriter that just sits
waiting doing nothing.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: our buffer replacement strategy is kind of lame

From

Robert Haas

Date:

03 January 2012, 13:16:07

On Tue, Jan 3, 2012 at 10:56 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Tue, Jan 3, 2012 at 3:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> The clock sweep is where all the time goes, in its current form.
>>
>> ...but I agree with this.  In its current form, the clock sweep has to
>> acquire a spinlock for every buffer it touches.  That's really
>> expensive, and I think we need to either get rid of it altogether or
>> at least find some way to move it into the background.  Ideally, in
>> the common case, a backend that wants to allocate a buffer shouldn't
>> need to do more than acquire a spinlock, pop a buffer off some sort of
>> linked list of allocation targets, and release the spinlock.  Whatever
>> other computation is needed should get pushed out of foreground
>> processing.
>
> So you don't think a freelist is worth having, but you want a list of
> allocation targets.
> What is the practical difference?

I think that our current freelist is practically useless, because it
is almost always empty, and the cases where it's not empty (startup,
and after a table or database drop) are so narrow that we don't really
get any benefit out of having it.  However, I'm not opposed to the
idea of a freelist in general: I think that if we actually put in some
effort to keep the freelist in a non-empty state it would help a lot,
because backends would then have much less work to do at buffer
allocation time.

>> I have trouble accepting that this is really the problem we should be
>> solving.  There's only one background writer process, and that process
>> is waking up every 200ms by default.  Many people probably tune that
>> down quite a bit, but unless I'm quite mistaken, it should still be a
>> drop in the bucket next to what the backends are doing.
>
> That doesn't follow. Forcing the bgwriter to wait makes no sense. Yes,
> most of the contention is caused by the other backends, but the
> bgwriter experiences that contention currently when it has no need to
> do so.

Sure, but if that contention is a negligible fraction of the total and
doesn't cost anything measurable, then it doesn't make sense to add
code to eliminate it.  There are all sorts of places in the system
where we have contention at a level that doesn't affect performance.
Many of those could probably be fixed by adding more complicated
locking, but that would just complicate the code to no purpose.  If
there's a demonstrable performance benefit to this change then it's
worth a harder look, but my gut feeling is that there won't be.

>>> 2. When a backend can't find a free buffer, it spins for a long time
>>> while holding the lock. This makes the buffer strategy O(N) in its
>>> worst case, which slows everything down. Notably, while this is
>>> happening the bgwriter sits doing nothing at all, right at the moment
>>> when it is most needed. The Clock algorithm is an approximation of an
>>> LRU, so is already suboptimal as a "perfect cache". Tweaking to avoid
>>> worst case behaviour makes sense. How much to tweak? Well,...
>>
>> I generally agree with this analysis, but I don't think the proposed
>> patch is going to solve the problem.  It may have some merit as a way
>> of limiting the worst case behavior.  For example, if every shared
>> buffer has a reference count of 5, the first buffer allocation that
>> misses is going to have a lot of work to do before it can actually
>> come up with a victim.  But I don't think it's going to provide good
>> scaling in general.  Even if the background writer only spins through,
>> on average, ten or fifteen buffers before finding one to evict, that
>> still means we're acquiring ten or fifteen spinlocks while holding
>> BufFreelistLock. I don't currently have the measurements to prove
>> that's too expensive, but I bet it is.
>
> I think its worth reducing the cost of scanning, but that has little
> to do with solving the O(N) problem. I think we need both.

Maybe.  I would really like a unified solution to both problems, but
I'd be willing to accept a solution for one of them if we're confident
that we've actually solved it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: our buffer replacement strategy is kind of lame

From

Jim Nasby

Date:

03 January 2012, 19:22:45

On Jan 3, 2012, at 11:15 AM, Robert Haas wrote:
>> So you don't think a freelist is worth having, but you want a list of
>> allocation targets.
>> What is the practical difference?
>
> I think that our current freelist is practically useless, because it
> is almost always empty, and the cases where it's not empty (startup,
> and after a table or database drop) are so narrow that we don't really
> get any benefit out of having it.  However, I'm not opposed to the
> idea of a freelist in general: I think that if we actually put in some
> effort to keep the freelist in a non-empty state it would help a lot,
> because backends would then have much less work to do at buffer
> allocation time.

This is exactly what the FreeBSD VM system does (which is at least one of the places where the idea of a clock sweep
forPG came from ages ago). There is a process that does nothing but attempt to keep X amount of memory on the free
list,where it can immediately be grabbed by anything that needs memory. Pages on the freelist are guaranteed to be
clean(as in not dirty), but not zero'd. In fact, IIRC if a page on the freelist gets referenced again it can be pulled
backout of the free list and put back into an active state. 

The one downside I see to this is that we'd need some heuristic to determine how many buffers we want to keep on the
freelist. 
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net

Re: our buffer replacement strategy is kind of lame

From

Robert Haas

Date:

03 January 2012, 21:27:46

On Tue, Jan 3, 2012 at 6:22 PM, Jim Nasby <jim@nasby.net> wrote:
> On Jan 3, 2012, at 11:15 AM, Robert Haas wrote:
>>> So you don't think a freelist is worth having, but you want a list of
>>> allocation targets.
>>> What is the practical difference?
>>
>> I think that our current freelist is practically useless, because it
>> is almost always empty, and the cases where it's not empty (startup,
>> and after a table or database drop) are so narrow that we don't really
>> get any benefit out of having it.  However, I'm not opposed to the
>> idea of a freelist in general: I think that if we actually put in some
>> effort to keep the freelist in a non-empty state it would help a lot,
>> because backends would then have much less work to do at buffer
>> allocation time.
>
> This is exactly what the FreeBSD VM system does (which is at least one of the places where the idea of a clock sweep
forPG came from ages ago). There is a process that does nothing but attempt to keep X amount of memory on the free
list,where it can immediately be grabbed by anything that needs memory. Pages on the freelist are guaranteed to be
clean(as in not dirty), but not zero'd. In fact, IIRC if a page on the freelist gets referenced again it can be pulled
backout of the free list and put back into an active state. 
>
> The one downside I see to this is that we'd need some heuristic to determine how many buffers we want to keep on the
freelist. 

Fortuitously, I believe the background writer already has most of the
necessary logic: it attempts to predict how many buffers are about to
be needed - I think based on a decaying average.

Actually, I think that logic could use some improvement, because I
believe I've heard Greg Smith comment that it's often necessary to
tune bgwriter_delay downward.  It'd be nice to make the delay adaptive
somehow, to avoid the need for manual tuning (and unnecessary wake-ups
when the system goes idle).

But possibly the existing logic is good enough for a first cut.
However, in the interest of full disclosure, I'll admit that I've done
no testing in this area at all and am talking mostly out of my
posterior.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: our buffer replacement strategy is kind of lame

From

Greg Smith

Date:

05 January 2012, 18:10:09

On 01/03/2012 06:22 PM, Jim Nasby wrote:
> On Jan 3, 2012, at 11:15 AM, Robert Haas wrote:
>    
>> I think that our current freelist is practically useless, because it
>> is almost always empty, and the cases where it's not empty (startup,
>> and after a table or database drop) are so narrow that we don't really
>> get any benefit out of having it.  However, I'm not opposed to the
>> idea of a freelist in general: I think that if we actually put in some
>> effort to keep the freelist in a non-empty state it would help a lot,
>> because backends would then have much less work to do at buffer
>> allocation time.
>>      
> This is exactly what the FreeBSD VM system does (which is at least one of the places where the idea of a clock sweep
forPG came from ages ago). There is a process that does nothing but attempt to keep X amount of memory on the free
list,where it can immediately be grabbed by anything that needs memory. Pages on the freelist are guaranteed to be
clean(as in not dirty), but not zero'd. In fact, IIRC if a page on the freelist gets referenced again it can be pulled
backout of the free list and put back into an active state.

>
> The one downside I see to this is that we'd need some heuristic to determine how many buffers we want to keep on the
freelist.

>    

http://wiki.postgresql.org/wiki/Todo#Background_Writer has "Consider 
adding buffers the background writer finds reusable to the free list" 
and "Automatically tune bgwriter_delay based on activity rather then 
using a fixed interval", which both point to my 8.3 musing and other 
suggestionss starting at 
http://archives.postgresql.org/pgsql-hackers/2007-04/msg00781.php  I 
could write both those in an afternoon.  The auto-tuning stuff already 
in the background writer originally intended to tackle this issue, but 
dropped it in lieu of shipping something simpler first.  There's even a 
prototype somewhere on an old drive here.

The first missing piece needed before this was useful was separating out 
the background writer and checkpointer processes.  Once I realized the 
checkpoints were monopolizing so much time, especially when they hit bad 
states, it was obvious the writer couldn't be relied upon for this job.  
That's much better now since Simon's 
806a2aee3791244bf0f916729bfdb5489936e068 "Split work of bgwriter between 
2 processes: bgwriter and checkpointer", which just became available in 
November to build on.

The second missing piece blocking this work in my mind was how exactly 
we're going to benchmark the result, mainly to prove it doesn't hurt 
some workloads.  I haven't fully internalized the implications of 
Robert's upthread comments, in terms of being able to construct a 
benchmark stressing both the best and worst case situation here.  That's 
really the hardest part of this whole thing, by a lot.  Recent spending 
has brought me an 8 HyperThread core laptop that can also run DTrace, so 
I expect to have better visibility into this soon too.

I think here in 2011 the idea of having a background writer process that 
could potentially occupy most of a whole core doing work so backends 
don't have to is an increasingly attractive one.  So long as that comes 
along with an auto-tuning delay, it shouldn't hurt the work toward 
lowering power management either.  Might even help really, by allowing 
larger values of bgwriter_delay than you'd want to use during busy 
periods.  I was planning to mimic the sort of fast attack/slow delay 
logic already used for the auto-tuned timing, so that you won't fall 
behind by more than one bgwriter_delay worth of activity.  Then it 
should realize a burst is here and the writer has to start moving faster.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

Re: our buffer replacement strategy is kind of lame

From

Heikki Linnakangas

Date:

20 January 2012, 10:29:52

On 03.01.2012 17:56, Simon Riggs wrote:
> On Tue, Jan 3, 2012 at 3:18 PM, Robert Haas<robertmhaas@gmail.com>  wrote:
>
>>> 2. When a backend can't find a free buffer, it spins for a long time
>>> while holding the lock. This makes the buffer strategy O(N) in its
>>> worst case, which slows everything down. Notably, while this is
>>> happening the bgwriter sits doing nothing at all, right at the moment
>>> when it is most needed. The Clock algorithm is an approximation of an
>>> LRU, so is already suboptimal as a "perfect cache". Tweaking to avoid
>>> worst case behaviour makes sense. How much to tweak? Well,...
>>
>> I generally agree with this analysis, but I don't think the proposed
>> patch is going to solve the problem.  It may have some merit as a way
>> of limiting the worst case behavior.  For example, if every shared
>> buffer has a reference count of 5, the first buffer allocation that
>> misses is going to have a lot of work to do before it can actually
>> come up with a victim.  But I don't think it's going to provide good
>> scaling in general.  Even if the background writer only spins through,
>> on average, ten or fifteen buffers before finding one to evict, that
>> still means we're acquiring ten or fifteen spinlocks while holding
>> BufFreelistLock. I don't currently have the measurements to prove
>> that's too expensive, but I bet it is.
>
> I think its worth reducing the cost of scanning, but that has little
> to do with solving the O(N) problem. I think we need both.
>
> I've left the way open for you to redesign freelist management in many
> ways. Please take the opportunity and go for it, though we must
> realise that major overhauls require significantly more testing to
> prove their worth. Reducing spinlocking only sounds like a good way to
> proceed for this release.
>
> If you don't have time in 9.2, then these two small patches are worth
> having. The bgwriter locking patch needs less formal evidence to show
> its worth. We simply don't need to have a bgwriter that just sits
> waiting doing nothing.

I'd like to see some benchmarks that show a benefit from these patches, 
before committing something like this that complicates the code. These 
patches are fairly small, but nevertheless. Once we have a test case, we 
can argue whether the benefit we're seeing is worth the extra code, or 
if there's some better way to achieve it.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: our buffer replacement strategy is kind of lame

From

Robert Haas

Date:

20 January 2012, 11:00:05

On Fri, Jan 20, 2012 at 9:29 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> I'd like to see some benchmarks that show a benefit from these patches,
> before committing something like this that complicates the code. These
> patches are fairly small, but nevertheless. Once we have a test case, we can
> argue whether the benefit we're seeing is worth the extra code, or if
> there's some better way to achieve it.

Agreed.  I don't really like either of these patches stylistically:
when you start sometimes locking things and sometimes not locking
things, it gets hard to reason about how the code will behave, and you
limit what you can in the future do inside the critical section
because you have to keep in mind that you don't really have full
mutual exclusion.  There are places where it may be worth making that
trade-off if we can show a large performance benefit, but I am wary of
quick hacks that may get in the way of more complete solutions we want
to implement later, especially if the performance benefit is marginal.

I'm in the process of trying to pull together benchmark numbers for
the various performance-related patches that are floating around this
CommitFest, but a big problem is that many of them need to be tested
in different ways, which are frequently not explicitly laid out in the
emails in which they are submitted.  Some of them are calculated to
improve latency, and others throughput.  Some require pgbench run with
a small scale factor, some require pgbench with a large scale factor,
and some need some completely other type of testing altogether.  Some
need short tests, others need long tests, still others require testing
with very specific parameters, and some don't really need more testing
so much as they need profiling.  Many (like the double-write patch)
need more thought about worst case scenarios, like a sequential scan
on a large, unhinted table.  Some only really need testing in one or
two scenarios, but others could affect a broad variety of workloads
and really ought to be tested in many different ways to fully
understand the behavior.   Some of them may be interdependent in
complex ways.

My current belief, based on the testing I did before Christmas and my
gut feelings about the remaining patches, is that the patches which
have the best chance of making a major difference on general workloads
are your XLOG insert scaling patch and the group commit patch.  Those
are, in fact, virtually the only ones that have had more than zero
test results posted to the list so far, and those test results are
extremely promising.  The various checkpoint-related patches also seem
somewhat promising to me, despite the lack (so far) of any test
results.  Everything else strikes me as likely to make a difference
that is either very small or only applicable in a fairly circumscribed
set of cases.  This preliminary opinion is subject to change as more
evidence comes in, of course.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company