Thread: RC2 and open issues

RC2 and open issues

From

Bruce Momjian

Date:

21 December 2004, 02:12:38

We are now packaging RC2. If nothing comes up after RC2 is released, we
can move to final release.

The open items list is attached. The doc changes can be easily
completed before final. The only code issue left is with bgwriter. We
always knew we needed to find better defaults for its parameters, but we
are only now finding more fundamental issues.

I think the summary I have seen recently pegs it right --- our use of %
of dirty buffers requires a scan of the entire buffer cache, and the
current delay of bgwriter is too high, but we can't lower it because the
buffer cache scan will become too expensive if done too frequently.

I think the ideal solution would be to remove bgwriter_percent or change
it to be a percentage of all buffers, not just dirty buffers, so we
don't have to scan the entire list. If we set the new value to 10% with
a delay of 1 second, and the bgwriter remembers the place it stopped
scanning the buffer cache, you will clean out the buffer cache
completely every 10 seconds.

Right now it seems no one can find proper values. We were clear that
this was an issue but it is bad news that we are only addressing it
during RC.

The 8.1 solution is to have some feedback system so writes by individual
backends cause the bgwriter to work more frequently.

The big question is what to do during RC2? Do we just leave it as
suboptimal knowing we will revisit it in 8.1 or try an incremental
solution for 8.0 that might work better.

We have to decide now.

---------------------------------------------------------------------------
PostgreSQL 8.0 Open Items =========================

Current version at http://candle.pha.pa.us/cgi-bin/pgopenitems.

Changes
-------
* change bgwriter buffer scan behavior?
* adjust bgwriter defaults

Documentation
-------------
* synchonize supported encodings and docs
* improve external interfaces documentation section
* manual pages

Fixed Since Last Beta
---------------------

-- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610)
359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square,
Pennsylvania19073

Re: RC2 and open issues

From

Tom Lane

Date:

21 December 2004, 02:35:50

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I think the ideal solution would be to remove bgwriter_percent or change
> it to be a percentage of all buffers, not just dirty buffers, so we
> don't have to scan the entire list.  If we set the new value to 10% with
> a delay of 1 second, and the bgwriter remembers the place it stopped
> scanning the buffer cache, you will clean out the buffer cache
> completely every 10 seconds.

But we don't *want* it to clean out the buffer cache completely.
There's no point in writing a "hot" page every few seconds.  So I don't
think I believe in remembering where we stopped anyway.

I think there's a reasonable case to be made for redefining
bgwriter_percent as the max percent of the total buffer list to scan
(not the max percent of the list to return --- Jan correctly pointed out
that the latter is useless).  Then we could modify
StrategyDirtyBufferList so that the percent and maxpages parameters are
passed in, so it can stop as soon as either one is satisfied.  This
would be a fairly small/safe code change and I wouldn't have a problem
doing it even at this late stage of the cycle.

Howeve ... we would have to crank up the default bgwriter_percent,
and I don't know if we have any better idea what to set it to after
such a change than we do now ...
        regards, tom lane

Re: RC2 and open issues

From

Bruce Momjian

Date:

21 December 2004, 03:47:40

Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > I think the ideal solution would be to remove bgwriter_percent or change
> > it to be a percentage of all buffers, not just dirty buffers, so we
> > don't have to scan the entire list.  If we set the new value to 10% with
> > a delay of 1 second, and the bgwriter remembers the place it stopped
> > scanning the buffer cache, you will clean out the buffer cache
> > completely every 10 seconds.
> 
> But we don't *want* it to clean out the buffer cache completely.

You are only cleaning out in pieces over a 10 second period so it is
getting dirty.  You are not scanning the entire buffer at one time.

> There's no point in writing a "hot" page every few seconds.  So I don't
> think I believe in remembering where we stopped anyway.

I was thinking if you are doing this scanning every X milliseconds then
after a while the front of the buffer cache will be mostly clean and the
end will be dirty so you will always be going over the same early ones
to get to the later dirty ones.  Remembering the location gives the scan
more uniform coverage of the buffer cache.

You need a "clock sweep" like BSD uses (and probably others).

> I think there's a reasonable case to be made for redefining
> bgwriter_percent as the max percent of the total buffer list to scan
> (not the max percent of the list to return --- Jan correctly pointed out
> that the latter is useless).  Then we could modify
> StrategyDirtyBufferList so that the percent and maxpages parameters are
> passed in, so it can stop as soon as either one is satisfied.  This
> would be a fairly small/safe code change and I wouldn't have a problem
> doing it even at this late stage of the cycle.
> 
> Howeve ... we would have to crank up the default bgwriter_percent,
> and I don't know if we have any better idea what to set it to after
> such a change than we do now ...

Once we make the change we will have to get our testers working on it. 
We need those figure to change over time based on backends doing writes
but ath isn't going to happen for 8.0.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: RC2 and open issues

From

Tom Lane

Date:

21 December 2004, 04:02:18

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> You need a "clock sweep" like BSD uses (and probably others).

No, that's *fundamentally* wrong.

The reason we are going to the trouble of maintaining a complicated
cache algorithm like ARC is so that we can tell the heavily used pages
from the lesser used ones.  To throw away that knowledge in favor of
doing I/O with a plain clock sweep algorithm is just wrong.

What's more, I don't even understand what clock sweep would mean given
that the ordering of the list is constantly changing.
        regards, tom lane

Re: RC2 and open issues

From

Tom Lane

Date:

21 December 2004, 04:10:26

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I am confused.  If we change the percentage to be X% of the entire
> buffer cache, and we set it to 1%, and we exit when either the dirty
> pages or % are reached, don't we end up just scanning the first 1% of
> the cache over and over again?

Exactly.  But 1% would be uselessly small with this definition.  Offhand
I'd think something like 50% might be a starting point; maybe even more.
What that says is that a page isn't a candidate to be written out by the
bgwriter until it's fallen halfway down the LRU list.
        regards, tom lane

Re: RC2 and open issues

From

Bruce Momjian

Date:

21 December 2004, 04:11:38

Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > I am confused.  If we change the percentage to be X% of the entire
> > buffer cache, and we set it to 1%, and we exit when either the dirty
> > pages or % are reached, don't we end up just scanning the first 1% of
> > the cache over and over again?
> 
> Exactly.  But 1% would be uselessly small with this definition.  Offhand
> I'd think something like 50% might be a starting point; maybe even more.
> What that says is that a page isn't a candidate to be written out by the
> bgwriter until it's fallen halfway down the LRU list.

So we are not scanning by buffer address but using the LRU list?  Are we
sure they are mostly dirty?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: RC2 and open issues

From

Tom Lane

Date:

21 December 2004, 04:21:15

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Tom Lane wrote:
>> Exactly.  But 1% would be uselessly small with this definition.  Offhand
>> I'd think something like 50% might be a starting point; maybe even more.
>> What that says is that a page isn't a candidate to be written out by the
>> bgwriter until it's fallen halfway down the LRU list.

> So we are not scanning by buffer address but using the LRU list?  Are we
> sure they are mostly dirty?

No.  The entire point is to keep the LRU end of the list mostly clean.

Now that you mention it, it might be interesting to try the approach of
doing a clock scan on the buffer array and ignoring the ARC lists
entirely.  That would be a fundamentally different way of envisioning
what the bgwriter is supposed to do, though.  I think the main reason
Jan didn't try that was he wanted to be sure the LRU page was usually
clean so that backends would seldom end up doing writes for themselves
when they needed to get a free buffer.

Maybe we need a hybrid approach: clean a few percent of the LRU end of
the ARC list in order to keep backends from blocking on writes, plus run
a clock scan to keep checkpoints from having to do much.  But that's way
beyond what we have time for in the 8.0 cycle.
        regards, tom lane

Re: RC2 and open issues

From

Bruce Momjian

Date:

21 December 2004, 04:51:02

Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom Lane wrote:
> >> Exactly.  But 1% would be uselessly small with this definition.  Offhand
> >> I'd think something like 50% might be a starting point; maybe even more.
> >> What that says is that a page isn't a candidate to be written out by the
> >> bgwriter until it's fallen halfway down the LRU list.
> 
> > So we are not scanning by buffer address but using the LRU list?  Are we
> > sure they are mostly dirty?
> 
> No.  The entire point is to keep the LRU end of the list mostly clean.
> 
> Now that you mention it, it might be interesting to try the approach of
> doing a clock scan on the buffer array and ignoring the ARC lists
> entirely.  That would be a fundamentally different way of envisioning
> what the bgwriter is supposed to do, though.  I think the main reason
> Jan didn't try that was he wanted to be sure the LRU page was usually
> clean so that backends would seldom end up doing writes for themselves
> when they needed to get a free buffer.
> 
> Maybe we need a hybrid approach: clean a few percent of the LRU end of
> the ARC list in order to keep backends from blocking on writes, plus run
> a clock scan to keep checkpoints from having to do much.  But that's way
> beyond what we have time for in the 8.0 cycle.

OK, so we scan from the end of the LRU.  If we scan X% and find _no_
dirty buffers perhaps we should start where we left off last time.

If we don't start where we left off, I am thinking if you do a lot of
writes then do nothing, the next checkpoint would be huge because a lot
of the LRU will be dirty because the bgwriter never got to it.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: RC2 and open issues

From

Gavin Sherry

Date:

21 December 2004, 05:12:19

On Mon, 20 Dec 2004, Tom Lane wrote:

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom Lane wrote:
> >> Exactly.  But 1% would be uselessly small with this definition.  Offhand
> >> I'd think something like 50% might be a starting point; maybe even more.
> >> What that says is that a page isn't a candidate to be written out by the
> >> bgwriter until it's fallen halfway down the LRU list.
>
> > So we are not scanning by buffer address but using the LRU list?  Are we
> > sure they are mostly dirty?
>
> No.  The entire point is to keep the LRU end of the list mostly clean.
>
> Now that you mention it, it might be interesting to try the approach of
> doing a clock scan on the buffer array and ignoring the ARC lists
> entirely.  That would be a fundamentally different way of envisioning
> what the bgwriter is supposed to do, though.  I think the main reason
> Jan didn't try that was he wanted to be sure the LRU page was usually
> clean so that backends would seldom end up doing writes for themselves
> when they needed to get a free buffer.

Neil and I spoke with Jan briefly last week and he mentioned a few
different approaches he'd been tossing over. Firstly, for alternative
runs, start X% on from the LRU, so that we aren't scanning clean buffers
all the time. Secondly, follow something like the approach you've
mentioned above but remember the offset. So, if we're scanning 10%, after
10 runs we will have written out all buffers.

I was also thinking of benchmarking the effect of changing the algorithm
in StrategyDirtyBufferList(): currently, for each iteration of the loop we
read a buffer from each of T1 and T2. I was wondering what effect reading
T1 first then T2 and vice versa would have on performance. I haven't
thought about this too hard, though, so it might be wrong headed.

>
> Maybe we need a hybrid approach: clean a few percent of the LRU end of
> the ARC list in order to keep backends from blocking on writes, plus run
> a clock scan to keep checkpoints from having to do much.  But that's way
> beyond what we have time for in the 8.0 cycle.

Definately.

>
>             regards, tom lane

Thanks,

Gavin

Re: RC2 and open issues

From

Bruce Momjian

Date:

21 December 2004, 05:26:10

Gavin Sherry wrote:
> Neil and I spoke with Jan briefly last week and he mentioned a few
> different approaches he'd been tossing over. Firstly, for alternative
> runs, start X% on from the LRU, so that we aren't scanning clean buffers
> all the time. Secondly, follow something like the approach you've
> mentioned above but remember the offset. So, if we're scanning 10%, after
> 10 runs we will have written out all buffers.
> 
> I was also thinking of benchmarking the effect of changing the algorithm
> in StrategyDirtyBufferList(): currently, for each iteration of the loop we
> read a buffer from each of T1 and T2. I was wondering what effect reading
> T1 first then T2 and vice versa would have on performance. I haven't
> thought about this too hard, though, so it might be wrong headed.

So we are all thinking in the same direction.  We might have only a few
days to finalize this before final release.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: RC2 and open issues

From

Tom Lane

Date:

21 December 2004, 06:33:12

Gavin Sherry <swm@linuxworld.com.au> writes:
> I was also thinking of benchmarking the effect of changing the algorithm
> in StrategyDirtyBufferList(): currently, for each iteration of the loop we
> read a buffer from each of T1 and T2. I was wondering what effect reading
> T1 first then T2 and vice versa would have on performance.

Looking at StrategyGetBuffer, it definitely seems like a good idea to
try to keep the bottom end of both T1 and T2 lists clean.  But we should
work at T1 a bit harder.

The insight I take away from today's discussion is that there are two
separate goals here: try to keep backends that acquire a buffer via
StrategyGetBuffer from being fed a dirty buffer they have to write,
and try to keep the next upcoming checkpoint from having too much work
to do.  Those are both laudable goals but I hadn't really seen before
that they may require different strategies to achieve.  I'm liking the
idea that bgwriter should alternate between doing writes in pursuit of
the one goal and doing writes in pursuit of the other.
        regards, tom lane

Re: RC2 and open issues

From

"Zeugswetter Andreas DAZ SD"

Date:

21 December 2004, 11:19:38

> If we don't start where we left off, I am thinking if you do a lot of
> writes then do nothing, the next checkpoint would be huge because a lot
> of the LRU will be dirty because the bgwriter never got to it.

I think the problem is, that we don't see wether a "read hot"
page is also "write hot". We would want to write dirty "read hot" pages,
but not "write hot" pages. It does not make sense to write a "write hot"
page since it will be dirty again when the checkpoint comes.

Andreas

Bgwriter behavior

From

Bruce Momjian

Date:

21 December 2004, 15:25:06

Tom Lane wrote:
> Gavin Sherry <swm@linuxworld.com.au> writes:
> > I was also thinking of benchmarking the effect of changing the algorithm
> > in StrategyDirtyBufferList(): currently, for each iteration of the loop we
> > read a buffer from each of T1 and T2. I was wondering what effect reading
> > T1 first then T2 and vice versa would have on performance.
> 
> Looking at StrategyGetBuffer, it definitely seems like a good idea to
> try to keep the bottom end of both T1 and T2 lists clean.  But we should
> work at T1 a bit harder.
> 
> The insight I take away from today's discussion is that there are two
> separate goals here: try to keep backends that acquire a buffer via
> StrategyGetBuffer from being fed a dirty buffer they have to write,
> and try to keep the next upcoming checkpoint from having too much work
> to do.  Those are both laudable goals but I hadn't really seen before
> that they may require different strategies to achieve.  I'm liking the
> idea that bgwriter should alternate between doing writes in pursuit of
> the one goal and doing writes in pursuit of the other.

It seems we have added a new limitation to bgwriter by not doing a full
scan.  With a full scan we could easily grab the first X pages starting
from the end of the LRU list and write them.  By not scanning the full
list we are opening the possibility of not seeing some of the front-most
LRU dirty pages.  And the full scan was removed so we can run bgwriter
more frequently, but we might end up with other problems.

I have a new proposal.  The idea is to cause bgwriter to increase its
frequency based on how quickly it finds dirty pages.

First, we remove the GUC bgwriter_maxpages because I don't see a good
way to set a default for that.  A default value needs to be based on a
percentage of the full buffer cache size.  Second, we make
bgwriter_percent cause the bgwriter to stop its scan once it has found a
number of dirty buffers that matches X% of the buffer cache size.  So,
if it is set to 5%, the bgwriter scan stops once it find enough dirty
buffers to equal 5% of the buffer cache size. 

Bgwriter continues to scan starting from the end of the LRU list, just
like it does now.

Now, to control the bgwriter frequency we multiply the percent of the
list it had to span by the bgwriter_delay value to determine when to run
bgwriter next.  For example, if you find enough dirty pages by looking
at only 10% of the buffer cache you multiple 10% (0.10) * bgwriter_delay
and that is when you run next.  If you have to scan 50%, bgwriter runs
next at 50% (0.50) * bgwriter_delay, and if it has to scan the entire
list it is 100% (1.00) * bgwriter_delay.

What this does is to cause bgwriter to run more frequently when there
are a lot of dirty buffers on the end of the LRU _and_ when the bgwriter
scan will be quick.  When there are few writes, bgwriter will run less
frequently but will write dirty buffers nearer to the head of the LRU.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Thread: RC2 and open issues

Attachment

Attachment

Attachment

Attachment