Home > mailing lists

Re: Page replacement algorithm in buffer cache - Mailing list pgsql-hackers

From	Jeff Janes
Subject	Re: Page replacement algorithm in buffer cache
Date	March 23, 2013 23:25:33
Msg-id	CAMkU=1x6F9Syts67bH68iX3hOLkt+YR=ijq+M5qSCBQrbNCSiA@mail.gmail.com Whole thread Raw
In response to	Re: Page replacement algorithm in buffer cache (Atri Sharma <atri.jiit@gmail.com>)
Responses	Re: Page replacement algorithm in buffer cache (Atri Sharma <atri.jiit@gmail.com>)
List	pgsql-hackers

Tree view

On Fri, Mar 22, 2013 at 4:06 AM, Atri Sharma <atri.jiit@gmail.com> wrote:

Not yet, I figured this might be a problem and am designing test cases
for the same. I would be glad for some help there please.

Perhaps this isn't the help you were looking for, but I spent a long time looking into this a few years ago. Then I stopped and decided to work on other things. I would recommend you do so too.

If I have to struggle to come up with an artificial test case that shows that there is a problem, then why should I believe that there actually is a problem? If you take a well-known problem (like, say, bad performance at shared_buffers > 8GB (or even lower, on Windows)) and create an artificial test case to exercise and investigate that, that is one thing. But why invent pathological test cases with no known correspondence to reality? There are plenty of real problems to work on, and some of them are just as intellectually interesting as the artificial problems are.

My conclusions were:

1) If everything fits in shared_buffers, then the replacement policy doesn't matter.

2) If shared_buffers is much smaller than RAM (the most common case, I believe), then what mostly matters is your OS's replacement policy, not pgsql's. Not much a pgsql hacker can do about this, other than turn into a kernel hacker.

3) If little of the highly-used data fits in RAM. then any non-absurd replacement policy is about as good as any other non-absurd one.

4) If most, but not quite all, of the highly-used data fits shared_buffers and shared_buffers takes most of RAM (or at least, most of RAM not already needed for other things like work_mem and executables), then the replacement policy matters a lot. But different policies suit different work-loads, and there is little reason to think we can come up with a way to choose between them. (Also, in these conditions, performance is very chaotic. You can run the same algorithm for a long time, and it can suddenly switch from good to bad or the other way around, and then stay in that new mode for a long time). Also, even if you come up with a good algorithm, if you make the data set 20% smaller or 20% larger, it is no longer a good algorithm.

5) Having buffers enter with usage_count=0 rather than 1 would probably be slightly better most of the time under conditions described in 4, but there is no way get enough evidence of this over enough conditions to justify making a change. And besides, how often do people run with shared_buffers being most of RAM, and the hot data just barely not fitting in it?

If you want some known problems that are in this general area, we have:

1) If all data fits in RAM but not shared_buffers, and you have a very large number of CPUs and a read-only or read-mostly workload, then BufFreelistLock can be a major bottle neck. (But, on a Amazon high-CPU instance, I did not see this very much. I suspect the degree of problem depends a lot on whether you have a lot of sockets with a few CPUs each, versus one chip with many CPUs). This is very easy to come up with model cases for, pgbench -S -c8 -j8, for example, can often show it.

2) A major reason that people run with shared_buffers much lower than RAM is that performance seems to suffer with shared_buffers > 8GB under write-heavy workloads, even with spread-out checkpoints. This is frequently reported as a real world problem, but as far as I know has never been reduced to a simple reproducible test case. (Although there was a recent thread, maybe "High CPU usage / load average after upgrading to Ubuntu 12.04", that I thought might be relevant to this. I haven't had the time to seriously study the thread, or the hardware to investigate it myself)

Cheers,

Jeff

pgsql-hackers by date:

From: Tom Lane
Date: 23 March 2013, 23:07:51
Subject: Re: Page replacement algorithm in buffer cache

From: Nicholas White
Date: 24 March 2013, 01:25:31
Subject: Re: Request for Patch Feedback: Lag & Lead Window Functions Can Ignore Nulls

Re: Page replacement algorithm in buffer cache - Mailing list pgsql-hackers

Previous

Next