Re: [HACKERS] Clock with Adaptive Replacement - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: [HACKERS] Clock with Adaptive Replacement |
Date | |
Msg-id | CA+TgmoafeZ0FvnGB7QOK3TDkkQWwpJY7PDpbc54VaDfjX0x1gQ@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Clock with Adaptive Replacement (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: [HACKERS] Clock with Adaptive Replacement
Re: [HACKERS] Clock with Adaptive Replacement |
List | pgsql-hackers |
On Tue, May 1, 2018 at 6:37 PM, Peter Geoghegan <pg@bowt.ie> wrote: > This seems to be an old idea. I'm not too surprised ... this area has been well-studied. > I've always had a problem with the 8GB/16GB upper limit on the size of > shared_buffers. Not because it's wrong, but because it's not something > that has ever been explained. I strongly suspect that it has something > to do with usage_count saturation, since it isn't reproducible with > any synthetic workload that I'm aware of. Quite possibly because there > are few bursty benchmarks. I've seen customer have very good luck going higher if it lets all the data fit in shared_buffers, or at least all the data that is accessed with any frequency. I think it's useful to imagine a series of concentric working sets - maybe you have 1GB of the hottest data, 3GB of data that is at least fairly hot, 10GB of data that is at least somewhat hot, and another 200GB of basically cold data. Increasing shared_buffers in a way that doesn't let the next "ring" fit in shared_buffers isn't likely to help very much. If you have 8GB of shared_buffers on this workload, going to 12GB is probably going to help -- that should be enough for the 10GB of somewhat-hot stuff and a little extra so that the somewhat-hot stuff doesn't immediately start getting evicted if some of the cold data is accessed. Similarly, going from 2GB to 4GB should be a big help, because now the fairly-hot stuff should stay in cache. But going from 4GB to 6GB or 12GB to 16GB may not do very much. It may even hurt, because the duplication between shared_buffers and the OS page cache means an overall reduction in available cache space. If for example you've got 16GB of memory and shared_buffers=2GB, you *may* be fitting all of the somewhat-hot data into cache someplace; bumping shared_buffers=4GB almost certainly means that will no longer happen, causing performance to tank. I don't really think that the 8GB rule of thumb is something that originates in any technical limitation of PostgreSQL or Linux. First of all it's just a rule of thumb -- the best value in a given installation can easily be something completely different. Second, to the extent that it's a useful rule of thumb, I think it's really a guess about what people's working set looks like: that going from 4GB to 8GB, say, significantly increases the chances of fitting the next-larger, next-cooler working set entirely in shared_buffers, going from 8GB to 16GB is less likely to accomplish this, and going from 16GB to 32GB probably won't. To a lesser extent, it's reflective of the point where scanning shared buffers to process relation drops gets painful, and the point where an immediate checkpoint suddenly dumping that much data out to the OS all at once starts to overwhelm the I/O subsystem for a significant period of time. But I think those really are lesser effects. My guess is that the big effect is balancing increased hit ratio vs. increased double buffering. > I agree that wall-clock time is a bad approach, actually. If we were > to use wall-clock time, we'd only do so because it can be used to > discriminate against a thing that we actually care about in an > approximate, indirect way. If there is a more direct way of > identifying correlated accesses, which I gather that there is, then we > should probably use it. For a start, I think it would be cool if somebody just gathered traces for some simple cases. For example, consider a pgbench transaction. If somebody produced a trace showing the buffer lookups in order annotated as heap, index leaf, index root, VM page, FSM root page, or whatever. Examining some other simple, common cases would probably help us understand whether it's normal to bump the usage count more than once per buffer for a single scan, and if so, exactly why that happens. If the code knows that it's accessing the same buffer a second (or subsequent) time, it could pass down a flag saying not to bump the usage count again. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: