Thread: Warm-up cache may have its virtue
Hinted by this thread: http://archives.postgresql.org/pgsql-performance/2006-01/msg00016.php I wonder if we should really implement file-system-cache-warmup strategy which we have discussed before. There are two natural good places to do this: (1) sequentail scan(2) bitmap index scan We can consider (2) as a generalized version of (1). For (1), we have mentioned several heuristics like keep scan interval to avoid competition. These strategy is also applable to (2). Question: why file-system level, instead of buffer pool level? For two reasons: (1) Notice that in the above thread, the user just use "shared_buffers = 8192" which suggest that file-system level is already good enough; (2) easy to implement. Use t*h*r*e*a*d? Well, I am a little bit afraid of mention this word. But we can have some dedicated backends to do this - like bgwriter. Let's dirty our hands! Comments? Regards, Qingqing
Qingqing Zhou <zhouqq@cs.toronto.edu> writes: > Hinted by this thread: > http://archives.postgresql.org/pgsql-performance/2006-01/msg00016.php > I wonder if we should really implement file-system-cache-warmup strategy > which we have discussed before. The difference between the cached and non-cached states is that the kernel has seen fit to remove those pages from its cache. It is reasonable to suppose that it did so because there was a more immediate use for the memory. Trying to override that behavior will therefore result in de-optimizing the global performance of the machine. If the machine is actually dedicated to Postgres, I'd expect disk pages to stay in cache without our taking any heroic measures to keep them there. If they don't, that's a matter for kernel configuration tuning, not "warmup" processes. regards, tom lane
On Thu, 5 Jan 2006, Tom Lane wrote: > > The difference between the cached and non-cached states is that the > kernel has seen fit to remove those pages from its cache. It is > reasonable to suppose that it did so because there was a more immediate > use for the memory. Trying to override that behavior will therefore > result in de-optimizing the global performance of the machine. > Yeah, so in another word, warm-up cache is just wasting of time if the pages are already in OS caches. I agree with this. But does this mean it may deserve to experiment another strategy: big-stomach Postgres, i.e., with big shared_buffer value. By this strategy, (1) almost all the buffers are in our control, and we will know when a pre-read is needed; (2) avoid double-buffering: though people are suggested not to use very big shared_buffer value, but in practice, I see people gain performance by increase it to 200000 or more. Feasibility: Our bufmgr lock rewrite already makes this possible. But to enable it, we may need more work: (w1) make bufferpool relation-wise, which makes our estimation of data page residence more easy and reliable. (w2) add aggresive pre-read on buffer pool level. Also, another benefit of w1 will make our query planner can estimate query cost more precisely. Regards, Qingqing
On Thu, 5 Jan 2006, Qingqing Zhou wrote: > > Feasibility: Our bufmgr lock rewrite already makes this possible. But to > enable it, we may need more work: (w1) make bufferpool relation-wise, > which makes our estimation of data page residence more easy and reliable. > (w2) add aggresive pre-read on buffer pool level. Also, another benefit of > w1 will make our query planner can estimate query cost more precisely. > "w1" is doable by introducing a shared-memory bitmap indicating which pages of a relation are in buffer pool (We may want to add a hash to manage the relations). Theoretically, O(shared_buffer) bits are enough. So this will not use a lot of space. When we maintain the SharedBufHash, we maintain this bitmap. When we do query cost estimation or preread, we just need a rough number, so this can be done by scanning the bitmap without lock. Thus there is also almost no extra cost. Regards, Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> wrote > >> Feasibility: Our bufmgr lock rewrite already makes this possible. But to >> enable it, we may need more work: (w1) make bufferpool relation-wise, >> which makes our estimation of data page residence more easy and reliable. >> (w2) add aggresive pre-read on buffer pool level. Also, another benefit >> of >> w1 will make our query planner can estimate query cost more precisely. > > "w1" is doable by introducing a shared-memory bitmap indicating which > pages of a relation are in buffer pool (We may want to add a hash to > manage the relations). Theoretically, O(shared_buffer) bits are enough. So > this will not use a lot of space. > > When we maintain the SharedBufHash, we maintain this bitmap. When we do > query cost estimation or preread, we just need a rough number, so this can > be done by scanning the bitmap without lock. Thus there is also almost no > extra cost. After some research, I come to the conclusion that the bitmap idea is bad - I hope I am wrong :-(. The benefits of adding a bitmap can enable us knowing current buffer residence: (b1) Plan stage: give a more accurate estimation of sequential scan; (b2) Execution stage: provide another way to let sequential scan/bitmap scan to identify the pages that need pre-read. For b1, it actually doesn't matter much though. With bitmap we definitely can give a better EXPLAIN numbers for seqscan only, but without the bitmap, we seldom make wrong choice of choosing/not choosing sequential scan. Another other cost estimation can get benefits? I am afraid no since before execution, we simply don't know what to read. For b2, the bitmap does provide another way without contenting the BufMappingLock to know the buffers we should preread, but since the contention of BufMappingLock is not intensive, this does marginal benefits. My previous estimation of the trouble/cost of maintaining this bitmap is too optimistic, for one thing, we need compress the bitmap since many of them are sparse. Different from uncompressed bitmap, reading without lock can cause core dump or totally wrong result instead of just some lossy one. Thus to visit a bitmap, we have to at least grab two locks as I can envision, one for relation mapping hash, the other for bitmap content protection. If no more possible benefits to expect, I don't think adding a bitmap is a good idea. Any other benefits that you can foresee? Regards, Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: > For b1, it actually doesn't matter much though. With bitmap we definitely > can give a better EXPLAIN numbers for seqscan only, but without the bitmap, > we seldom make wrong choice of choosing/not choosing sequential scan. I think you have a more severe problem than that. It's not sequential scans that we have trouble estimating. Most of their blocks will be uncached and they'll be read sequentially. Both of these factors make estimating their costs pretty straightforward. It's the index scans that are the problem. Index scans look bad to the optimizer because they're random access, but they often have very high cache hit rates because they access relatively few blocks and often they're hot (the DBA did after all feel compelled to create the index in the first place). Moreover they're often inside Nested Loop plans which causes many of those blocks to be accessed repeatedly within the loop. And the cache hit rate matters *a lot* for index scans since a cache hit means the block won't be affected by the random access penalty. That is, it the cache speedup will help both sequential and index scans but skipping the seek only helps the index scan. And that's true regardless of whether it's found in Postgres's buffer cache or has to be read in from the filesystem cache. So you won't really be able to tell how many seeks are avoided without knowing whether the block is in the filesystem cache. In other words, the difference between being in Postgres's buffer cache and being in the filesystem cache, while not insignificant, isn't really relevant to the planner since it affects sequential scans and index scans equally. It's the difference between being in either cache versus requiring disk i/o that affects index scans disproportionately. And worse, it doesn't really matter whether it's in the cache when the query is planned. It matters whether it'll be in the cache when the access is made. If the node is inside a Nested Loop then subsequent trips through the loop the same blocks may end up being read and they may all be cached. -- greg
On Sat, 7 Jan 2006, Greg Stark wrote: > > "Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: > > > For b1, it actually doesn't matter much though. With bitmap we definitely > > can give a better EXPLAIN numbers for seqscan only, but without the bitmap, > > we seldom make wrong choice of choosing/not choosing sequential scan. > > I think you have a more severe problem than that. > > It's not sequential scans that we have trouble estimating. > It's the index scans that are the problem. Exactly, we are saying the same thing. > > In other words, the difference between being in Postgres's buffer cache and > being in the filesystem cache, while not insignificant, isn't really relevant > to the planner since it affects sequential scans and index scans equally. The bitmap was proposed since I think it is time to use dominated shared_buffer size. Thus, if it is not in buffer cache, it is not in OS cache either. Regards, Qingqing
Qingqing Zhou <zhouqq@cs.toronto.edu> writes: > > In other words, the difference between being in Postgres's buffer cache and > > being in the filesystem cache, while not insignificant, isn't really relevant > > to the planner since it affects sequential scans and index scans equally. > > The bitmap was proposed since I think it is time to use dominated > shared_buffer size. Thus, if it is not in buffer cache, it is not in OS > cache either. Hm. Personally I have a hunch you're right. But there we have no actual evidence. The first thing that needs to happen is changes to use O_DIRECT for everything and then benchmarking one of those big TPC tests with the O_DIRECT build and a large buffer cache versus a normal build with an traditional buffer cache size. If it's anywhere close, even with no prefetching then it ought to be clear that the costs of double buffering are becoming substantial. As far as predicting cache hits I think the best Postgres could do is track the average cache hit rate, either overall for the whole system or perhaps even per table and index. The first problem I see with that is that most systems have a mix of OLTP and DSS queries and the two might have different patterns. Perhaps keeping track of cache hit rates in multiple buckets based on the estimated number of rows? Maybe exponentially growing buckets of "1-10" "10-100" "100-1k" "1k-10k", ... -- greg
"Greg Stark" <gsstark@mit.edu> wrote in message news:87ek3k2c9j.fsf@stark.xeocode.com... > > Hm. Personally I have a hunch you're right. But there we have no actual > evidence. The first thing that needs to happen is changes to use O_DIRECT > for > everything and then benchmarking one of those big TPC tests with the > O_DIRECT > build and a large buffer cache versus a normal build with an traditional > buffer cache size. > A nice thing is that we can have both. User can choose to use small shared_buffer or big shared_buffer. According to user's choice, we will use different IO/buffering strategy. > If it's anywhere close, even with no prefetching then it ought to be clear > that the costs of double buffering are becoming substantial. > AFAIU double buffering only hurts when we use big shared_buffer value. > As far as predicting cache hits I think the best Postgres could do is > track > the average cache hit rate, either overall for the whole system or perhaps > even per table and index. > There is a linux kernel implementation of pre-read: http://glide.stanford.edu/lxr/source/mm/readahead.c?v=linux-2.6.5#L306 We have better hints for it: seqscan and bitmap scan. Regards, Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> wrote > > I wonder if we should really implement file-system-cache-warmup strategy > which we have discussed before. There are two natural good places to do > this: > > (1) sequentail scan > (2) bitmap index scan > For the sake of memory, there is a third place a warm-up cache or pre-read is beneficial (OS won't help us): (3) xlog recovery Regards, Qingqing
On Sat, Jan 14, 2006 at 04:13:56PM -0500, Qingqing Zhou wrote: > > "Qingqing Zhou" <zhouqq@cs.toronto.edu> wrote > > > > I wonder if we should really implement file-system-cache-warmup strategy > > which we have discussed before. There are two natural good places to do > > this: > > > > (1) sequentail scan > > (2) bitmap index scan > > > > For the sake of memory, there is a third place a warm-up cache or pre-read > is beneficial (OS won't help us): > (3) xlog recovery Wouldn't it be better to improve pre-reading data instead, ie, making sure things like seqscan and bitmap scan always keep the IO system busy? -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461