Thread: Warm-up cache may have its virtue

Warm-up cache may have its virtue

Qingqing Zhou
Hinted by this thread:

I wonder if we should really implement file-system-cache-warmup strategy
which we have discussed before. There are two natural good places to do
(1) sequentail scan(2) bitmap index scan

We can consider (2) as a generalized version of (1). For (1), we have
mentioned several heuristics like keep scan interval to avoid competition.
These strategy is also applable to (2).

Question: why file-system level, instead of buffer pool level?  For two
reasons:  (1) Notice that in the above thread, the user just use
"shared_buffers = 8192" which suggest that file-system level is already
good enough; (2) easy to implement.

Use t*h*r*e*a*d? Well, I am a little bit afraid of mention this word.
But we can have some dedicated backends to do this - like bgwriter.

Let's dirty our hands!



Re: Warm-up cache may have its virtue

Tom Lane
Qingqing Zhou <> writes:
> Hinted by this thread:
> I wonder if we should really implement file-system-cache-warmup strategy
> which we have discussed before.

The difference between the cached and non-cached states is that the
kernel has seen fit to remove those pages from its cache.  It is
reasonable to suppose that it did so because there was a more immediate
use for the memory.  Trying to override that behavior will therefore
result in de-optimizing the global performance of the machine.

If the machine is actually dedicated to Postgres, I'd expect disk pages
to stay in cache without our taking any heroic measures to keep them
there.  If they don't, that's a matter for kernel configuration tuning,
not "warmup" processes.
        regards, tom lane

Re: Warm-up cache may have its virtue

Qingqing Zhou

On Thu, 5 Jan 2006, Tom Lane wrote:
> The difference between the cached and non-cached states is that the
> kernel has seen fit to remove those pages from its cache.  It is
> reasonable to suppose that it did so because there was a more immediate
> use for the memory.  Trying to override that behavior will therefore
> result in de-optimizing the global performance of the machine.

Yeah, so in another word, warm-up cache is just wasting of time if the
pages are already in OS caches. I agree with this. But does this mean it
may deserve to experiment another strategy: big-stomach Postgres, i.e.,
with big shared_buffer value. By this strategy, (1) almost all the buffers
are in our control, and we will know when a pre-read is needed; (2) avoid
double-buffering: though people are suggested not to use very big
shared_buffer value, but in practice, I see people gain performance by
increase it to 200000 or more.

Feasibility: Our bufmgr lock rewrite already makes this possible. But to
enable it, we may need more work: (w1) make bufferpool relation-wise,
which makes our estimation of data page residence more easy and reliable.
(w2) add aggresive pre-read on buffer pool level. Also, another benefit of
w1 will make our query planner can estimate query cost more precisely.


Re: Warm-up cache may have its virtue

Qingqing Zhou

On Thu, 5 Jan 2006, Qingqing Zhou wrote:
> Feasibility: Our bufmgr lock rewrite already makes this possible. But to
> enable it, we may need more work: (w1) make bufferpool relation-wise,
> which makes our estimation of data page residence more easy and reliable.
> (w2) add aggresive pre-read on buffer pool level. Also, another benefit of
> w1 will make our query planner can estimate query cost more precisely.

"w1" is doable by introducing a shared-memory bitmap indicating which
pages of a relation are in buffer pool (We may want to add a hash to
manage the relations). Theoretically, O(shared_buffer) bits are enough. So
this will not use a lot of space.

When we maintain the SharedBufHash, we maintain this bitmap. When we do
query cost estimation or preread, we just need a rough number, so this can
be done by scanning the bitmap without lock. Thus there is also almost no
extra cost.


Re: Warm-up cache may have its virtue

"Qingqing Zhou"
"Qingqing Zhou" <> wrote
>> Feasibility: Our bufmgr lock rewrite already makes this possible. But to
>> enable it, we may need more work: (w1) make bufferpool relation-wise,
>> which makes our estimation of data page residence more easy and reliable.
>> (w2) add aggresive pre-read on buffer pool level. Also, another benefit 
>> of
>> w1 will make our query planner can estimate query cost more precisely.
> "w1" is doable by introducing a shared-memory bitmap indicating which
> pages of a relation are in buffer pool (We may want to add a hash to
> manage the relations). Theoretically, O(shared_buffer) bits are enough. So
> this will not use a lot of space.
> When we maintain the SharedBufHash, we maintain this bitmap. When we do
> query cost estimation or preread, we just need a rough number, so this can
> be done by scanning the bitmap without lock. Thus there is also almost no
> extra cost.

After some research, I come to the conclusion that the bitmap idea is bad - 
I hope I am wrong :-(.

The benefits of adding a bitmap can enable us knowing current buffer 
residence: (b1) Plan stage: give a more accurate estimation of sequential 
scan; (b2) Execution stage: provide another way to let sequential 
scan/bitmap scan to identify the pages that need pre-read.

For b1, it actually doesn't matter much though. With bitmap we definitely 
can give a better EXPLAIN numbers for seqscan only, but without the bitmap, 
we seldom make wrong choice of choosing/not choosing sequential scan. 
Another other cost estimation can get benefits? I am afraid no since before 
execution, we simply don't know what to read. For b2, the bitmap does 
provide another way without contenting the BufMappingLock to know the 
buffers we should preread, but since the contention of BufMappingLock is not 
intensive, this does marginal benefits.

My previous estimation of the trouble/cost of maintaining this bitmap is too 
optimistic, for one thing, we need compress the bitmap since many of them 
are sparse. Different from uncompressed bitmap, reading without lock can 
cause core dump or totally wrong result instead of just some lossy one. Thus 
to visit a bitmap, we have to at least grab two locks as I can envision, one 
for relation mapping hash, the other for bitmap content protection.

If no more possible benefits to expect, I don't think adding a bitmap is a 
good idea. Any other benefits that you can foresee?


Re: Warm-up cache may have its virtue

Greg Stark
"Qingqing Zhou" <> writes:

> For b1, it actually doesn't matter much though. With bitmap we definitely 
> can give a better EXPLAIN numbers for seqscan only, but without the bitmap, 
> we seldom make wrong choice of choosing/not choosing sequential scan. 

I think you have a more severe problem than that. 

It's not sequential scans that we have trouble estimating. Most of their
blocks will be uncached and they'll be read sequentially. Both of these
factors make estimating their costs pretty straightforward.

It's the index scans that are the problem. Index scans look bad to the
optimizer because they're random access, but they often have very high cache
hit rates because they access relatively few blocks and often they're hot (the
DBA did after all feel compelled to create the index in the first place).
Moreover they're often inside Nested Loop plans which causes many of those
blocks to be accessed repeatedly within the loop.

And the cache hit rate matters *a lot* for index scans since a cache hit means
the block won't be affected by the random access penalty. That is, it the
cache speedup will help both sequential and index scans but skipping the seek
only helps the index scan. 

And that's true regardless of whether it's found in Postgres's buffer cache or
has to be read in from the filesystem cache. So you won't really be able to
tell how many seeks are avoided without knowing whether the block is in the
filesystem cache. 

In other words, the difference between being in Postgres's buffer cache and
being in the filesystem cache, while not insignificant, isn't really relevant
to the planner since it affects sequential scans and index scans equally. It's
the difference between being in either cache versus requiring disk i/o that
affects index scans disproportionately.

And worse, it doesn't really matter whether it's in the cache when the query
is planned. It matters whether it'll be in the cache when the access is made.
If the node is inside a Nested Loop then subsequent trips through the loop the
same blocks may end up being read and they may all be cached.


Re: Warm-up cache may have its virtue

Qingqing Zhou

On Sat, 7 Jan 2006, Greg Stark wrote:

> "Qingqing Zhou" <> writes:
> > For b1, it actually doesn't matter much though. With bitmap we definitely
> > can give a better EXPLAIN numbers for seqscan only, but without the bitmap,
> > we seldom make wrong choice of choosing/not choosing sequential scan.
> I think you have a more severe problem than that.
> It's not sequential scans that we have trouble estimating.
> It's the index scans that are the problem.

Exactly, we are saying the same thing.

> In other words, the difference between being in Postgres's buffer cache and
> being in the filesystem cache, while not insignificant, isn't really relevant
> to the planner since it affects sequential scans and index scans equally.

The bitmap was proposed since I think it is time to use dominated
shared_buffer size. Thus, if it is not in buffer cache, it is not in OS
cache either.


Re: Warm-up cache may have its virtue

Greg Stark
Qingqing Zhou <> writes:

> > In other words, the difference between being in Postgres's buffer cache and
> > being in the filesystem cache, while not insignificant, isn't really relevant
> > to the planner since it affects sequential scans and index scans equally.
> The bitmap was proposed since I think it is time to use dominated
> shared_buffer size. Thus, if it is not in buffer cache, it is not in OS
> cache either.

Hm. Personally I have a hunch you're right. But there we have no actual
evidence. The first thing that needs to happen is changes to use O_DIRECT for
everything and then benchmarking one of those big TPC tests with the O_DIRECT
build and a large buffer cache versus a normal build with an traditional
buffer cache size.

If it's anywhere close, even with no prefetching then it ought to be clear
that the costs of double buffering are becoming substantial.

As far as predicting cache hits I think the best Postgres could do is track
the average cache hit rate, either overall for the whole system or perhaps
even per table and index. 

The first problem I see with that is that most systems have a mix of OLTP and
DSS queries and the two might have different patterns. Perhaps keeping track
of cache hit rates in multiple buckets based on the estimated number of rows?
Maybe exponentially growing buckets of "1-10" "10-100" "100-1k" "1k-10k", ...


Re: Warm-up cache may have its virtue

"Qingqing Zhou"
"Greg Stark" <> wrote in message
> Hm. Personally I have a hunch you're right. But there we have no actual
> evidence. The first thing that needs to happen is changes to use O_DIRECT 
> for
> everything and then benchmarking one of those big TPC tests with the 
> build and a large buffer cache versus a normal build with an traditional
> buffer cache size.

A nice thing is that we can have both. User can choose to use small 
shared_buffer or big shared_buffer. According to user's choice, we will use 
different IO/buffering strategy.

> If it's anywhere close, even with no prefetching then it ought to be clear
> that the costs of double buffering are becoming substantial.

AFAIU double buffering only hurts when we use big shared_buffer value.

> As far as predicting cache hits I think the best Postgres could do is 
> track
> the average cache hit rate, either overall for the whole system or perhaps
> even per table and index.

There is a linux kernel implementation of pre-read:

We have better hints for it: seqscan and bitmap scan.


Re: Warm-up cache may have its virtue

"Qingqing Zhou"
"Qingqing Zhou" <> wrote
> I wonder if we should really implement file-system-cache-warmup strategy
> which we have discussed before. There are two natural good places to do
> this:
> (1) sequentail scan
> (2) bitmap index scan

For the sake of memory, there is a third place a warm-up cache or pre-read 
is beneficial (OS won't help us):
(3) xlog recovery


Re: Warm-up cache may have its virtue

"Jim C. Nasby"
On Sat, Jan 14, 2006 at 04:13:56PM -0500, Qingqing Zhou wrote:
> "Qingqing Zhou" <> wrote
> >
> > I wonder if we should really implement file-system-cache-warmup strategy
> > which we have discussed before. There are two natural good places to do
> > this:
> >
> > (1) sequentail scan
> > (2) bitmap index scan
> >
> For the sake of memory, there is a third place a warm-up cache or pre-read 
> is beneficial (OS won't help us):
> (3) xlog recovery

Wouldn't it be better to improve pre-reading data instead, ie, making
sure things like seqscan and bitmap scan always keep the IO system busy?
Jim C. Nasby, Sr. Engineering Consultant
Pervasive Software    work: 512-231-6117
vcard:       cell: 512-569-9461