Home > mailing lists

Re: cache estimates, cache access cost - Mailing list pgsql-hackers

From	Robert Haas
Subject	Re: cache estimates, cache access cost
Date	May 19, 2011 09:47:17
Msg-id	BANLkTinKk5z659bd1-YX=r0iehB6iSJNYQ@mail.gmail.com Whole thread Raw
In response to	Re: cache estimates, cache access cost (Cédric Villemain <cedric.villemain.debian@gmail.com>)
Responses	Re: cache estimates, cache access cost
List	pgsql-hackers

Tree view

On Thu, May 19, 2011 at 8:19 AM, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:
> 2011/5/19 Robert Haas <robertmhaas@gmail.com>:
>> On Tue, May 17, 2011 at 6:11 PM, Cédric Villemain
>> <cedric.villemain.debian@gmail.com> wrote:
>>> The point is to get ratio in cache, not the distribution of the data
>>> in cache (pgfincore also allows you to see this information).
>>> I don't see how a stable (a server in production) system can have its
>>> ratio moving up and down so fast without known pattern.
>>
>> Really?  It doesn't seem that hard to me.  For example, your nightly
>> reports might use a different set of tables than are active during the
>> day....
>
> yes, this is known pattern, I believe we can work with it.

I guess the case where I agree that this would be relatively static is
on something like a busy OLTP system.  If different users access
different portions of the main tables, which parts of each relation
are hot might move around, but overall the percentage of that relation
in cache probably won't move around a ton, except perhaps just after
running a one-off reporting query, or when the system is first
starting up.

But that's not everybody's workload.  Imagine a system that is
relatively lightly used.  Every once in a while someone comes along
and runs a big reporting query.  Well, the contents of the buffer
caches are might vary considerably depending on *which* big reporting
queries ran most recently.

Also, even if we knew what was going to be in cache at the start of
the query, the execution of the query might change things greatly as
it runs.  For example, imagine a join between some table and itself.
If we estimate that none of the data is i cache, we will almost
certainly be wrong, because it's likely both sides of the join are
going to access some of the same pages.  Exactly how many depends on
the details of the join condition and whether we choose to implement
it by merging, sorting, or hashing.  But it's likely going to be more
than zero.  This problem can also arise in other contexts - for
example, if a query accesses a bunch of large tables, the tables that
are accessed later in the computation might be less cached than the
ones accessed earlier in the computation, because the earlier accesses
pushed parts of the tables accessed later out of cache.  Or, if a
query requires a large sort, and the value of work_mem is very high
(say 1GB), the sort might evict data from cache.  Now maybe none of
this matters a bit in practice, but it's something to think about.

There was an interesting report on a problem along these lines from
Kevin Grittner a while back.  He found he needed to set seq_page_cost
and random_page_cost differently for the database user that ran the
nightly reports, precisely because the degree of caching was very
different than it was for the daily activity, and he got bad plans
otherwise.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

pgsql-hackers by date:

From: Cédric Villemain
Date: 19 May 2011, 09:19:52
Subject: Re: cache estimates, cache access cost

From: Leonardo Francalanci
Date: 19 May 2011, 09:52:55
Subject: Re: switch UNLOGGED to LOGGED

Re: cache estimates, cache access cost - Mailing list pgsql-hackers

Previous

Next