Re: cache estimates, cache access cost - Mailing list pgsql-hackers

From Cédric Villemain
Subject Re: cache estimates, cache access cost
Date
Msg-id BANLkTikJmneM45UTgbw8tpFC+Lke=gQfJA@mail.gmail.com
Whole thread Raw
In response to Re: cache estimates, cache access cost  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: cache estimates, cache access cost  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
2011/5/17 Robert Haas <robertmhaas@gmail.com>:
> On Sun, May 15, 2011 at 11:52 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>> Cédric Villemain wrote:
>>>
>>> http://git.postgresql.org/gitweb?p=users/c2main/postgres.git;a=shortlog;h=refs/heads/analyze_cache
>>
>> This rebases easily to make Cedric's changes move to the end; I just pushed
>> a version with that change to
>> https://github.com/greg2ndQuadrant/postgres/tree/analyze_cache if anyone
>> wants a cleaner one to browse.  I've attached a patch too if that's more
>> your thing.
>
> Thank you.  I don't much like sucking in other people's git repos - it
> tends to take a lot longer than just opening a patch file, and if I
> add the repo as a remote then my git repo ends up bloated.  :-(
>
>> The more important question is how to store the data collected and
>> then use it for optimizing queries.
>
> Agreed, but unless I'm missing something, this patch does nothing
> about that.  I think the first step needs to be to update all the
> formulas that are based on random_page_cost and seq_page_cost to
> properly take cache_page_cost into account - and in some cases it may
> be a bit debatable what the right mathematics are.

Yes, I provide the branch only in case someone want to hack the
costsize and to close the problem of getting stats.

>
> For what it's worth, I don't believe for a minute that an analyze
> process that may run only run on a given table every six months has a
> chance of producing useful statistics about the likelihood that a
> table will be cached.  The buffer cache can turn over completely in
> under a minute, and a minute is a lot less than a month.  Now, if we
> measured this information periodically for a long period of time and
> averaged it, that might be a believable basis for setting an optimizer

The point is to get ratio in cache, not the distribution of the data
in cache (pgfincore also allows you to see this information).
I don't see how a stable (a server in production) system can have its
ratio moving up and down so fast without known pattern.
Maybe it is datawarehouse, so data move a lot, then just update your
per-relation stats before starting your queries as suggested in other
threads. Maybe it is just a matter of frequency of stats update or
explicit request like we *use to do* (ANALYZE foo;) to handle those
situations.

> parameter.  But I think we should take the approach recently discussed
> on performance: allow it to be manually set by the administrator on a
> per-relation basis, with some reasonable default (maybe based on the
> size of the relation relative to effective_cache_size) if the
> administrator doesn't intervene.  I don't want to be excessively
> negative about the approach of examining the actual behavior of the
> system and using that to guide system behavior - indeed, I think there
> are quite a few places where we would do well to incorporate that
> approach to a greater degree than we do currently.  But I think that
> it's going to take a lot of research, and a lot of work, and a lot of
> performance testing, to convince ourselves that we've come up with an
> appropriate feedback mechanism that will actually deliver better
> performance across a large variety of workloads.  It would be much
> better, IMHO, to *first* get a cached_page_cost parameter added, even
> if the mechanism by which caching percentages are set is initially
> quite crude - that will give us a clear-cut benefit that people can
> begin enjoying immediately.

The plugin I provided is just to be able to do first analysis on how
the os cache size move. You can either use pgfincore to monitor that
per table or use the patch and monitor columns values for *cache.

I took the Hooks approach because it allows to do what you want :)
You can set up a hook where you set the values you want to see, it
allows for example to fix cold start values, or permanent values set
by DBA or ... do what you want here.

The topic is do we need more parameters to increase the value of our planner ?
1/ cache_page_cost
2/ cache information, arbitrary set or not.

Starting with 1/ is ok for me, I prefer to try both at once if
possible to remove the pain to hack twice costsize.c

Several items are to be discussed after that: formulas to handle
'small' tables, data distribution usage (this one hit an old topic
about auto-partitionning  as we are here), cold state, hot state, ...

PS: there is very good blocker for the pg_class changes : what happens
in a standby ? Maybe it just opens the door on how to unlock that or
find another option to get the information per table but distinct per
server. (or we don't care, at least for a first implementation, like
for other parameters)
--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


pgsql-hackers by date:

Previous
From: Dave Page
Date:
Subject: Re: deprecating contrib for PGXN
Next
From: Josh Kupershmidt
Date:
Subject: patch: Allow \dd to show constraint comments