Re: A costing analysis tool - Mailing list pgsql-hackers

From Kevin Grittner
Subject Re: A costing analysis tool
Date
Msg-id s34e9be4.028@gwmta.wicourts.gov
Whole thread Raw
In response to A costing analysis tool  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Responses Re: A costing analysis tool
List pgsql-hackers
Thanks for the well wishes.

It sounds like you were addressing a slightly different problem --
more ambitious than what I propose tackle do as a first step.
If I understand you, you were trying to develop your own
predictive costing formulas based on plans.  I'm merely talking
about a tool to evaluate the relative accuracy of the predictions
generated by PostgreSQL.  So for the purposes of the proposed
tool, the PostgreSQL estimates are central.

The ultimate goal is to be able to spot where the current
calculations are least accurate, so that any adjustments can
be made where they are most needed.  You will notice that my
working assumptions start with the observation that most of
the time, PostgreSQL does very well.  I have certainly found
that to be the case, compared to a commercial product running
the same queries against the same data.  I'm operating on the
assumption that relatively minor adjustments to current
techniques can take off what rough edges there are.

That said, there's certainly overlap between your effort and
what I'm going to be developing.  Do you have anything from
your work which might save me some time?

You point regarding a convenient way for people to submit
results from diverse environments, with a nice web
presentation of collated results is well taken.  I'm not sure
my "off-the-cuff" estimate from an earlier post allows enough
time for that -- I'll have to be sure to include something at
least adequate and expandable in the design.  I probably won't
try for anything too fancy.  I'm at my best on frameworky
internal sorts of things.

There's a chance that I may be able to talk my client into
putting a web app guy on this for a few days to make it pretty.
You never know.  Would it make more sense for my client to
host something like that on their servers, or would it be more
appropriate to have something which would install on a
postgresql.org server?  If the latter, what sorts of
technologies are supported?  (My client is partial to cocoon.)
-Kevin


>>> Martijn van Oosterhout <kleptog@svana.org> 10/13/05 2:41 PM >>>
On Thu, Oct 13, 2005 at 01:52:10PM -0500, Kevin Grittner wrote:
> Thanks, Josh, for the feedback.
>
> It sounds as though you are more focused on picking up costing
> problems which happen during production -- which is clearly
> valuable, but addresses a somewhat different set of needs than
> I was looking at.  That said, it seems like there is potential to share
> signifcant code between the two techniques.  We'll have to see if
> we can work that out.

Firstly, I really hope you get further with this than I did a while ago
when I attempted. It's certainly a worthly goal.

Secondly, while checking for problems in productions systems is good,
it's not going to help with fixing the cost model. For that you need
raw data.

My basic plan was to setup tables of different sizes and attempt to run
queries such as:

- Index Scan on each table with different types of keys and coverage.
- Seq Scan
- Nested loop, etc...

I did reach the point where I was wishing I could just give PostgreSQL
the plan and tell it to execute it. :) The point of the exercise is to
be able to derive correlations so you could from the plan calcuate the
actual costs. For example, run a nested loop with an inner index scan
once, twice, three times etc so we can actually *see* what the cache
effects are.

I got stuck on working out how to force the optimiser to produce the
plan I want. I didn't try too hard though. The enable_xxx options
should be enough, hopefully. Ofcourse you want to run it with different
numbers of shared buffers to see how they affect the results.

And then you ideally want the results for several different machines,
different disk subsystems, memory types, etc and placed on a nice web
page so other people can run correlations on the data themselves.

This is essentially what you already came up with. Note that for these
purposes the actual estimates by PostgreSQL are irrelevent. However, I
strongly suggest finding a way of collating the results publically from
lots of people because digging for correlations is something lots of
people can hammer on and is really hard to program.

Hope this helps,



pgsql-hackers by date:

Previous
From: Josh Berkus
Date:
Subject: Re: roundoff problem in time datatype
Next
From: Bruce Momjian
Date:
Subject: Re: [COMMITTERS] pgsql: Back out this because of fear of changing