A costing analysis tool - Mailing list pgsql-hackers

From Kevin Grittner
Subject A costing analysis tool
Date
Msg-id s34d55ef.014@gwmta.wicourts.gov
Whole thread Raw
Responses Re: A costing analysis tool
Re: A costing analysis tool
Re: A costing analysis tool
List pgsql-hackers
I'm looking at trying to fix some clear flaws in costing which cause
of our real-world queries to choose sub-optimal plans under PostgreSQL.
It's clear that there needs to be a tool to analyze the accuracy of
costing for a variety of queries, both to direct any efforts to fix
problems and to test for possible costing regressions.  As far as I can
tell, no such tool currently exists.  If I've missed something, please
let me know, even if it's ad hoc or incomplete.

Note that I'm talking about a tool strictly to check the accuracy of
the estimated costs of plans chosen by the planner, nothing else.

I'm at the rough planning stage, and would appreciate any feedback on
my thoughts before I start actual development.  Considerations, in no
particular order (but numbered for easy reference):

(1)  Most of the time, on a well-configured database, PostgreSQL
chooses a plan which performs very well.  Many test cases need to cover
these normal cases, both to set a baseline and to catch any regression.

(2)  A large database must be created for these tests, since many
issues don't show up in small tables.  The same data must be generated
in every database, so results are comparable and reproducable.

(3)  Developers should be able to easily add test cases, either for
their own use or contributed to the community.

(4)  The same test query should be able to easily run in different
permutations of the following:
 (a)  With no data cached before the run, or as fully cached as possible. (b)  With various enable_xxx settings on or
off.(c)  With or without significant dead space in the tables/indexes. 

(5)  The tool needs to be able to run in a variety of OS environments.
At a minimum, this means some way to pick up configuration information
to specify how to start and stop the back end, and how to flush the
system cache.

(6)  The relative costs of various plans shifts dramatically when C
asserts are enabled.  To avoid misleading results, the tool should warn
the user when run on a build configured with --enable-cassert, and all
results from such an environment should be conspicuously identified as
such.

(7)  I envision a process to create a test database, populate it, run a
series of test cases with EXPLAIN ANALYZE, capture the results, parse
the results and store them in a database, analyze the results to find
means and standard deviations both overall and for each type of plan,
and report summaries and outliers -- with references to the test cases.
The primary statistic of interest is actual time divided by cost.  This
seems like it would be of interest overall, and within the permutations
mentioned above for a single query.

A reasonable set of test cases would illuminate where costing
adjustments would provide the most benefit for the least risk.  By
reining in the most extreme outliers, we could allow the planner to
recognize the best plan among the options it is considering with
greater accuracy.  This tool could be used as a regression test to
ensure that a costing adjustment didn't distort the cost of something
which had been accurately costed.

So, what do you think?

-Kevin



pgsql-hackers by date:

Previous
From: "Kevin Grittner"
Date:
Subject: Re: Are cost estimates based on asserts?
Next
From: David Fetter
Date:
Subject: Re: Comments on columns in the pg_catalog tables/views