Re: A costing analysis tool - Mailing list pgsql-hackers

From Josh Berkus
Subject Re: A costing analysis tool
Date
Msg-id 434E98A0.2080404@agliodbs.com
Whole thread Raw
In response to A costing analysis tool  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
List pgsql-hackers
Kevin,

> I'm looking at trying to fix some clear flaws in costing which cause
> of our real-world queries to choose sub-optimal plans under PostgreSQL.
> It's clear that there needs to be a tool to analyze the accuracy of
> costing for a variety of queries, both to direct any efforts to fix
> problems and to test for possible costing regressions.  As far as I can
> tell, no such tool currently exists.  If I've missed something, please
> let me know, even if it's ad hoc or incomplete.

Actually, this is pretty completely what I've been thinking about for 
the last year.   I'm very happy that someone else is interested in 
working on it.

> (2)  A large database must be created for these tests, since many
> issues don't show up in small tables.  The same data must be generated
> in every database, so results are comparable and reproducable.
> 
> (3)  Developers should be able to easily add test cases, either for
> their own use or contributed to the community.

Sure.  However, I think it's important to seperate the test cases from 
the cost collection tool.  Our *best* test cases will be real production 
applications.   For synthetic test cases, we can look to improving 
DBT-OSDL, Jan-TPCW, OSDBB and eDB's test (if they ever publish it).  The 
only thing that mess of tests is lacking is easy setup and portability.


> (7)  I envision a process to create a test database, populate it, run a
> series of test cases with EXPLAIN ANALYZE, capture the results, parse
> the results and store them in a database, analyze the results to find
> means and standard deviations both overall and for each type of plan,
> and report summaries and outliers -- with references to the test cases.
> The primary statistic of interest is actual time divided by cost.  This
> seems like it would be of interest overall, and within the permutations
> mentioned above for a single query.

I would actually like to do this differently.   I think an asynchronous 
logging mechanism is more useful, because there are cost estimation 
problems which don't show up except under conditions of concurrency and 
heavy server load.  For this reason, it's very important that this kind 
of cost collection could be performed on a production application.

What that would mean is some process whereby the system could sample, 
say, 5% of the queries being run (at random) and run EXPLAIN ANALYZEs 
against them, logging the results in a way that could be tabularized.

Speaking of which, I think you're missing an important first step: 
tabular output for EXPLAIN ANALYZE.   A whole host of query testing 
tools could be developed if it were easy to shove EA results into a 
format where statistics could be run on them.  Without it, it's pretty 
hard to do the rest of the testing.

> So, what do you think?

How much time do you have to spend on this?

I'd like to offer you the TestPerf project on pgfoundry 
(www.pgfoundry.org/projects/testperf) as a container for your work on 
this idea.   I also have access to a variety of test machines for 
performance tests.

--Josh



pgsql-hackers by date:

Previous
From: Darcy Buskermolen
Date:
Subject: Re: [PATCHES] Work-in-progress referential action trigger
Next
From: Tom Lane
Date:
Subject: Re: Allowed timezone values