Re: A costing analysis tool - Mailing list pgsql-hackers

From Kevin Grittner
Subject Re: A costing analysis tool
Date
Msg-id s34fab58.030@gwmta.wicourts.gov
Whole thread Raw
In response to A costing analysis tool  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Responses Re: A costing analysis tool
List pgsql-hackers
I think I get your point now.  If I understand it, you could
accomplish what you want under my rough ("exists only in my
head so far") design by creating your own test cases and putting
together a script to run just those.  I would be exremely leary of
comparing tests against a database under load with tests run
against an otherwise idle system -- you would expect the time
per cost unit to increase -- but as long as the load remained
relatively constant, you could get useful information from
comparing the various tests from that system under that load.

Are we tracking now?

I propose capturing only three values from the output of explain
analyze, and saving it with many columns of context information.
Below is a sample script to capture the three columns and display
them with the derived ratio and two very limited bits of context.
Don't worry about the lack of proper context just yet, we're talking
about what I propose to capture from the output files.
Here's the script:

------------------------------------------------------------
# simplistic example of parsing through PostgreSQL analyze verbose output
import sys

def printHuman(filename, lineNum, operation, cost, actualTime):print "%-60s  %3s  %-20s  %10.2f  %12.3f  %f" \%
(filename,lineNum, operation, float(cost), float(actualTime), float(actualTime) / float(cost)) 

for inputfilename in sys.argv[1:]:inputfile = open(inputfilename, 'r')inputlines =
inputfile.readlines()inputfile.close()forindx in range(len(inputlines)):    line = inputlines[indx].lstrip()    if
line[:2]== '->':        opName = line[3:line.find('(')].strip()        tailstripwords = (' on ', ' using ')        for
wordin tailstripwords:            onpos = opName.find(word)            if onpos > -1:                opName =
opName[:onpos]       dotspos = line.find('..')        cost = line[dotspos + 2: line.find(' ', dotspos)]
nextdotspos= line.find('..', line.find('actual time'))        actTime = line[nextdotspos + 2: line.find(' ',
nextdotspos)]       printHuman(inputfilename, indx, opName, cost, actTime) 
------------------------------------------------------------

and here's the output based on the plans shown in an earlier thread
( http://archives.postgresql.org/pgsql-hackers/2005-10/msg00303.php ):

------------------------------------------------------------
[kgrittn@WCS45792 kgrittn]$ python parse_plan.py *.eplain.analyze
milw-cal-daterange-bare-cached.eplain.analyze                   1  Merge Join             176218.09      4500.785
0.025541
milw-cal-daterange-bare-cached.eplain.analyze                   3  Sort                    48975.72       812.160
0.016583
milw-cal-daterange-bare-cached.eplain.analyze                   5  Bitmap Heap Scan        47081.47       427.463
0.009079
milw-cal-daterange-bare-cached.eplain.analyze                   7  Bitmap Index Scan         297.07        19.089
0.064258
milw-cal-daterange-bare-cached.eplain.analyze                   9  Sort                   126945.94      3124.966
0.024617
milw-cal-daterange-bare-cached.eplain.analyze                  11  Bitmap Heap Scan       119071.84      1195.302
0.010038
milw-cal-daterange-bare-cached.eplain.analyze                  13  Bitmap Index Scan        1037.84        54.917
0.052915
milw-cal-daterange-bare-noncached.eplain.analyze                1  Merge Join             176218.09     32071.776
0.182000
milw-cal-daterange-bare-noncached.eplain.analyze                3  Sort                    48975.72      5954.706
0.121585
milw-cal-daterange-bare-noncached.eplain.analyze                5  Bitmap Heap Scan        47081.47      5560.895
0.118112
milw-cal-daterange-bare-noncached.eplain.analyze                7  Bitmap Index Scan         297.07        52.675
0.177315
milw-cal-daterange-bare-noncached.eplain.analyze                9  Sort                   126945.94     25551.291
0.201277
milw-cal-daterange-bare-noncached.eplain.analyze               11  Bitmap Heap Scan       119071.84     23588.122
0.198100
milw-cal-daterange-bare-noncached.eplain.analyze               13  Bitmap Index Scan        1037.84       146.412
0.141074
milw-cal-daterange-nomergejoin-cached.eplain.analyze            1  Nested Loop            298740.94      1464.720
0.004903
milw-cal-daterange-nomergejoin-cached.eplain.analyze            2  Index Scan              49419.45       412.268
0.008342
milw-cal-daterange-nomergejoin-cached.eplain.analyze            4  Index Scan                  9.93         0.019
0.001913
milw-cal-daterange-nomergejoin-noncached.eplain.analyze         1  Nested Loop            298740.94     14277.124
0.047791
milw-cal-daterange-nomergejoin-noncached.eplain.analyze         2  Index Scan              49419.45     11510.723
0.232919
milw-cal-daterange-nomergejoin-noncached.eplain.analyze         4  Index Scan                  9.93         0.062
0.006244
------------------------------------------------------------

Besides the additional context info, I expect to be storing the log
of the ratio, since it seems to make more sense to average and
look for outliers based on that than the raw ratio.  Is there anything
else you think should be captured?  (Keep in mind that I'm trying
to keep the scope of this very focused, so it can actually get done,
but I'm not adverse to capturing something else that's just sitting
there if it's useful to someone.)

-Kevin


>>> Josh Berkus <josh@agliodbs.com> 10/14/05 11:20 AM >>>
Kevin,

> It sounds as though you are more focused on picking up costing
> problems which happen during production -- which is clearly
> valuable, but addresses a somewhat different set of needs than
> I was looking at.  That said, it seems like there is potential to share
> signifcant code between the two techniques.  We'll have to see if
> we can work that out.

Hmmmm.  I think we're not communicating yet.  I wanted to get across two
points:

1) The "cost accuracy statistics collector" should be a *separate* tool from
the performance test.  This will allow the collector to be used with a
variety of different test frameworks, increasing its use and our general
knowledge of costing under different systems.   It will also allow us to use
pre-built tests which will save time.

2) The collector should be designed in such a way as to allow collection of
data from production databases, for two reasons:  a)  There are factors which affect cost, like concurrency, system
loadand  
swapping, that tend not to occur on test systems.  Ignoring these factors
will make our cost model fragile and no improvement over the current code.  b)  Far more (like 10x) people in the
communitywould be willing to run a  
relatively non-intrusive tool against their production system than would be
willing to set up a full-blown performance test harness.  The more
information about the greater variety of systems and application designs we
collect, the more accurate our cost model is.  Therefore we want a tool which
can be used by as many people as possible, which means that production
systems need to be an option.

> I didn't want to broach the subject of the programming language
> for this at the early conceptual stages, but if we're talking about
> code sharing, it can't wait too long, so I'll jump in with it now.  I was
> considering using python to program the tool I was discussing.

I see nothing wrong with using Python.

> If
> python is used, I don't care whether there is any change to EXPLAIN
> ANALYZE -- it only takes a few lines of code to pull out what I need
> in the current form.

Hmmm ... I think you're collecting less data that I would consider necessary
for full analysis of complex queries.  Could you give an example?

> My concern is whether python is supported on
> all of the target platforms.

I think python support is broad enough that using it is not an inhibitor.
We're not talking Haskell, after all.

>  I think I will be significantly more
> productive at this in python than if I used C or perl, but if it's not
> accessible to the PostgreSQL community as a whole, I'll cope.
> Comments, anyone?

There are actually lots of Python people in the commmunity, and Python is easy
to learn/read if you are experienced with C, Perl, or Ruby.  So I have no
objections.

> I need to have somewhere for the work to live, and I quite frankly
> would just as soon dodge the overhead of setting up and maintaining
> something, so if noone has objections or other suggestions, I'm
> inclined to take you up on your offer to use your testperf project.
> Does anyone think some other location would be more appropriate?

More appropriate than pgFoundry?  I can't imagine.  You'll need to register as
a user on pgFoundry, and send me your user name.

> If we get into much more detail, I assume we should take this
> off-list.

Well, once you get going we'll use the testperf-general list, which doesn't
get much traffic these days.  pgFoundry also supports bug tracking, task
management, and document sharing, you should check it out.

--Josh

--
Josh Berkus
Aglio Database Solutions
San Francisco



pgsql-hackers by date:

Previous
From: Josh Berkus
Date:
Subject: Re: A costing analysis tool
Next
From: Tom Lane
Date:
Subject: Re: A costing analysis tool