Call for alpha testing: planner statistics revisions - Mailing list pgsql-general

From Tom Lane
Subject Call for alpha testing: planner statistics revisions
Date
Msg-id 11872.992810825@sss.pgh.pa.us
Whole thread Raw
Responses Re: Call for alpha testing: planner statistics revisions  (Einar Karttunen <ekarttun@cs.Helsinki.FI>)
List pgsql-general
I have finished a first pass at the planner statistics and cost
estimation changes that I want to do for 7.2.  It's now time to see
how well the new code does in the real world.  Accordingly, if you've
had problems in the past with silly choices of plans, I'd like to ask
you to load your data into a test installation of current sources and
see if the planner is any brighter than before.

I realize that loading a bunch of data into a temporary installation is
a pain in the neck, but it'd be really great to get some feedback about
performance of the new code now, while we're still early enough in the
7.2 development cycle to do something about any problems that turn up.

If you're willing to help out, you can get current sources from the
CVS server, or from the nightly snapshot tarball (see the dev/
directory on your favorite Postgres FTP mirror).

Some highlights of the new code include:

* ANALYZE is now available as a separate command; you can run it without
also doing a VACUUM.  (Of course, VACUUM ANALYZE still works.)

* On large tables, ANALYZE uses a random sample of rows rather than
examining every row, so that it should take a reasonably short time
even on very large tables.  Possible downside: inaccurate stats.
We need to find out if the sample size is large enough.

* Statistics now include the "top ten" most common values, not just
the single most common value, plus an estimate of the total number of
distinct values in a column.  This should mean that selectivity
estimates for "column = something" estimates are a lot better than
before, especially for highly skewed data distributions.

* Statistics also include (for scalar datatypes) a histogram that
gives the boundary values dividing the data into ten
roughly-equal-population bins.  This should allow much better estimation
for inequality and range queries, again especially for skewed data
distributions.  (Note that "range queries" include such things as
anchored LIKE and regexp searches, plus now inet subnet searches thanks
to Alex Pilosov.)

* The magic number "ten" mentioned above is controllable via
ALTER TABLE tab ALTER COLUMN col SET STATISTICS statstarget.
Adjusting it gives a tradeoff between estimation accuracy and
time/space taken by ANALYZE.  We need to find out if ten is a good
default or not ... it might be too high or too low.

* There's also a physical-order-correlation statistic that should help
the planner deal with clustered indices better.  Whether it's good
enough, and whether the costs are correctly interpolated using it,
remain to be seen.

For more info see my original proposal at
http://fts.postgresql.org/db/mw/msg.html?mid=112714
and followup discussion.

            regards, tom lane

pgsql-general by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: primary/secondary index, difference
Next
From: GH
Date:
Subject: Re: sequence, set value of