Thread: Increasing statistics results in worse estimates

Increasing statistics results in worse estimates

From
Shelby Cain
Date:
I'm having a problem with analyze in 8.0.2 that is
really bothering me.  I have a table that contains
address, city and state with about 7.8m rows in it.

On that table I have two non-unique indexes for city
and zipcode respectively.  This table was loaded in a
manner such that it is essentially sorted by zipcode.
Sorting by zipcode implies that there should exist a
pretty strong correlation on the city column as well.

With a statistics target of 50 on city I'm getting
good estimates for row counts for arbitrary cities
(ie: explain select count(*) from addresses where city
= 'DALLAS' estimates 474k rows out of 500k actual) but
a poor estimate for the correlation (0.13 according to
pg_stats).  This seems to be causing the planner to
pick a table scan for "select count(*) from test_zipc
where city = 'DALLAS' (est 474k rows)" vs picking an
index scan for "select count(*) from test_zipc where
zipcode like '75%' (est 2m rows)".

Increasing the statistics target on the city column
and re-analyzing the table seems to make the
correlation estimate better (shows about 0.5) but the
row count estimates are thrown off by 2 orders of
magnitude in some cases.  Repeating the above queries
I get a row estimate of 8k for "select count(*) from
test_zipc where city = 'DALLAS'" and a row estimate of
6m for the "select count(*) from test_zipc where
zipcode like '75%'".  In this case, the planner picked
an index scan for the city = 'X' condition but for
what I feel are the wrong reasons because it under
estimated the row count.  Re-analyzing the table
multiple times always shows about an 8k estimate.

Is my data set that promblematic?  Has anyone seen
similar behavior?  Any suggestions on how to improve
these stats?

Regards,

Shelby Cain



__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

Re: Increasing statistics results in worse estimates

From
Tom Lane
Date:
Shelby Cain <alyandon@yahoo.com> writes:
> Increasing the statistics target on the city column
> and re-analyzing the table seems to make the
> correlation estimate better (shows about 0.5) but the
> row count estimates are thrown off by 2 orders of
> magnitude in some cases.

What did you increase it to, exactly?  Could we see the contents of
pg_stats for these two columns at both target settings?

            regards, tom lane

Re: Increasing statistics results in worse estimates

From
Tom Lane
Date:
Shelby Cain <alyandon@yahoo.com> writes:
> --- Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> What did you increase it to, exactly?  Could we see
>> the contents of
>> pg_stats for these two columns at both target
>> settings?

> Generally, the more I increased the stats target the
> better the correlation estimate and the worse the row estimate.

Hm.  I should have realized why correlation wouldn't be high for the
city name: given the ordering by zipcode, city name values may be
pretty well clumped, but they aren't in any kind of alphabetical
order --- and it's the overall ordering, not the clumping, that
correlation measures.

However, there is something absolutely wacko about the stats collection
process here ... you've got fairly reasonable looking results for
most-common-values of city name at the lower end of the stats settings
(HOUSTON and DALLAS are the most common, sounds about right) ... but at
the higher settings the ordering of most-common entries just goes nuts.
We've got some kind of bug there.

What exactly are you changing in the different cases ---
default_statistics_target, or are you doing an ALTER TABLE on some
of the columns (if so which)?

It might be easier to debug this if you could send me the test case.
Any problem with sending just the city name and zipcode columns
of the table (offlist of course)?  COPY TO with a column list can
extract that for you.

            regards, tom lane

Re: Increasing statistics results in worse estimates

From
Shelby Cain
Date:
--- Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Hm.  I should have realized why correlation wouldn't
> be high for the
> city name: given the ordering by zipcode, city name
> values may be
> pretty well clumped, but they aren't in any kind of
> alphabetical
> order --- and it's the overall ordering, not the
> clumping, that
> correlation measures.
>

Ah.  Localized clumping != Overall ordering.  Thanks
for the clarification.

> However, there is something absolutely wacko about
> the stats collection
> process here ... you've got fairly reasonable
> looking results for
> most-common-values of city name at the lower end of
> the stats settings
> (HOUSTON and DALLAS are the most common, sounds
> about right) ... but at
> the higher settings the ordering of most-common
> entries just goes nuts.
> We've got some kind of bug there.
>

I had noticed that as well but wasn't sure about the
whether MCV really meant what I thought it did.

> What exactly are you changing in the different cases
> ---
> default_statistics_target, or are you doing an ALTER
> TABLE on some
> of the columns (if so which)?

I have a setting of 30 for default_statistics_target
and I am manipulating the statistics target for city
by alter table.

>
> It might be easier to debug this if you could send
> me the test case.
> Any problem with sending just the city name and
> zipcode columns
> of the table (offlist of course)?  COPY TO with a
> column list can
> extract that for you.
>

I had already removed proprietary data to try and
whittle down the number of columns I needed to
demonstrate the weirdness so I can host a dump of the
table.  However, before I take that step I should
mention that this is the native Windows port so if
that changes anything let me know.

Regards,

Shelby Cain

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

Re: Increasing statistics results in worse estimates

From
Tom Lane
Date:
Shelby Cain <alyandon@yahoo.com> writes:
> I had already removed proprietary data to try and
> whittle down the number of columns I needed to
> demonstrate the weirdness so I can host a dump of the
> table.  However, before I take that step I should
> mention that this is the native Windows port so if
> that changes anything let me know.

Well, if I can't reproduce the misbehavior on a Unix machine then
there'll be reason to suspect a platform-specific bug ... but it's
still a bug.

It'll be important to know the exact datatypes of the columns,
as well as the database locale and encoding you are using.

            regards, tom lane