Thread: Increasing statistics results in worse estimates
I'm having a problem with analyze in 8.0.2 that is really bothering me. I have a table that contains address, city and state with about 7.8m rows in it. On that table I have two non-unique indexes for city and zipcode respectively. This table was loaded in a manner such that it is essentially sorted by zipcode. Sorting by zipcode implies that there should exist a pretty strong correlation on the city column as well. With a statistics target of 50 on city I'm getting good estimates for row counts for arbitrary cities (ie: explain select count(*) from addresses where city = 'DALLAS' estimates 474k rows out of 500k actual) but a poor estimate for the correlation (0.13 according to pg_stats). This seems to be causing the planner to pick a table scan for "select count(*) from test_zipc where city = 'DALLAS' (est 474k rows)" vs picking an index scan for "select count(*) from test_zipc where zipcode like '75%' (est 2m rows)". Increasing the statistics target on the city column and re-analyzing the table seems to make the correlation estimate better (shows about 0.5) but the row count estimates are thrown off by 2 orders of magnitude in some cases. Repeating the above queries I get a row estimate of 8k for "select count(*) from test_zipc where city = 'DALLAS'" and a row estimate of 6m for the "select count(*) from test_zipc where zipcode like '75%'". In this case, the planner picked an index scan for the city = 'X' condition but for what I feel are the wrong reasons because it under estimated the row count. Re-analyzing the table multiple times always shows about an 8k estimate. Is my data set that promblematic? Has anyone seen similar behavior? Any suggestions on how to improve these stats? Regards, Shelby Cain __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Shelby Cain <alyandon@yahoo.com> writes: > Increasing the statistics target on the city column > and re-analyzing the table seems to make the > correlation estimate better (shows about 0.5) but the > row count estimates are thrown off by 2 orders of > magnitude in some cases. What did you increase it to, exactly? Could we see the contents of pg_stats for these two columns at both target settings? regards, tom lane
Shelby Cain <alyandon@yahoo.com> writes: > --- Tom Lane <tgl@sss.pgh.pa.us> wrote: >> What did you increase it to, exactly? Could we see >> the contents of >> pg_stats for these two columns at both target >> settings? > Generally, the more I increased the stats target the > better the correlation estimate and the worse the row estimate. Hm. I should have realized why correlation wouldn't be high for the city name: given the ordering by zipcode, city name values may be pretty well clumped, but they aren't in any kind of alphabetical order --- and it's the overall ordering, not the clumping, that correlation measures. However, there is something absolutely wacko about the stats collection process here ... you've got fairly reasonable looking results for most-common-values of city name at the lower end of the stats settings (HOUSTON and DALLAS are the most common, sounds about right) ... but at the higher settings the ordering of most-common entries just goes nuts. We've got some kind of bug there. What exactly are you changing in the different cases --- default_statistics_target, or are you doing an ALTER TABLE on some of the columns (if so which)? It might be easier to debug this if you could send me the test case. Any problem with sending just the city name and zipcode columns of the table (offlist of course)? COPY TO with a column list can extract that for you. regards, tom lane
--- Tom Lane <tgl@sss.pgh.pa.us> wrote: > Hm. I should have realized why correlation wouldn't > be high for the > city name: given the ordering by zipcode, city name > values may be > pretty well clumped, but they aren't in any kind of > alphabetical > order --- and it's the overall ordering, not the > clumping, that > correlation measures. > Ah. Localized clumping != Overall ordering. Thanks for the clarification. > However, there is something absolutely wacko about > the stats collection > process here ... you've got fairly reasonable > looking results for > most-common-values of city name at the lower end of > the stats settings > (HOUSTON and DALLAS are the most common, sounds > about right) ... but at > the higher settings the ordering of most-common > entries just goes nuts. > We've got some kind of bug there. > I had noticed that as well but wasn't sure about the whether MCV really meant what I thought it did. > What exactly are you changing in the different cases > --- > default_statistics_target, or are you doing an ALTER > TABLE on some > of the columns (if so which)? I have a setting of 30 for default_statistics_target and I am manipulating the statistics target for city by alter table. > > It might be easier to debug this if you could send > me the test case. > Any problem with sending just the city name and > zipcode columns > of the table (offlist of course)? COPY TO with a > column list can > extract that for you. > I had already removed proprietary data to try and whittle down the number of columns I needed to demonstrate the weirdness so I can host a dump of the table. However, before I take that step I should mention that this is the native Windows port so if that changes anything let me know. Regards, Shelby Cain __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Shelby Cain <alyandon@yahoo.com> writes: > I had already removed proprietary data to try and > whittle down the number of columns I needed to > demonstrate the weirdness so I can host a dump of the > table. However, before I take that step I should > mention that this is the native Windows port so if > that changes anything let me know. Well, if I can't reproduce the misbehavior on a Unix machine then there'll be reason to suspect a platform-specific bug ... but it's still a bug. It'll be important to know the exact datatypes of the columns, as well as the database locale and encoding you are using. regards, tom lane