Re: Query performance PLEASE HELP - Mailing list pgsql-general
From | Dmitry Tkach |
---|---|
Subject | Re: Query performance PLEASE HELP |
Date | |
Msg-id | 3E3AE5DF.4070408@openratings.com Whole thread Raw |
In response to | Query performance PLEASE HELP (Dmitry Tkach <dmitry@openratings.com>) |
List | pgsql-general |
Tom Lane wrote: >Dmitry Tkach <dmitry@openratings.com> writes: > >>Well... Yes. I am doing that daily. Actually, I was wonderring about >>those estimates too, but that's not my primary concern right now - >>perhaps, it should be, but, as far as I understand, the estimate only >>matter for the query plan selection, and I don't have a problem with the >>query plan >> > >You should. If the query can be done any faster, it will be by picking >a different query plan. I'm not sure what would be a better plan, but >certainly a large part of the problem is that the planner is so far off >about the rowcount estimates. > So, what would be the alternative query plan be to make it quicker? I tried forcing it to do the join the other way around (by doing set enable_sort=off; - this forces it to use tradestyle for the outer loop), but that doesn't make it any better... What other possibilities are there that would be likely to improve things? > > >One thing I'm wondering is if the index on tradestyle.name could be helpful. >How selective is "ts.name like 'POST%'", exactly --- does that eliminate >a lot of rows, or not? Is the thing able to use that as an indexqual >(ie, are you in C locale)? > Yes. There is an index on ts.name. I mentioned that in the 'PS' of the original message - if I use another parameter for the name criteria, it decides to use that index, but, as I said, it does not help very much. In that particular case I mentioned, the execution took about 5 minutes - still a lot, although a little better then the 7 I had for the first query, but I believe, this "improvement" is because of caching, not really due to a different query plan. > >To tell you the truth, I do not believe your assertion that these tables >have been analyzed. I don't see how the rowcount estimates could be so >small if the planner were aware of the true table statistics. > Well... It doesn't actually COUNT all the rows when you do the analyze, right? Those are just estimates, based on subsampling... Business names isn't something perfectly ditributed statistically, so, those estimates being off doesn't surprise me much really... > What does >pg_stats show for the columns used in the query? > select * from pg_stats where tablename in ('tradestyle', 'managed_supplier') and attname in ('name', 'duns', 'subscriber'); -[ RECORD 1 ]-----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- tablename | tradestyle attname | duns null_frac | 0 avg_width | 4 n_distinct | -1 most_common_vals | most_common_freqs | histogram_bounds | {1145200,22833532,49988608,79446436,102189545,135526259,196661250,612015735,806455895,876539727,969917566} correlation | -0.18264 -[ RECORD 2 ]-----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- tablename | tradestyle attname | name null_frac | 0 avg_width | 24 n_distinct | 385825 most_common_vals | {"UNITED STATES POSTAL SERVICE","G N C","H & R BLOCK","CARL'S JR","EDWARD D JONES & CO L P","FARMERS INSURANCE","UNITED STATES DEPARTMENT OF THE AIR FORCE","WOMENS INTERNATIONAL BOWL"} most_common_freqs | {0.00233333,0.001,0.001,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667} histogram_bounds | {".COM LLC","BEAU OLSON JOHN","COLONIAL LIFE & ACCIDENT INSURANCE COMPANY INC","EISENHOWER PARTNERSHIP","GROWTH & OPPORTUNITY INC","JUNCTION AUTOLAND","MC EXPORT","PANTS HANGER","SALISBURY NEWS","THE BACK STORE II LLC","ZR ENTERPRISES, INC"} correlation | 0.010456 -[ RECORD 3 ]-----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- tablename | managed_supplier attname | duns null_frac | 0 avg_width | 4 n_distinct | -0.688082 most_common_vals | {1021435,1213214,1307974,2190528,2593051,3292620,17543877,27399237,41869355,49591852} most_common_freqs | {0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667} histogram_bounds | {1002229,6557045,22423490,50341502,73508350,112086095,168224033,252940564,627776784,847204914,2100000158} correlation | -0.151041 -[ RECORD 4]-----+----------------------------------------------------------------------------------------------- tablename | managed_supplier attname | subscriber null_frac | 0 avg_width | 4 n_distinct | 44 most_common_vals | {74,81,20,111,67,45,66,108,75,68} most_common_freqs | {0.181333,0.169667,0.138333,0.124,0.0396667,0.0386667,0.0376667,0.0336667,0.0333333,0.0303333} histogram_bounds | {3,63,72,78,98,107,110,112,118,124,1001} correlation | -0.912398 What does NEGATIVE n_distinct mean (for managed_supplier.duns)? :-) It lies about many things... For example - n_distinct for tradestyle.name = 385825 is about 100 times small than the actual number (which is a little over 30 million) Thanks! Dima
pgsql-general by date: