Thread: query slows down with more accurate stats
In the process of optimizing some queries, I have found the following query seems to degrade in performance the more accurate I make the statistics on the table... whether by using increased alter table ... set statistics or by using vacuum.. SELECT count( cl.caller_id ), npanxx.city, npanxx.state FROM cl LEFT OUTER JOIN npanxx on substr( cl.caller_id, 1, 3 ) = npanxx.npa and substr( cl.caller_id, 4, 3 ) = npanxx.nxx LEFT OUTER JOIN cp ON cl.caller_id = cp.caller_id WHERE cl.ivr_system_id = 130 AND cl.call_time > '2004-03-01 06:00:00.0 CST' AND cl.call_time < '2004-04-01 06:00:00.0 CST' AND cp.age >= 18 AND cp.age <= 24 AND cp.gender = 'm' GROUP BY npanxx.city, npanxx.state live=# analyze cl; ANALYZE live=# select reltuples from pg_class where relname = 'cl'; reltuples ----------- 53580 (1 row) live=# select count(*) from cl; count --------- 1140166 (1 row) The plan i get under these conditions is actually pretty good... HashAggregate (cost=28367.22..28367.66 rows=174 width=32) (actual time=1722.060..1722.176 rows=29 loops=1) -> Nested Loop (cost=0.00..28365.92 rows=174 width=32) (actual time=518.592..1716.254 rows=558 loops=1) -> Nested Loop Left Join (cost=0.00..20837.33 rows=1248 width=32) (actual time=509.991..1286.755 rows=13739 loops=1) -> Index Scan using cl_ivr_system_id on cl (cost=0.00..13301.15 rows=1248 width=14) (actual time=509.644..767.421rows=13739 loops=1) Index Cond: (ivr_system_id = 130) Filter: ((call_time > '2004-03-01 07:00:00-05'::timestamp with time zone) AND (call_time < '2004-04-0107:00:00-05'::timestamp with time zone)) -> Index Scan using npanxx_pkey on npanxx (cost=0.00..6.02 rows=1 width=32) (actual time=0.025..0.027 rows=1loops=13739) Index Cond: ((substr(("outer".caller_id)::text, 1, 3) = (npanxx.npa)::text) AND (substr(("outer".caller_id)::text,4, 3) = (npanxx.nxx)::text)) -> Index Scan using cp_pkey on cp (cost=0.00..6.02 rows=1 width=14) (actual time=0.027..0.027 rows=0 loops=13739) Index Cond: (("outer".caller_id)::text = (cp.caller_id)::text) Filter: ((age >= 18) AND (age <= 24) AND (gender = 'm'::bpchar)) Total runtime: 1722.489 ms (12 rows) but when i do live=# vacuum cl; VACUUM live=# select reltuples from pg_class where relname = 'cl'; reltuples ------------- 1.14017e+06 (1 row) (or alternatively increase the stats target on the table) I now get the following plan: HashAggregate (cost=80478.74..80482.41 rows=1471 width=32) (actual time=8132.261..8132.422 rows=29 loops=1) -> Merge Join (cost=79951.95..80467.70 rows=1471 width=32) (actual time=7794.338..8130.041 rows=558 loops=1) Merge Cond: ("outer"."?column4?" = "inner"."?column2?") -> Sort (cost=55719.06..55745.42 rows=10546 width=32) (actual time=4031.827..4052.526 rows=13739 loops=1) Sort Key: (cl.caller_id)::text -> Merge Right Join (cost=45458.30..55014.35 rows=10546 width=32) (actual time=2944.441..3796.787 rows=13739loops=1) Merge Cond: ((("outer".npa)::text = "inner"."?column2?") AND (("outer".nxx)::text = "inner"."?column3?")) -> Index Scan using npanxx_pkey on npanxx (cost=0.00..8032.99 rows=132866 width=32) (actual time=0.200..461.767rows=130262 loops=1) -> Sort (cost=45458.30..45484.67 rows=10546 width=14) (actual time=2942.994..2967.935 rows=13739 loops=1) Sort Key: substr((cl.caller_id)::text, 1, 3), substr((cl.caller_id)::text, 4, 3) -> Seq Scan on cl (cost=0.00..44753.60 rows=10546 width=14) (actual time=1162.423..2619.662rows=13739 loops=1) Filter: ((ivr_system_id = 130) AND (call_time > '2004-03-01 07:00:00-05'::timestamp withtime zone) AND (call_time < '2004-04-01 07:00:00-05'::timestamp with time zone)) -> Sort (cost=24232.89..24457.06 rows=89667 width=14) (actual time=3761.703..3900.340 rows=98010 loops=1) Sort Key: (cp.caller_id)::text -> Seq Scan on cp (cost=0.00..15979.91 rows=89667 width=14) (actual time=0.128..1772.215 rows=100302 loops=1) Filter: ((age >= 18) AND (age <= 24) AND (gender = 'm'::bpchar)) Total runtime: 8138.607 ms (17 rows) so i guess i am wondering if there is something I should be doing to help get the better plan at the more accurate stats levels and/or why it doesn't stick with the original plan (I noticed disabling merge joins does seem to push it back to the original plan). alternatively if anyone has any general suggestions on speeding up the query I'd be open to that too :-) Robert Treat -- Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL
Robert Treat <xzilla@users.sourceforge.net> writes: > live=# analyze cl; > ANALYZE > live=# select reltuples from pg_class where relname = 'cl'; > reltuples > ----------- > 53580 > (1 row) > live=# vacuum cl; > VACUUM > live=# select reltuples from pg_class where relname = 'cl'; > reltuples > ------------- > 1.14017e+06 > (1 row) Well, the first problem is why is ANALYZE's estimate of the total row count so bad :-( ? I suspect you are running into the situation where the initial pages of the table are thinly populated and ANALYZE mistakenly assumes the rest are too. Manfred is working on a revised sampling method for ANALYZE that should fix this problem in 7.5 and beyond, but for now it seems like a VACUUM FULL might be in order. > so i guess i am wondering if there is something I should be doing to > help get the better plan at the more accurate stats levels and/or why it > doesn't stick with the original plan (I noticed disabling merge joins > does seem to push it back to the original plan). With the larger number of estimated rows it's figuring the nestloop will be too expensive. The row estimate for the cl scan went up from 1248 to 10546, so the estimated cost for the nestloop plan would go to about 240000 units vs 80000 for the mergejoin plan. This is obviously off rather badly when the true runtimes are 1.7 vs 8.1 seconds :-(. I think this is an example of a case where we really need better estimation of nestloop costs --- it's drastically overestimating the relative cost of the nestloop because it's not accounting for the cache benefits of the repeated index searches. You could probably force the nestloop to be chosen by lowering random_page_cost, but that's just a kluge solution ... the real problem is the model is wrong. I have a to-do item to work on this, and will try to bump up its priority a bit. regards, tom lane
[Just a quick note here; a more thorough discussion of my test results will be posted to -hackers] On Tue, 13 Apr 2004 15:18:42 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Well, the first problem is why is ANALYZE's estimate of the total row >count so bad :-( ? I suspect you are running into the situation where >the initial pages of the table are thinly populated and ANALYZE >mistakenly assumes the rest are too. Manfred is working on a revised >sampling method for ANALYZE that should fix this problem The new method looks very promising with respect to row count estimation: I got estimation errors of +/- 1% where the old method was off by up to 60%. (My test methods might be a bit biased though :-)) My biggest concern at the moment is that the new sampling method violates the contract of returning each possible sample with he same probability: getting several tuples from the same page is more likely than with the old method. Servus Manfred
Manfred Koizar <mkoi-pg@aon.at> writes: > My biggest concern at the moment is that the new sampling method > violates the contract of returning each possible sample with he same > probability: getting several tuples from the same page is more likely > than with the old method. Hm, are you sure? I recall objecting to your original proposal because I thought that would happen, but after further thought it seemed not. Also, I'm not at all sure that the old method satisfies that constraint completely in the presence of nonuniform numbers of tuples per page, so we'd not necessarily be going backwards anyhow ... regards, tom lane
On Thu, 15 Apr 2004 20:18:49 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> getting several tuples from the same page is more likely >> than with the old method. > >Hm, are you sure? Almost sure. Let's look at a corner case: What is the probability of getting a sample with no two tuples from the same page? To simplify the problem assume that each page contains the same number of tuples c. If the number of pages is B and the sample size is n, a perfect sampling method collects a sample where all tuples come from different pages with probability (in OpenOffice.org syntax): p = prod from{i = 0} to{n - 1} {{c(B - i)} over {cB - i}} or in C: p = 1.0; for (i = 0; i < n; ++i) p *= c*(B - i) / (c*B - i) This probability grows with increasing B. >Also, I'm not at all sure that the old method satisfies that constraint >completely in the presence of nonuniform numbers of tuples per page, >so we'd not necessarily be going backwards anyhow ... Yes, it boils down to a decision whether we want to replace one not quite perfect sampling method with another not quite perfect method. I'm still working on putting together the pros and cons ... Servus Manfred
Manfred Koizar <mkoi-pg@aon.at> writes: > If the number of pages is B and the sample size is n, a perfect sampling > method collects a sample where all tuples come from different pages with > probability (in OpenOffice.org syntax): > p = prod from{i = 0} to{n - 1} {{c(B - i)} over {cB - i}} So? You haven't proven that either sampling method fails to do the same. The desired property can also be phrased as "every tuple should be equally likely to be included in the final sample". What we actually have in the case of your revised algorithm is "every page is equally likely to be sampled, and of the pages included in the sample, every tuple is equally likely to be chosen". Given that there are B total pages of which we sample b pages that happen to contain T tuples (in any distribution), the probability that a particular tuple gets chosen is (b/B) * (n/T) assuming that the two selection steps are independent and unbiased. Now b, B, and n are not dependent on which tuple we are talking about. You could argue that a tuple on a heavily populated page is statistically likely to see a higher T when it's part of the page sample pool than a tuple on a near-empty page is likely to see, and therefore there is some bias against selection of the former tuple. But given a sample over a reasonably large number of pages, the contribution of any one page to T should be fairly small and so this effect ought to be small. In fact, because T directly determines our estimate of the total number of tuples in the relation, your experiments showing that the new method gives a reliable tuple count estimate directly prove that T is pretty stable regardless of exactly which pages get included in the sample. So I think this method is effectively unbiased at the tuple level. The variation in probability of selection of individual tuples can be no worse than the variation in the overall tuple count estimate. regards, tom lane
On Tue, 2004-04-13 at 15:18, Tom Lane wrote: > Robert Treat <xzilla@users.sourceforge.net> writes: > Well, the first problem is why is ANALYZE's estimate of the total row > count so bad :-( ? I suspect you are running into the situation where > the initial pages of the table are thinly populated and ANALYZE > mistakenly assumes the rest are too. That was my thinking, which is somewhat confirmed after a vacuum full on the table; now analyze gives pretty accurate states. Of course the downside is that now the query is consistently slower. > > so i guess i am wondering if there is something I should be doing to > > help get the better plan at the more accurate stats levels and/or why it > > doesn't stick with the original plan (I noticed disabling merge joins > > does seem to push it back to the original plan). > > With the larger number of estimated rows it's figuring the nestloop will > be too expensive. The row estimate for the cl scan went up from 1248 > to 10546, so the estimated cost for the nestloop plan would go to about > 240000 units vs 80000 for the mergejoin plan. This is obviously off > rather badly when the true runtimes are 1.7 vs 8.1 seconds :-(. > > I think this is an example of a case where we really need better > estimation of nestloop costs --- it's drastically overestimating the > relative cost of the nestloop because it's not accounting for the cache > benefits of the repeated index searches. You could probably force the > nestloop to be chosen by lowering random_page_cost, but that's just a > kluge solution ... the real problem is the model is wrong. > Unfortunately playing with random_page_cost doesn't seem to be enough to get it to favor the nested loop... though setting it down to 2 does help overall. played with index_cpu_tuple_cost a bit but that seemed even less useful. aggravating when you know there is a better plan it could pick but no (clean) way to get it to do so... > I have a to-do item to work on this, and will try to bump up its > priority a bit. > I'll keep an eye out, thanks Tom. Robert Treat -- Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL
On Fri, 16 Apr 2004 10:34:49 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> p = prod from{i = 0} to{n - 1} {{c(B - i)} over {cB - i}} > >So? You haven't proven that either sampling method fails to do the >same. On the contrary, I believe that above formula is more or less valid for both methods. The point is in what I said next: | This probability grows with increasing B. For the one-stage sampling method B is the number of pages of the whole table. With two-stage sampling we have to use n instead of B and get a smaller probability (for n < B, of course). So this merely shows that the two sampling methods are not equivalent. >The desired property can also be phrased as "every tuple should be >equally likely to be included in the final sample". Only at first sight. You really expect more from random sampling. Otherwise I'd just put one random tuple and its n - 1 successors (modulo N) into the sample. This satisfies your condition but you wouldn't call it a random sample. Random sampling is more like "every possible sample is equally likely to be collected", and two-stage sampling doesn't satisfy this condition. But if in your opinion the difference is not significant, I'll stop complaining against my own idea. Is there anybody else who cares? >You could argue that a tuple on a heavily populated page is >statistically likely to see a higher T when it's part of the page sample >pool than a tuple on a near-empty page is likely to see, and therefore >there is some bias against selection of the former tuple. But given a >sample over a reasonably large number of pages, the contribution of any >one page to T should be fairly small and so this effect ought to be >small. It is even better: Storing a certain number of tuples on heavily populated pages takes less pages than to store them on sparsely populated pages (due to tuple size or to dead tuples). So heavily populated pages are less likely to be selected in stage one, and this exactly offsets the effect of increasing T. >So I think this method is effectively unbiased at the tuple level. Servus Manfred
Manfred Koizar <mkoi-pg@aon.at> writes: > Random sampling is more like "every possible sample is equally likely to > be collected", and two-stage sampling doesn't satisfy this condition. Okay, I finally see the point here: in the limit as the number of pages B goes to infinity, you'd expect the probability that each tuple in your sample came from a different page to go to 1. But this doesn't happen in the two-stage sampling method: the probability doesn't increase beyond the value it would have for B=n. On the average each sample page would supply one tuple, but the odds that this holds *exactly* would be pretty low. However the existing sampling method has glaring flaws of its own, in particular having to do with the fact that a tuple whose slot is preceded by N empty slots is N times more likely to be picked than one that has no empty-slot predecessors. The fact that the two-stage method artificially constrains the sample to come from only n pages seems like a minor problem by comparison; I'd happily accept it to get rid of the empty-slot bias. A possible compromise is to limit the number of pages sampled to something a bit larger than n, perhaps 2n or 3n. I don't have a feeling for the shape of the different-pages probability function; would this make a significant difference, or would it just waste cycles? regards, tom lane
Number of pages in a random sample (was: query slows down with more accurate stats)
From
Manfred Koizar
Date:
On Mon, 19 Apr 2004 12:00:10 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote: >A possible compromise is to limit the number of pages sampled to >something a bit larger than n, perhaps 2n or 3n. I don't have a feeling >for the shape of the different-pages probability function; would this >make a significant difference, or would it just waste cycles? I would have replied earlier, if I had a good answer. What I have so far contains at least one, probably two flaws. Knowing not much more than the four basic arithmetic operations I was not able to improve my model. So I post what I have: As usual we assume a constant number c of tuples per page. If we have a table of size B pages and want to collect a sample of n tuples, the number of possible samples is (again in OOo syntax) left( binom{cB}{n} right) If we select an arbitrary page, the number of possible samples that do NOT contain any tuple from this page is left( binom {c (B-1)} {n} right) Let's forget about our actual implementations of sampling methods and pretend we have a perfect random sampling method. So the probability Pnot(c, B, n) that a certain page is not represented in a random sample is left( binom {c (B-1)} {n} right) over left( binom{cB}{n} right) which can be transformed into the more computing-friendly form prod from{i=0} to{n-1} {{cB-c - i} over {cB - i}} Clearly the probability that a certain page *is* represented in a sample is Pyes(c, B, n) = 1 - Pnot(c, B, n) The next step assumes that these probabilities are independent for different pages, which in reality they are not. We simply estimate the number of pages represented in a random sample as numPag(c, B, n) = B * Pyes(c, B, n) Here are some results for n = 3000: B \ c-> 10 | 100 | 200 -------+-------+-------+------- 100 | --- | 100 | 100 1000 | 972 | 953 | 951 2000 | 1606 | 1559 | 1556 3000 | 1954 | 1902 | 1899 6000 | 2408 | 2366 | 2363 9000 | 2588 | 2555 | 2553 20000 | 2805 | 2788 | 2787 30000 | 2869 | 2856 | 2856 100000 | 2960 | 2956 | 2956 This doesn't look to depend heavily on the number of tuples per page, which sort of justifies the assumption that c is constant. In the next step I tried to estimate the number of pages that contain exactly 1, 2, ... tuples of the sample. My naive procedure works as follows (I'm not sure whether it is even valid as a rough approximation, constructive criticism is very welcome): For c=100, B=3000, n=3000 we expect 1902 pages to contain at least 1 tuple of the sample. There are 1098 more tuples than pages, these tuples lie somewhere in those 1902 pages from the first step. numPag(99, 1902, 1098) = 836 pages contain at least a second tuple. So the number of pages containing exactly 1 tuple is 1902 - 836 = 1066. Repeating these steps we get 611 pages with 2 tuples, 192 with 3, 30 with 4, and 3 pages with 5 tuples. Here are some more numbers for c = 100 and n = 3000: B | pages with 1, 2, ... tuples -------+-------------------------------------------------------- 100 | 1 to 24 tuples: 0, then 1, 2, 4, 10, 18, 26, 24, 11, 4 1000 | 108, 201, 268, 229, 113, 29, 5 2000 | 616, 555, 292, 83, 12, 1 3000 | 1066, 611, 192, 30, 3 6000 | 1809, 484, 68, 5 9000 | 2146, 374, 32, 2 20000 | 2584, 196, 8 30000 | 2716, 138, 3 100000 | 2912, 44 A small C program to experimentally confirm or refute these calculations is attached. Its results are fairly compatible with above numbers, IMHO. Servus Manfred
Attachment
On Mon, 26 Apr 2004 08:08:16 -0700, Sailesh Krishnamurthy <sailesh@cs.berkeley.edu> wrote: > "A Bi-Level Bernoulli Scheme for Database Sampling" > Peter Haas, Christian Koenig (SIGMOD 2004) Does this apply to our problem? AFAIK with Bernoulli sampling you don't know the sample size in advance. Anyway, thanks for the hint. Unfortunately I couldn't find the document. Do you have a link? Servus Manfred