Thread: query slows down with more accurate stats

query slows down with more accurate stats

From

Robert Treat

Date:

13 April 2004, 15:02:59

In the process of optimizing some queries, I have found the following
query seems to degrade in performance the more accurate I make the
statistics on the table... whether by using increased alter table ...
set statistics or by using vacuum..

SELECT
    count( cl.caller_id ),
    npanxx.city,
    npanxx.state
FROM
    cl
    LEFT OUTER JOIN npanxx
      on substr( cl.caller_id, 1, 3 ) = npanxx.npa
      and substr( cl.caller_id, 4, 3 ) = npanxx.nxx
    LEFT OUTER JOIN cp
      ON cl.caller_id = cp.caller_id
WHERE
    cl.ivr_system_id = 130
    AND
    cl.call_time > '2004-03-01 06:00:00.0 CST'
    AND
    cl.call_time < '2004-04-01 06:00:00.0 CST'
    AND
    cp.age >= 18
    AND
    cp.age <= 24
    AND
    cp.gender = 'm'
GROUP BY
    npanxx.city,
    npanxx.state


live=# analyze cl;
ANALYZE
live=# select reltuples from pg_class where relname = 'cl';
 reltuples
-----------
     53580
(1 row)

live=# select count(*) from cl;
  count
---------
 1140166
(1 row)

The plan i get under these conditions is actually pretty good...

 HashAggregate  (cost=28367.22..28367.66 rows=174 width=32) (actual time=1722.060..1722.176 rows=29 loops=1)
   ->  Nested Loop  (cost=0.00..28365.92 rows=174 width=32) (actual time=518.592..1716.254 rows=558 loops=1)
         ->  Nested Loop Left Join  (cost=0.00..20837.33 rows=1248 width=32) (actual time=509.991..1286.755 rows=13739
loops=1)
               ->  Index Scan using cl_ivr_system_id on cl  (cost=0.00..13301.15 rows=1248 width=14) (actual
time=509.644..767.421rows=13739 loops=1) 
                     Index Cond: (ivr_system_id = 130)
                     Filter: ((call_time > '2004-03-01 07:00:00-05'::timestamp with time zone) AND (call_time <
'2004-04-0107:00:00-05'::timestamp with time zone)) 
               ->  Index Scan using npanxx_pkey on npanxx  (cost=0.00..6.02 rows=1 width=32) (actual time=0.025..0.027
rows=1loops=13739) 
                     Index Cond: ((substr(("outer".caller_id)::text, 1, 3) = (npanxx.npa)::text) AND
(substr(("outer".caller_id)::text,4, 3) = (npanxx.nxx)::text)) 
         ->  Index Scan using cp_pkey on cp  (cost=0.00..6.02 rows=1 width=14) (actual time=0.027..0.027 rows=0
loops=13739)
               Index Cond: (("outer".caller_id)::text = (cp.caller_id)::text)
               Filter: ((age >= 18) AND (age <= 24) AND (gender = 'm'::bpchar))
 Total runtime: 1722.489 ms
(12 rows)


but when i do

live=# vacuum cl;
VACUUM
live=# select reltuples from pg_class where relname = 'cl';
  reltuples
-------------
 1.14017e+06
(1 row)

(or alternatively increase the stats target on the table)

I now get the following plan:

 HashAggregate  (cost=80478.74..80482.41 rows=1471 width=32) (actual time=8132.261..8132.422 rows=29 loops=1)
   ->  Merge Join  (cost=79951.95..80467.70 rows=1471 width=32) (actual time=7794.338..8130.041 rows=558 loops=1)
         Merge Cond: ("outer"."?column4?" = "inner"."?column2?")
         ->  Sort  (cost=55719.06..55745.42 rows=10546 width=32) (actual time=4031.827..4052.526 rows=13739 loops=1)
               Sort Key: (cl.caller_id)::text
               ->  Merge Right Join  (cost=45458.30..55014.35 rows=10546 width=32) (actual time=2944.441..3796.787
rows=13739loops=1) 
                     Merge Cond: ((("outer".npa)::text = "inner"."?column2?") AND (("outer".nxx)::text =
"inner"."?column3?"))
                     ->  Index Scan using npanxx_pkey on npanxx  (cost=0.00..8032.99 rows=132866 width=32) (actual
time=0.200..461.767rows=130262 loops=1) 
                     ->  Sort  (cost=45458.30..45484.67 rows=10546 width=14) (actual time=2942.994..2967.935 rows=13739
loops=1)
                           Sort Key: substr((cl.caller_id)::text, 1, 3), substr((cl.caller_id)::text, 4, 3)
                           ->  Seq Scan on cl  (cost=0.00..44753.60 rows=10546 width=14) (actual
time=1162.423..2619.662rows=13739 loops=1) 
                                 Filter: ((ivr_system_id = 130) AND (call_time > '2004-03-01 07:00:00-05'::timestamp
withtime zone) AND (call_time < '2004-04-01 07:00:00-05'::timestamp with time zone)) 
         ->  Sort  (cost=24232.89..24457.06 rows=89667 width=14) (actual time=3761.703..3900.340 rows=98010 loops=1)
               Sort Key: (cp.caller_id)::text
               ->  Seq Scan on cp  (cost=0.00..15979.91 rows=89667 width=14) (actual time=0.128..1772.215 rows=100302
loops=1)
                     Filter: ((age >= 18) AND (age <= 24) AND (gender = 'm'::bpchar))
 Total runtime: 8138.607 ms
(17 rows)


so i guess i am wondering if there is something I should be doing to
help get the better plan at the more accurate stats levels and/or why it
doesn't stick with the original plan (I noticed disabling merge joins
does seem to push it back to the original plan).

alternatively if anyone has any general suggestions on speeding up the
query I'd be open to that too :-)


Robert Treat
--
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL

Re: query slows down with more accurate stats

From

Tom Lane

Date:

13 April 2004, 16:18:45

Robert Treat <xzilla@users.sourceforge.net> writes:
> live=# analyze cl;
> ANALYZE
> live=# select reltuples from pg_class where relname = 'cl';
>  reltuples
> -----------
>      53580
> (1 row)
> live=# vacuum cl;
> VACUUM
> live=# select reltuples from pg_class where relname = 'cl';
>   reltuples
> -------------
>  1.14017e+06
> (1 row)

Well, the first problem is why is ANALYZE's estimate of the total row
count so bad :-( ?  I suspect you are running into the situation where
the initial pages of the table are thinly populated and ANALYZE
mistakenly assumes the rest are too.  Manfred is working on a revised
sampling method for ANALYZE that should fix this problem in 7.5 and
beyond, but for now it seems like a VACUUM FULL might be in order.

> so i guess i am wondering if there is something I should be doing to
> help get the better plan at the more accurate stats levels and/or why it
> doesn't stick with the original plan (I noticed disabling merge joins
> does seem to push it back to the original plan).

With the larger number of estimated rows it's figuring the nestloop will
be too expensive.  The row estimate for the cl scan went up from 1248
to 10546, so the estimated cost for the nestloop plan would go to about
240000 units vs 80000 for the mergejoin plan.  This is obviously off
rather badly when the true runtimes are 1.7 vs 8.1 seconds :-(.

I think this is an example of a case where we really need better
estimation of nestloop costs --- it's drastically overestimating the
relative cost of the nestloop because it's not accounting for the cache
benefits of the repeated index searches.  You could probably force the
nestloop to be chosen by lowering random_page_cost, but that's just a
kluge solution ... the real problem is the model is wrong.

I have a to-do item to work on this, and will try to bump up its
priority a bit.

            regards, tom lane

Re: query slows down with more accurate stats

From

Manfred Koizar

Date:

15 April 2004, 20:30:36

[Just a quick note here;  a more thorough discussion of my test results
will be posted to -hackers]

On Tue, 13 Apr 2004 15:18:42 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Well, the first problem is why is ANALYZE's estimate of the total row
>count so bad :-( ?  I suspect you are running into the situation where
>the initial pages of the table are thinly populated and ANALYZE
>mistakenly assumes the rest are too.  Manfred is working on a revised
>sampling method for ANALYZE that should fix this problem

The new method looks very promising with respect to row count
estimation:  I got estimation errors of +/- 1% where the old method was
off by up to 60%.  (My test methods might be a bit biased though :-))

My biggest concern at the moment is that the new sampling method
violates the contract of returning each possible sample with he same
probability:  getting several tuples from the same page is more likely
than with the old method.

Servus
 Manfred

Re: query slows down with more accurate stats

From

Tom Lane

Date:

15 April 2004, 21:18:47

Manfred Koizar <mkoi-pg@aon.at> writes:
> My biggest concern at the moment is that the new sampling method
> violates the contract of returning each possible sample with he same
> probability:  getting several tuples from the same page is more likely
> than with the old method.

Hm, are you sure?  I recall objecting to your original proposal because
I thought that would happen, but after further thought it seemed not.

Also, I'm not at all sure that the old method satisfies that constraint
completely in the presence of nonuniform numbers of tuples per page,
so we'd not necessarily be going backwards anyhow ...

            regards, tom lane

Re: query slows down with more accurate stats

From

Manfred Koizar

Date:

16 April 2004, 07:13:59

On Thu, 15 Apr 2004 20:18:49 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> getting several tuples from the same page is more likely
>> than with the old method.
>
>Hm, are you sure?

Almost sure.  Let's look at a corner case:  What is the probability of
getting a sample with no two tuples from the same page?  To simplify the
problem assume that each page contains the same number of tuples c.

If the number of pages is B and the sample size is n, a perfect sampling
method collects a sample where all tuples come from different pages with
probability (in OpenOffice.org syntax):

    p = prod from{i = 0} to{n - 1} {{c(B - i)}  over {cB - i}}

or in C:

    p = 1.0;
    for (i = 0; i < n; ++i)
        p *= c*(B - i) / (c*B - i)

This probability grows with increasing B.

>Also, I'm not at all sure that the old method satisfies that constraint
>completely in the presence of nonuniform numbers of tuples per page,
>so we'd not necessarily be going backwards anyhow ...

Yes, it boils down to a decision whether we want to replace one not
quite perfect sampling method with another not quite perfect method.
I'm still working on putting together the pros and cons ...

Servus
 Manfred

Re: query slows down with more accurate stats

From

Tom Lane

Date:

16 April 2004, 11:34:55

Manfred Koizar <mkoi-pg@aon.at> writes:
> If the number of pages is B and the sample size is n, a perfect sampling
> method collects a sample where all tuples come from different pages with
> probability (in OpenOffice.org syntax):
>     p = prod from{i = 0} to{n - 1} {{c(B - i)}  over {cB - i}}

So?  You haven't proven that either sampling method fails to do the
same.

The desired property can also be phrased as "every tuple should be
equally likely to be included in the final sample".  What we actually
have in the case of your revised algorithm is "every page is equally
likely to be sampled, and of the pages included in the sample, every
tuple is equally likely to be chosen".  Given that there are B total
pages of which we sample b pages that happen to contain T tuples (in any
distribution), the probability that a particular tuple gets chosen is
    (b/B) * (n/T)
assuming that the two selection steps are independent and unbiased.

Now b, B, and n are not dependent on which tuple we are talking about.
You could argue that a tuple on a heavily populated page is
statistically likely to see a higher T when it's part of the page sample
pool than a tuple on a near-empty page is likely to see, and therefore
there is some bias against selection of the former tuple.  But given a
sample over a reasonably large number of pages, the contribution of any
one page to T should be fairly small and so this effect ought to be
small.  In fact, because T directly determines our estimate of the total
number of tuples in the relation, your experiments showing that the new
method gives a reliable tuple count estimate directly prove that T is
pretty stable regardless of exactly which pages get included in the
sample.  So I think this method is effectively unbiased at the tuple
level.  The variation in probability of selection of individual tuples
can be no worse than the variation in the overall tuple count estimate.

            regards, tom lane

Re: query slows down with more accurate stats

From

Robert Treat

Date:

16 April 2004, 12:37:25

On Tue, 2004-04-13 at 15:18, Tom Lane wrote:
> Robert Treat <xzilla@users.sourceforge.net> writes:
> Well, the first problem is why is ANALYZE's estimate of the total row
> count so bad :-( ?  I suspect you are running into the situation where
> the initial pages of the table are thinly populated and ANALYZE
> mistakenly assumes the rest are too.

That was my thinking, which is somewhat confirmed after a vacuum full on
the table; now analyze gives pretty accurate states.  Of course the
downside is that now the query is consistently slower.

> > so i guess i am wondering if there is something I should be doing to
> > help get the better plan at the more accurate stats levels and/or why it
> > doesn't stick with the original plan (I noticed disabling merge joins
> > does seem to push it back to the original plan).
>
> With the larger number of estimated rows it's figuring the nestloop will
> be too expensive.  The row estimate for the cl scan went up from 1248
> to 10546, so the estimated cost for the nestloop plan would go to about
> 240000 units vs 80000 for the mergejoin plan.  This is obviously off
> rather badly when the true runtimes are 1.7 vs 8.1 seconds :-(.
>
> I think this is an example of a case where we really need better
> estimation of nestloop costs --- it's drastically overestimating the
> relative cost of the nestloop because it's not accounting for the cache
> benefits of the repeated index searches.  You could probably force the
> nestloop to be chosen by lowering random_page_cost, but that's just a
> kluge solution ... the real problem is the model is wrong.
>

Unfortunately playing with random_page_cost doesn't seem to be enough to
get it to favor the nested loop... though setting it down to 2 does help
overall.  played with index_cpu_tuple_cost a bit but that seemed even
less useful. aggravating when you know there is a better plan it could
pick but no (clean) way to get it to do so...

> I have a to-do item to work on this, and will try to bump up its
> priority a bit.
>

I'll keep an eye out, thanks Tom.


Robert Treat
--
Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL

Re: query slows down with more accurate stats

From

Manfred Koizar

Date:

16 April 2004, 19:24:11

On Fri, 16 Apr 2004 10:34:49 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>     p = prod from{i = 0} to{n - 1} {{c(B - i)}  over {cB - i}}
>
>So?  You haven't proven that either sampling method fails to do the
>same.

On the contrary, I believe that above formula is more or less valid for
both methods.  The point is in what I said next:
| This probability grows with increasing B.

For the one-stage sampling method B is the number of pages of the whole
table.  With two-stage sampling we have to use n instead of B and get a
smaller probability (for n < B, of course).  So this merely shows that
the two sampling methods are not equivalent.

>The desired property can also be phrased as "every tuple should be
>equally likely to be included in the final sample".

Only at first sight.  You really expect more from random sampling.
Otherwise I'd just put one random tuple and its n - 1 successors (modulo
N) into the sample.  This satisfies your condition but you wouldn't call
it a random sample.

Random sampling is more like "every possible sample is equally likely to
be collected", and two-stage sampling doesn't satisfy this condition.

But if in your opinion the difference is not significant, I'll stop
complaining against my own idea.  Is there anybody else who cares?

>You could argue that a tuple on a heavily populated page is
>statistically likely to see a higher T when it's part of the page sample
>pool than a tuple on a near-empty page is likely to see, and therefore
>there is some bias against selection of the former tuple.  But given a
>sample over a reasonably large number of pages, the contribution of any
>one page to T should be fairly small and so this effect ought to be
>small.

It is even better:  Storing a certain number of tuples on heavily
populated pages takes less pages than to store them on sparsely
populated pages (due to tuple size or to dead tuples).  So heavily
populated pages are less likely to be selected in stage one, and this
exactly offsets the effect of increasing T.

>So I think this method is effectively unbiased at the tuple level.

Servus
 Manfred

Re: query slows down with more accurate stats

From

Tom Lane

Date:

19 April 2004, 13:00:25

Manfred Koizar <mkoi-pg@aon.at> writes:
> Random sampling is more like "every possible sample is equally likely to
> be collected", and two-stage sampling doesn't satisfy this condition.

Okay, I finally see the point here: in the limit as the number of pages
B goes to infinity, you'd expect the probability that each tuple in your
sample came from a different page to go to 1.  But this doesn't happen
in the two-stage sampling method: the probability doesn't increase
beyond the value it would have for B=n.  On the average each sample page
would supply one tuple, but the odds that this holds *exactly* would be
pretty low.

However the existing sampling method has glaring flaws of its own,
in particular having to do with the fact that a tuple whose slot is
preceded by N empty slots is N times more likely to be picked than one
that has no empty-slot predecessors.  The fact that the two-stage
method artificially constrains the sample to come from only n pages
seems like a minor problem by comparison; I'd happily accept it to get
rid of the empty-slot bias.

A possible compromise is to limit the number of pages sampled to
something a bit larger than n, perhaps 2n or 3n.  I don't have a feeling
for the shape of the different-pages probability function; would this
make a significant difference, or would it just waste cycles?

            regards, tom lane

Number of pages in a random sample (was: query slows down with more accurate stats)

From

Manfred Koizar

Date:

25 April 2004, 17:24:32

On Mon, 19 Apr 2004 12:00:10 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>A possible compromise is to limit the number of pages sampled to
>something a bit larger than n, perhaps 2n or 3n.  I don't have a feeling
>for the shape of the different-pages probability function; would this
>make a significant difference, or would it just waste cycles?

I would have replied earlier, if I had a good answer.  What I have so
far contains at least one, probably two flaws.  Knowing not much more
than the four basic arithmetic operations I was not able to improve my
model.  So I post what I have:

As usual we assume a constant number c of tuples per page.  If we have a
table of size B pages and want to collect a sample of n tuples, the
number of possible samples is (again in OOo syntax)

    left( binom{cB}{n} right)

If we select an arbitrary page, the number of possible samples that do
NOT contain any tuple from this page is

    left( binom {c (B-1)} {n} right)

Let's forget about our actual implementations of sampling methods and
pretend we have a perfect random sampling method.  So the probability
Pnot(c, B, n) that a certain page is not represented in a random sample
is

    left( binom {c (B-1)} {n} right) over left( binom{cB}{n} right)

which can be transformed into the more computing-friendly form

    prod from{i=0} to{n-1} {{cB-c - i} over {cB - i}}

Clearly the probability that a certain page *is* represented in a sample
is

    Pyes(c, B, n) = 1 - Pnot(c, B, n)

The next step assumes that these probabilities are independent for
different pages, which in reality they are not.  We simply estimate the
number of pages represented in a random sample as

    numPag(c, B, n) = B * Pyes(c, B, n)

Here are some results for n = 3000:

    B  \ c->    10 |   100 |   200
    -------+-------+-------+-------
       100 |  ---  |   100 |   100
      1000 |   972 |   953 |   951
      2000 |  1606 |  1559 |  1556
      3000 |  1954 |  1902 |  1899
      6000 |  2408 |  2366 |  2363
      9000 |  2588 |  2555 |  2553
     20000 |  2805 |  2788 |  2787
     30000 |  2869 |  2856 |  2856
    100000 |  2960 |  2956 |  2956

This doesn't look to depend heavily on the number of tuples per page,
which sort of justifies the assumption that c is constant.

In the next step I tried to estimate the number of pages that contain
exactly 1, 2, ... tuples of the sample.  My naive procedure works as
follows (I'm not sure whether it is even valid as a rough approximation,
constructive criticism is very welcome):

For c=100, B=3000, n=3000 we expect 1902 pages to contain at least 1
tuple of the sample.  There are 1098 more tuples than pages, these
tuples lie somewhere in those 1902 pages from the first step.
numPag(99, 1902, 1098) = 836 pages contain at least a second tuple.
So the number of pages containing exactly 1 tuple is 1902 - 836 = 1066.
Repeating these steps we get 611 pages with 2 tuples, 192 with 3, 30
with 4, and 3 pages with 5 tuples.

Here are some more numbers for c = 100 and n = 3000:

       B   | pages with 1, 2, ... tuples
    -------+--------------------------------------------------------
       100 |  1 to 24 tuples: 0, then 1, 2, 4, 10, 18, 26, 24, 11, 4
      1000 |  108, 201, 268, 229, 113, 29, 5
      2000 |  616, 555, 292,  83, 12, 1
      3000 | 1066, 611, 192,  30,  3
      6000 | 1809, 484,  68,   5
      9000 | 2146, 374,  32,   2
     20000 | 2584, 196,   8
     30000 | 2716, 138,   3
    100000 | 2912,  44

A small C program to experimentally confirm or refute these calculations
is attached.  Its results are fairly compatible with above numbers,
IMHO.

Servus
 Manfred

Attachment

samsim.c

Re: [HACKERS] Number of pages in a random sample

From

Manfred Koizar

Date:

29 April 2004, 13:54:26

On Mon, 26 Apr 2004 08:08:16 -0700, Sailesh Krishnamurthy
<sailesh@cs.berkeley.edu> wrote:
> "A Bi-Level Bernoulli Scheme for Database Sampling"
> Peter Haas, Christian Koenig (SIGMOD 2004)

Does this apply to our problem?  AFAIK with Bernoulli sampling you don't
know the sample size in advance.

Anyway, thanks for the hint.  Unfortunately I couldn't find the
document.  Do you have a link?

Servus
 Manfred