Thread: Incorrect assumptions with low LIMITs

Incorrect assumptions with low LIMITs

From
Simon Riggs
Date:
I have a query where the LIMIT clause is incorrectly reducing the cost
of the query. This results in queries doing LIMIT m run much longer
when m is small (e.g. 1-3) than if m is larger. (PG 9.0) (And yes,
ANALYZE was run and, no, increasing stats_target for the columns
doesn't help).

Notice that the total predicted cost is a small fraction of the cost
of the IndexScan. It's a small cost because we assume that the LIMIT
reduces the cost of the query in a linear manner, which just ain't so.

Limit  (cost=0.00..8320.49 rows=25 width=28) (actual
time=0.024..239.044 rows=25 loops=1) ->  Index Scan Backward using blah_pkey on blah
(cost=0.00..4112652.70 rows=12357 width=28) (actual
time=0.022..239.009 rows=25 loops=1)       Index Cond: (blah_id <= 72572020)       Filter: (user_id = ANY
('{....list....}'::integer[]))

What we are assuming is that if doing the whole scan would reveal N
rows, then we only need to do m/N fraction of the scan per output row.
So if the Limit is very small then this means that the cost of this
plan looks really low. Low enough that it beats the plan you might
expect to see, which is an IndexScan or a BitMapIndexScan on user_id.

This inappropriate assumption is causing bad plans on the two worst
query types on this large database, so its not just an idle annoyance.
I've also seen this before in other cases at other release levels. I'm
not reporting this because I need help finding a solution, but because
its a clear bug. So its not a topic for the performance list.

Other examples

Limit 1-3    Total runtime: 3394.080 ms
example plan

Limit  (cost=0.00..1719.89 rows=2 width=1757) (actual
time=3394.000..3394.000 rows=0 loops=1) ->  Nested Loop  (cost=0.00..769649.42 rows=895 width=1757) (actual
time=3393.998..3393.998 rows=0 loops=1)       ->  Seq Scan on user  (cost=0.00..763051.97 rows=895
width=1607) (actual time=3393.996..3393.996 rows=0 loops=1)             Filter: ((user_id <> ALL ('{...about 10
users...}'::integer[])) AND (thing_id = ANY ('{array of
integers}'::bigint[])))       ->  Index Scan using auth_user_pkey on auth_user
(cost=0.00..7.36 rows=1 width=150) (never executed)             Index Cond: (id = user.user_id)

Limit 4+   Total runtime: 2.132 ms

Limit  (cost=3052.13..3100.82 rows=4 width=1757) (actual
time=2.053..2.053 rows=0 loops=1) ->  Nested Loop  (cost=3052.13..13946.44 rows=895 width=1757)
(actual time=2.050..2.050 rows=0 loops=1)       ->  Bitmap Heap Scan on user  (cost=3052.13..7348.98 rows=895
width=1607) (actual time=2.048..2.048 rows=0 loops=1)             Recheck Cond: (thing_id = ANY
('{...lots...}'::bigint[]))            Filter: (user_id <> ALL ('{...~10 users...}'::integer[]))             ->  Bitmap
IndexScan on user_thing_id 
(cost=0.00..3051.90 rows=895 width=0) (actual time=2.026..2.026 rows=6
loops=1)                   Index Cond: (thing_id = ANY ('{...lots...}'::bigint[]))       ->  Index Scan using
auth_user_pkeyon auth_user 
(cost=0.00..7.36 rows=1 width=150) (never executed)             Index Cond: (id = user.user_id)

2000x slower is enough for me to call this a bug, rather than
euphemistically saying "this is sub-optimal".

So there are a few problems:

1. We assume that all values exist within the table. There is no
measure of sparsity. Any value between min and max is assumed to exist
and is assigned the averaged selectivity for a non-MFV.

2. We assume that if values do exist that they have rows uniformly
distributed across the whole table like rungs on a ladder. So the
fewer rows you want the cheaper it is to access them. This mistake is
made any time we propagate the LIMIT clause onto a SeqScan.

3. We take no account of the potential for error in the plan. The plan
chosen is very risky, but has a lower cost. It would be better to
favour less risky plans for most cases.

So (1) leads us to make an overestimate of the selectivity, then we
use (2) to deduce that the overall cost is therefore incredibly low.
So the "worst case" selectivity actually leads us to making a
best-case assumption: that rows are frequent and also uniformly
distributed.


In terms of proposed solutions, (2) is the only one that seems able to
be altered with any ease. If we have a reasonably large selectivity,
its OK to assume that we don't have to scan much of a table before we
find a matching row. But taking that assumption to extremes causes
these bad plans.

Any time we apply a LIMIT clause to a plan with a SeqScan or
unqualified IndexScan, we shouldn't assume the scan will do less than
say 10% of the table. It might, but its an unsafe assumption because
as the selectivity decreases so does the safety of the assumption that
rows are uniformly distributed. So basically put a clamp on the
ability of this assumption to scale downwards when m/N is very small.

We could also improve on (1) for some data types. Some measure of
sparsity could easily be gained by taking NDistinct/(Max - Min) for
integer types. That is useful because integers are commonly used as
keys and so some special attention there is not unwarranted.

Other ideas welcome.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Incorrect assumptions with low LIMITs

From
Tom Lane
Date:
Simon Riggs <simon@2ndQuadrant.com> writes:
> 2. We assume that if values do exist that they have rows uniformly
> distributed across the whole table like rungs on a ladder.

Well, yeah.  That's sometimes wrong, but not always.  In the absence
of evidence to the contrary, I think it's a better assumption than
most others.

> Any time we apply a LIMIT clause to a plan with a SeqScan or
> unqualified IndexScan, we shouldn't assume the scan will do less than
> say 10% of the table.

This is a horrid idea, because it will stop the planner from using
fast-start plans in many situations where they are wins.

> Other ideas welcome.

You are not the first person to have run into this type of problem.
If it were easily solved by some minor hack, we would have done that
long since.  The problem is to not throw the baby out with the
bathwater.

I can see an argument for downgrading the assumed effectiveness of
restrictions that are applied as qpquals (ie, not indexquals), but
the LIMIT node coster doesn't have easy access to the information
needed for that, especially not in situations more complicated than
a single-table scan.
        regards, tom lane


Re: Incorrect assumptions with low LIMITs

From
Jeff Davis
Date:
On Fri, 2012-03-16 at 18:25 +0000, Simon Riggs wrote:
> Any time we apply a LIMIT clause to a plan with a SeqScan or
> unqualified IndexScan, we shouldn't assume the scan will do less than
> say 10% of the table. It might, but its an unsafe assumption because
> as the selectivity decreases so does the safety of the assumption that
> rows are uniformly distributed.

Just trying to follow along. You mean "as the selectivity _increases_
the safety of the assumption that the rows are uniformly distributed
decreases", right?

Regards,Jeff Davis



Re: Incorrect assumptions with low LIMITs

From
Simon Riggs
Date:
On Fri, Mar 16, 2012 at 9:39 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Fri, 2012-03-16 at 18:25 +0000, Simon Riggs wrote:
>> Any time we apply a LIMIT clause to a plan with a SeqScan or
>> unqualified IndexScan, we shouldn't assume the scan will do less than
>> say 10% of the table. It might, but its an unsafe assumption because
>> as the selectivity decreases so does the safety of the assumption that
>> rows are uniformly distributed.
>
> Just trying to follow along. You mean "as the selectivity _increases_
> the safety of the assumption that the rows are uniformly distributed
> decreases", right?

Selectivity meaning the value between 0 and 1 that describes, in the
planner, the fraction of rows we will get back from a scan. 1.0 means
100% of rows. When selectivity is low, that means very few rows will
come back. I think you are using "high selectivity" as meaning
"returns few of the rows", so you understand me, but just flip the
meaning of the words.

When you have lots of rows, its a good assumption they are spread out
and a scan will find some of them quickly.

When you have very few rows, assuming they are evenly spaced is just
weird. Clumpiness of some kind seems much more likely. Much more
easily understood if the values are dates, for example.

Given the estimated number of rows is deliberately a worst case (i,e.
high), that sounds like the scan will work. Yet the reality is that
doing the scan is incredibly costly when those assumptions break,
which they do, often. Especially when the values don't exist at all
because the table is sparse.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Incorrect assumptions with low LIMITs

From
Simon Riggs
Date:
On Fri, Mar 16, 2012 at 9:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Simon Riggs <simon@2ndQuadrant.com> writes:
>> 2. We assume that if values do exist that they have rows uniformly
>> distributed across the whole table like rungs on a ladder.
>
> Well, yeah.  That's sometimes wrong, but not always.  In the absence
> of evidence to the contrary, I think it's a better assumption than
> most others.

The evidence I showed proves it is x1000 worse. I'm not sure how you
say it is "better than most" or that there is no evidence.


> You are not the first person to have run into this type of problem.

Yes - if I thought it was an isolated problem I would not have brought it up.

Also, I don't bring it up because I need help; the actual problem has
workarounds. But rewriting SQL because of a planner problem is not
something that highlights our strengths.


> If it were easily solved by some minor hack, we would have done that
> long since.  The problem is to not throw the baby out with the
> bathwater.
>
> I can see an argument for downgrading the assumed effectiveness of
> restrictions that are applied as qpquals (ie, not indexquals), but
> the LIMIT node coster doesn't have easy access to the information
> needed for that, especially not in situations more complicated than
> a single-table scan.

My wish was to register this as both a common and significant bug,
whatever the solution is.

The bug appears in very simple queries, so we can't hide behind some
doubt as to the exact cause.

Few planning errors are wrong by more than 1000 times; this plan gets
worse every time the client gains new customers, so its not just bad,
its getting worse. Multiple different queries show the same symptoms,
so its common also.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Incorrect assumptions with low LIMITs

From
Greg Stark
Date:
On Sat, Mar 17, 2012 at 9:34 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> My wish was to register this as both a common and significant bug,

It has definitely come up here before many times.

However at root the problem is part of the general class of not
understanding how two different columns are related. Postgres is
assuming they're entirely independent and therefore all the values of
x are uniformly distributed over values of y.

To plan this better in your case it would have to know that blah_id <=
72572020 is not equally likely for user_id = ANY
('{....list....}'::integer[]) as it is for the table as a whole.


-- 
greg


Re: Incorrect assumptions with low LIMITs

From
Simon Riggs
Date:
On Sat, Mar 17, 2012 at 11:33 AM, Greg Stark <stark@mit.edu> wrote:
> On Sat, Mar 17, 2012 at 9:34 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> My wish was to register this as both a common and significant bug,
>
> It has definitely come up here before many times.
>
> However at root the problem is part of the general class of not
> understanding how two different columns are related. Postgres is
> assuming they're entirely independent and therefore all the values of
> x are uniformly distributed over values of y.
>
> To plan this better in your case it would have to know that blah_id <=
> 72572020 is not equally likely for user_id = ANY
> ('{....list....}'::integer[]) as it is for the table as a whole.

That is not the root cause in this case, though I agree that is also a
problem. The first plan is more complex because of an ORDER BY that
favours the index scan, but that's actually irrelevant to the case;
the use of blah_id is actually the user using the fact that things are
allocated in date order to avoid doing a date sort.

The problems are as I described them

(1) no account made for sparsity, and other factors leading to an
overestimate of rows (N)

(2) inappropriate assumption of the effect of LIMIT m, which causes a
costly SeqScan to appear better than an IndexScan for low m/N, when in
fact that is seldom the case.

Overestimating N in (1) inverts the problem, so that an overestimate
isn't the safe thing at all.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Incorrect assumptions with low LIMITs

From
Jeff Davis
Date:
On Sat, 2012-03-17 at 12:48 +0000, Simon Riggs wrote:
> The problems are as I described them
> 
> (1) no account made for sparsity, and other factors leading to an
> overestimate of rows (N)
> 
> (2) inappropriate assumption of the effect of LIMIT m, which causes a
> costly SeqScan to appear better than an IndexScan for low m/N, when in
> fact that is seldom the case.
> 
> Overestimating N in (1) inverts the problem, so that an overestimate
> isn't the safe thing at all.

I think the actual problem has more to do with risk. The planner doesn't
know how uniform the distribution of the table is, which introduces risk
for the table scan.

I would tend to agree that for low selectivity fraction and a very low
limit (e.g. 1-3 in your example) and a large table, it doesn't seem like
a good risk to use a table scan. I don't know how that should be modeled
or implemented though.

Regards,Jeff Davis



Re: Incorrect assumptions with low LIMITs

From
Daniel Farina
Date:
On Mon, Mar 19, 2012 at 6:19 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> On Sat, 2012-03-17 at 12:48 +0000, Simon Riggs wrote:
>> The problems are as I described them
>>
>> (1) no account made for sparsity, and other factors leading to an
>> overestimate of rows (N)
>>
>> (2) inappropriate assumption of the effect of LIMIT m, which causes a
>> costly SeqScan to appear better than an IndexScan for low m/N, when in
>> fact that is seldom the case.
>>
>> Overestimating N in (1) inverts the problem, so that an overestimate
>> isn't the safe thing at all.
>
> I think the actual problem has more to do with risk. The planner doesn't
> know how uniform the distribution of the table is, which introduces risk
> for the table scan.
>
> I would tend to agree that for low selectivity fraction and a very low
> limit (e.g. 1-3 in your example) and a large table, it doesn't seem like
> a good risk to use a table scan. I don't know how that should be modeled
> or implemented though.

FWIW, we have been bitten by the uniformity assumption a number of
times, but we kind of know what to look for.  It also has an
interesting interpretation as applied to garbage (insomuch as queries
all carry a conceptual "WHERE xmin...xmax... predicate).  Sometimes a
table is mostly garbage for a particular value, and the index scan
constantly has to seek the next tuple in its search to find
non-garbage.

I don't know how many times the uniformity assumption has held up fine
(guess: many), but cases of where an index condhas that little "Filter:" footnote has been a ticking timebomb for a
number of people I've interacted with.  It can be subtle because the
amount of scanned data may not be enormous, so upon re-running the
query it is not nearly as pathological as the original run given cache
effects, the only way to make it very visible is to look at the
EXPLAIN with BUFFERS option and note how many pages are being
discarded by the filter/how many pages of the underlying relation and
indexes are being processed only to be discarded.

Were I writing a OLTP-query performance linting tool, queries with
partially-matched index conditions might even be on my list of things
to warn about.  That is unfortunate for cases where the unmatched part
of the index condition may only qualify, say, five records in all
situations, but the database has no construct to know and enforce this
AFAIK.

-- 
fdr