Re: Why is explain horribly optimistic for sorts? - Mailing list pgsql-general

From Tom Lane
Subject Re: Why is explain horribly optimistic for sorts?
Date
Msg-id 7107.983645760@sss.pgh.pa.us
Whole thread Raw
In response to Re: Why is explain horribly optimistic for sorts?  (Ben <bench@silentmedia.com>)
Responses Re: Why is explain horribly optimistic for sorts?
List pgsql-general
Ben <bench@silentmedia.com> writes:
> After enabling the new likeplanning code, things only get worse... the
> estimated cost for the same query is now:

> Sort  (cost=4.95..4.95 rows=1 width=136)
>   ->  Index Scan using jennyann_target_key on jennyann  (cost=0.00..4.94 rows=1 width=136)

Yeah, I was afraid of that: the new code is smarter but also subject to
the same range-estimation problem that plagues us in cases like "WHERE a
BETWEEN b AND c".  The only stats we have for range estimation are the
column minimum and maximum values collected by VACUUM ANALYZE, so we
just do a linear interpolation.  If your data isn't fairly uniformly
distributed between the min and max then range estimates will be all
wet.  In this example, it seems that 80% of your target entries start
with "/music/", but I'll bet the min and max cover a much wider range.
So the estimator mistakenly thinks that LIKE '/music/%' will select
a fairly small proportion of rows, rather than 80%, and this leads the
planner to choose a plan that would be appropriate if that were true.

There has been talk of keeping more extensive stats to allow better
estimation in cases like this, but I don't foresee it happening for
a release or two yet...

> Out of curiosity, why does it take so long to order data by a datetime
> field?

AFAIK it shouldn't be materially different from sorting on a float8
field.  You're still blaming the wrong thing: this isn't the sorter's
fault, it's doing the best it can.  The reason this is so slow, I
believe, is that an indexscan is being used to select 80% of the data.
That results in a lot of useless thrashing to visit tuples in strict
order by target. A simple sequential scan would've been much faster.

It's possible that an indexscan on logtime would work better, because it
could stop as soon as it had retrieved 1000 matching tuples.  This would
be particularly likely if the tuples are more or less in logtime order,
so that there's not a lot of disk thrashing to retrieve them.  Otherwise
a plain old sequential scan and sort is likely to be the fastest way.

Unfortunately the planner is never going to figure this out as long as
it's so far off about the selectivity of the LIKE.  You could try
forcing its hand with various combinations of
    SET enable_sort TO off;
    SET enable_indexscan TO off;
... just remember to turn these things back on for subsequent queries,
else you'll get really bad plans ...

            regards, tom lane

pgsql-general by date:

Previous
From: will trillich
Date:
Subject: Re: can a trigger on insert -> update other tables?
Next
From: Matthew Rice
Date:
Subject: documentation bug (was Re: Why is explain horribly optimistic for sorts?)