Re: Planner wrongly shuns multi-column index for select .. order by col1, col2 limit 1 - Mailing list pgsql-performance

From John Surcombe
Subject Re: Planner wrongly shuns multi-column index for select .. order by col1, col2 limit 1
Date
Msg-id C0B87DEF8EF4F94C9D8FBD3151B627B26C0C47@digimapserver.digimap.local
Whole thread Raw
In response to Re: Planner wrongly shuns multi-column index for select .. order by col1, col2 limit 1  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Planner wrongly shuns multi-column index for select .. order by col1, col2 limit 1  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
> >> When we 'EXPLAIN' this query, PostgreSQL says it is using the index
> >> idx_receiveddatetime.  The way the application is designed means
that
> >> in virtually all cases the query will have to scan a very long way
> >> into idx_receiveddatetime to find the first record where userid =
> 311369000.
> >> If however we delete the idx_receiveddatetime index, the query uses
> >> the idx_userid_receiveddatetime index, and the query only takes a
few
> >> milliseconds.
>
> > That's just bizarre ... it knows the index is applicable, and the
cost
> > estimates clearly favor the better index, so why did it pick the
worse
> > one?
>
> No, scratch that, I misread the plans.  It *is* picking the plan it
thinks has
> lower cost; it's just a mistaken cost estimate.  It's strange though
that the less
> selective indexscan is getting a lower cost estimate.  I wonder
whether your
> table is (almost) perfectly ordered by receiveddatetime, such that the
one-
> column index has correlation close to 1.0.  That could possibly lower
the cost
> estimate to the point where it'd appear to dominate the other index.
It'd be
> useful to see the pg_stats.correlation value for both the userid and
> receiveddatetime columns.

Yes, the table is indeed nearly perfectly ordered by receiveddatetime
(correlation 0.998479).  correlation on userid is -0.065556.  n_distinct
on userid is also low: 1097.

Is the problem perhaps something like the following:  PostgreSQL is
thinking that because there are not many userids and there is low
correlation, that if it just scans the table from the top in date order,
this will be cheap (because receiveddatetime correlation is high so it
won't have to seek randomly), and it won't have to scan very far before
it finds the first row with a matching userid.

The problem is though that in our application the userids are not
scattered randomly.  There are small regions of the table where a
particular userid appears frequently, interspersed with much larger
regions (perhaps millions or tens of millions of rows) where it doesn't
appear at all.  So in fact the planner's preferred solution is often
pathologically bad.

Is there a solution?

pgsql-performance by date:

Previous
From: Julius Tuskenis
Date:
Subject: Re: unexpected stable function behavior
Next
From: Tom Lane
Date:
Subject: Re: unexpected stable function behavior