Re: BUG #17540: Prepared statement: PG switches to a generic query plan which is consistently much slower - Mailing list pgsql-bugs

From Tom Lane
Subject Re: BUG #17540: Prepared statement: PG switches to a generic query plan which is consistently much slower
Date
Msg-id 3011319.1657216944@sss.pgh.pa.us
Whole thread Raw
In response to Re: BUG #17540: Prepared statement: PG switches to a generic query plan which is consistently much slower  (Andres Freund <andres@anarazel.de>)
List pgsql-bugs
Andres Freund <andres@anarazel.de> writes:
> On 2022-07-06 23:13:18 -0400, Tom Lane wrote:
>> comparing the cost estimate for the generic plan to the cost estimate
>> for a custom plan is not really logically valid, because those estimates
>> are founded on different statistics.  I don't know how to fix that :-(.

> I think there's something more fundamentally wrong - somehow we end up with
> assuming > 50% selectivity on both the min and the max initplan, for the same
> condition!

Well, sure, because it *is* the same condition.  AFAICS this is operating
as designed.  Do I wish it were better?  Sure, but there is no simple fix
here.

The reasoning that's being applied in the generic plan is

(1) default selectivity estimate for a scalar inequality is
#define DEFAULT_INEQ_SEL  0.3333333333333333

(2) therefore, the filter condition on the indexscan will select a random
one-third of the table;

(3) therefore, the LIMIT will be able to stop after about three rows,
whichever direction we scan in.

The information that is lacking is that the "id" and "timestamp"
columns are heavily correlated, so that we may have to scan far more
than three rows in "id" order before finding a row satisfying the
inequality on "timestamp".  This is a problem we've understood for
a long time --- I recall talking about it at PGCon a decade ago.

The extended stats machinery provides a framework wherein we could
calculate and save the ordering correlation between the two columns,
but I don't believe it actually calculates that number yet --- I think
the functional-dependency stuff is close but not the right thing.
Even if we had the stats, it's not very clear where to fit this
type of consideration into the planner's estimates.

> In this case the whole generic plan part seems like a red herring. The generic
> plan is *awful* and would still be awful if the value were known, but
> somewhere around the middle of the value range.

If the value were somewhere around the middle (which is more or less
what we're assuming for the generic plan), then an indexscan on the
timestamp column isn't going to be that great either; you'd still
be scanning half the table.

> FWIW, manually writing the min/max as ORDER BY timestamp ASC/DESC LIMIT 1
> queries yields a *vastly* better plan:

Those queries give the wrong answers.  We're looking for the min or max
id, not the id associated with the min or max timestamp.  (They're
accidentally the same with this toy dataset.)

            regards, tom lane



pgsql-bugs by date:

Previous
From: Andres Freund
Date:
Subject: Re: BUG #17540: Prepared statement: PG switches to a generic query plan which is consistently much slower
Next
From: PG Bug reporting form
Date:
Subject: BUG #17542: tsquery returns incorrect results with nested, conjuncted followed-by operators