Re: BUG #17964: Missed query planner optimization - Mailing list pgsql-bugs

From David Rowley
Subject Re: BUG #17964: Missed query planner optimization
Date
Msg-id CAApHDvq_5v=G=9aS3QQwRUH-h1nf09Ycnzh=GQeDJtsYsQvd_g@mail.gmail.com
Whole thread Raw
In response to Re: BUG #17964: Missed query planner optimization  (Mathias Kunter <mathiaskunter@gmail.com>)
Responses Re: BUG #17964: Missed query planner optimization
List pgsql-bugs
On Wed, 7 Jun 2023 at 23:48, Mathias Kunter <mathiaskunter@gmail.com> wrote:
> Yes, it's not a bug, but it's something which can be improved. If I
> simply change the original query from this:
>
> > SELECT ... WHERE id IN (SELECT ...);
>
> into this:
>
> > SELECT ... WHERE id = ANY(ARRAY(SELECT ...));
>
> then Postgres uses an index scan, and the query is orders of magnitude
> faster. Note that the planner actually correctly computes the estimated
> costs for both variants, since I get:

What's going on here is that there is code which will convert
supported IN clauses into semi-joins. The first of your queries has
this done, but the 2nd query does not.  The 2nd query, since the
semi-join conversion is not done, the qual later becomes eligible to
be pushed down into the union subquery meaning the non-matching rows
get filtered before the UNION is evaluated.  We'll never attempt to
push joins (semi or otherwise) down into subqueries, so this is not
done for the first query.

The main problem here is that in some cases the first of your queries
will be faster and in other cases the 2nd will be faster. It'll depend
on how many rows are in each table. So, really to find the fastest
plan, the planner would have to consider both options.  That would
unfortunately mean that we'd have to perform the join search once
without the semi-join pushed down and again with the semi-join pushed
down.  The join search is going to be the slowest part of planning
when there are a few tables to join, so doing that twice could add
quite a bit of overhead to the planner.   You might also then consider
how many times you'd need to perform the join search if there were,
say, 5 IN clauses.  To exhaustively find the best plan we'd need to
try all possible combinations of converting each IN clause to a
semi-join vs leaving the qual alone.  If the main query already had,
say 5 tables to join then that suddenly becomes a hugely costly query
to plan.  Given that, I'm not all that sure you're likely to see us
making any improvements here.  I suggest just rewriting the query in a
way that it executes more quickly for you.

David



pgsql-bugs by date:

Previous
From: Mathias Kunter
Date:
Subject: Re: BUG #17964: Missed query planner optimization
Next
From: David Rowley
Date:
Subject: Re: BUG #17967: [msvc-x86] Failed to build with error LNK2019: unresolved external symbol __BitScanReverse64 referen