Re: Explain plan changes - IN CLAUSE ( Passing direct values Vs INNERQuery ) - Mailing list pgsql-performance

From David Rowley
Subject Re: Explain plan changes - IN CLAUSE ( Passing direct values Vs INNERQuery )
Date
Msg-id CAApHDvoTSiiYDuEY3gxZ_j6pnxtp6jGwguvLet4+ix6L1Z7aQg@mail.gmail.com
Whole thread Raw
In response to Re: Explain plan changes - IN CLAUSE ( Passing direct values Vs INNERQuery )  ("David G. Johnston" <david.g.johnston@gmail.com>)
List pgsql-performance
On Fri, 8 May 2020 at 10:00, David G. Johnston
<david.g.johnston@gmail.com> wrote:
>
> On Thu, May 7, 2020 at 11:07 AM Amarendra Konda <amar.vijaya@gmail.com> wrote:
>>
>> EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)  SELECT pa.process_activity_id AS pa_process_activity_id  FROM
process_activitypa WHERE pa.app_id = '126502930200650' AND pa.created > '1970-01-01 00:00:00'  AND EXISTS ( SELECT 1
FROMprocess_instance pi where pi.app_id = pa.app_id AND pi.process_instance_id = pa.process_instance_id  AND pi.user_id
='137074931866340') ORDER BY pa.process_instance_id,  pa.created limit 50; 
>>
>>
>>                ->  Index Scan using process_activity_process_instance_id_app_id_created_idx on
public.process_activitypa  (cost=0.70..1061.62 rows=1436 width=32) (actual time=0.011..20.320 rows=23506 loops=2) 
>
> > Index Cond: ((m.process_instance_id = pi.process_instance_id) AND (m.app_id = '126502930200650'::bigint) AND
(m.created> '1970-01-01 00:00:00'::timestamp without time zone)) 
>
> I suppose during the nested loop the inner index scan could limit itself to the first 50 entries it finds (since the
firsttwo index columns are being held constant on each scan, m.created should define the traversal order...) so that
theoutput of the nested loop ends up being (max 2 x 50) 100 entries which are then sorted and only the top 50 returned. 
>
> Whether the executor could but isn't doing that here or isn't programmed to do that (or my logic is totally off) I do
notknow. 

I think the planner is likely not putting the process_activity table
on the outer side of the nested loop join due to the poor row
estimates.  If it knew that so many rows would match the join then it
likely would have done that to save from having to perform the sort at
all.  However, because the planner has put the process_instance on the
outer side of the nested loop join, it's the pathkeys from that path
that the nested loop node has, which is not the same as what the ORDER
BY needs, so the planner must add a sort step, which means that all
rows from the nested loop plan must be read so that they can be
sorted.

It might be worth trying: create index on process_instance
(user_id,app_id); as that might lower the cost of performing the join
in the opposite order and have the planner prefer that order instead.
If doing that, the OP could then ditch the
fki_conv_konotor_user_user_id index to save space.

If that's not enough to convince the planner that the opposite order
is better then certainly SET enable_sort TO off; would.

David



pgsql-performance by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: Explain plan changes - IN CLAUSE ( Passing direct values Vs INNERQuery )
Next
From: Laurenz Albe
Date:
Subject: Re: pg_attribute, pg_class, pg_depend grow huge in count and sizewith multiple tenants.