Re: IN vs EXIIST - Mailing list pgsql-general

From Greg Stark
Subject Re: IN vs EXIIST
Date
Msg-id 87ofaqrj9a.fsf@stark.dyndns.tv
Whole thread Raw
In response to Re: IN vs EXIIST  (Jean-Christian Imbeault <jc@mega-bucks.co.jp>)
List pgsql-general
Caveat, I'm new to Postgres. However extrapolating from my experience with
Oracle I think some of you are going about this optimization work the wrong
way around.

Don't focus on IN vs EXISTS, focus on the query plans they generate. Which of
two identical queries is faster is purely a quirk of optimizer (and it's
arguably always a bug if they aren't both the same). It's like debating which
of two identical cars will be faster without actually checking who is driving.

And don't focus on the estimated costs from explain plan. If they were always
right then there wouldn't be any disputes like these. Well not quite.

So in the cases below as Jean-Christian showed the NOT IN query leads the
optimizer to do two sequential scans and a full sort and unique operation.
Whereas the EXISTS syntax query leads the optimizer to use the index on one of
the tables.

Usually doing a full table scan and an extra sort and unique operation will be
slower, unless you were going to be reading a lot of the table anyways. So
which of these two plans will be fastest should hinge on how much of the truth
table pertains to the attribute being checked.

Exactly where the breakeven point is depends on the details of the database
engine. For Oracle the traditional rule of thumb is a (surprisingly low) 10%.

Someone implied that EXISTS might be more influenced by a VACUUM ANALYZE. That
wouldn't be surprising if the optimizer is changing strategies if it has
enough data to determine that an index scan will read more than the magic
breakeven percentage of the full table and switches to a full table scan.

In the query plans below I stripped out the distracting cost estimates:

Jean-Christian Imbeault <jc@mega-bucks.co.jp> writes:

> psql TMP -c "explain select count(distinct invoice_id) from invoice_li where
> received='true' AND shipped='false' AND cancelled='false'
>   AND
>    (invoice_id not in
...
>
> NOTICE:  QUERY PLAN:
>
> Aggregate
>    ->  Seq Scan on invoice_li
>          SubPlan
>            ->  Materialize
>                  ->  Unique
>                        ->  Sort
>                              ->  Seq Scan on invoice_li

> $ psql TMP -c "explain select count(distinct invoice_id) from invoice_li where
> received='true'
>   AND shipped='false' AND cancelled='false'
>   AND
>     (NOT EXISTS
...
>
> NOTICE:  QUERY PLAN:
>
> Aggregate
>    ->  Seq Scan on invoice_li
>          SubPlan
>            ->  Index Scan using invoice_li_pkey on invoice_li sq_inv_li


--
greg

pgsql-general by date:

Previous
From: Lee Harr
Date:
Subject: Re: executing sql statements from file
Next
From: ljb
Date:
Subject: Re: Permissions with multiple groups...