IN not handled very well? - Mailing list pgsql-performance

From Ben
Subject IN not handled very well?
Date
Msg-id 37536ADC-0D18-485C-9F77-B875D69FBCAA@silentmedia.com
Whole thread Raw
Responses Re: IN not handled very well?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
I've got this query with an IN clause:

select count(*),public.album.gid,public.album.name,public.album.id
from public.album,public.albumjoin,public.puid,public.puidjoin where
albumjoin.album = public.album.id and public.puidjoin.track =
public.albumjoin.track and public.puid.id = public.puidjoin.puid and
public.puid.puid
IN (select umdb.puid.name from umdb.puid,umdb.node where umdb.puid.id
= umdb.node.puid and umdb.node.dir=5886)
  group by gid,name,public.album.id having count(*) >= 6 order by
count(*) desc;

It gives me a rather expensive plan:

  Sort  (cost=35729.07..35729.75 rows=272 width=69)
    Sort Key: count(*)
    ->  HashAggregate  (cost=35713.31..35718.07 rows=272 width=69)
          Filter: (count(*) >= 6)
          ->  Nested Loop  (cost=51.67..35709.91 rows=272 width=69)
                ->  Nested Loop  (cost=51.67..34216.30 rows=272 width=4)
                      ->  Nested Loop  (cost=51.67..33338.04 rows=272
width=4)
                            ->  Hash IN Join  (cost=51.67..31794.72
rows=218 width=4)
                                  Hash Cond: (("outer".puid)::text =
"inner".name)
                                  ->  Seq Scan on puid
(cost=0.00..23495.21 rows=1099421 width=44)
                                  ->  Hash  (cost=51.63..51.63
rows=15 width=40)
                                        ->  Nested Loop
(cost=0.00..51.63 rows=15 width=40)
                                              ->  Index Scan using
node_dir on node  (cost=0.00..3.22 rows=16 width=4)
                                                    Index Cond: (dir
= 5886)
                                              ->  Index Scan using
puid_pkey on puid  (cost=0.00..3.01 rows=1 width=44)
                                                    Index Cond:
(puid.id = "outer".puid)
                            ->  Index Scan using puidjoin_puidtrack
on puidjoin  (cost=0.00..7.05 rows=2 width=8)
                                  Index Cond: ("outer".id =
puidjoin.puid)
                      ->  Index Scan using albumjoin_trackindex on
albumjoin  (cost=0.00..3.22 rows=1 width=8)
                            Index Cond: ("outer".track =
albumjoin.track)
                ->  Index Scan using album_pkey on album
(cost=0.00..5.48 rows=1 width=69)
                      Index Cond: ("outer".album = album.id)

If I'm reading this right, it looks like it's expensive because it's
doing a sequential scan on public.puid.puid to satisfy the IN clause.
(Although why it's doing that I'm not sure, given that there's a
recently analyzed index on public.puid.puid.) Interestingly, if I
replace that IN subselect with the 15 values it will return, my plan
improves by two orders of magnitude:

  Sort  (cost=235.53..235.56 rows=12 width=69)
    Sort Key: count(*)
    ->  HashAggregate  (cost=235.11..235.32 rows=12 width=69)
          Filter: (count(*) >= 6)
          ->  Nested Loop  (cost=20.03..234.96 rows=12 width=69)
                ->  Nested Loop  (cost=20.03..169.06 rows=12 width=4)
                      ->  Nested Loop  (cost=20.03..130.32 rows=12
width=4)
                            ->  Bitmap Heap Scan on puid
(cost=20.03..59.52 rows=10 width=4)
                                  Recheck Cond: ((puid =
'f68dcf86-992c-2e4a-21fb-2fc8c56edfeb'::bpchar) OR (puid =
'7716dbcf-56ab-623b-ab33-3b2e67a0727c'::bpchar) OR (puid =
'724d6a39-0d15-a296-2dd2-127c34f13809'::bpchar) OR (puid =
'02f0cd9f-9fa5-abda-06cd-5dbb13826243'::bpchar) OR (puid = '165d5bea-
b21f-9302-b991-0927f491787b'::bpchar) OR (puid = '4223dbc8-85af-a92e-
b63d-72a726475e2c'::bpchar) OR (puid = '2d43ef9a-
c7ee-2425-7fac-8b937cbed178'::bpchar) OR (puid = '9ff81c2f-04b7-
cf5d-705f-7b944a5ae093'::bpchar) OR (puid = 'deaddddd-dfaf-18dd-6d4d-
c483e8ba60f7'::bpchar) OR (puid = '20939b69-
ff98-770a-1444-3b0e9892712f'::bpchar))
                                  ->  BitmapOr  (cost=20.03..20.03
rows=10 width=0)
                                        ->  Bitmap Index Scan on
puid_puidindex  (cost=0.00..2.00 rows=1 width=0)
                                              Index Cond: (puid =
'f68dcf86-992c-2e4a-21fb-2fc8c56edfeb'::bpchar)
                                        ->  Bitmap Index Scan on
puid_puidindex  (cost=0.00..2.00 rows=1 width=0)
                                              Index Cond: (puid =
'7716dbcf-56ab-623b-ab33-3b2e67a0727c'::bpchar)
                                        ->  Bitmap Index Scan on
puid_puidindex  (cost=0.00..2.00 rows=1 width=0)
                                              Index Cond: (puid =
'724d6a39-0d15-a296-2dd2-127c34f13809'::bpchar)
                                        ->  Bitmap Index Scan on
puid_puidindex  (cost=0.00..2.00 rows=1 width=0)
                                              Index Cond: (puid =
'02f0cd9f-9fa5-abda-06cd-5dbb13826243'::bpchar)
                                        ->  Bitmap Index Scan on
puid_puidindex  (cost=0.00..2.00 rows=1 width=0)
                                              Index Cond: (puid =
'165d5bea-b21f-9302-b991-0927f491787b'::bpchar)
                                        ->  Bitmap Index Scan on
puid_puidindex  (cost=0.00..2.00 rows=1 width=0)
                                              Index Cond: (puid =
'4223dbc8-85af-a92e-b63d-72a726475e2c'::bpchar)
                                        ->  Bitmap Index Scan on
puid_puidindex  (cost=0.00..2.00 rows=1 width=0)
                                              Index Cond: (puid =
'2d43ef9a-c7ee-2425-7fac-8b937cbed178'::bpchar)
                                        ->  Bitmap Index Scan on
puid_puidindex  (cost=0.00..2.00 rows=1 width=0)
                                              Index Cond: (puid =
'9ff81c2f-04b7-cf5d-705f-7b944a5ae093'::bpchar)
                                        ->  Bitmap Index Scan on
puid_puidindex  (cost=0.00..2.00 rows=1 width=0)
                                              Index Cond: (puid =
'deaddddd-dfaf-18dd-6d4d-c483e8ba60f7'::bpchar)
                                        ->  Bitmap Index Scan on
puid_puidindex  (cost=0.00..2.00 rows=1 width=0)
                                              Index Cond: (puid =
'20939b69-ff98-770a-1444-3b0e9892712f'::bpchar)
                            ->  Index Scan using puidjoin_puidtrack
on puidjoin  (cost=0.00..7.05 rows=2 width=8)
                                  Index Cond: ("outer".id =
puidjoin.puid)
                      ->  Index Scan using albumjoin_trackindex on
albumjoin  (cost=0.00..3.22 rows=1 width=8)
                            Index Cond: ("outer".track =
albumjoin.track)
                ->  Index Scan using album_pkey on album
(cost=0.00..5.48 rows=1 width=69)
                      Index Cond: ("outer".album = album.id)

I guess my question is: if postgres is (correctly) estimating that
only 15 rows will come out of the subselect, and it knows it can
choose a much better plan with bitmap index scans, should it be able
to choose the bitmap plan over the sequential scan? Or should I run
the subselect myself and then rewrite my query to push in the
constant values?

pgsql-performance by date:

Previous
From: Ron Mayer
Date:
Subject: Re: Large tables (was: RAID 0 not as fast as
Next
From: Tom Lane
Date:
Subject: Re: IN not handled very well?