Thread: IN not handled very well?

IN not handled very well?

From
Ben
Date:
I've got this query with an IN clause:

select count(*),public.album.gid,public.album.name,public.album.id
from public.album,public.albumjoin,public.puid,public.puidjoin where
albumjoin.album = public.album.id and public.puidjoin.track =
public.albumjoin.track and public.puid.id = public.puidjoin.puid and
public.puid.puid
IN (select umdb.puid.name from umdb.puid,umdb.node where umdb.puid.id
= umdb.node.puid and umdb.node.dir=5886)
  group by gid,name,public.album.id having count(*) >= 6 order by
count(*) desc;

It gives me a rather expensive plan:

  Sort  (cost=35729.07..35729.75 rows=272 width=69)
    Sort Key: count(*)
    ->  HashAggregate  (cost=35713.31..35718.07 rows=272 width=69)
          Filter: (count(*) >= 6)
          ->  Nested Loop  (cost=51.67..35709.91 rows=272 width=69)
                ->  Nested Loop  (cost=51.67..34216.30 rows=272 width=4)
                      ->  Nested Loop  (cost=51.67..33338.04 rows=272
width=4)
                            ->  Hash IN Join  (cost=51.67..31794.72
rows=218 width=4)
                                  Hash Cond: (("outer".puid)::text =
"inner".name)
                                  ->  Seq Scan on puid
(cost=0.00..23495.21 rows=1099421 width=44)
                                  ->  Hash  (cost=51.63..51.63
rows=15 width=40)
                                        ->  Nested Loop
(cost=0.00..51.63 rows=15 width=40)
                                              ->  Index Scan using
node_dir on node  (cost=0.00..3.22 rows=16 width=4)
                                                    Index Cond: (dir
= 5886)
                                              ->  Index Scan using
puid_pkey on puid  (cost=0.00..3.01 rows=1 width=44)
                                                    Index Cond:
(puid.id = "outer".puid)
                            ->  Index Scan using puidjoin_puidtrack
on puidjoin  (cost=0.00..7.05 rows=2 width=8)
                                  Index Cond: ("outer".id =
puidjoin.puid)
                      ->  Index Scan using albumjoin_trackindex on
albumjoin  (cost=0.00..3.22 rows=1 width=8)
                            Index Cond: ("outer".track =
albumjoin.track)
                ->  Index Scan using album_pkey on album
(cost=0.00..5.48 rows=1 width=69)
                      Index Cond: ("outer".album = album.id)

If I'm reading this right, it looks like it's expensive because it's
doing a sequential scan on public.puid.puid to satisfy the IN clause.
(Although why it's doing that I'm not sure, given that there's a
recently analyzed index on public.puid.puid.) Interestingly, if I
replace that IN subselect with the 15 values it will return, my plan
improves by two orders of magnitude:

  Sort  (cost=235.53..235.56 rows=12 width=69)
    Sort Key: count(*)
    ->  HashAggregate  (cost=235.11..235.32 rows=12 width=69)
          Filter: (count(*) >= 6)
          ->  Nested Loop  (cost=20.03..234.96 rows=12 width=69)
                ->  Nested Loop  (cost=20.03..169.06 rows=12 width=4)
                      ->  Nested Loop  (cost=20.03..130.32 rows=12
width=4)
                            ->  Bitmap Heap Scan on puid
(cost=20.03..59.52 rows=10 width=4)
                                  Recheck Cond: ((puid =
'f68dcf86-992c-2e4a-21fb-2fc8c56edfeb'::bpchar) OR (puid =
'7716dbcf-56ab-623b-ab33-3b2e67a0727c'::bpchar) OR (puid =
'724d6a39-0d15-a296-2dd2-127c34f13809'::bpchar) OR (puid =
'02f0cd9f-9fa5-abda-06cd-5dbb13826243'::bpchar) OR (puid = '165d5bea-
b21f-9302-b991-0927f491787b'::bpchar) OR (puid = '4223dbc8-85af-a92e-
b63d-72a726475e2c'::bpchar) OR (puid = '2d43ef9a-
c7ee-2425-7fac-8b937cbed178'::bpchar) OR (puid = '9ff81c2f-04b7-
cf5d-705f-7b944a5ae093'::bpchar) OR (puid = 'deaddddd-dfaf-18dd-6d4d-
c483e8ba60f7'::bpchar) OR (puid = '20939b69-
ff98-770a-1444-3b0e9892712f'::bpchar))
                                  ->  BitmapOr  (cost=20.03..20.03
rows=10 width=0)
                                        ->  Bitmap Index Scan on
puid_puidindex  (cost=0.00..2.00 rows=1 width=0)
                                              Index Cond: (puid =
'f68dcf86-992c-2e4a-21fb-2fc8c56edfeb'::bpchar)
                                        ->  Bitmap Index Scan on
puid_puidindex  (cost=0.00..2.00 rows=1 width=0)
                                              Index Cond: (puid =
'7716dbcf-56ab-623b-ab33-3b2e67a0727c'::bpchar)
                                        ->  Bitmap Index Scan on
puid_puidindex  (cost=0.00..2.00 rows=1 width=0)
                                              Index Cond: (puid =
'724d6a39-0d15-a296-2dd2-127c34f13809'::bpchar)
                                        ->  Bitmap Index Scan on
puid_puidindex  (cost=0.00..2.00 rows=1 width=0)
                                              Index Cond: (puid =
'02f0cd9f-9fa5-abda-06cd-5dbb13826243'::bpchar)
                                        ->  Bitmap Index Scan on
puid_puidindex  (cost=0.00..2.00 rows=1 width=0)
                                              Index Cond: (puid =
'165d5bea-b21f-9302-b991-0927f491787b'::bpchar)
                                        ->  Bitmap Index Scan on
puid_puidindex  (cost=0.00..2.00 rows=1 width=0)
                                              Index Cond: (puid =
'4223dbc8-85af-a92e-b63d-72a726475e2c'::bpchar)
                                        ->  Bitmap Index Scan on
puid_puidindex  (cost=0.00..2.00 rows=1 width=0)
                                              Index Cond: (puid =
'2d43ef9a-c7ee-2425-7fac-8b937cbed178'::bpchar)
                                        ->  Bitmap Index Scan on
puid_puidindex  (cost=0.00..2.00 rows=1 width=0)
                                              Index Cond: (puid =
'9ff81c2f-04b7-cf5d-705f-7b944a5ae093'::bpchar)
                                        ->  Bitmap Index Scan on
puid_puidindex  (cost=0.00..2.00 rows=1 width=0)
                                              Index Cond: (puid =
'deaddddd-dfaf-18dd-6d4d-c483e8ba60f7'::bpchar)
                                        ->  Bitmap Index Scan on
puid_puidindex  (cost=0.00..2.00 rows=1 width=0)
                                              Index Cond: (puid =
'20939b69-ff98-770a-1444-3b0e9892712f'::bpchar)
                            ->  Index Scan using puidjoin_puidtrack
on puidjoin  (cost=0.00..7.05 rows=2 width=8)
                                  Index Cond: ("outer".id =
puidjoin.puid)
                      ->  Index Scan using albumjoin_trackindex on
albumjoin  (cost=0.00..3.22 rows=1 width=8)
                            Index Cond: ("outer".track =
albumjoin.track)
                ->  Index Scan using album_pkey on album
(cost=0.00..5.48 rows=1 width=69)
                      Index Cond: ("outer".album = album.id)

I guess my question is: if postgres is (correctly) estimating that
only 15 rows will come out of the subselect, and it knows it can
choose a much better plan with bitmap index scans, should it be able
to choose the bitmap plan over the sequential scan? Or should I run
the subselect myself and then rewrite my query to push in the
constant values?

Re: IN not handled very well?

From
Tom Lane
Date:
Ben <bench@silentmedia.com> writes:
>                             ->  Hash IN Join  (cost=51.67..31794.72
> rows=218 width=4)
>                                   Hash Cond: (("outer".puid)::text =
> "inner".name)
>                                   ->  Seq Scan on puid
> (cost=0.00..23495.21 rows=1099421 width=44)

>                             ->  Bitmap Heap Scan on puid
> (cost=20.03..59.52 rows=10 width=4)
>                                   Recheck Cond: ((puid =
> 'f68dcf86-992c-2e4a-21fb-2fc8c56edfeb'::bpchar) OR (puid =
> '7716dbcf-56ab-623b-ab33-3b2e67a0727c'::bpchar) OR (puid =


Apparently you've got a datatype mismatch: name is text while puid is
char(N).  The comparisons to name can't be converted into indexscans
on puid because the semantics aren't the same for text and char
comparisons.

            regards, tom lane

Re: IN not handled very well?

From
Ben
Date:
Ah, so I do. Thanks, that helps an awful lot.

But the plan is still twice as expensive as when I put in the static
values. Is it just unreasonable to expect the planner to see that
there aren't many rows in the subselect, so to use the bitmap scans
after all?

On Sep 24, 2006, at 10:57 AM, Tom Lane wrote:

> Ben <bench@silentmedia.com> writes:
>>                             ->  Hash IN Join  (cost=51.67..31794.72
>> rows=218 width=4)
>>                                   Hash Cond: (("outer".puid)::text =
>> "inner".name)
>>                                   ->  Seq Scan on puid
>> (cost=0.00..23495.21 rows=1099421 width=44)
>
>>                             ->  Bitmap Heap Scan on puid
>> (cost=20.03..59.52 rows=10 width=4)
>>                                   Recheck Cond: ((puid =
>> 'f68dcf86-992c-2e4a-21fb-2fc8c56edfeb'::bpchar) OR (puid =
>> '7716dbcf-56ab-623b-ab33-3b2e67a0727c'::bpchar) OR (puid =
>
>
> Apparently you've got a datatype mismatch: name is text while puid is
> char(N).  The comparisons to name can't be converted into indexscans
> on puid because the semantics aren't the same for text and char
> comparisons.
>
>             regards, tom lane


Re: IN not handled very well?

From
Jim Nasby
Date:
On Sep 24, 2006, at 2:12 PM, Ben wrote:
> Ah, so I do. Thanks, that helps an awful lot.
>
> But the plan is still twice as expensive as when I put in the
> static values. Is it just unreasonable to expect the planner to see
> that there aren't many rows in the subselect, so to use the bitmap
> scans after all?

Based on your initial post, it probably should know that it's only
getting 15 rows (since it did in your initial plan), so it's unclear
why it's not choosing the bitmap scan.

Can you post the results of EXPLAIN ANALYZE?
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)