Thread: IN subquery not using a hash

IN subquery not using a hash

From
Paul Tillotson
Date:
For the following query, postgres is running the IN subquery over and
over again (once for each row scanned in the parent table.)


I would have expected it to run the whole query once and create a hash
which would then be probed once for every row scanned in the parent
table.  I assumed that it was not doing so because it thought that the
resulting hash table would exceed sort_mem, but setting sort_mem to half
a gigabyte did not make any difference.  Is there some other reason that
the optimizer is not using a hash table?


563 pages * 8 KB per page * 296 tuples fetched / 52085 tuples in the
whole table = 25 KB.  Shouldn't the optimizer think that the subquery
will only fetch 25 KB worth of rows?

(Later, I realized that the official name for "sort_mem" is now
work_mem.  Now, does this mean that my set sort_mem = 500000 did not do
anything?)

Regards,
Paul Tillotson

omnis=> select version();
version
---------------------------------------------------------------------------------------------------------
PostgreSQL 8.0.3 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3.3
20040412 (Red Hat Linux 3.3.3-7)
(1 row)
omnis=> show sort_mem;
20000
omnis=> set sort_mem = 500000;
SET
omnis=> explain analyze select 1 from parents where lname = 'SMITH' OR
parentid IN (SELECT parentid FROM child where childlast = 'SMITH');
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on parents (cost=357.78..226649.83 rows=59785 width=0) (actual
time=127.644..104568.639 rows=855 loops=1)
    Filter: (((lname)::text = 'SMITH'::text) OR (subplan))
    SubPlan
        -> Materialize (cost=357.78..360.74 rows=296 width=4) (actual
time=0.001..0.257 rows=313 loops=117943)
            -> Index Scan using child_childlast_index on child
(cost=0.00..357.48 rows=296 width=4) (actual time=0.073..1.325 rows=313
loops=1)
            Index Cond: ((childlast)::text = 'SMITH'::text)
Total runtime: 104569.800 ms
(7 rows)

omnis=> select reltuples, relpages from pg_class where relname = 'child';
reltuples | relpages
-----------+----------
52085 | 563
(1 row)


Re: IN subquery not using a hash

From
Tom Lane
Date:
Paul Tillotson <spam1011@adelphia.net> writes:
> For the following query, postgres is running the IN subquery over and
> over again (once for each row scanned in the parent table.)
> I would have expected it to run the whole query once and create a hash
> which would then be probed once for every row scanned in the parent
> table.  I assumed that it was not doing so because it thought that the
> resulting hash table would exceed sort_mem,

Hardly likely, considering it's estimating only 296 rows in the subquery
output.  My bet is that you've chosen a datatype whose comparisons are
not hashable (like char(n)).  What is the datatype of parentid in these
tables, anyway?

            regards, tom lane

Re: IN subquery not using a hash

From
Paul Tillotson
Date:
Tom Lane wrote:

>Paul Tillotson <spam1011@adelphia.net> writes:
>
>
>>For the following query, postgres is running the IN subquery over and
>>over again (once for each row scanned in the parent table.)
>>I would have expected it to run the whole query once and create a hash
>>which would then be probed once for every row scanned in the parent
>>table.  I assumed that it was not doing so because it thought that the
>>resulting hash table would exceed sort_mem,
>>
>>
>
>Hardly likely, considering it's estimating only 296 rows in the subquery
>output.  My bet is that you've chosen a datatype whose comparisons are
>not hashable (like char(n)).  What is the datatype of parentid in these
>tables, anyway?
>
>            regards, tom lane
>
>
>
I don't have access to the machine now, but my memory is that
parent.parentid is numeric(10,2) and child.parentid is int.    If
child.parentid is int and parent.parentid is numeric, would that cause
this?  (Not good database design, I know.)

I am 100% certain that neither of these are char(n), and 99% certain
that they are either numeric or int.

Paul Tillotson


Re: IN subquery not using a hash

From
Michael Fuhr
Date:
On Wed, Jul 20, 2005 at 08:11:46PM -0400, Paul Tillotson wrote:
> Tom Lane wrote:
> >Hardly likely, considering it's estimating only 296 rows in the subquery
> >output.  My bet is that you've chosen a datatype whose comparisons are
> >not hashable (like char(n)).  What is the datatype of parentid in these
> >tables, anyway?
> >
> I don't have access to the machine now, but my memory is that
> parent.parentid is numeric(10,2) and child.parentid is int.

Numeric isn't hashable, but I don't know enough about the internals
to say why.  Tom?

Why different types, and why numeric for one of them?  Why not
integer for both?

--
Michael Fuhr
http://www.fuhr.org/~mfuhr/

Re: IN subquery not using a hash

From
Tom Lane
Date:
Paul Tillotson <spam1011@adelphia.net> writes:
> Tom Lane wrote:
>> Hardly likely, considering it's estimating only 296 rows in the subquery
>> output.  My bet is that you've chosen a datatype whose comparisons are
>> not hashable (like char(n)).  What is the datatype of parentid in these
>> tables, anyway?
>>
> I don't have access to the machine now, but my memory is that
> parent.parentid is numeric(10,2) and child.parentid is int.

Offhand I don't believe there are any hashable crosstype comparisons.
In this case the int is probably getting promoted to numeric, but I
think numeric comparison isn't hashable either (because for example
'0.0' = '0.000' but the internal representations are different).

            regards, tom lane

Re: IN subquery not using a hash

From
Paul Tillotson
Date:
Tom Lane wrote:

>Paul Tillotson <spam1011@adelphia.net> writes:
>
>
>>Tom Lane wrote:
>>
>>
>>>Hardly likely, considering it's estimating only 296 rows in the subquery
>>>output.  My bet is that you've chosen a datatype whose comparisons are
>>>not hashable (like char(n)).  What is the datatype of parentid in these
>>>tables, anyway?
>>>
>>>
>>>
>>I don't have access to the machine now, but my memory is that
>>parent.parentid is numeric(10,2) and child.parentid is int.
>>
>>
>
>Offhand I don't believe there are any hashable crosstype comparisons.
>In this case the int is probably getting promoted to numeric, but I
>think numeric comparison isn't hashable either (because for example
>'0.0' = '0.000' but the internal representations are different).
>
>
This is apparently the trouble.  This query doesn't use a hash:

SELECT * FROM table1
WHERE <condition> OR numeric1 IN (SELECT int1 FROM table2)

But, this query (identical except for the cast) does:

SELECT * FROM table1
WHERE <condition> OR numeric1::int IN (SELECT int1 FROM table2)

Thanks for the help, Tom and others.

Paul Tillotson