Thread: IN subquery not using a hash
For the following query, postgres is running the IN subquery over and over again (once for each row scanned in the parent table.) I would have expected it to run the whole query once and create a hash which would then be probed once for every row scanned in the parent table. I assumed that it was not doing so because it thought that the resulting hash table would exceed sort_mem, but setting sort_mem to half a gigabyte did not make any difference. Is there some other reason that the optimizer is not using a hash table? 563 pages * 8 KB per page * 296 tuples fetched / 52085 tuples in the whole table = 25 KB. Shouldn't the optimizer think that the subquery will only fetch 25 KB worth of rows? (Later, I realized that the official name for "sort_mem" is now work_mem. Now, does this mean that my set sort_mem = 500000 did not do anything?) Regards, Paul Tillotson omnis=> select version(); version --------------------------------------------------------------------------------------------------------- PostgreSQL 8.0.3 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3.3 20040412 (Red Hat Linux 3.3.3-7) (1 row) omnis=> show sort_mem; 20000 omnis=> set sort_mem = 500000; SET omnis=> explain analyze select 1 from parents where lname = 'SMITH' OR parentid IN (SELECT parentid FROM child where childlast = 'SMITH'); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------- Seq Scan on parents (cost=357.78..226649.83 rows=59785 width=0) (actual time=127.644..104568.639 rows=855 loops=1) Filter: (((lname)::text = 'SMITH'::text) OR (subplan)) SubPlan -> Materialize (cost=357.78..360.74 rows=296 width=4) (actual time=0.001..0.257 rows=313 loops=117943) -> Index Scan using child_childlast_index on child (cost=0.00..357.48 rows=296 width=4) (actual time=0.073..1.325 rows=313 loops=1) Index Cond: ((childlast)::text = 'SMITH'::text) Total runtime: 104569.800 ms (7 rows) omnis=> select reltuples, relpages from pg_class where relname = 'child'; reltuples | relpages -----------+---------- 52085 | 563 (1 row)
Paul Tillotson <spam1011@adelphia.net> writes: > For the following query, postgres is running the IN subquery over and > over again (once for each row scanned in the parent table.) > I would have expected it to run the whole query once and create a hash > which would then be probed once for every row scanned in the parent > table. I assumed that it was not doing so because it thought that the > resulting hash table would exceed sort_mem, Hardly likely, considering it's estimating only 296 rows in the subquery output. My bet is that you've chosen a datatype whose comparisons are not hashable (like char(n)). What is the datatype of parentid in these tables, anyway? regards, tom lane
Tom Lane wrote: >Paul Tillotson <spam1011@adelphia.net> writes: > > >>For the following query, postgres is running the IN subquery over and >>over again (once for each row scanned in the parent table.) >>I would have expected it to run the whole query once and create a hash >>which would then be probed once for every row scanned in the parent >>table. I assumed that it was not doing so because it thought that the >>resulting hash table would exceed sort_mem, >> >> > >Hardly likely, considering it's estimating only 296 rows in the subquery >output. My bet is that you've chosen a datatype whose comparisons are >not hashable (like char(n)). What is the datatype of parentid in these >tables, anyway? > > regards, tom lane > > > I don't have access to the machine now, but my memory is that parent.parentid is numeric(10,2) and child.parentid is int. If child.parentid is int and parent.parentid is numeric, would that cause this? (Not good database design, I know.) I am 100% certain that neither of these are char(n), and 99% certain that they are either numeric or int. Paul Tillotson
On Wed, Jul 20, 2005 at 08:11:46PM -0400, Paul Tillotson wrote: > Tom Lane wrote: > >Hardly likely, considering it's estimating only 296 rows in the subquery > >output. My bet is that you've chosen a datatype whose comparisons are > >not hashable (like char(n)). What is the datatype of parentid in these > >tables, anyway? > > > I don't have access to the machine now, but my memory is that > parent.parentid is numeric(10,2) and child.parentid is int. Numeric isn't hashable, but I don't know enough about the internals to say why. Tom? Why different types, and why numeric for one of them? Why not integer for both? -- Michael Fuhr http://www.fuhr.org/~mfuhr/
Paul Tillotson <spam1011@adelphia.net> writes: > Tom Lane wrote: >> Hardly likely, considering it's estimating only 296 rows in the subquery >> output. My bet is that you've chosen a datatype whose comparisons are >> not hashable (like char(n)). What is the datatype of parentid in these >> tables, anyway? >> > I don't have access to the machine now, but my memory is that > parent.parentid is numeric(10,2) and child.parentid is int. Offhand I don't believe there are any hashable crosstype comparisons. In this case the int is probably getting promoted to numeric, but I think numeric comparison isn't hashable either (because for example '0.0' = '0.000' but the internal representations are different). regards, tom lane
Tom Lane wrote: >Paul Tillotson <spam1011@adelphia.net> writes: > > >>Tom Lane wrote: >> >> >>>Hardly likely, considering it's estimating only 296 rows in the subquery >>>output. My bet is that you've chosen a datatype whose comparisons are >>>not hashable (like char(n)). What is the datatype of parentid in these >>>tables, anyway? >>> >>> >>> >>I don't have access to the machine now, but my memory is that >>parent.parentid is numeric(10,2) and child.parentid is int. >> >> > >Offhand I don't believe there are any hashable crosstype comparisons. >In this case the int is probably getting promoted to numeric, but I >think numeric comparison isn't hashable either (because for example >'0.0' = '0.000' but the internal representations are different). > > This is apparently the trouble. This query doesn't use a hash: SELECT * FROM table1 WHERE <condition> OR numeric1 IN (SELECT int1 FROM table2) But, this query (identical except for the cast) does: SELECT * FROM table1 WHERE <condition> OR numeric1::int IN (SELECT int1 FROM table2) Thanks for the help, Tom and others. Paul Tillotson