Thread: index not used with subselect in where clause ?
Good evening (GMT+0200) everybody... I have a table with an indexed integer attribute used to determine which rows to fetch from my data base. (pgsql version is 7.1RC4) When doing a SELECT attr1 FROM table1 WHERE attr1 IN (<list of ints>) AND <more conditions>; where <list of ints> is entered explicitly (or generated by a program) everything works fine and fast using the index on attr1. But when I try doing a SELECT attr1 FROM table1 WHERE attr1 IN (<SUBSELECT returning list of ints>) AND <more conditions>; then the SELECT on table1 uses a sequential scan running 'endlessly'. Is this really how it's meant to be? If so, why? If not, how can I get the second version working??? Thanks in advance... Christian -- "The sky above the port was the color of television, tuned to a dead channel." -- W.G. --
On Mon, 16 Apr 2001, Christian Fritze wrote: > SELECT attr1 FROM table1 WHERE attr1 IN (<list of ints>) > AND <more conditions>; > > where <list of ints> is entered explicitly (or generated by a > program) everything works fine and fast using the index on attr1. > > But when I try doing a > > SELECT attr1 FROM table1 WHERE attr1 IN (<SUBSELECT returning list of ints>) > AND <more conditions>; > > then the SELECT on table1 uses a sequential scan running 'endlessly'. From the FAQ: 4.23) Why are my subqueries using IN so slow? Currently, we join subqueries to outer queries by sequentially scanning the result of the subquery for each row of the outer query. A workaround is to replace IN with EXISTS: SELECT * FROM tab WHERE col1 IN (SELECT col2 FROM TAB2) to: SELECT * FROM tab WHERE EXISTS (SELECT col2 FROM TAB2 WHERE col1 = col2) We hope to fix this limitation in a future release.
Hello and thank you for your quick reply... > > But when I try doing a > > > > SELECT attr1 FROM table1 WHERE attr1 IN (<SUBSELECT returning list of ints>) > > AND <more conditions>; > > > > then the SELECT on table1 uses a sequential scan running 'endlessly'. > > >From the FAQ: > > 4.23) Why are my subqueries using IN so slow? OOOPS, I confess that I didn't think of revisiting the source tree after installing the complete HTML docs. I don't think the FAQ is in there, is it? Maybe this would be helpful... Anyway the solution from the FAQ doesn't seem to help me. Maybe I wasn't clear enough on what I'm trying to do, so here is a 'real world' example: 1. That's my situation now: finn@chatsubo:~ > /Projekte/ENDEBIT/packages/pgsql-Test/bin/psql -p 7432 zewtest_7.1 -c "explain select * from allmain where idn in (select distinct dokids_as_int from allslwfull where wort_nouml_lower like 'gen%')" NOTICE: QUERY PLAN: Seq Scan on allmain (cost=0.00..69328.08 rows=19619 width=556) SubPlan -> Materialize (cost=3.45..3.45 rows=1 width=4) -> Unique (cost=3.45..3.45 rows=1 width=4) -> Sort (cost=3.45..3.45 rows=1 width=4) -> Index Scan using allslwfull_low_idx on allslwfull (cost=0.00..3.44 rows=1 width=4) EXPLAIN 2. FAQ 4.23 now seems to suggest something like finn@chatsubo:~ > /Projekte/ENDEBIT/packages/pgsql-Test/bin/psql -p 7432 zewtest_7.1 -c "explain select * from allmain where exists (select distinct dokids_as_int from allslwfull where dokids_as_int = idn and wort_nouml_lower like 'gen%')" NOTICE: QUERY PLAN: Seq Scan on allmain (cost=0.00..69328.08 rows=19619 width=556) SubPlan -> Unique (cost=3.45..3.46 rows=1 width=4) -> Sort (cost=3.45..3.45 rows=1 width=4) -> Index Scan using allslwfull_low_idx on allslwfull (cost=0.00..3.44 rows=1 width=4) EXPLAIN Doesn't seem to make much of a difference... ;-) 3. On the other hand: finn@chatsubo:~ > /Projekte/ENDEBIT/packages/pgsql-Test/bin/psql -p 7432 zewtest_7.1 -c "explain select distinct dokids_as_int from allslwfull where wort_nouml_lower like 'gen%'" NOTICE: QUERY PLAN: Unique (cost=3.45..3.45 rows=1 width=4) -> Sort (cost=3.45..3.45 rows=1 width=4) -> Index Scan using allslwfull_low_idx on allslwfull (cost=0.00..3.44 rows=1 width=4) EXPLAIN finn@chatsubo:~ > /Projekte/ENDEBIT/packages/pgsql-Test/bin/psql -p 7432 zewtest_7.1 -c "explain select * from allmain where idn in (677676, 34487, 45353)" NOTICE: QUERY PLAN: Index Scan using allmainidn_idx, allmainidn_idx, allmainidn_idx on allmain (cost=0.00..10.44 rows=1 width=556) EXPLAIN Being far from knowlegeable in terms of database theory and looking just at the figures returned by EXPLAIN I wondered if it wouldn't be much faster to simply evaluate the inner query and hand the result over to the outer query instead of performing an expensive join behind the scenes... Am I thinking too naively here? greetings... Christian -- "The sky above the port was the color of television, tuned to a dead channel." -- W.G. --
Christian Fritze <The.Finn@sprawl.de> writes: > explain select * from allmain where exists (select distinct > dokids_as_int from allslwfull where dokids_as_int = idn and > wort_nouml_lower like 'gen%') Try dropping the "distinct" on the inner select. As a moment's thought will reveal, it's not buying you anything; and it's costing you sort and unique passes over the subplan result. regards, tom lane
Adding a LIMIT 1 in the subplan may also help -- as you only need a single match to make it true so additional finds are useless -- it'll stop sooner or will be more likely to use an index than a full table scan. -- Rod Taylor There are always four sides to every story: your side, their side, the truth, and what really happened. ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> To: "Christian Fritze" <The.Finn@sprawl.de> Cc: "Stephan Szabo" <sszabo@megazone23.bigpanda.com>; <pgsql-general@postgresql.org> Sent: Tuesday, April 17, 2001 10:33 AM Subject: Re: [GENERAL] index not used with subselect in where clause ? > Christian Fritze <The.Finn@sprawl.de> writes: > > explain select * from allmain where exists (select distinct > > dokids_as_int from allslwfull where dokids_as_int = idn and > > wort_nouml_lower like 'gen%') > > Try dropping the "distinct" on the inner select. As a moment's thought > will reveal, it's not buying you anything; and it's costing you sort > and unique passes over the subplan result. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/users-lounge/docs/faq.html >
"Rod Taylor" <rod.taylor@inquent.com> writes: > Adding a LIMIT 1 in the subplan may also help -- as you only need a > single match to make it true so additional finds are useless -- it'll > stop sooner or will be more likely to use an index than a full table > scan. I believe this is not necessary; the WHERE EXISTS(...) context is sufficient to cue the planner that only one tuple will be retrieved, and it will alter the subselect's plan accordingly. regards, tom lane
> Adding a LIMIT 1 in the subplan may also help -- as you only need a > single match to make it true so additional finds are useless -- it'll > stop sooner or will be more likely to use an index than a full table > scan. > -- > Rod Taylor I'm not sure if I understand you correctly here: the subplan uses an index scan already. It's the seq. scan in the outer query that makes me whine. > There are always four sides to every story: your side, their side, the > truth, and what really happened. Off Topic: I like that .sig, thoughts of "The man who shot Liberty Valance" come to mind... :-) > From: "Tom Lane" <tgl@sss.pgh.pa.us> [...] > > Christian Fritze <The.Finn@sprawl.de> writes: > > > explain select * from allmain where exists (select distinct > > > dokids_as_int from allslwfull where dokids_as_int = idn and > > > wort_nouml_lower like 'gen%') > > > > Try dropping the "distinct" on the inner select. As a moment's Yep, that increases performance... ...by about 0.35 % according to EXPLAIN :-{ Well, what I'm doing right now is the following: I perform the inner query (which is reasonably fast) and pump the result through the JDBC driver into my application. There I build the outer query with an explicit list of integers for the WHERE clause and hand that query back to the data base. But that doesn't seem very smart either: in cases where the inner query returns only a few results it's not really necessary. In cases where it returns a few thousands, I need to split the outer query in order not to run into a 'query too long' error (which comes from the jdbc driver rather than from pgsql if I'm right? Maybe I should try to tweak that driver? Hmmm...) That splitting however eats away much (if not all) of the intended performance gain. greetings... Christian -- "The sky above the port was the color of television, tuned to a dead channel." -- W.G. --