Thread: Avoiding sequential scans with OR join condition
Hello. I have a query like: SELECT big_table.* FROM little_table, big_table WHERE little_table.x = 10 AND little_table.y IN (big_table.y1, big_table.y2); I have indexes on both big_table.y1 and big_table.y2 and on little_table.x and little_table.y. The result is a sequential scan of big_table. In order to prevent this, I've rewritten the query as: SELECT big_table.* FROM little_table, big_table WHERE little_table.x = 10 AND little_table.y = big_table.y1 UNION SELECT big_table.* FROM little_table, big_table WHERE little_table.x = 10 AND little_table.y = big_table.y2 which does allow an index scan, but suffers from two separate queries along with a unique sort, which, from the data, represents 90% of the tuples returned by both queries. Is there any way to write the first query such that indexes will be used? Mike Mascari
Am Samstag, 16. Oktober 2004 07:23 schrieb Mike Mascari: > Hello. I have a query like: > > SELECT big_table.* > FROM little_table, big_table > WHERE little_table.x = 10 AND > little_table.y IN (big_table.y1, big_table.y2); > > I have indexes on both big_table.y1 and big_table.y2 and on > little_table.x and little_table.y. The result is a sequential scan of > big_table. In order to prevent this, Maybe the postgres planner decided to choose a seq scan because the planner thinks it is faster, and often it is right. Did you vacuum analyze before? try: VACCUM ANALYZE; SET enable_seq_scan to off; EXPLAIN ANALYZE <your query> SET enable_seq_scan to on; EXPLAIN ANALYZE <your query> you will see why postgres planner did choose a seq scan and if it was right to do so (but never disable seq scan on production environment, not even for one query. you do not want it.) (i hope syntax is correct otherwise consult the manual) > I've rewritten the query as: > SELECT big_table.* > FROM little_table, big_table > WHERE little_table.x = 10 AND > little_table.y = big_table.y1 > UNION > SELECT big_table.* > FROM little_table, big_table > WHERE little_table.x = 10 AND > little_table.y = big_table.y2 > > which does allow an index scan, but suffers from two separate queries > along with a unique sort, which, from the data, represents 90% of the > tuples returned by both queries. this is the reason it seems why postgres choose a seq scan in the first query. if it has to scan 90% of data anyway, it is faster than doing two index lookups before. > Is there any way to write the first query such that indexes will be used? i do not know your db design but it looks queer to me to have a big_table with two columns y1 and y2 which seems to have the same meaning (some value which is compared to another value of little_table). why dont you put just one column "y" in your big_table? kind regards, janning > Mike Mascari
If the problem is the sort, use UNION ALL. As for the query restructuring, I don't know if there is a way of restructuring the query to do it in a single query. You would be able to contruct a query plan that would do it, something like: -> Nested Loop -> Append -> Index Scan on big_table.y1 -> Index Scan on big_table.y2 -> Index Scan on little_table But I have no idea how to get PostgreSQL to produce this... On Sat, Oct 16, 2004 at 01:23:09AM -0400, Mike Mascari wrote: > Hello. I have a query like: > > SELECT big_table.* > FROM little_table, big_table > WHERE little_table.x = 10 AND > little_table.y IN (big_table.y1, big_table.y2); > > I have indexes on both big_table.y1 and big_table.y2 and on > little_table.x and little_table.y. The result is a sequential scan of > big_table. In order to prevent this, I've rewritten the query as: > > SELECT big_table.* > FROM little_table, big_table > WHERE little_table.x = 10 AND > little_table.y = big_table.y1 > UNION > SELECT big_table.* > FROM little_table, big_table > WHERE little_table.x = 10 AND > little_table.y = big_table.y2 > > which does allow an index scan, but suffers from two separate queries > along with a unique sort, which, from the data, represents 90% of the > tuples returned by both queries. > > Is there any way to write the first query such that indexes will be used? > > Mike Mascari > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
Mike Mascari <mascarm@mascari.com> writes: > SELECT big_table.* > FROM little_table, big_table > WHERE little_table.x = 10 AND > little_table.y IN (big_table.y1, big_table.y2); > Is there any way to write the first query such that indexes will be used? I'm afraid you're stuck with the UNION workaround. The planner's treatment of OR indexscans is entirely separate from its treatment of join indexscans, so it's just not capable of forming the sort of plan you are envisioning. It'd be nice to improve that someday, but it'd take either a pile of duplicate code, or a fairly thorough rewrite of indxpath.c/orindxpath.c. regards, tom lane
I would use 2 left joins and use the where condition to make sure one of them is true, such as: select big_table.* from big_table left join little_table as l1 on big_table.y1=l1.y and l1.x=10 left join little_table as l2 on big_table.y2=l2.y and l1.x=10 where l1.p_key is not null and l2.p_key is not null I have never tried this in postgresql, but in my experience with various other DB engines it is a lot faster then using an or in the join and faster then a union. Thank You Sim Zacks IT Manager CompuLab 04-829-0145 - Office 04-832-5251 - Fax ________________________________________________________________________________ Hello. I have a query like: SELECT big_table.* FROM little_table, big_table WHERE little_table.x = 10 AND little_table.y IN (big_table.y1, big_table.y2); I have indexes on both big_table.y1 and big_table.y2 and on little_table.x and little_table.y. The result is a sequential scan of big_table. In order to prevent this, I've rewritten the query as: SELECT big_table.* FROM little_table, big_table WHERE little_table.x = 10 AND little_table.y = big_table.y1 UNION SELECT big_table.* FROM little_table, big_table WHERE little_table.x = 10 AND little_table.y = big_table.y2 which does allow an index scan, but suffers from two separate queries along with a unique sort, which, from the data, represents 90% of the tuples returned by both queries. Is there any way to write the first query such that indexes will be used? Mike Mascari ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly
Sim Zacks wrote: > I would use 2 left joins and use the where condition to make sure one > of them is true, such as: > > select big_table.* from > big_table left join little_table as l1 on big_table.y1=l1.y and > l1.x=10 > left join little_table as l2 on big_table.y2=l2.y and l1.x=10 > where l1.p_key is not null and l2.p_key is not null > > I have never tried this in postgresql, but in my experience with > various other DB engines it is a lot faster then using an or in the > join and faster then a union. Wow! Thanks! That certainly did the trick. Mike Mascari
Mike Mascari wrote: > Sim Zacks wrote: > >> I would use 2 left joins and use the where condition to make sure one >> of them is true, such as: >> >> select big_table.* from >> big_table left join little_table as l1 on big_table.y1=l1.y and >> l1.x=10 >> left join little_table as l2 on big_table.y2=l2.y and l1.x=10 >> where l1.p_key is not null and l2.p_key is not null >> >> I have never tried this in postgresql, but in my experience with >> various other DB engines it is a lot faster then using an or in the >> join and faster then a union. > > Wow! Thanks! That certainly did the trick. I'm thinking that the WHERE clauses condition should read: WHERE l1.p_pkey is not null OR l2.p_key is not null; My condition for a given selection of a big_table tuple is that either y1 or y2 exist as a valid x from little_table. So I think I need an OR instead of an AND. And AND condition would require that both y1 and y2 for the sample tuple of big_table be a valid x from little_table. Correct? Mike Mascari
Mike, You are probably correct, I was thinking in English, not SQL. That's what happens when I bang code too early in the morning. Thank You Sim Zacks IT Manager CompuLab 04-829-0145 - Office 04-832-5251 - Fax ________________________________________________________________________________ Mike Mascari wrote: > Sim Zacks wrote: > >> I would use 2 left joins and use the where condition to make sure one >> of them is true, such as: >> >> select big_table.* from >> big_table left join little_table as l1 on big_table.y1=l1.y and >> l1.x=10 >> left join little_table as l2 on big_table.y2=l2.y and l1.x=10 >> where l1.p_key is not null and l2.p_key is not null >> >> I have never tried this in postgresql, but in my experience with >> various other DB engines it is a lot faster then using an or in the >> join and faster then a union. > > Wow! Thanks! That certainly did the trick. I'm thinking that the WHERE clauses condition should read: WHERE l1.p_pkey is not null OR l2.p_key is not null; My condition for a given selection of a big_table tuple is that either y1 or y2 exist as a valid x from little_table. So I think I need an OR instead of an AND. And AND condition would require that both y1 and y2 for the sample tuple of big_table be a valid x from little_table. Correct? Mike Mascari
On Sun, Oct 17, 2004 at 03:30:32 -0400, Mike Mascari <mascarm@mascari.com> wrote: > > I'm thinking that the WHERE clauses condition should read: > > WHERE l1.p_pkey is not null OR l2.p_key is not null; That seems to make more sense. I was puzzling about that condition myself. If both keys where not null, there wouldn't even be a need for "left" joins. Note that the output is different than you were originally getting as well, since previously rows looked like one big table row combined with one little table row. Now you are getting one big table row combined with two little table rows (one of which might be null). This is probably still faster, but you will need to change how you use the output.