Thread: Idea for speeding up uncorrelated subqueries
Someone was just complaining over in the sql list about the poor performance of select name,description from descriptions where name in (select name from descriptions where description like '%Bankverbindung%'); Since the inner query is uncorrelated with the outer, there's really no need to execute it more than once, but currently it's re-executed each time through the outer plan. I wonder whether it wouldn't be a good idea to force a Materialize node to be added to the top of an uncorrelated subplan? Then at least the re-executions would be pretty cheap... regards, tom lane
> Someone was just complaining over in the sql list about the poor > performance of > > select name,description from descriptions > where name in (select name > from descriptions > where description like '%Bankverbindung%'); > > Since the inner query is uncorrelated with the outer, there's really > no need to execute it more than once, but currently it's re-executed > each time through the outer plan. > > I wonder whether it wouldn't be a good idea to force a Materialize > node to be added to the top of an uncorrelated subplan? Then at > least the re-executions would be pretty cheap... Yes, the subqueries need work. We don't even do index lookups into the inner plan, only sequential. Already on TODO. The multiple query executions are not on the TODO list. Not sure why this is happening here. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <maillist@candle.pha.pa.us> writes: > Yes, the subqueries need work. We don't even do index lookups into the > inner plan, only sequential. Already on TODO. Huh? I don't follow that at all... > The multiple query executions are not on the TODO list. Not sure why > this is happening here. After looking at subselect.c I think I understand why --- InitPlans are only for subqueries that are known to return a *single* reslt. When you have a subquery that might potentially return many, many tuples, you need to scan through those tuples, so we use SubPlan tactics even if there's not a query correlation. But this neglects the cost of re-executing the subplan over and over. Materializing the result should help, no? (Of course there are cases where it won't, such as when the subplan is just an unqualified select, but most of the time it should be a win, I think...) regards, tom lane
> Bruce Momjian <maillist@candle.pha.pa.us> writes: > > Yes, the subqueries need work. We don't even do index lookups into the > > inner plan, only sequential. Already on TODO. > > Huh? I don't follow that at all... Suppose you have a subquery that returns 1000 rows. There is no code so lookups of the inner table are indexed: select *from tabwhere col in (select col2 from tab2) In this case, a sequential scan of the subquery results are required. I didn't think the subquery was executed every time it needed to see if col1 was in the subquery. > > > The multiple query executions are not on the TODO list. Not sure why > > this is happening here. > > After looking at subselect.c I think I understand why --- InitPlans are > only for subqueries that are known to return a *single* reslt. When you > have a subquery that might potentially return many, many tuples, you > need to scan through those tuples, so we use SubPlan tactics even if > there's not a query correlation. > > But this neglects the cost of re-executing the subplan over and over. > Materializing the result should help, no? (Of course there are cases > where it won't, such as when the subplan is just an unqualified select, > but most of the time it should be a win, I think...) No what Vadim is done MVCC, I would like to bug him to improve subquery performance. We are tweeking the optimizer, but we have this huge subquery performance problem here. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <maillist@candle.pha.pa.us> writes: >> Bruce Momjian <maillist@candle.pha.pa.us> writes: >>>> Yes, the subqueries need work. We don't even do index lookups into the >>>> inner plan, only sequential. Already on TODO. >> >> Huh? I don't follow that at all... > Suppose you have a subquery that returns 1000 rows. There is no code so > lookups of the inner table are indexed: > select * > from tab > where col in (select col2 from tab2) > In this case, a sequential scan of the subquery results are required. Well, yes, the subquery is a sequential scan. I guess what you are envisioning is rewriting this into some kind of nested-loop join? For simple cases that might be possible... regards, tom lane
> Bruce Momjian <maillist@candle.pha.pa.us> writes: > >> Bruce Momjian <maillist@candle.pha.pa.us> writes: > >>>> Yes, the subqueries need work. We don't even do index lookups into the > >>>> inner plan, only sequential. Already on TODO. > >> > >> Huh? I don't follow that at all... > > > Suppose you have a subquery that returns 1000 rows. There is no code so > > lookups of the inner table are indexed: > > > select * > > from tab > > where col in (select col2 from tab2) > > > In this case, a sequential scan of the subquery results are required. > > Well, yes, the subquery is a sequential scan. I guess what you are > envisioning is rewriting this into some kind of nested-loop join? > For simple cases that might be possible... Yes, or mergejoin/hashjoin. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian wrote: > > Yes, the subqueries need work. We don't even do index lookups into the > inner plan, only sequential. Already on TODO. The multiple query ^^^^^^^^^^^^^^^ What? Indices are used when appropriate. Vadim
> Bruce Momjian wrote: > > > > Yes, the subqueries need work. We don't even do index lookups into the > > inner plan, only sequential. Already on TODO. The multiple query > ^^^^^^^^^^^^^^^ > What? Indices are used when appropriate. Sorry, bad wording. My English should be better. :-) I meant to say that joins from the outer plan to subplan are always nested loops, not hash or mergejoins. If you have something like: delete from tab where col in (select col2 from largetable) it takes a long time, when it really should be quick. That is why people are encouraged to use EXISTS. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian wrote: > > Suppose you have a subquery that returns 1000 rows. There is no code so > lookups of the inner table are indexed: > > select * > from tab > where col in (select col2 from tab2) > > In this case, a sequential scan of the subquery results are required. I > didn't think the subquery was executed every time it needed to see if > col1 was in the subquery. Oh, well. These are cases when query may be rewritten with EXISTS. This is possible when there are no aggregates in subquery. > > After looking at subselect.c I think I understand why --- InitPlans are > > only for subqueries that are known to return a *single* reslt. When you > > have a subquery that might potentially return many, many tuples, you > > need to scan through those tuples, so we use SubPlan tactics even if > > there's not a query correlation. Yes. But as I said already, you can use InitPlan (i.e. execute subquery first) after removing duplicates from subquery results. > > But this neglects the cost of re-executing the subplan over and over. > > Materializing the result should help, no? (Of course there are cases We could not only cache subquery results (materialization) but also hash them. > > where it won't, such as when the subplan is just an unqualified select, > > but most of the time it should be a win, I think...) In such cases, if there are no aggregates in subquery then EXISTS could be used else materialization will still help. > No what Vadim is done MVCC, I would like to bug him to improve subquery > performance. We are tweeking the optimizer, but we have this huge > subquery performance problem here. No, Bruce. I'm in WAL now. I think that we need in recovery (remember that you'll lose indices being updated when some crash took place), fast backup (it's easy to copy part of log than dump 1Gb table), fast commits (<= 1 fsync per commit using group commit, instead of >= 2 fsyncs now), savepoints AND buffered logging, which you, Bruce, want so much, and so long -:). Vadim
> > > where it won't, such as when the subplan is just an unqualified select, > > > but most of the time it should be a win, I think...) > > In such cases, if there are no aggregates in subquery then EXISTS > could be used else materialization will still help. > > > No what Vadim is done MVCC, I would like to bug him to improve subquery > > performance. We are tweeking the optimizer, but we have this huge > > subquery performance problem here. > > No, Bruce. I'm in WAL now. I think that we need in recovery > (remember that you'll lose indices being updated when some > crash took place), fast backup (it's easy to copy part of log > than dump 1Gb table), fast commits (<= 1 fsync per commit > using group commit, instead of >= 2 fsyncs now), savepoints > AND buffered logging, which you, Bruce, want so much, > and so long -:). Oh, no, I have been outmaneuvered by Vadim. Help. Isn't it something that takes only a few hours to implement. We can't keep telling people to us EXISTS, especially because most SQL people think correlated queries are slower that non-correlated ones. Can we just on-the-fly rewrite the query to use exists? -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian wrote: > > Isn't it something that takes only a few hours to implement. We can't > keep telling people to us EXISTS, especially because most SQL people > think correlated queries are slower that non-correlated ones. Can we > just on-the-fly rewrite the query to use exists? This seems easy to implement. We could look does subquery have aggregates or not before calling union_planner() in subselect.c:_make_subplan() and rewrite it (change slink->subLinkType from IN to EXISTS and add quals). Without caching implemented IN-->EXISTS rewriting always has sence. After implementation of caching we probably should call union_planner() for both original/modified subqueries and compare costs/sizes of EXISTS/IN_with_caching plans and maybe even make decision what plan to use after parent query is planned and we know for how many parent rows subplan will be executed. Vadim
Bruce Momjian <maillist@candle.pha.pa.us> writes: > Isn't it something that takes only a few hours to implement. We can't > keep telling people to us EXISTS, especially because most SQL people > think correlated queries are slower that non-correlated ones. Can we > just on-the-fly rewrite the query to use exists? I was just about to suggest exactly that. The "IN (subselect)" notation seems to be a lot more intuitive --- at least, people keep coming up with it --- so why not rewrite it to the EXISTS form, if we can handle that more efficiently? regards, tom lane
> Bruce Momjian <maillist@candle.pha.pa.us> writes: > > Isn't it something that takes only a few hours to implement. We can't > > keep telling people to us EXISTS, especially because most SQL people > > think correlated queries are slower that non-correlated ones. Can we > > just on-the-fly rewrite the query to use exists? > > I was just about to suggest exactly that. The "IN (subselect)" > notation seems to be a lot more intuitive --- at least, people > keep coming up with it --- so why not rewrite it to the EXISTS > form, if we can handle that more efficiently? Yes, we have the nice subselect feature. I just hate to see it not completely finished, performance-wise. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026