Thread: Good Optimization
There is a simple way to optimize SQL queries involving joins to PostgreSQL that I think should be handled by Postgre? If one is joining a tables a,b on attribute "x" and if one has something like x=3 then it helps A LOT to say: a.x=3 and b.x=3 in addition to saying a.x=b.x ... The example below shoulds the radical speed gain of doing this, and I think it isn't something real obvious to most people... Of course it could just be a common thing to do in SQL, anyway, just thought I'd let you all know what I discovered. Here is an example: ftc=> explain select * from po,tickets where po_id=material_po and po_id=8888 ; NOTICE: QUERY PLAN: Nested Loop (cost=401.34 size=6146 width=158) -> Index Scan using ipo_po_id_units on po (cost=2.05 size=2 width=94) -> Index Scan using itickets_mpou on tickets (cost=199.64 size=70650 width=6 4) EXPLAIN ftc=> ftc=> explain select * from po,tickets where po_id=material_po and po_id=8888 an d material_po=8888; NOTICE: QUERY PLAN: Nested Loop (cost=21.42 size=268 width=158) -> Index Scan using ipo_po_id_units on po (cost=2.05 size=2 width=94) -> Index Scan using itickets_material_po on tickets (cost=9.68 size=3073 wid th=64) EXPLAIN ftc=>
On Wed, 7 Jul 1999, secret wrote: > > There is a simple way to optimize SQL queries involving joins to > PostgreSQL that I think should be handled by Postgre? If one is joining > a tables a,b on attribute "x" and if one has something like x=3 then it > helps A LOT to say: a.x=3 and b.x=3 in addition to saying a.x=b.x ... > The example below shoulds the radical speed gain of doing this, and I > think it isn't something real obvious to most people... > Also if the optimizer works similar to Oracle then the order of where statements would also help in actual performance. I'm mostly a lurker on this list but have never seen anything about this issue. When using Oracle, and won't it be nice when people mentionPostgreSQL functionality when comparing other databases, the statements at the bottom should be the most restrictive and joins should be near the top. select * from po,tickets where po_id=material_po and po_id=8888 ; Would be the best. Whereas select * from po,tickets where po_id=8888 and po_id=material_po; would do the join then select those tuples that have a po_id of 8888. Of course this is probably what PostgreSQL is doing already without a requirement for the positioning. Rod -- Roderick A. Anderson raanders@altoplanos.net Altoplanos Information Systems, Inc. Voice: 208.765.6149 212 S. 11th Street, Suite 5 FAX: 208.664.5299 Coeur d'Alene, ID 83814
secret <secret@kearneydev.com> writes: > There is a simple way to optimize SQL queries involving joins to > PostgreSQL that I think should be handled by Postgre? If one is joining > a tables a,b on attribute "x" and if one has something like x=3 then it > helps A LOT to say: a.x=3 and b.x=3 in addition to saying a.x=b.x ... > The example below shoulds the radical speed gain of doing this, and I > think it isn't something real obvious to most people... How much *actual* speedup is there? I don't trust the optimizer's numbers as anything more than relative measures ;-) I'm a bit surprised that you are getting a nested-loop plan and not a merge or hash join. With a merge join, at least, there ought not be a large difference from providing the additional qual clause (I think). What Postgres version are you using? regards, tom lane
"Roderick A. Anderson" <raanders@altoplanos.net> writes: > Also if the optimizer works similar to Oracle then the order of > where statements would also help in actual performance. I'm mostly a > lurker on this list but have never seen anything about this issue. > When using Oracle, and won't it be nice when people mention PostgreSQL > functionality when comparing other databases, the statements at the > bottom should be the most restrictive and joins should be near the top. The Postgres optimizer doesn't care about the order of where clauses; it'll pick out the clauses it thinks are most effective regardless of the order you write them in. regards, tom lane
On Wed, 7 Jul 1999, Tom Lane wrote: > The Postgres optimizer doesn't care about the order of where clauses; > it'll pick out the clauses it thinks are most effective regardless > of the order you write them in. > > regards, tom lane As I figured it would be as soon as I opened my big mouth. Rod -- Roderick A. Anderson raanders@altoplanos.net Altoplanos Information Systems, Inc. Voice: 208.765.6149 212 S. 11th Street, Suite 5 FAX: 208.664.5299 Coeur d'Alene, ID 83814
> Also if the optimizer works similar to Oracle then the order of > where statements would also help in actual performance. I'm mostly a > lurker on this list but have never seen anything about this issue. > When using Oracle, and won't it be nice when people mention PostgreSQL > functionality when comparing other databases, the statements at the > bottom should be the most restrictive and joins should be near the top. > > select * > from po,tickets > where po_id=material_po > and po_id=8888 ; > > Would be the best. Whereas > > select * > from po,tickets > where po_id=8888 > and po_id=material_po; > > would do the join then select those tuples that have a po_id of 8888. > > Of course this is probably what PostgreSQL is doing already without a > requirement for the positioning. This is not your mother's database. We don't care about statement ordering. :-) -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> There is a simple way to optimize SQL queries involving joins to > PostgreSQL that I think should be handled by Postgre? If one is joining > a tables a,b on attribute "x" and if one has something like x=3 then it > helps A LOT to say: a.x=3 and b.x=3 in addition to saying a.x=b.x ... > The example below shoulds the radical speed gain of doing this, and I > think it isn't something real obvious to most people... > > Of course it could just be a common thing to do in SQL, anyway, just > thought I'd let you all know what I discovered. > > > > Here is an example: > [...] Hmmmmm, wouldn't it be nice to do such things automagically (rewriter? planner?). Even if I don't think that it is that trivial as it might look like. The entire queries qualification is ONE big expression of cascaded AND and OR nodes. Deciding which constant expressions could be added again with substituted Var nodes because there's some "VarX == VarY" and don't falling into recursion traps isn't easy. But no expression comparing two Var's of different RTE's can be put down into an execution subtree. And that is what reduces the amount of data to get joined. I think this is worth the efford. I think the best place for it will be at the top of the optimizer. It's more an optimization issue even if implemented in rewriting technique. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #========================================= wieck@debis.com (Jan Wieck) #
> > There is a simple way to optimize SQL queries involving joins to > > PostgreSQL that I think should be handled by Postgre? If one is joining > > a tables a,b on attribute "x" and if one has something like x=3 then it > > helps A LOT to say: a.x=3 and b.x=3 in addition to saying a.x=b.x ... > > The example below shoulds the radical speed gain of doing this, and I > > think it isn't something real obvious to most people... > > > > Of course it could just be a common thing to do in SQL, anyway, just > > thought I'd let you all know what I discovered. > > > > > > > > Here is an example: > > [...] > > Hmmmmm, > > wouldn't it be nice to do such things automagically > (rewriter? planner?). Added to TODO: * In WHERE x=3 AND x=y, add y=3 -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
wieck@debis.com (Jan Wieck) writes: > I think the best place for it will be at the top of the > optimizer. It's more an optimization issue even if > implemented in rewriting technique. I have thought for some time that we need an additional phase in between the rewriter and the planner/optimizer proper. This phase would be the right place to do constant-subexpression folding, cnfify(), and any other rearrangements of the parse tree that would be likely to help the planning process. So far, the only specific example I had in mind was the index restriction clauses that are added for LIKE and ~ matches with a constant initial pattern --- those are currently done by gram.y but I think they need to happen much later. If this idea of propagating restriction clauses across a join is really worthwhile, then this new optimization phase would be the right place for that too. The reason I'd prefer to keep it separate from the rewriter is that the rewriter is mostly concerned with functionality (implementing views etc) whereas this new module would be concerned with optimization. Obviously you do not want to start worrying about optimization until you've got the fully expanded query available. regards, tom lane
> The reason I'd prefer to keep it separate from the rewriter is that the > rewriter is mostly concerned with functionality (implementing views etc) > whereas this new module would be concerned with optimization. Obviously > you do not want to start worrying about optimization until you've got > the fully expanded query available. Sounds good. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian wrote: > > Added to TODO: > > * In WHERE x=3 AND x=y, add y=3 I don't know if I'm way off, but wouldn't removing "x=y" improve performance further? -- ech`echo xiun|tr nu oc|sed 'sx\([sx]\)\([xoi]\)xo un\2\1 is xg'`ol Känns det oklart? Fråga på!
> > Bruce Momjian wrote: > >=20 > > Added to TODO: > >=20 > > * In WHERE x=3D3 AND x=3Dy, add y=3D3 > > I don't know if I'm way off, but wouldn't removing "x=3Dy" improve > performance further? Maybe in the simple case above. But it will probably change the result set if someone issues SELECT a.x, b.y FROM a, b WHERE a.x >= 10 AND a.x <= 19 AND a.x = b.x; which should then get rewritten into SELECT a.x, b.y FROM a, b WHERE a.x >= 10 AND a.x <= 19 AND a.x = b.x AND b.x >= 10 AND b.x <= 19; This time the "a.x = b.x" is important! IMHO we're improving optimization more and more on the cost of query parse/rewrite/optimize/plan time. Thus performance of statements that EXECUTE fast slows down more and more. Isn't it time to think about some (maybe shared) "parsetree->plan" cache that provides ready to use plans if only Const values have changed? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #========================================= wieck@debis.com (Jan Wieck) #
Jan Wieck wrote: > > > > > Bruce Momjian wrote: > > >=20 > > > Added to TODO: > > >=20 > > > * In WHERE x=3D3 AND x=3Dy, add y=3D3 > > > > I don't know if I'm way off, but wouldn't removing "x=3Dy" improve > > performance further? > > Maybe in the simple case above. But it will probably change > the result set if someone issues > > SELECT a.x, b.y FROM a, b WHERE a.x >= 10 > AND a.x <= 19 > AND a.x = b.x; > > which should then get rewritten into > > SELECT a.x, b.y FROM a, b WHERE a.x >= 10 > AND a.x <= 19 > AND a.x = b.x > AND b.x >= 10 > AND b.x <= 19; > > This time the "a.x = b.x" is important! Ouch... didn't think that far. However, if an cache is implemented it might be worth the extra overhead to optimize special cases like the one I commented on. Since most systems use a fairly limited set of queries (I suppose...) most of the queries will be cached quite fast. BTW, I just love the work all of you are doing! Postgres is awsome! /Kudo
Patrik Kudo wrote: > Jan Wieck wrote: > > > > This time the "a.x = b.x" is important! > > Ouch... didn't think that far. However, if an cache is implemented > it might be worth the extra overhead to optimize special cases like > the one I commented on. Since most systems use a fairly limited set > of queries (I suppose...) most of the queries will be cached quite > fast. Exactly the fact that most system use a limited set of queries made me think that a shared query->plan cache, that in the best of all worlds isn't lost on backend termination, is worth the efford. And it would lower the need of client side prepared execution plans as discussed sometimes. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #========================================= wieck@debis.com (Jan Wieck) #
wieck@debis.com (Jan Wieck) writes: > IMHO we're improving optimization more and more on the cost > of query parse/rewrite/optimize/plan time. Thus performance > of statements that EXECUTE fast slows down more and more. > Isn't it time to think about some (maybe shared) > "parsetree->plan" cache that provides ready to use plans if > only Const values have changed? Easier said than done, because the plan chosen by the optimizer may well depend on the values of the constants. For example, in SELECT * FROM t1, t2 WHERE t1.a = t2.a AND t1.b < 12; the optimizer will attempt to estimate the fraction of t1's rows that will pass the restriction "t1.b < 12", and then it will choose the type of join depending on how many rows it thinks there will be. If we require plans to be chosen without dependence on the values of constants, we will have to give up a great deal of optimization. I do not object to letting the user specifically say PREPARE xyz... and then using that prepared plan; there are plenty of times when trading off planning time against getting a narrowly-tailored plan is a useful thing to do. But we mustn't pre-empt the user's choice. Note also that in a PREPARE context, it is known which items are substitutable parameters and which are plain constants, so some amount of optimization can still go on. regards, tom lane
[Charset iso-8859-1 unsupported, filtering to ASCII...] > Bruce Momjian wrote: > > > > Added to TODO: > > > > * In WHERE x=3 AND x=y, add y=3 > > I don't know if I'm way off, but wouldn't removing "x=y" improve > performance further? Let me rewrite it. TODO list updated: * In WHERE tab1.x=3 AND tab1.x=tab2.y, add tab2.y=3 -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> IMHO we're improving optimization more and more on the cost > of query parse/rewrite/optimize/plan time. Thus performance > of statements that EXECUTE fast slows down more and more. > Isn't it time to think about some (maybe shared) > "parsetree->plan" cache that provides ready to use plans if > only Const values have changed? TODO list has: * Cache most recent query plan(s?) -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Tom Lane wrote: > secret <secret@kearneydev.com> writes: > > There is a simple way to optimize SQL queries involving joins to > > PostgreSQL that I think should be handled by Postgre? If one is joining > > a tables a,b on attribute "x" and if one has something like x=3 then it > > helps A LOT to say: a.x=3 and b.x=3 in addition to saying a.x=b.x ... > > The example below shoulds the radical speed gain of doing this, and I > > think it isn't something real obvious to most people... > > How much *actual* speedup is there? I don't trust the optimizer's > numbers as anything more than relative measures ;-) > > I'm a bit surprised that you are getting a nested-loop plan and not > a merge or hash join. With a merge join, at least, there ought not be > a large difference from providing the additional qual clause (I think). > What Postgres version are you using? > > regards, tom lane The actual performance difference is HUGE. Hours vs minutes or Minutes vs Seconds... David Secret MIS Director Kearney Development Co., Inc.
Bruce Momjian wrote: > > IMHO we're improving optimization more and more on the cost > > of query parse/rewrite/optimize/plan time. Thus performance > > of statements that EXECUTE fast slows down more and more. > > Isn't it time to think about some (maybe shared) > > "parsetree->plan" cache that provides ready to use plans if > > only Const values have changed? > > TODO list has: > > * Cache most recent query plan(s?) > > -- > Bruce Momjian | http://www.op.net/~candle > maillist@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 Why not let the user select how much time he wants the DB to spend on optimization? maybe an explicit "FAST SELECT ..." will skip the costly steps for subsecond queries, or "OPTIMIZE SELECT" for the long(hour) queries that really need a lot of help :)... Or even a SET OPTION= for turning it on/off? --David
> Why not let the user select how much time he wants the DB to spend on > optimization? maybe an explicit "FAST SELECT ..." will skip the costly steps > for subsecond queries, or "OPTIMIZE SELECT" for the long(hour) queries that > really need a lot of help :)... Or even a SET OPTION= for turning it on/off? In most cases, the optimizer knows much better how long to optimize, and the time it spends is usually very small. Most time is spent in executor. Pre-6.5 optimizer had problems, so this was not true then. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
secret <secret@kearneydev.com> writes: > Tom Lane wrote: >> How much *actual* speedup is there? I don't trust the optimizer's >> numbers as anything more than relative measures ;-) >> >> I'm a bit surprised that you are getting a nested-loop plan and not >> a merge or hash join. With a merge join, at least, there ought not be >> a large difference from providing the additional qual clause (I think). >> What Postgres version are you using? > The actual performance difference is HUGE. Hours vs minutes or Minutes vs > Seconds... Well, yeah, it could be huge in a nested-loop scenario; in a mergejoin I think it would make little difference. Actually, if the inner path is indexed then this shouldn't make any difference for a nestloop either; each probe into the inner path *ought* to be using the value of the current outer tuple's join variable as an indexqual constraint, which would have the same limiting effect as the explicit restriction you propose adding. There is code in the optimizer that claims to be making that happen. Sounds like it is broken :-( What are the data types of po_id and material_po in your example? And, again, which Postgres version exactly? regards, tom lane
Patrik Kudo wrote: > Jan Wieck wrote: > > > > This time the "a.x = b.x" is important! > > Ouch... didn't think that far. However, if an cache is implemented > it might be worth the extra overhead to optimize special cases like > the one I commented on. Since most systems use a fairly limited set > of queries (I suppose...) most of the queries will be cached quite > fast. Exactly the fact that most system use a limited set of queries made me think that a shared query->plan cache, that in the best of all worlds isn't lost on backend termination, is worth the efford. And it would lower the need of client side prepared execution plans as discussed sometimes. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #========================================= wieck@debis.com (Jan Wieck) #
Bruce Momjian wrote: > > Added to TODO: > > * In WHERE x=3 AND x=y, add y=3 I don't know if I'm way off, but wouldn't removing "x=y" improve performance further? -- ech`echo xiun|tr nu oc|sed 'sx\([sx]\)\([xoi]\)xo un\2\1 is xg'`ol Känns det oklart? Fråga på!