Thread: temporary indexes
Just a "wouldn't it be nice if" sort of feature request. I'm not sure how practical it is. Someone in our organization wrote a data fix query, which has sort of odd logic, but it does what they need. The problem is that it ran for 14 hours in a test against a copy of the data. I looked at it and figured it could do better with an extra index. The index took five minutes to build, and the run time for the query dropped to five minutes. The index is not needed for production, so it was then dropped. It struck me that it would be outstanding if the planner could recognize this sort of situation, and build a temporary index based on the snapshot of the data visible to the transaction. It seems to me that the obvious downside of this would be the explosion in the number of permutations the planner would need to examine -- based not just on what indexes ARE there, but which ones it could build. At a minimum, there would need to be a cost threshold below which it would not even consider the option. (In this case, as long as the optimizer spent less than 13 hours and 50 minutes considering its options, we would have come out ahead.) I'm not sure the details of this particular incident are that relevant, but I've attached the query and the two plans. -Kevin
Attachment
On Tue, Feb 28, 2006 at 09:44:08AM -0600, Kevin Grittner wrote: > It struck me that it would be outstanding if the planner could > recognize this sort of situation, and build a temporary index based on > the snapshot of the data visible to the transaction. It seems to me > that the obvious downside of this would be the explosion in the number > of permutations the planner would need to examine -- based not just on > what indexes ARE there, but which ones it could build. At a minimum, > there would need to be a cost threshold below which it would not even > consider the option. (In this case, as long as the optimizer spent less > than 13 hours and 50 minutes considering its options, we would have come > out ahead.) FWIW, Sybase supported something similar a long time ago. It had the ability to build a temporary 'clustered table' (think index organized table) when there was enough benefit to do so. This is actually much easier to make happen inside a transaction for us, because we don't need to keep visibility information around. There's probably also some index metadata that could be done away with. Perhaps the materialize node could be made to allow this. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > It struck me that it would be outstanding if the planner could > recognize this sort of situation, and build a temporary index based on > the snapshot of the data visible to the transaction. I don't think that's an appropriate solution at all. What it looks like to me (assuming that explain's estimated row counts are reasonably on-target) is that the time is all going into the EXISTS subplans. The real problem here is that we aren't doing anything to convert correlated EXISTS subqueries into some form of join that's smarter than a raw nestloop. regards, tom lane
"Jim C. Nasby" <jnasby@pervasive.com> writes: > FWIW, Sybase supported something similar a long time ago. It had the > ability to build a temporary 'clustered table' (think index organized > table) when there was enough benefit to do so. This is actually > much easier to make happen inside a transaction for us, because we don't > need to keep visibility information around. There's probably also some > index metadata that could be done away with. Perhaps the materialize > node could be made to allow this. How does what you describe differ from a merge join? Or a hash join, if you imagine the temp index as being a hash rather than btree index? The issue at hand really has nothing to do with temp indexes, it's with the constrained way that the planner deals with EXISTS subplans. The subplans themselves are cheap enough, even in the poorly-indexed variant, that the planner would certainly never have decided to create an index to use for them. The problem only becomes apparent at the next level up, where those subplans are going to be repeated a huge number of times ---- but the subplan plan is already chosen and won't be changed. So even if we invented a temp-index facility, it would fail to be applied in Kevin's example. The limiting factor is that EXISTS subplans aren't flattened ... and once that's fixed, I doubt the example would need any new kind of join support. regards, tom lane
>>> On Tue, Feb 28, 2006 at 11:05 am, in message <16076.1141146348@sss.pgh.pa.us>, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > The issue at hand really has nothing to do with temp indexes, it's with > the constrained way that the planner deals with EXISTS subplans. Yet when the index exists, the query is optimized well. > The > subplans themselves are cheap enough, even in the poorly- indexed > variant, that the planner would certainly never have decided to create > an index to use for them. That depends. If the planner was able to generate hypothetical index descriptions which might be useful, and analyze everything based on those (adding in creation cost, of course) -- why would it not be able to come up with the plan which it DID use when the index existed. > The limiting factor is that EXISTS subplans > aren't flattened ... and once that's fixed, I doubt the example would > need any new kind of join support. <digression> I'm all for that. So far, we've been going after the low-hanging fruit in our use of PostgreSQL. When we get to the main applications, we're going to be dealing with a lot more in the way of EXISTS clauses. The product we're moving from usually optimized an IN test the same as the logically equivalent EXISTS test, and where a difference existed, it almost always did better with the EXISTS -- so we encouraged application programmers to use that form. Also, EXISTS works in situations where you need to compare on multiple columns, so it is useful in many situations where EXISTS or MIN/MAX techniques just don't work. </digression> If fixing this would allow hash or merge techniques to cover this as well as the index did, and that is true in a more general sense (not just for this one example), then temporary indexes would clearly not have any value. -Kevin
>>> On Tue, Feb 28, 2006 at 11:36 am, in message <440435BC.EE98.0025.0@wicourts.gov>, "Kevin Grittner" > Also, EXISTS works in situations where > you need to compare on multiple columns, so it is useful in many > situations where EXISTS or MIN/MAX techniques just don't work. Sorry. That should have read: EXISTS works in situations where you need to compare on multiple columns, so it is useful in many situations where IN or MIN/MAX techniques just don't work.
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > EXISTS works in situations where > you need to compare on multiple columns, so it is useful in many > situations where IN or MIN/MAX techniques just don't work. IN works fine on multiple columns: (foo, bar, baz) IN (SELECT x, y, z FROM ...) regards, tom lane
>>> On Tue, Feb 28, 2006 at 12:06 pm, in message <16797.1141150016@sss.pgh.pa.us>, Tom Lane <tgl@sss.pgh.pa.us> wrote: > IN works fine on multiple columns: > > (foo, bar, baz) IN (SELECT x, y, z FROM ...) Thanks for pointing that out. I recognize it as valid ANSI/ISO syntax, using a row value constructor list. Unfortunately, row value constructor lists failed to make the portability cut for allowed syntax here, because it was not supported by all candidate products at the time. It is still not supported by the product we're moving from, and we'll have about a one year window when our code will need to run on both. -Kevin
On Tue, Feb 28, 2006 at 11:36:28AM -0600, Kevin Grittner wrote: > <digression> > I'm all for that. So far, we've been going after the low-hanging fruit > in our use of PostgreSQL. When we get to the main applications, we're > going to be dealing with a lot more in the way of EXISTS clauses. The > product we're moving from usually optimized an IN test the same as the > logically equivalent EXISTS test, and where a difference existed, it > almost always did better with the EXISTS -- so we encouraged application > programmers to use that form. Also, EXISTS works in situations where > you need to compare on multiple columns, so it is useful in many > situations where EXISTS or MIN/MAX techniques just don't work. > </digression> Maybe it's just the way my twisted mind thinks, but I generally prefer using a JOIN when possible... -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
>>> On Tue, Feb 28, 2006 at 11:05 am, in message <16076.1141146348@sss.pgh.pa.us>, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The limiting factor is that EXISTS subplans > aren't flattened ... and once that's fixed, I doubt the example would > need any new kind of join support. I rewrote the query to use IN predicates rather than EXISTS predicates, and the cost estimates look like this: EXISTS, no index: 1.6 billion EXISTS, with index: 0.023 billion IN, no index: 13.7 billion IN, with index: 10.6 billion At least for the two EXISTS cases, the estimates were roughly accurate. These plans were run against the data after the fix, but analyze has not been run since then, so the estimates should be comparable with the earlier post. I'm not used to using the IN construct this way, so maybe someone can spot something horribly stupid in how I tried to use it. -Kevin
Attachment
Kevin Grittner wrote: > I rewrote the query to use IN predicates rather than EXISTS predicates, > and the cost estimates look like this: > > EXISTS, no index: 1.6 billion > EXISTS, with index: 0.023 billion > IN, no index: 13.7 billion > IN, with index: 10.6 billion > > At least for the two EXISTS cases, the estimates were roughly accurate. > These plans were run against the data after the fix, but analyze has > not been run since then, so the estimates should be comparable with the > earlier post. > > I'm not used to using the IN construct this way, so maybe someone can > spot something horribly stupid in how I tried to use it. I will have a look at your queries tomorrow. Some general advice (rdbms agnostic) on when to use IN and when to use EXISTS taken from "SQL performance tuning": - if the inner table has few rows and the outer has many then IN is preferred - if however you have a restrictive expression on the outer query you should preferr EXISTS - use NOT EXISTS instead of NOT IN (break out early) regards, Lukas
>>> On Tue, Feb 28, 2006 at 3:02 pm, in message <20060228210232.GW82012@pervasive.com>, "Jim C. Nasby" <jnasby@pervasive.com> wrote: > > Maybe it's just the way my twisted mind thinks, but I generally prefer > using a JOIN when possible... Definitely. But sometimes you don't want one row from a table for each qualifying row in another table, you want one row from the table if one or more qualifying rows exist in the other table. Those are the cases in question here. Don't suggest that I just let the duplicates happen and use DISTINCT, that is much more prone to logic errors in complex queries, and typically optimizes worse. -Kevin