Thread: temporary indexes

temporary indexes

From
"Kevin Grittner"
Date:
Just a "wouldn't it be nice if" sort of feature request.  I'm not sure
how practical it is.

Someone in our organization wrote a data fix query, which has sort of
odd logic, but it does what they need.  The problem is that it ran for
14 hours in a test against a copy of the data.  I looked at it and
figured it could do better with an extra index.  The index took five
minutes to build, and the run time for the query dropped to five
minutes.  The index is not needed for production, so it was then
dropped.

It struck me that it would be outstanding if the planner could
recognize this sort of situation, and build a temporary index based on
the snapshot of the data visible to the transaction.  It seems to me
that the obvious downside of this would be the explosion in the number
of permutations the planner would need to examine -- based not just on
what indexes ARE there, but which ones it could build.  At a minimum,
there would need to be a cost threshold below which it would not even
consider the option.  (In this case, as long as the optimizer spent less
than 13 hours and 50 minutes considering its options, we would have come
out ahead.)

I'm not sure the details of this particular incident are that relevant,
but I've attached the query and the two plans.

-Kevin


Attachment

Re: temporary indexes

From
"Jim C. Nasby"
Date:
On Tue, Feb 28, 2006 at 09:44:08AM -0600, Kevin Grittner wrote:
> It struck me that it would be outstanding if the planner could
> recognize this sort of situation, and build a temporary index based on
> the snapshot of the data visible to the transaction.  It seems to me
> that the obvious downside of this would be the explosion in the number
> of permutations the planner would need to examine -- based not just on
> what indexes ARE there, but which ones it could build.  At a minimum,
> there would need to be a cost threshold below which it would not even
> consider the option.  (In this case, as long as the optimizer spent less
> than 13 hours and 50 minutes considering its options, we would have come
> out ahead.)

FWIW, Sybase supported something similar a long time ago. It had the
ability to build a temporary 'clustered table' (think index organized
table)  when there was enough benefit to do so. This is actually
much easier to make happen inside a transaction for us, because we don't
need to keep visibility information around. There's probably also some
index metadata that could be done away with. Perhaps the materialize
node could be made to allow this.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: [PERFORM] temporary indexes

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> It struck me that it would be outstanding if the planner could
> recognize this sort of situation, and build a temporary index based on
> the snapshot of the data visible to the transaction.

I don't think that's an appropriate solution at all.  What it looks like
to me (assuming that explain's estimated row counts are reasonably
on-target) is that the time is all going into the EXISTS subplans.  The
real problem here is that we aren't doing anything to convert correlated
EXISTS subqueries into some form of join that's smarter than a raw
nestloop.

            regards, tom lane

Re: [PERFORM] temporary indexes

From
Tom Lane
Date:
"Jim C. Nasby" <jnasby@pervasive.com> writes:
> FWIW, Sybase supported something similar a long time ago. It had the
> ability to build a temporary 'clustered table' (think index organized
> table)  when there was enough benefit to do so. This is actually
> much easier to make happen inside a transaction for us, because we don't
> need to keep visibility information around. There's probably also some
> index metadata that could be done away with. Perhaps the materialize
> node could be made to allow this.

How does what you describe differ from a merge join?  Or a hash join,
if you imagine the temp index as being a hash rather than btree index?

The issue at hand really has nothing to do with temp indexes, it's with
the constrained way that the planner deals with EXISTS subplans.  The
subplans themselves are cheap enough, even in the poorly-indexed
variant, that the planner would certainly never have decided to create
an index to use for them.  The problem only becomes apparent at the next
level up, where those subplans are going to be repeated a huge number of
times ---- but the subplan plan is already chosen and won't be changed.
So even if we invented a temp-index facility, it would fail to be
applied in Kevin's example.  The limiting factor is that EXISTS subplans
aren't flattened ... and once that's fixed, I doubt the example would
need any new kind of join support.

            regards, tom lane

Re: [PERFORM] temporary indexes

From
"Kevin Grittner"
Date:
>>> On Tue, Feb 28, 2006 at 11:05 am, in message
<16076.1141146348@sss.pgh.pa.us>,
Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> The issue at hand really has nothing to do with temp indexes, it's
with
> the constrained way that the planner deals with EXISTS subplans.

Yet when the index exists, the query is optimized well.

> The
> subplans themselves are cheap enough, even in the poorly- indexed
> variant, that the planner would certainly never have decided to
create
> an index to use for them.

That depends.  If the planner was able to generate hypothetical index
descriptions which might be useful, and analyze everything based on
those (adding in creation cost, of course) -- why would it not be able
to come up with the plan which it DID use when the index existed.

> The limiting factor is that EXISTS subplans
> aren't flattened ... and once that's fixed, I doubt the example
would
> need any new kind of join support.

<digression>
I'm all for that.  So far, we've been going after the low-hanging fruit
in our use of PostgreSQL.  When we get to the main applications, we're
going to be dealing with a lot more in the way of EXISTS clauses.  The
product we're moving from usually optimized an IN test the same as the
logically equivalent EXISTS test, and where a difference existed, it
almost always did better with the EXISTS -- so we encouraged application
programmers to use that form.  Also, EXISTS works in situations where
you need to compare on multiple columns, so it is useful in many
situations where EXISTS or MIN/MAX techniques just don't work.
</digression>

If fixing this would allow hash or merge techniques to cover this as
well as the index did, and that is true in a more general sense (not
just for this one example), then temporary indexes would clearly not
have any value.

-Kevin



Re: [PERFORM] temporary indexes

From
"Kevin Grittner"
Date:
>>> On Tue, Feb 28, 2006 at 11:36 am, in message
<440435BC.EE98.0025.0@wicourts.gov>, "Kevin Grittner"
> Also, EXISTS works in situations where
> you need to compare on multiple columns, so it is useful in many
> situations where EXISTS or MIN/MAX techniques just don't work.

Sorry.  That should have read:

EXISTS works in situations where
you need to compare on multiple columns, so it is useful in many
situations where IN or MIN/MAX techniques just don't work.


Re: [PERFORM] temporary indexes

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> EXISTS works in situations where
> you need to compare on multiple columns, so it is useful in many
> situations where IN or MIN/MAX techniques just don't work.

IN works fine on multiple columns:

    (foo, bar, baz) IN (SELECT x, y, z FROM ...)

            regards, tom lane

Re: [PERFORM] temporary indexes

From
"Kevin Grittner"
Date:
>>> On Tue, Feb 28, 2006 at 12:06 pm, in message
<16797.1141150016@sss.pgh.pa.us>,
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> IN works fine on multiple columns:
>
>     (foo, bar, baz) IN (SELECT x, y, z FROM ...)

Thanks for pointing that out.  I recognize it as valid ANSI/ISO syntax,
using a row value constructor list.  Unfortunately, row value
constructor lists failed to make the portability cut for allowed syntax
here, because it was not supported by all candidate products at the
time.  It is still not supported by the product we're moving from, and
we'll have about a one year window when our code will need to run on
both.

-Kevin


Re: [PERFORM] temporary indexes

From
"Jim C. Nasby"
Date:
On Tue, Feb 28, 2006 at 11:36:28AM -0600, Kevin Grittner wrote:
> <digression>
> I'm all for that.  So far, we've been going after the low-hanging fruit
> in our use of PostgreSQL.  When we get to the main applications, we're
> going to be dealing with a lot more in the way of EXISTS clauses.  The
> product we're moving from usually optimized an IN test the same as the
> logically equivalent EXISTS test, and where a difference existed, it
> almost always did better with the EXISTS -- so we encouraged application
> programmers to use that form.  Also, EXISTS works in situations where
> you need to compare on multiple columns, so it is useful in many
> situations where EXISTS or MIN/MAX techniques just don't work.
> </digression>

Maybe it's just the way my twisted mind thinks, but I generally prefer
using a JOIN when possible...
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: [PERFORM] temporary indexes

From
"Kevin Grittner"
Date:
>>> On Tue, Feb 28, 2006 at 11:05 am, in message
<16076.1141146348@sss.pgh.pa.us>,
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The limiting factor is that EXISTS subplans
> aren't flattened ... and once that's fixed, I doubt the example
would
> need any new kind of join support.

I rewrote the query to use IN predicates rather than EXISTS predicates,
and the cost estimates look like this:

EXISTS, no index:  1.6 billion
EXISTS, with index:  0.023 billion
IN, no index:  13.7 billion
IN, with index:  10.6 billion

At least for the two EXISTS cases, the estimates were roughly accurate.
 These plans were run against the data after the fix, but analyze has
not been run since then, so the estimates should be comparable with the
earlier post.

I'm not used to using the IN construct this way, so maybe someone can
spot something horribly stupid in how I tried to use it.

-Kevin


Attachment

Re: [PERFORM] temporary indexes

From
Lukas Smith
Date:
Kevin Grittner wrote:

> I rewrote the query to use IN predicates rather than EXISTS predicates,
> and the cost estimates look like this:
> 
> EXISTS, no index:  1.6 billion
> EXISTS, with index:  0.023 billion
> IN, no index:  13.7 billion
> IN, with index:  10.6 billion
> 
> At least for the two EXISTS cases, the estimates were roughly accurate.
>  These plans were run against the data after the fix, but analyze has
> not been run since then, so the estimates should be comparable with the
> earlier post.
> 
> I'm not used to using the IN construct this way, so maybe someone can
> spot something horribly stupid in how I tried to use it.

I will have a look at your queries tomorrow. Some general advice (rdbms 
agnostic) on when to use IN and when to use EXISTS taken from "SQL 
performance tuning":

- if the inner table has few rows and the outer has many then IN is 
preferred
- if however you have a restrictive expression on the outer query you 
should preferr EXISTS
- use NOT EXISTS instead of NOT IN (break out early)

regards,
Lukas


Re: [PERFORM] temporary indexes

From
"Kevin Grittner"
Date:
>>> On Tue, Feb 28, 2006 at  3:02 pm, in message
<20060228210232.GW82012@pervasive.com>, "Jim C. Nasby"
<jnasby@pervasive.com>
wrote:
>
> Maybe it's just the way my twisted mind thinks, but I generally
prefer
> using a JOIN when possible...

Definitely.  But sometimes you don't want one row from a table for each
qualifying row in another table, you want one row from the table if one
or more qualifying rows exist in the other table.  Those are the cases
in question here.  Don't suggest that I just let the duplicates happen
and use DISTINCT, that is much more prone to logic errors in complex
queries, and typically optimizes worse.

-Kevin