Thread: Re: [GENERAL] Yet another btree gotcha

Re: [GENERAL] Yet another btree gotcha

From
Mike Mascari
Date:
I realize that this is somewhat off topic, but I
have seen far better improvments in query speeds
when using a multifield index instead of individual
ones. So, for the given select below:

SELECT km.km, su.su
FROM km, su
WHERE km.id = su.id
AND km.rel = su.rel
AND km.item = su.item
AND su ~ '^ethanol';

multifield indexes should be created:

CREATE INDEX k_km on km(id, rel, item);
CREATE INDEX k_su on su(id, rel, item);

and the query should be rewritten as:

SELECT km.km, su.su
FROM km, su
WHERE
(km.id,km.rel,km.item) = (su.id,su.rel,su.item)
AND su ~ '^ethanol';

Perhaps the distribution of your data is more
"normal" using multikey indexes (I hope).  Note
that if you have a multifield index, and you still
submit your original query, the planner/optimizer
doesn't appear smart enough to use the multikey
index except for the first join condition, so to
realize all of the benefits, you have to use the

WHERE
(t1.field1,t1.field2) = (t2.field1,t2.field2)

construction instead of:

WHERE
t1.field1 = t2.field1 AND
t1.field2 = t2.field2

On another note, looking through the backend
sources reveals that, at one point, partial indices
were once allowed:

CREATE INDEX k_km on km(item) WHERE item <> 1;

but the grammer has been removed and is now
illegal (although the code to handle it still
exists). Perhaps someday this feature will be
restored.

Hope the above helps some,

Mike Mascari
(mascarim@yahoo.com)

--- "Gene Selkov, Jr." <selkovjr@mcs.anl.gov> wrote:
> I am wondering whether it's normal to see dropping a
> btree improve the
> query which could never complete enough that it
> completes in a blink?
>
> I realize the data I have here represent the worst
> possible input to
> btree, and I'm probably better off without any index
> at all, but I
> guess it is something that the optimizer or the
> access method itself
> should be able to decide.
>
> I am joining two tables, "km" and "su" on an int2
> attribute, "item".
> Just take a look at the value histograms for item:
>
> emp=> SELECT item, count (*) AS count FROM km GROUP
> BY item;
> item|count
> ----+-----
>    1|31262
>    2|  110
>    3|    3
>    4|    1
> (4 rows)
>
> emp=> SELECT item, count (*) AS count FROM su GROUP
> BY item;
> item|count
> ----+-----
>    1|94108
>    2| 1697
>    3|  773
>    4|  482
>    5|  237
>    6|  146
>    7|  105
>    8|   68
>    9|   41
>   10|   29
>   11|   22
>   12|   15
>   13|   13
>   14|    8
>   15|    7
>   16|    6
>   17|    5
>   18|    4
>   19|    4
>   20|    4
>   21|    4
>   22|    3
>   23|    3
>   24|    3
>   25|    1
>   26|    1
>   27|    1
>   28|    1
>   29|    1
>   30|    1
>   31|    1
>
> As a default rule, I used to create the btree
> indices for all integer
> types, regardless of their values. Not anymore. It
> took me quite a
> while to figure that the following query did not
> work because of the
> faulty btree index on "item" (other joined
> attributes are char()):
>
> SELECT km.km, su.su
>   FROM km, su
>   WHERE km.id = su.id
>     AND km.rel = su.rel
>     AND km.item = su.item
>     AND su ~ '^ethanol';
>
> Can the btree or any other AM be smart enough and
> bail out from CREATE
> INDEX saying, "your data isn't worth indexing"?
>
> --Gene


=====

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com

Re: [GENERAL] Yet another btree gotcha

From
"Gene Selkov, Jr."
Date:
Mike Mascari wrote:
>
> I realize that this is somewhat off topic, but I
> have seen far better improvments in query speeds
> when using a multifield index instead of individual
> ones.

I only saw an infinitesimal improvement in my case (I guess it
already was as good as it could be), but I appreciate pointing this
option out to me. I believe it can account for better performance in
some cases, but it's already useful in keeping the number of indices
low and it also makes the queries shorter and more readable.

> Perhaps the distribution of your data is more
> "normal" using multikey indexes (I hope).  Note
> that if you have a multifield index, and you still
> submit your original query, the planner/optimizer
> doesn't appear smart enough to use the multikey
> index except for the first join condition,

It does that now:

explain SELECT km.km, su.su FROM km, su WHERE (km.id,km.rel,km.item) = (su.id,su.rel,su.item) AND su ~ '^ethanol'
NOTICE:  QUERY PLAN:

Nested Loop  (cost=5071.34 rows=5 width=76)
  ->  Seq Scan on su  (cost=5039.20 rows=16 width=38)
  ->  Index Scan using k_km on km  (cost=2.01 rows=31376 width=38)

explain SELECT km.km, su.su FROM km, su WHERE km.id = su.id AND km.rel = su.rel AND km.item = su.item AND su ~
'^ethanol';"
NOTICE:  QUERY PLAN:

Nested Loop  (cost=5071.34 rows=5 width=76)
  ->  Seq Scan on su  (cost=5039.20 rows=16 width=38)
  ->  Index Scan using k_km on km  (cost=2.01 rows=31376 width=38)


and it uses only one individual index when the composite index is dropped:

explain SELECT km.km, su.su FROM km, su WHERE km.id = su.id AND km.rel = su.rel AND km.item = su.item AND su ~
'^ethanol';"
NOTICE:  QUERY PLAN:

Nested Loop  (cost=5075.54 rows=5 width=76)
  ->  Seq Scan on su  (cost=5039.20 rows=16 width=38)
  ->  Index Scan using kmidix on km  (cost=2.27 rows=31376 width=38)

explain SELECT km.km, su.su FROM km, su WHERE (km.id,km.rel,km.item) = (su.id,su.rel,su.item) AND su ~ '^ethanol'"
NOTICE:  QUERY PLAN:

Nested Loop  (cost=5075.54 rows=5 width=76)
  ->  Seq Scan on su  (cost=5039.20 rows=16 width=38)
  ->  Index Scan using kmidix on km  (cost=2.27 rows=31376 width=38)

Note that (1) it does not seem to treat the AND clauses and list
comparisons differently (and that's great!), and (2), in this case,
the composite index is just as good as one of the individual indices
-- the one with the largest number of bins (km.id and su.id are unique
for each record, while "rel" and "item" aren't)


> On another note, looking through the backend
> sources reveals that, at one point, partial indices
> were once allowed:
>
> CREATE INDEX k_km on km(item) WHERE item <> 1;
>
> but the grammer has been removed and is now
> illegal (although the code to handle it still
> exists). Perhaps someday this feature will be
> restored.

I like this idea, too.

Thanks a lot,

--Gene