Thread: visualizing B-tree index coverage

visualizing B-tree index coverage

From
"TJ O'Donnell"
Date:
Does anyone know of a tools that allows one to visualize
the tree created by a multi-column B-tree index?
A picture of a tree with branches, showing how "branchy" the
tree is would be great.
I'm wondering how well I've "clustered" the data in my table
using the multi-column index.  In other words, do my
multi-columns sufficiently but not overly discriminate rows from each other?
Do I have too many with the same index? (not enough branches)
Do I have a unique index for each row? (way too many branches)

Thanks,
TJ



Re: visualizing B-tree index coverage

From
"Dann Corbit"
Date:
Useful explanation of PostgreSQL index format:
http://www.faqs.org/docs/ppbook/c13329.htm

I think you are aiming for the wrong thing.
The worst possible index is one with every value the same.
The second worst (still basically useless) is one with only two values.
The greater the differentiation of the data, the more workload is
reduced on a search.

Since it isn't a straight binary tree, I don't think that having highly
dissimilar data in the index should be a problem.

Do you have data or experience that shows otherwise?

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of TJ O'Donnell
Sent: Tuesday, January 25, 2005 2:19 PM
To: pgsql-general@postgresql.org
Cc: tjo@acm.org
Subject: [GENERAL] visualizing B-tree index coverage

Does anyone know of a tools that allows one to visualize
the tree created by a multi-column B-tree index?
A picture of a tree with branches, showing how "branchy" the
tree is would be great.
I'm wondering how well I've "clustered" the data in my table
using the multi-column index.  In other words, do my
multi-columns sufficiently but not overly discriminate rows from each
other?
Do I have too many with the same index? (not enough branches)
Do I have a unique index for each row? (way too many branches)

Thanks,
TJ



---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Re: visualizing B-tree index coverage

From
"TJ O'Donnell"
Date:
Since I'm using a multi-column index, I can greatly influence
the nature of the index created, depending on which columns I use
and how many.  I'm searching for an optimal set
of columns that creates an index that, for sure does not have
every value the same, nor only two values.  Instead, I want to see
how well I've spread the index out over the data (if that phrasing makes sense).

More specifically, I have character data representing molecular structures.
I've written (rather slow) search functions.  I can create any number of
columns that "fingerprint" each structure, e.g. # Carbon atoms, # N atoms,
# single bonds, etc.  I expect my fingerprints will not be unique (fingerprint may
be a poor analogy), but rather will classify similar structures together.
I create a multi-column index on these counts and
get about 2-3 times speedup using 13 columns right now.
For example:

select count(smiles) from structure where  oe_matches(smiles,'c1ccccc1CC(=O)NC')  about 15 sec.

select count(smiles) from structure where
 (_c, _n, _o, _s, _p, _halo,
  _arom_c, _arom_n, _arom_o, _arom_s,
  _atoms, _single_bonds, _other_bonds)  >=
 ( 3,1,1,0,0,0, 6,0,0,0, 11,4,7 )
 and oe_matches(smiles,'c1ccccc1CC(=O)NC')   about 6 seconds
when the (_c, etc.) is a multi-column index.

The data isn't inherently structured in any way that invites some particular number of columns
for indexing.  I don't want to use too many, nor too few columns.  I also
want to optimize the nature(which atom types, bond types, etc.)
of the count columns.  While I could do this
and use the speedup as the measure of success, I think
that if my B-tree were "covering" the data well, I would get the best results.
Covering means finding that optimal situation where there is not one index for all rows
and also not a unique index for every row - something inbetween would be ideal,
or is that basically a wrong idea?

TJ



> Useful explanation of PostgreSQL index format:
> http://www.faqs.org/docs/ppbook/c13329.htm
>
> I think you are aiming for the wrong thing.
> The worst possible index is one with every value the same.
> The second worst (still basically useless) is one with only two values. The greater the
> differentiation of the data, the more workload is
> reduced on a search.
>
> Since it isn't a straight binary tree, I don't think that having highly dissimilar data in the
> index should be a problem.
>
> Do you have data or experience that shows otherwise?
>
> -----Original Message-----
> From: pgsql-general-owner@postgresql.org
> [mailto:pgsql-general-owner@postgresql.org] On Behalf Of TJ O'Donnell Sent: Tuesday, January 25,
> 2005 2:19 PM
> To: pgsql-general@postgresql.org
> Cc: tjo@acm.org
> Subject: [GENERAL] visualizing B-tree index coverage
>
> Does anyone know of a tools that allows one to visualize
> the tree created by a multi-column B-tree index?
> A picture of a tree with branches, showing how "branchy" the
> tree is would be great.
> I'm wondering how well I've "clustered" the data in my table
> using the multi-column index.  In other words, do my
> multi-columns sufficiently but not overly discriminate rows from each other?
> Do I have too many with the same index? (not enough branches)
> Do I have a unique index for each row? (way too many branches)
>
> Thanks,
> TJ
>
>
>
> ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to
> increase your free space map settings




Re: visualizing B-tree index coverage

From
"Dann Corbit"
Date:
Normally, a unique, clustered index is the silver bullet to the best
performance (but my experience with unique clustered indexes is largely
in non-PostgreSQL database systems -- so take it with a grain of salt).

I do not see any extra expense for a unique index verses one that is
mostly unique.  Further, if an index is unique, that should be an
excellent optimizer hint for query acceleration.

If you know what queries you run most frequently, I would tailor the
index for optimal query execution via the join columns and columns often
involved in where clause filtering.

If it is easily possible to make a unique index I would definitely time
the queries with a unique index as well.   I suspect that the unique
index will fare better unless there is something odd about your data.

-----Original Message-----
From: TJ O'Donnell [mailto:tjo@acm.org]
Sent: Tuesday, January 25, 2005 3:50 PM
To: pgsql-general@postgresql.org
Cc: Dann Corbit
Subject: RE: [GENERAL] visualizing B-tree index coverage

Since I'm using a multi-column index, I can greatly influence
the nature of the index created, depending on which columns I use
and how many.  I'm searching for an optimal set
of columns that creates an index that, for sure does not have
every value the same, nor only two values.  Instead, I want to see
how well I've spread the index out over the data (if that phrasing makes
sense).

More specifically, I have character data representing molecular
structures.
I've written (rather slow) search functions.  I can create any number of
columns that "fingerprint" each structure, e.g. # Carbon atoms, # N
atoms,
# single bonds, etc.  I expect my fingerprints will not be unique
(fingerprint may
be a poor analogy), but rather will classify similar structures
together.
I create a multi-column index on these counts and
get about 2-3 times speedup using 13 columns right now.
For example:

select count(smiles) from structure where
oe_matches(smiles,'c1ccccc1CC(=O)NC')  about 15 sec.

select count(smiles) from structure where
 (_c, _n, _o, _s, _p, _halo,
  _arom_c, _arom_n, _arom_o, _arom_s,
  _atoms, _single_bonds, _other_bonds)  >=
 ( 3,1,1,0,0,0, 6,0,0,0, 11,4,7 )
 and oe_matches(smiles,'c1ccccc1CC(=O)NC')   about 6 seconds
when the (_c, etc.) is a multi-column index.

The data isn't inherently structured in any way that invites some
particular number of columns
for indexing.  I don't want to use too many, nor too few columns.  I
also
want to optimize the nature(which atom types, bond types, etc.)
of the count columns.  While I could do this
and use the speedup as the measure of success, I think
that if my B-tree were "covering" the data well, I would get the best
results.
Covering means finding that optimal situation where there is not one
index for all rows
and also not a unique index for every row - something inbetween would be
ideal,
or is that basically a wrong idea?

TJ



> Useful explanation of PostgreSQL index format:
> http://www.faqs.org/docs/ppbook/c13329.htm
>
> I think you are aiming for the wrong thing.
> The worst possible index is one with every value the same.
> The second worst (still basically useless) is one with only two
values. The greater the
> differentiation of the data, the more workload is
> reduced on a search.
>
> Since it isn't a straight binary tree, I don't think that having
highly dissimilar data in the
> index should be a problem.
>
> Do you have data or experience that shows otherwise?
>
> -----Original Message-----
> From: pgsql-general-owner@postgresql.org
> [mailto:pgsql-general-owner@postgresql.org] On Behalf Of TJ O'Donnell
Sent: Tuesday, January 25,
> 2005 2:19 PM
> To: pgsql-general@postgresql.org
> Cc: tjo@acm.org
> Subject: [GENERAL] visualizing B-tree index coverage
>
> Does anyone know of a tools that allows one to visualize
> the tree created by a multi-column B-tree index?
> A picture of a tree with branches, showing how "branchy" the
> tree is would be great.
> I'm wondering how well I've "clustered" the data in my table
> using the multi-column index.  In other words, do my
> multi-columns sufficiently but not overly discriminate rows from each
other?
> Do I have too many with the same index? (not enough branches)
> Do I have a unique index for each row? (way too many branches)
>
> Thanks,
> TJ
>
>
>
> ---------------------------(end of
broadcast)--------------------------- TIP 7: don't forget to
> increase your free space map settings




Re: visualizing B-tree index coverage

From
"Dann Corbit"
Date:
Some other things that are important:
How much is the data in transition (updates/deletes/inserts)?  If the
data is mostly static or static you can add many special case indexes
with little penalty.  The biggest cost of indexes (besides disk space
consumed) is in the slowdown of inserts, updates, and deletes.  If the
data hardly changes, you can throw on index after index with little
cost.  But if the data is in huge flux, you will have to be careful
about performance targets for each index you add.

This stuff may prove to be of great value:
http://www.postgresql.org/docs/8.0/interactive/monitoring-stats.html

I would also run EXPLAIN against every distinct sort of query you plan
to execute (unless it is for ad-hoc reporting in which case such a
requirement cannot be met).


Re: visualizing B-tree index coverage

From
Oleg Bartunov
Date:
Excuse me for bothering but what kind of search engine you
developed. Does it looks like sets comparing ?

     Oleg

On Tue, 25 Jan 2005, TJ O'Donnell wrote:

> Since I'm using a multi-column index, I can greatly influence
> the nature of the index created, depending on which columns I use
> and how many.  I'm searching for an optimal set
> of columns that creates an index that, for sure does not have
> every value the same, nor only two values.  Instead, I want to see
> how well I've spread the index out over the data (if that phrasing makes sense).
>
> More specifically, I have character data representing molecular structures.
> I've written (rather slow) search functions.  I can create any number of
> columns that "fingerprint" each structure, e.g. # Carbon atoms, # N atoms,
> # single bonds, etc.  I expect my fingerprints will not be unique (fingerprint may
> be a poor analogy), but rather will classify similar structures together.
> I create a multi-column index on these counts and
> get about 2-3 times speedup using 13 columns right now.
> For example:
>
> select count(smiles) from structure where  oe_matches(smiles,'c1ccccc1CC(=O)NC')  about 15 sec.
>
> select count(smiles) from structure where
> (_c, _n, _o, _s, _p, _halo,
>  _arom_c, _arom_n, _arom_o, _arom_s,
>  _atoms, _single_bonds, _other_bonds)  >=
> ( 3,1,1,0,0,0, 6,0,0,0, 11,4,7 )
> and oe_matches(smiles,'c1ccccc1CC(=O)NC')   about 6 seconds
> when the (_c, etc.) is a multi-column index.
>
> The data isn't inherently structured in any way that invites some particular number of columns
> for indexing.  I don't want to use too many, nor too few columns.  I also
> want to optimize the nature(which atom types, bond types, etc.)
> of the count columns.  While I could do this
> and use the speedup as the measure of success, I think
> that if my B-tree were "covering" the data well, I would get the best results.
> Covering means finding that optimal situation where there is not one index for all rows
> and also not a unique index for every row - something inbetween would be ideal,
> or is that basically a wrong idea?
>
> TJ
>
>
>
>> Useful explanation of PostgreSQL index format:
>> http://www.faqs.org/docs/ppbook/c13329.htm
>>
>> I think you are aiming for the wrong thing.
>> The worst possible index is one with every value the same.
>> The second worst (still basically useless) is one with only two values. The greater the
>> differentiation of the data, the more workload is
>> reduced on a search.
>>
>> Since it isn't a straight binary tree, I don't think that having highly dissimilar data in the
>> index should be a problem.
>>
>> Do you have data or experience that shows otherwise?
>>
>> -----Original Message-----
>> From: pgsql-general-owner@postgresql.org
>> [mailto:pgsql-general-owner@postgresql.org] On Behalf Of TJ O'Donnell Sent: Tuesday, January 25,
>> 2005 2:19 PM
>> To: pgsql-general@postgresql.org
>> Cc: tjo@acm.org
>> Subject: [GENERAL] visualizing B-tree index coverage
>>
>> Does anyone know of a tools that allows one to visualize
>> the tree created by a multi-column B-tree index?
>> A picture of a tree with branches, showing how "branchy" the
>> tree is would be great.
>> I'm wondering how well I've "clustered" the data in my table
>> using the multi-column index.  In other words, do my
>> multi-columns sufficiently but not overly discriminate rows from each other?
>> Do I have too many with the same index? (not enough branches)
>> Do I have a unique index for each row? (way too many branches)
>>
>> Thanks,
>> TJ
>>
>>
>>
>> ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to
>> increase your free space map settings
>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
>               http://archives.postgresql.org
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: visualizing B-tree index coverage

From
PFC
Date:
    I think you missed an important "feature" of multicolumn indexes, that
you better not use 'OR' in your expressions. You seem to want only to use
'>=' so this should be OK.

    Suppose you have 3 columns a,z,e containing values linearly distributed
between ...

select min(a),max(a),min(z),max(z),min(e),max(e) from test;
  min | max | min | max | min | max
-----+-----+-----+-----+-----+-----
    0 |  13 |   0 |  99 |   0 |  99


For instance the following query is indexed :
explain analyze select * from test where a>=0 and z>=90 and e>=0;
                                                        QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------
  Index Scan using testa on test  (cost=0.00..1637.56 rows=11345 width=16)
(actual time=0.085..51.441 rows=13000 loops=1)
    Index Cond: ((a >= 0) AND (z >= 90) AND (e >= 0))
  Total runtime: 56.307 ms

The following is only partially indexed :

explain analyze select * from test where (a=1 or a=2) and (z=1 or z=8) and
e>=0;
                                                          QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------
  Index Scan using testa, testa on test  (cost=0.00..3269.06 rows=346
width=16) (actual time=0.328..52.961 rows=400 loops=1)
    Index Cond: ((a = 1) OR (a = 2))
    Filter: (((z = 1) OR (z = 8)) AND (e >= 0))
  Total runtime: 53.297 ms

    You see the 'index cond' field which is what determines the fetched rows,
which are then fetched and filtered with the 'filter' expression. Having
the most selective index cond is important because it will diminish the
number of rows to be fetched. However, in your case the filter expression
is also beneficial because any row eliminated by the filter will not need
to go through your expensive matching function.

In this case :

SELECT count(*) FROM test;
    => 131072

SELECT count(*) FROM test WHERE ((a = 1) OR (a = 2));
    => 20000

SELECT count(*) FROM test WHERE (a=1 or a=2) and (z=1 or z=8) and e>=0;
    => 400

In this case the index fetches 20k rows out of 131072 but only 400 are
used...

If you don't use OR, index use is more likely :

explain analyze select * from test where (a,z,e) >= (0,50,80);
                                                        QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------
  Index Scan using testa on test  (cost=0.00..1669.78 rows=12627 width=16)
(actual time=0.087..58.316 rows=13000 loops=1)
    Index Cond: ((a >= 0) AND (z >= 50) AND (e >= 80))
  Total runtime: 63.049 ms

Here you have a full index scan.

To determine the efficiency of your indexes, you can thus use this method,
and look at the 'index cond' and 'filter' expressions, and counting the
rows matched by each.





















> particular number of columns
> for indexing.  I don't want to use too many, nor too few columns.  I also
> want to optimize the nature(which atom types, bond types, etc.)
> of the count columns.  While I could do this
> and use the speedup as the measure of success, I think
> that if my B-tree were "covering" the data well, I would get the best
> results.
> Covering means finding that optimal situation where there is not one
> index for all rows
> and also not a unique index for every row - something inbetween would be
> ideal,
> or is that basically a wrong idea?

Re: visualizing B-tree index coverage

From
"Florian G. Pflug"
Date:
TJ O'Donnell wrote:
> More specifically, I have character data representing molecular structures.
> I've written (rather slow) search functions.  I can create any number of
> columns that "fingerprint" each structure, e.g. # Carbon atoms, # N atoms,
> # single bonds, etc.  I expect my fingerprints will not be unique (fingerprint may
> be a poor analogy), but rather will classify similar structures together.
> I create a multi-column index on these counts and
> get about 2-3 times speedup using 13 columns right now.
> For example:
> select count(smiles) from structure where  oe_matches(smiles,'c1ccccc1CC(=O)NC')  about 15 sec.
>
> select count(smiles) from structure where
>  (_c, _n, _o, _s, _p, _halo,
>   _arom_c, _arom_n, _arom_o, _arom_s,
>   _atoms, _single_bonds, _other_bonds)  >=
>  ( 3,1,1,0,0,0, 6,0,0,0, 11,4,7 )
>  and oe_matches(smiles,'c1ccccc1CC(=O)NC')   about 6 seconds
> when the (_c, etc.) is a multi-column index.
Is 3 really only a lower bound for the number of carbon atoms (which, I
guess is what _c represents), or will there be exactly 3 carbon atoms in
the molecules found? If not, can you estimate an upper bound, as well as
an lower bound? If you can, than you cound further optimize this by
using a range query (i.e. select ... from ... where (...) >= (...) and
(...) <= (...( and oe_matches(...)).

If you can only determine bounds (upper _and_ lower) for some of the
"fingerprint" columns, that I would suggest you choose those for the
index, and use them as the "first" columns (because otherwise a
range-scan won't work).

"analyze" gathers some statistics about the distribution of values in a
table - I guess you could extract at least an approximation of "index
coverage" from there.

Since there cost of index inserts/updates/deletes is afaik largly
independent from the number of columns the index consists of, I'd
suggest that you add columns until the stats show that it's nearly an
unique index - but prefer columns for which you can compute strong upper
  & lower bounds, and put those "in front".

> The data isn't inherently structured in any way that invites some particular number of columns
> for indexing.  I don't want to use too many, nor too few columns.  I also
> want to optimize the nature(which atom types, bond types, etc.)
> of the count columns.  While I could do this
> and use the speedup as the measure of success, I think
> that if my B-tree were "covering" the data well, I would get the best results.
> Covering means finding that optimal situation where there is not one index for all rows
> and also not a unique index for every row - something inbetween would be ideal,
> or is that basically a wrong idea?
I believe it depends on the costs of executing one oe_matches call -
according to you, those costs are quite heavy, so I every call you can
omit by adding more columns to the index is worth the performance penalty.

If you need fast adding of data, as well as fast searches, you could add
the data into a "pending" table that is not indexed, and union the
searches over both tables. At low-traffic times (say, at night), you
could move the data from the "pending" table to the "structure" table.
This of course only works, if the number of tuples insertes between two
move-runs is much smaller than the number of tuples in "structure".

greetings, Florian Pflug


Re: visualizing B-tree index coverage

From
TJ O'Donnell
Date:
I realize that using OR will not result in an index scan.
I will never be interested in a OR condition for the kinds
of searches I use.  In my Select statements, I always name
every column of the multi-column index in same order that
they were named when creating the index.  I always use
the >= condition, and very rarely, the = condition.

However, I am concerned that I must place
the most selective column first in my index.  I cannot tell,
a priori, which column will be most selective.  That depends on the
nature of search, which can vary widely each time.
Are you saying that if my first column is not selective, even though the remaining
columns are, the planner may choose not to use the index after
seeing that the first column is not very selective?
That seems like an oversight, IMHO.  Shouldn't the overall effect of
using all the columns be considered before choosing not to use an
index scan?

Since I'm using every column of my multi-column index for every search,
and I always use >=, Explain Analyze always shows that every column
is considered in the index scan.  However, that is only when the
index scan is used.  Sometimes, Explain Analyze shows it is not used.
That appears to happen when my search condition is very general.
This it to be expected, so I am not worried.  Most of my searches will
be intermediate, namely not VERY selective, but also not VERY general.
So the idea of the multi-column index is to "characterize" each row
sufficiently, even when it is a perfectly ordinary row with no ONE
feature being distinctive, but rather several features together giving
it it's distinctive character.  That is my interpretation of the
multi-column index.

TJ


PFC wrote:
>
>     I think you missed an important "feature" of multicolumn indexes,
> that  you better not use 'OR' in your expressions. You seem to want only
> to use  '>=' so this should be OK.
>
>     Suppose you have 3 columns a,z,e containing values linearly
> distributed  between ...
>
> select min(a),max(a),min(z),max(z),min(e),max(e) from test;
>  min | max | min | max | min | max
> -----+-----+-----+-----+-----+-----
>    0 |  13 |   0 |  99 |   0 |  99
>
>
> For instance the following query is indexed :
> explain analyze select * from test where a>=0 and z>=90 and e>=0;
>                                                        QUERY PLAN
>
-------------------------------------------------------------------------------------------------------------------------

>
>  Index Scan using testa on test  (cost=0.00..1637.56 rows=11345
> width=16)  (actual time=0.085..51.441 rows=13000 loops=1)
>    Index Cond: ((a >= 0) AND (z >= 90) AND (e >= 0))
>  Total runtime: 56.307 ms
>
> The following is only partially indexed :
>
> explain analyze select * from test where (a=1 or a=2) and (z=1 or z=8)
> and  e>=0;
>                                                          QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------------

>
>  Index Scan using testa, testa on test  (cost=0.00..3269.06 rows=346
> width=16) (actual time=0.328..52.961 rows=400 loops=1)
>    Index Cond: ((a = 1) OR (a = 2))
>    Filter: (((z = 1) OR (z = 8)) AND (e >= 0))
>  Total runtime: 53.297 ms
>
>     You see the 'index cond' field which is what determines the fetched
> rows,  which are then fetched and filtered with the 'filter' expression.
> Having  the most selective index cond is important because it will
> diminish the  number of rows to be fetched. However, in your case the
> filter expression  is also beneficial because any row eliminated by the
> filter will not need  to go through your expensive matching function.
>
> In this case :
>
> SELECT count(*) FROM test;
>     => 131072
>
> SELECT count(*) FROM test WHERE ((a = 1) OR (a = 2));
>     => 20000
>
> SELECT count(*) FROM test WHERE (a=1 or a=2) and (z=1 or z=8) and e>=0;
>     => 400
>
> In this case the index fetches 20k rows out of 131072 but only 400 are
> used...
>
> If you don't use OR, index use is more likely :
>
> explain analyze select * from test where (a,z,e) >= (0,50,80);
>                                                        QUERY PLAN
>
-------------------------------------------------------------------------------------------------------------------------

>
>  Index Scan using testa on test  (cost=0.00..1669.78 rows=12627
> width=16)  (actual time=0.087..58.316 rows=13000 loops=1)
>    Index Cond: ((a >= 0) AND (z >= 50) AND (e >= 80))
>  Total runtime: 63.049 ms
>
> Here you have a full index scan.
>
> To determine the efficiency of your indexes, you can thus use this
> method,  and look at the 'index cond' and 'filter' expressions, and
> counting the  rows matched by each.
>

>
>> particular number of columns
>> for indexing.  I don't want to use too many, nor too few columns.  I also
>> want to optimize the nature(which atom types, bond types, etc.)
>> of the count columns.  While I could do this
>> and use the speedup as the measure of success, I think
>> that if my B-tree were "covering" the data well, I would get the best
>> results.
>> Covering means finding that optimal situation where there is not one
>> index for all rows
>> and also not a unique index for every row - something inbetween would
>> be  ideal,
>> or is that basically a wrong idea?

Re: visualizing B-tree index coverage

From
Greg Stark
Date:
"TJ O'Donnell" <tjo@acm.org> writes:

> However, I am concerned that I must place the most selective column first in
> my index. I cannot tell, a priori, which column will be most selective. That
> depends on the nature of search, which can vary widely each time.

If you're always using > operators then perhaps you would be better off with
three separate indexes on <x>, <y>, and <z> rather than one on <x,y,z>. It
depends on your use case.

It's also possible you would be better off with a GiST index on the array
[x,y,z] and using the GiST indexable operators in contrib/intarray. However
there are disadvantages to these addons as well.

> Are you saying that if my first column is not selective, even though the
> remaining columns are, the planner may choose not to use the index after
> seeing that the first column is not very selective? That seems like an
> oversight, IMHO. Shouldn't the overall effect of using all the columns be
> considered before choosing not to use an index scan?

If your index is on <x,y,z> and you say x=? and y=? and z=? then it's best to
have x be the most selective column but it's only a question of degree. The
index will be useful in basically the same situations, it'll just perform
somewhat better.

On the other hand if your index is on <x,y,z> and you say "x>? and y=? and
z=?" then the remaining qualifications after x>? are useless. Postgres simply
cannot use them in the index lookup. If your index was on <y,z,x> then
Postgres can use it to go straight to the start of the records that match and
read them from the index until they stop matching.

Consider the analogous situation if you were using an index of a book. I tell
you to look up everything in the encylopedia that starts with a letter after
"p". And by the way the second letter has to be an "x" and the third letter a
"y". You will more or less have to skim through everything after "p" even
though you know the second and third letter because the matching entries will
be scattered throughout the remaining pages.

Actually that's not entirely true, you could check each letter after "p" and
look up specifically "qxy...", "rxy...", "sxy...", "txy...", etc right up to
"xzy...". But unfortunately that's just not something Postgres currently knows
how to do. Even then you can see that it's much less efficient than just
reading a contiguous section of records out in order.

--
greg

Re: visualizing B-tree index coverage

From
PFC
Date:
> I realize that using OR will not result in an index scan.
> I will never be interested in a OR condition for the kinds
> of searches I use.  In my Select statements, I always name
> every column of the multi-column index in same order that
> they were named when creating the index.  I always use
> the >= condition, and very rarely, the = condition.

    All the leftmost index column must be named, but the order is unimportant.
    You can use (a BETWEEN x AND y) instead of (a>=x AND a<=y), it is cleaner.

> However, I am concerned that I must place
> the most selective column first in my index.  I cannot tell,
> a priori, which column will be most selective.  That depends on the
> nature of search, which can vary widely each time.
> Are you saying that if my first column is not selective, even though the
> remaining
> columns are, the planner may choose not to use the index after
> seeing that the first column is not very selective?

    I thought this was true but made some tests and the index scanner is
smart.

    Try this :
CREATE TABLE test (id serial primary key, a INTEGER, z INTEGER, e INTEGER,
r INTEGER, t INTEGER, y INTEGER ) WITHOUT OIDS;
INSERT 1M rows into table using a plpgsql function, with a,z,e,r,t,y being
floor(random()*10) for instance.

    Then you can try various selects. a,z,e,r,t,y are a linear distribution
between 0 and 9 included, so :
a>=A AND z>=Z ... y>=Y gives a result set of about
(10-A)*(10-Z)*...*(10-Y) results. You'll see the planner will use an index
scan when needed. You can try the easiest case (a>=9) which just explores
one part of the tree, and the worst case which explores a part of all
leafs (y>=9). Both should yield about the same number of results, but the
first should be faster. To know how much, just try ;)

> That seems like an oversight, IMHO.  Shouldn't the overall effect of
> using all the columns be considered before choosing not to use an
> index scan?

    I think it is. There are no cross column correlation stats though.

> Since I'm using every column of my multi-column index for every search,
> and I always use >=, Explain Analyze always shows that every column
> is considered in the index scan.  However, that is only when the
> index scan is used.  Sometimes, Explain Analyze shows it is not used.
> That appears to happen when my search condition is very general.
> This it to be expected, so I am not worried.  Most of my searches will
> be intermediate, namely not VERY selective, but also not VERY general.
> So the idea of the multi-column index is to "characterize" each row
> sufficiently, even when it is a perfectly ordinary row with no ONE
> feature being distinctive, but rather several features together giving
> it it's distinctive character.  That is my interpretation of the
> multi-column index.

    If you have some features which are highly selective, you can create a
single column index on them. It won't be used often, but when it will, it
will really work.




Re: visualizing B-tree index coverage

From
"Dann Corbit"
Date:
>>
Rather than having the most selective item first, put the most
frequently used item first.

So (for instance) if your database is an organic chemistry database and
the columns were elements, then carbon might be a logical first choice,
hydrogen second, etc. for your index.

What you want in creating the index is to choose an index where the
items requested for filtering (in either WHERE clause or join columns)
are very sure to exist somewhere in the early part of the index.

A big problem will arise if (for all indexes) there is no instance where
the first column of the index is referenced in the where clause.

That means you will do a complete table scan.

So create an index so for queries the most probable things to ask for
are earliest in the index.  More important than selectivity is the
probability of the column being included in the filter part of the
request.
<<
-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of TJ O'Donnell
Sent: Thursday, January 27, 2005 7:38 AM
To: PFC
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] visualizing B-tree index coverage

I realize that using OR will not result in an index scan.
I will never be interested in a OR condition for the kinds
of searches I use.  In my Select statements, I always name
every column of the multi-column index in same order that
they were named when creating the index.  I always use
the >= condition, and very rarely, the = condition.

However, I am concerned that I must place
the most selective column first in my index.  I cannot tell,
a priori, which column will be most selective.  That depends on the
nature of search, which can vary widely each time.
Are you saying that if my first column is not selective, even though the
remaining
columns are, the planner may choose not to use the index after
seeing that the first column is not very selective?
That seems like an oversight, IMHO.  Shouldn't the overall effect of
using all the columns be considered before choosing not to use an
index scan?

Since I'm using every column of my multi-column index for every search,
and I always use >=, Explain Analyze always shows that every column
is considered in the index scan.  However, that is only when the
index scan is used.  Sometimes, Explain Analyze shows it is not used.
That appears to happen when my search condition is very general.
This it to be expected, so I am not worried.  Most of my searches will
be intermediate, namely not VERY selective, but also not VERY general.
So the idea of the multi-column index is to "characterize" each row
sufficiently, even when it is a perfectly ordinary row with no ONE
feature being distinctive, but rather several features together giving
it it's distinctive character.  That is my interpretation of the
multi-column index.

TJ


PFC wrote:
>
>     I think you missed an important "feature" of multicolumn indexes,
> that  you better not use 'OR' in your expressions. You seem to want
only
> to use  '>=' so this should be OK.
>
>     Suppose you have 3 columns a,z,e containing values linearly
> distributed  between ...
>
> select min(a),max(a),min(z),max(z),min(e),max(e) from test;
>  min | max | min | max | min | max
> -----+-----+-----+-----+-----+-----
>    0 |  13 |   0 |  99 |   0 |  99
>
>
> For instance the following query is indexed :
> explain analyze select * from test where a>=0 and z>=90 and e>=0;
>                                                        QUERY PLAN
>
------------------------------------------------------------------------
-------------------------------------------------
>
>  Index Scan using testa on test  (cost=0.00..1637.56 rows=11345
> width=16)  (actual time=0.085..51.441 rows=13000 loops=1)
>    Index Cond: ((a >= 0) AND (z >= 90) AND (e >= 0))
>  Total runtime: 56.307 ms
>
> The following is only partially indexed :
>
> explain analyze select * from test where (a=1 or a=2) and (z=1 or z=8)

> and  e>=0;
>                                                          QUERY PLAN
>
------------------------------------------------------------------------
----------------------------------------------------
>
>  Index Scan using testa, testa on test  (cost=0.00..3269.06 rows=346
> width=16) (actual time=0.328..52.961 rows=400 loops=1)
>    Index Cond: ((a = 1) OR (a = 2))
>    Filter: (((z = 1) OR (z = 8)) AND (e >= 0))
>  Total runtime: 53.297 ms
>
>     You see the 'index cond' field which is what determines the
fetched
> rows,  which are then fetched and filtered with the 'filter'
expression.
> Having  the most selective index cond is important because it will
> diminish the  number of rows to be fetched. However, in your case the
> filter expression  is also beneficial because any row eliminated by
the
> filter will not need  to go through your expensive matching function.
>
> In this case :
>
> SELECT count(*) FROM test;
>     => 131072
>
> SELECT count(*) FROM test WHERE ((a = 1) OR (a = 2));
>     => 20000
>
> SELECT count(*) FROM test WHERE (a=1 or a=2) and (z=1 or z=8) and
e>=0;
>     => 400
>
> In this case the index fetches 20k rows out of 131072 but only 400 are

> used...
>
> If you don't use OR, index use is more likely :
>
> explain analyze select * from test where (a,z,e) >= (0,50,80);
>                                                        QUERY PLAN
>
------------------------------------------------------------------------
-------------------------------------------------
>
>  Index Scan using testa on test  (cost=0.00..1669.78 rows=12627
> width=16)  (actual time=0.087..58.316 rows=13000 loops=1)
>    Index Cond: ((a >= 0) AND (z >= 50) AND (e >= 80))
>  Total runtime: 63.049 ms
>
> Here you have a full index scan.
>
> To determine the efficiency of your indexes, you can thus use this
> method,  and look at the 'index cond' and 'filter' expressions, and
> counting the  rows matched by each.
>

>
>> particular number of columns
>> for indexing.  I don't want to use too many, nor too few columns.  I
also
>> want to optimize the nature(which atom types, bond types, etc.)
>> of the count columns.  While I could do this
>> and use the speedup as the measure of success, I think
>> that if my B-tree were "covering" the data well, I would get the best

>> results.
>> Covering means finding that optimal situation where there is not one

>> index for all rows
>> and also not a unique index for every row - something inbetween would

>> be  ideal,
>> or is that basically a wrong idea?

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Does indexing help >= as well as = for integer columns?

From
"TJ O'Donnell"
Date:
I have a table of about 5 million rows, 24 columns.
Integer column _c is BTREE indexed (as is _n, _o and 3 others).

This I understand and like:
Explain Analyze Select count(smiles) from structure where _c = 30
Aggregate  (cost=105595.11..105595.11 rows=1 width=32) (actual time=17.722..17.724 rows=1 loops=1)
  ->  Index Scan using "Nc" on structure  (cost=0.00..105528.89 rows=26486 width=32) (actual
time=0.098..16.095 rows=734 loops=1)
        Index Cond: (_c = 30)
Total runtime: 18.019 ms

This I don't get.  Why is an index scan not used?  Isn't an index supposed
to help when using > < >= <= too?
Explain Analyze Select count(smiles) from structure where _c >= 30
Aggregate  (cost=196033.74..196033.74 rows=1 width=32) (actual time=42133.432..42133.434 rows=1
loops=1)
  ->  Seq Scan on structure  (cost=0.00..191619.56 rows=1765669 width=32) (actual
time=8050.437..42117.062 rows=1569 loops=1)
        Filter: (_c >= 30)
Total runtime: 42133.746 ms


TJ



Re: Does indexing help >= as well as = for integer columns?

From
PFC
Date:
> This I don't get.  Why is an index scan not used?  Isn't an index
> supposed
> to help when using > < >= <= too?

    It should !

> Explain Analyze Select count(smiles) from structure where _c >= 30
> Aggregate  (cost=196033.74..196033.74 rows=1 width=32) (actual
> time=42133.432..42133.434 rows=1
> loops=1)
>   ->  Seq Scan on structure  (cost=0.00..191619.56 rows=1765669
> width=32) (actual
> time=8050.437..42117.062 rows=1569 loops=1)
>         Filter: (_c >= 30)
> Total runtime: 42133.746 ms


    See these :

->  Index Scan using "Nc" on structure  (cost=0.00..105528.89 rows=26486
width=32) (actualtime=0.098..16.095 rows=734 loops=1)
->  Seq Scan on structure  (cost=0.00..191619.56 rows=1765669 width=32)
(actual time=8050.437..42117.062 rows=1569 loops=1)

    In the index scan case, Planner thinks it'll get "rows=26486" but in
reality only gets 734 rows.
    In the seq scan case, Planner thinks it'll get "rows=1765669" but in
reality only gets 1569 rows.

    The two are way off-mark. 26486 still makes it choose an index scan
because it's a small fraction of the table, but 1765669 is not.

    Analyze, use more precise statistics (alter table set statistics),
whatever... but you gotta get the planner correctly estimating these
rowcounts.

Re: Does indexing help >= as well as = for integer columns?

From
Tom Lane
Date:
"TJ O'Donnell" <tjo@acm.org> writes:
> This I don't get.  Why is an index scan not used?  Isn't an index supposed
> to help when using > < >= <= too?
> Explain Analyze Select count(smiles) from structure where _c >= 30
> Aggregate  (cost=196033.74..196033.74 rows=1 width=32) (actual time=42133.432..42133.434 rows=1
> loops=1)
>   ->  Seq Scan on structure  (cost=0.00..191619.56 rows=1765669 width=32) (actual
> time=8050.437..42117.062 rows=1569 loops=1)
>         Filter: (_c >= 30)

Have you ANALYZEd the table lately?  That rowcount estimate is off by
about three orders of magnitude :-(

            regards, tom lane

Re: Does indexing help >= as well as = for integer columns?

From
Greg Stark
Date:

> "TJ O'Donnell" <tjo@acm.org> writes:
>
> >   ->  Seq Scan on structure  (cost=0.00..191619.56 rows=1765669 width=32) (actual
> > time=8050.437..42117.062 rows=1569 loops=1)
         ^^^^^^^^

And do you vacuum regularly? Have you done batch updates or deletes and then
never inserted enough records to reuse that dead space? You might have to
vacuum full (or alternatively use the cluster command) to reclaim this dead
space.

I don't recall what the original motivation to rewrite the analyze sampling
was, did having lots of dead tuples cause bad estimates in 7.4?

--
greg

Re: Does indexing help >= as well as = for integer columns?

From
TJ O'Donnell
Date:
I had thought that the Creation of the Index would do something
equivalent to Analyze.  I tried Analyze Verbose and it improved
the scanner's ability to predict when an index would be useful.

Last week, I asked about visualizing B-tree "coverage".  I think
I meant "Can I see the histograms that Analyze creates?"
Are they available anywhere?  The docs mention them (bins) and I
was hoping Analyze Verbose would show them to me.

TJ

Tom Lane wrote:
> "TJ O'Donnell" <tjo@acm.org> writes:
>
>>This I don't get.  Why is an index scan not used?  Isn't an index supposed
>>to help when using > < >= <= too?
>>Explain Analyze Select count(smiles) from structure where _c >= 30
>>Aggregate  (cost=196033.74..196033.74 rows=1 width=32) (actual time=42133.432..42133.434 rows=1
>>loops=1)
>>  ->  Seq Scan on structure  (cost=0.00..191619.56 rows=1765669 width=32) (actual
>>time=8050.437..42117.062 rows=1569 loops=1)
>>        Filter: (_c >= 30)
>
>
> Have you ANALYZEd the table lately?  That rowcount estimate is off by
> about three orders of magnitude :-(
>
>             regards, tom lane

Re: Does indexing help >= as well as = for integer columns?

From
Martijn van Oosterhout
Date:
On Wed, Feb 02, 2005 at 06:51:13AM -0800, TJ O'Donnell wrote:
> I had thought that the Creation of the Index would do something
> equivalent to Analyze.  I tried Analyze Verbose and it improved
> the scanner's ability to predict when an index would be useful.

Create index creates an index, analyze collects statistics. Neither
happens without being asked for...

> Last week, I asked about visualizing B-tree "coverage".  I think
> I meant "Can I see the histograms that Analyze creates?"
> Are they available anywhere?  The docs mention them (bins) and I
> was hoping Analyze Verbose would show them to me.

Maybe pg_statistic? You may need the oid of the column definition to
work out what goes where...

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: Does indexing help >= as well as = for integer columns?

From
Tom Lane
Date:
"TJ O'Donnell" <tjo@acm.org> writes:
> Last week, I asked about visualizing B-tree "coverage".  I think
> I meant "Can I see the histograms that Analyze creates?"
> Are they available anywhere?

See pg_stats

            regards, tom lane

Re: Does indexing help >= as well as = for integer columns?

From
Manfred Koizar
Date:
On 02 Feb 2005 00:58:17 -0500, Greg Stark <gsstark@mit.edu> wrote:
>I don't recall what the original motivation to rewrite the analyze sampling
>was, did having lots of dead tuples cause bad estimates in 7.4?

The original intention was to make analysing huge tables faster.  But
the patch turned out to be _slower_ for mid sized tables.  So it was
sold to the community as a way to produce better statistics for
relations with many dead tuples near the start.

Servus
 Manfred