Re: visualizing B-tree index coverage - Mailing list pgsql-general

From Florian G. Pflug
Subject Re: visualizing B-tree index coverage
Date
Msg-id 41F8EB19.305@phlo.org
Whole thread Raw
In response to Re: visualizing B-tree index coverage  ("TJ O'Donnell" <tjo@acm.org>)
List pgsql-general
TJ O'Donnell wrote:
> More specifically, I have character data representing molecular structures.
> I've written (rather slow) search functions.  I can create any number of
> columns that "fingerprint" each structure, e.g. # Carbon atoms, # N atoms,
> # single bonds, etc.  I expect my fingerprints will not be unique (fingerprint may
> be a poor analogy), but rather will classify similar structures together.
> I create a multi-column index on these counts and
> get about 2-3 times speedup using 13 columns right now.
> For example:
> select count(smiles) from structure where  oe_matches(smiles,'c1ccccc1CC(=O)NC')  about 15 sec.
>
> select count(smiles) from structure where
>  (_c, _n, _o, _s, _p, _halo,
>   _arom_c, _arom_n, _arom_o, _arom_s,
>   _atoms, _single_bonds, _other_bonds)  >=
>  ( 3,1,1,0,0,0, 6,0,0,0, 11,4,7 )
>  and oe_matches(smiles,'c1ccccc1CC(=O)NC')   about 6 seconds
> when the (_c, etc.) is a multi-column index.
Is 3 really only a lower bound for the number of carbon atoms (which, I
guess is what _c represents), or will there be exactly 3 carbon atoms in
the molecules found? If not, can you estimate an upper bound, as well as
an lower bound? If you can, than you cound further optimize this by
using a range query (i.e. select ... from ... where (...) >= (...) and
(...) <= (...( and oe_matches(...)).

If you can only determine bounds (upper _and_ lower) for some of the
"fingerprint" columns, that I would suggest you choose those for the
index, and use them as the "first" columns (because otherwise a
range-scan won't work).

"analyze" gathers some statistics about the distribution of values in a
table - I guess you could extract at least an approximation of "index
coverage" from there.

Since there cost of index inserts/updates/deletes is afaik largly
independent from the number of columns the index consists of, I'd
suggest that you add columns until the stats show that it's nearly an
unique index - but prefer columns for which you can compute strong upper
  & lower bounds, and put those "in front".

> The data isn't inherently structured in any way that invites some particular number of columns
> for indexing.  I don't want to use too many, nor too few columns.  I also
> want to optimize the nature(which atom types, bond types, etc.)
> of the count columns.  While I could do this
> and use the speedup as the measure of success, I think
> that if my B-tree were "covering" the data well, I would get the best results.
> Covering means finding that optimal situation where there is not one index for all rows
> and also not a unique index for every row - something inbetween would be ideal,
> or is that basically a wrong idea?
I believe it depends on the costs of executing one oe_matches call -
according to you, those costs are quite heavy, so I every call you can
omit by adding more columns to the index is worth the performance penalty.

If you need fast adding of data, as well as fast searches, you could add
the data into a "pending" table that is not indexed, and union the
searches over both tables. At low-traffic times (say, at night), you
could move the data from the "pending" table to the "structure" table.
This of course only works, if the number of tuples insertes between two
move-runs is much smaller than the number of tuples in "structure".

greetings, Florian Pflug


pgsql-general by date:

Previous
From: "Sandeep Gaikwad"
Date:
Subject: Foreign Key relationship between two databases
Next
From: Max
Date:
Subject: GiST index not used for ORDER BY?