Re: visualizing B-tree index coverage - Mailing list pgsql-general
From | TJ O'Donnell |
---|---|
Subject | Re: visualizing B-tree index coverage |
Date | |
Msg-id | 1540.209.223.166.104.1106697005.squirrel@gnova.com Whole thread Raw |
In response to | Re: visualizing B-tree index coverage ("Dann Corbit" <DCorbit@connx.com>) |
Responses |
Re: visualizing B-tree index coverage
Re: visualizing B-tree index coverage Re: visualizing B-tree index coverage |
List | pgsql-general |
Since I'm using a multi-column index, I can greatly influence the nature of the index created, depending on which columns I use and how many. I'm searching for an optimal set of columns that creates an index that, for sure does not have every value the same, nor only two values. Instead, I want to see how well I've spread the index out over the data (if that phrasing makes sense). More specifically, I have character data representing molecular structures. I've written (rather slow) search functions. I can create any number of columns that "fingerprint" each structure, e.g. # Carbon atoms, # N atoms, # single bonds, etc. I expect my fingerprints will not be unique (fingerprint may be a poor analogy), but rather will classify similar structures together. I create a multi-column index on these counts and get about 2-3 times speedup using 13 columns right now. For example: select count(smiles) from structure where oe_matches(smiles,'c1ccccc1CC(=O)NC') about 15 sec. select count(smiles) from structure where (_c, _n, _o, _s, _p, _halo, _arom_c, _arom_n, _arom_o, _arom_s, _atoms, _single_bonds, _other_bonds) >= ( 3,1,1,0,0,0, 6,0,0,0, 11,4,7 ) and oe_matches(smiles,'c1ccccc1CC(=O)NC') about 6 seconds when the (_c, etc.) is a multi-column index. The data isn't inherently structured in any way that invites some particular number of columns for indexing. I don't want to use too many, nor too few columns. I also want to optimize the nature(which atom types, bond types, etc.) of the count columns. While I could do this and use the speedup as the measure of success, I think that if my B-tree were "covering" the data well, I would get the best results. Covering means finding that optimal situation where there is not one index for all rows and also not a unique index for every row - something inbetween would be ideal, or is that basically a wrong idea? TJ > Useful explanation of PostgreSQL index format: > http://www.faqs.org/docs/ppbook/c13329.htm > > I think you are aiming for the wrong thing. > The worst possible index is one with every value the same. > The second worst (still basically useless) is one with only two values. The greater the > differentiation of the data, the more workload is > reduced on a search. > > Since it isn't a straight binary tree, I don't think that having highly dissimilar data in the > index should be a problem. > > Do you have data or experience that shows otherwise? > > -----Original Message----- > From: pgsql-general-owner@postgresql.org > [mailto:pgsql-general-owner@postgresql.org] On Behalf Of TJ O'Donnell Sent: Tuesday, January 25, > 2005 2:19 PM > To: pgsql-general@postgresql.org > Cc: tjo@acm.org > Subject: [GENERAL] visualizing B-tree index coverage > > Does anyone know of a tools that allows one to visualize > the tree created by a multi-column B-tree index? > A picture of a tree with branches, showing how "branchy" the > tree is would be great. > I'm wondering how well I've "clustered" the data in my table > using the multi-column index. In other words, do my > multi-columns sufficiently but not overly discriminate rows from each other? > Do I have too many with the same index? (not enough branches) > Do I have a unique index for each row? (way too many branches) > > Thanks, > TJ > > > > ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to > increase your free space map settings
pgsql-general by date: