Re: [GENERAL] visualizing B-tree index coverage - Mailing list pgsql-hackers

From Dann Corbit
Subject Re: [GENERAL] visualizing B-tree index coverage
Date
Msg-id D425483C2C5C9F49B5B7A41F89441547055869@postal.corporate.connx.com
Whole thread Raw
List pgsql-hackers
Speaking of bitmapped indexes for PostgreSQL ...

Someone is working on adding bitmapped indexes into PostgreSQL:
http://new.foo-baz.com/en/index.php/archives/2004/12/06/id/25/

This also is interesting:
http://www.it.iitb.ac.in/~rvijay/dbms/proj/

-----Original Message-----
From: TJ O'Donnell [mailto:tjo@acm.org]
Sent: Wednesday, January 26, 2005 9:46 AM
To: Dann Corbit
Subject: Re: [GENERAL] visualizing B-tree index coverage

My indexed columns will never be used in joins with other tables.
They don't really represent useful data at all!
Instead, they are used only behind the scenes as an incomplete,
all-integer representation of the real string data.
Their purpose is to quickly (hence the index)
come up with a set of rows that actually needs to be searched using the
slower full search.  For example, there will be many molecules that
contain >17 carbons, >4 nitrogens, >3 oxygens, >22 single bonds, etc.
but probably no more than 25%.  So a slow search of 25% of the rows
is a 4x speedup (neglecting the cost of the index scan in the first
place).
My use of the index is also rarely for an exact lookup, where I'd want
to find compounds with exactly 17C, 4N, 3O, etc. atoms.  Instead,
it is >= in 99% of the cases, since I am looking for structures which
contain my substructure.

It might be similar to an ordinary string search where an index is
created
not just on the initial letter of each word, but also the frequency of
vowels, consonants, number of doubled-letters, etc. within each word.
But in my case, the letters(atoms) are not always in a fixed order
within
the word(molecule).

I think my goal is to find an optimal set of columns.  There will always
be
pathological cases where the indexing is not helpful.  For example, if
the user searches for 'C', that means any structure containing at least
one carbon atom.  That is well over 95% of the database.  Indexing won't
help there.  These types of searches are silly, really, and will
not ever be seen in the real world.

So, the more complex my substructure pattern is, the fewer rows I'll
have
to search....that is if my multi-column index sufficiently represents
the
complexity of the data.

TJ


Dann Corbit wrote:
> It is certainly an interesting problem and it does sound a bit
atypical.
>
> Two things are important for the indexes:
> 1.  Join columns.
> 2.  Where clause references.
>
> For join columns, if there is an index on all the columns in a join,
or
> on the most significant columns in a join it will significantly impact
> performance.
>
> For instance,
> SELECT * from a, b
> WHERE a.c1 = b.c1 and a.c2 = b.c2 and a.c3 = b.c3 and a.c5 = b.c5
>
> If we have an index on either of the tables on
> (c5, c3, c4, c1)
> Then the index will seek to c5,c3 and greatly reduce the work.  Since
c4
> is missing, that is the end of the value and if c4 were added to the
> index it would become more effective.
>
> For a where clause, if the most significant columns in an index are
> contained in the where clause, then the index will be used.
>
> For example, if I have a table a with an index on (c1, c2, c3, c4)
> And I have a query:
> SELECT * FROM a where c1 = 'Joe' AND c2 = 'Smith' and c4 =
'555-11-1234'
> Then c1 and c2 will be used from the index to filter the search
greatly.
> But this query:
> SELECT * FROM a where c2 = 'Smith' and c3 = 'MS' and c4 =
'555-11-1234'
> Will not use the index at all, because c1 is the most significant
column
> of the index and it is not used in the where clause.
>
> -----Original Message-----
> From: TJ O'Donnell [mailto:tjo@acm.org]
> Sent: Tuesday, January 25, 2005 7:28 PM
> To: Dann Corbit
> Subject: Re: [GENERAL] visualizing B-tree index coverage
>
>
> I think I am using indexes differently than most people.
> For example, there will NEVER be a search directly on
> an indexed column (or columns).  It will only be used in
> conjunction with a search on the smiles/structure column
> to aid in eliminating rows which could never match using the
> true match function.  And I will ALWAYS use
> every column of the multi-column index, in "left to right"
> order, when I use the index at all.  If some counts are 0,
> for example 0 nitrogen atoms, that is as useful a guide as
> a number > 0, so my index will be something like (1,2,0,0,0,8,4,...)
>
> This seems to argue for my using one multi-column index
> or reasonable (15-20?) size.  Using 13 helps lots.  Using
> 2 does not help much.  Aside from that, I guess I'll have to
> experiment more.
>
> Thanks for your insights, Dann.
>
> TJ
>
>
> Dann Corbit wrote:
>
>>-----Original Message-----
>>From: TJ [mailto:tjo@acm.org]
>>Sent: Tuesday, January 25, 2005 5:55 PM
>>To: Dann Corbit
>>Subject: Re: [GENERAL] visualizing B-tree index coverage
>>
>>Is it highly desirable to have as
>>few rows as possible with identical index, even if I'm not using an
>>index for a unique lookup (or at least very rarely doing that)?
>>
>>There, I cannot say.  I would benchmark more than one approach.
>><<
>>
>>I understand the great advantage of a unique index when a lookup
>>of one particular row is desired.  Most of my searches, using my
>>oe_matches functions are expected to return many rows which
>>match, kind of like a regexp string match or even a LIKE, but
>>more complex for molecular structures.  I am basically counting
>>on the index to rule out using the expensive search for as many
>>rows as possible, yet include all rows that would match after
>>an exhaustive search of the whole table.
>>
>>Unfortuately, I don't know which queries my user's run most often.
>>Think of it, again, like a regexp search.  You can't tell what kinds
>>of words or sub-words a user might be interested in - in my case
>>molecular fragments or substructures.
>>
>>I think I should expand my multi-column index up to the limit
>>of 32 columns as a limiting case.  This would almost surely make
>>a unique index for each row.  Then if my timings are much improved,
>>I could go with that.
>>
>>If you have to go to 32 columns, I doubt if there will be much
>
> benefit.
>
>>The staggering size of an index entry will be too much.
>>
>>If you can add one or two columns to what you have, that would be
>>better, perhaps.  But the extra columns will help if and only if they
>>are actually used in the query and if all the preceding columns are
>
> also
>
>>used.
>><<
>>
>>With my current 13 column index, in a 250,000 row table, there are
>>193 rows with an identical index.  That is the maximum set of rows
>>with identical index.  There are many other sets, most with about
>>10 rows having identical index.  Again, is it highly desirable to have
>>as
>>few rows as possible with identical index, even if I'm not using an
>>index for a unique lookup (or at least very rarely doing that)?
>>
>>If you have to add a large number of columns, it will probably be
>
> worse
>
>>than not doing it.
>><<
>>
>>As for updates and deletes, these are relatively rare.  The db
>>would grow (INSERT) by less than 1% per day.  There would be virtually
>>no
>>DELETES, perhaps some UPDATES.  The inserts would typically happen
>>overnight and could be coordinates with a re-indexing if needed.
>>Even dropping the index and re-creating it would be workable.
>>I expect 1-3 million rows would be the maximum.
>>
>>This is important, then.
>>
>>The most important column in any index is the first column.  So, for
>>every column that you think will receive a large volume of
>
> transactions
>
>>it might be good to add a single column index.  And if you know of
>
> some
>
>>columns that will be used very frequently in combination in queries,
>
> it
>
>>would be good to make multiple (but shorter) indexes on these.
>>
>>If you create a large, compound index, and you do not include the
>
> first
>
>>column of the index in your query, then the index is useless.
>>
>>So (for instance) if I made an index on (protons, neutrons) and
>
> someone
>
>>made a query:
>>SELECT * FROM atoms WHERE neutrons = 127
>>The index would not be used at all.
>>
>>Therefore, a number of auxiliary indexes may turn out to be a nice
>>benefit for performance.
>>
>>Probably, the statistics collector will tell you which ones are the
>
> most
>
>>valuable over time.
>>
>>So maybe, when you are done, you will have one index with 15 columns,
>
> 5
>
>>indexes with 5 columns, 8 indexes with 3 columns, and 10 indexes with
>
> 2
>
>>columns.  As long as the transaction load for updates is light, a huge
>>number of indexes would not be too painful.   But projections,
>>experiments and measurements are the only real way to know what is
>
> best.
>
>><<


pgsql-hackers by date:

Previous
From: Darcy Buskermolen
Date:
Subject: cvs TIP, tsearch2 and Solaris 8 Sparc
Next
From: Tom Lane
Date:
Subject: Re: cvs TIP, tsearch2 and Solaris 8 Sparc