Thread: testing/predicting optimization using indexes

testing/predicting optimization using indexes

From
TJ O'Donnell
Date:
I have several questions reagaring the kind of increase in speed I can
expect when I use a multi-column index. Here's what I've done so far.

I've written some search functions which operate on character varying
data used to represent molecular structures.  We call this a Smiles string.
I want to optimize the search using an index.  As a test, I've created
9 integer columns in the tables containting atom counts, e.g.
number of carbon atoms, oxygen, aromatic carbon, etc.
I then made a multi-column index.  Here are some samples times

1. When the table contains only smiles, no 9 integer columns and no index:
Select count(smiles) from structure where oe_matches(smiles,'c1ccccc1CC(=O)NC');
1313 rows in about 15 seconds.

2. When the table contains smiles and the 9 integer columns as an index:
Select count(smiles) from structure where oe_matches(smiles,'c1ccccc1CC(=O)NC');
1313 rows in about 20 seconds.

3. When the table contains smiles and the 9 integer columns as an index:
Select smiles,id from structure where (nc,nn,no,ns,"n-arom-c","n-arom-n","n-arom-o","n-arom-s",nhalo) >=
(3,1,1,0,6,0,0,0,0)
and oe_matches(smiles,'c1ccccc1CC(=O)NC');
1313 rows in about 7 seconds.

I'm quite happy with the speedup in 3, but puzzled over the slowdown in 2.
Here are my questions.
1. Why does the search slow down after I've created the extra columns and
index, even when I don't ask to use the index in the SQL, as in 2.
2. Since I get such a nice speedup in 3, should I go to the trouble to
create a new datatype (smiles) and define how it should be indexed in a way
analogous to the 9 integer columns?  In other words, could I expect an even
greater speedup using a new datatype and index?

Thanks,
TJ



Re: testing/predicting optimization using indexes

From
PFC
Date:
> I'm quite happy with the speedup in 3, but puzzled over the slowdown in  
> 2.Could you provide :
- SELECT count(*) FROM structure;=> NRows- SELECT avg(length(smiles)) FROM structure;
Then VACUUM FULL ANALYZE structureRedo your timings and this time post EXPLAIN ANALYZE
Also your query returns 1313 rows, so wan you post :

EXPLAIN ANALYZE SELECT oe_matches(smiles,'c1ccccc1CC(=O)NC') FROM  
structure;
=> time T1
EXPLAIN ANALYZE SELECT smiles FROM structure;
=> time T2

(T1-T2)/(NRows) will give you an estimate of the time spent in each  
oe_matches call.
Also note that for postgres (a,b) > (c,d) means ((a>c) and (b>d)), which  
can be misleading, but I think that's what you wanted.


Re: testing/predicting optimization using indexes

From
TJ O'Donnell
Date:
I was puzzled as to why my search slowed down when I added columns.
The VACUUM did not restore the former speed,
which I had obtained before adding the columns.
So, I rebuilt the table with only the smiles column and my original
speed was again obtained (not surprising).
After I added the extra columns, it slowed down again.
Finally, I built the table with all the additional columns created
during the initial creation of the table.  The original speed was obtained!
I conclude that the addition of columns after building all the rows of
a table somehow makes the table access less efficient.  Is this generally
true?  Is there a more efficient way to add columns to a table after its
initial construction?

The secondary issue was one of using an index on the additional columns.
This greatly speeds up the overall search, by limiting the number of
rows needing to use oe_matches.  I am currently working on optimizing the
number and nature of these extra columns.  However, my initial question
still remains.  Once I find a good set of columns to use as an index,
will I then get even greater speed by defining a new data type and an
index method equivalent to my multi-column index?

Here are the data you requested.  I think this is less important now that
I know I should create all my columns from the beginning.
Thanks for the tip on how to compute average time spent in my
oe_matches functions.  This will be very useful for future optimization.

SELECT count(*) FROM structure
237597

SELECT avg(length(smiles)) FROM structure
37.6528912402092619

VACUUM FULL ANALYZE structure
(no output)

EXPLAIN ANALYZE SELECT oe_matches(smiles,'c1ccccc1CC(=O)NC') FROM  structure
Seq Scan on structure  (cost=0.00..7573.96 rows=237597 width=41) (actual time=17.443..15025.974 rows=237597 loops=1)
Total runtime: 16786.542 ms

EXPLAIN ANALYZE SELECT smiles FROM structure
Seq Scan on structure  (cost=0.00..6979.97 rows=237597 width=41) (actual time=0.067..735.884 rows=237597 loops=1)
Total runtime: 1200.661 ms


TJ


PFC wrote:
> 
>> I'm quite happy with the speedup in 3, but puzzled over the slowdown 
>> in  2.
> 
>     Could you provide :
> 
>     - SELECT count(*) FROM structure;
>     => NRows
>     - SELECT avg(length(smiles)) FROM structure;
> 
>     Then VACUUM FULL ANALYZE structure
>     Redo your timings and this time post EXPLAIN ANALYZE
> 
>     Also your query returns 1313 rows, so wan you post :
> 
> EXPLAIN ANALYZE SELECT oe_matches(smiles,'c1ccccc1CC(=O)NC') FROM  
> structure;
> => time T1
> EXPLAIN ANALYZE SELECT smiles FROM structure;
> => time T2
> 
> (T1-T2)/(NRows) will give you an estimate of the time spent in each  
> oe_matches call.
> 
>     Also note that for postgres (a,b) > (c,d) means ((a>c) and (b>d)), 
> which  can be misleading, but I think that's what you wanted.


Re: testing/predicting optimization using indexes

From
PFC
Date:
> Finally, I built the table with all the additional columns created
> during the initial creation of the table.  The original speed was  
> obtained!
Quite strange !Did you vacuum full ? analyze ? Did you set a default value for the  
columns ? mmm.... maybe it's not the fact of adding the columns, but the  
fact of filling them with values, which screws up the vacuum if your fsm  
setting is too small ?Try vacuum verbose, good luck parsing the results ;)

> The secondary issue was one of using an index on the additional columns.
> This greatly speeds up the overall search, by limiting the number of
> rows needing to use oe_matches.  I am currently working on optimizing the
> number and nature of these extra columns.  However, my initial question
> still remains.  Once I find a good set of columns to use as an index,
> will I then get even greater speed by defining a new data type and an
> index method equivalent to my multi-column index?
You'll know that by counting the rows matched by the pre-filter (your  
columns), counting the rows actually matched, which will give you the  
number of calls to oe_match you saved, then look at the mean time for  
oe_match...

> SELECT count(*) FROM structure
> 237597
>
> SELECT avg(length(smiles)) FROM structure
> 37.6528912402092619
Well, your rows have 26 bytes header + then about 45 bytes of TEXT, and 4  
bytes per integer column... I don't think the bytes spent in your columns  
are significant... They could have been if your smiles string had been  
shorter.

> EXPLAIN ANALYZE SELECT oe_matches(smiles,'c1ccccc1CC(=O)NC') FROM   
> structure
> Seq Scan on structure  (cost=0.00..7573.96 rows=237597 width=41) (actual  
> time=17.443..15025.974 rows=237597 loops=1)
> Total runtime: 16786.542 ms
>
> EXPLAIN ANALYZE SELECT smiles FROM structure
> Seq Scan on structure  (cost=0.00..6979.97 rows=237597 width=41) (actual  
> time=0.067..735.884 rows=237597 loops=1)
> Total runtime: 1200.661 ms

OK so it takes 1.2 secs to actually read the data, and 16.8 secs to run  
oe_match... so a call is about 65 microseconds...  Note that this time  
could depend a lot on the smiles column and also on the query string !
What you need now is to estimate the selectivity of your pre filtering  
columns, to be able to select the best possible columns : for various  
smiles queries, compute the row count which gets past the filter, and the  
row count that actually matches the oe_match. Ideally you want the first  
to be as close as possible to the second, but for your test query, as you  
return 0.5% of the table, even an inefficient pre-filter which would let  
10% of the rows through would yield a 10x speed improvement. You'd want to  
get below the 2-3% bar so that postgres will use an index scan, which will  
be even faster. Don't forget to do a sanity-check that all the rows that  
match your smiles query also match your columns filter !
Also, using several columns (say a,b,c,d) is not optimal. Say a,b,c,d  
each contain integers between 0 and 10 with linear distribution ; then a  
query starting with 'a>=0' will automatically match more than 90% of the  
data and not use the index. You'll get a seq scan. So, either you can  
always get your first column very selective, or you'll have to use a gist  
index and integer arrays.
If you get times that you like, then you're done ; else there may be  
another path for optimization, getting your hands dirty in the code, but  
not to the point of creating index types :
You'll have noted that the 'c1ccccc1CC(=O)NC' string gets reparsed for  
every processed row. You should benchmark how much time is lost in this  
parsing. You probably won't be able to do this with postgres (maybe  
matching 'c1ccccc1CC(=O)NC' with an empty smiles string ?), so you may  
have to call the C++ functions directly.If this time is significant, you might want to create a datatype which  
will contain a compiled query string. You'll have to write a few C  
functions for that (dont ask me) but it should be a lot simpler than  
coding a new index type. Then you'd create a special version of oe_match  
which would take a precompiled query string. Depending on the time  
necessary to parse it, it may work.