Re: testing/predicting optimization using indexes - Mailing list pgsql-sql

From PFC
Subject Re: testing/predicting optimization using indexes
Date
Msg-id opsk8ft0qzth1vuj@musicbox
Whole thread Raw
In response to Re: testing/predicting optimization using indexes  (TJ O'Donnell <tjo@acm.org>)
List pgsql-sql
> Finally, I built the table with all the additional columns created
> during the initial creation of the table.  The original speed was  
> obtained!
Quite strange !Did you vacuum full ? analyze ? Did you set a default value for the  
columns ? mmm.... maybe it's not the fact of adding the columns, but the  
fact of filling them with values, which screws up the vacuum if your fsm  
setting is too small ?Try vacuum verbose, good luck parsing the results ;)

> The secondary issue was one of using an index on the additional columns.
> This greatly speeds up the overall search, by limiting the number of
> rows needing to use oe_matches.  I am currently working on optimizing the
> number and nature of these extra columns.  However, my initial question
> still remains.  Once I find a good set of columns to use as an index,
> will I then get even greater speed by defining a new data type and an
> index method equivalent to my multi-column index?
You'll know that by counting the rows matched by the pre-filter (your  
columns), counting the rows actually matched, which will give you the  
number of calls to oe_match you saved, then look at the mean time for  
oe_match...

> SELECT count(*) FROM structure
> 237597
>
> SELECT avg(length(smiles)) FROM structure
> 37.6528912402092619
Well, your rows have 26 bytes header + then about 45 bytes of TEXT, and 4  
bytes per integer column... I don't think the bytes spent in your columns  
are significant... They could have been if your smiles string had been  
shorter.

> EXPLAIN ANALYZE SELECT oe_matches(smiles,'c1ccccc1CC(=O)NC') FROM   
> structure
> Seq Scan on structure  (cost=0.00..7573.96 rows=237597 width=41) (actual  
> time=17.443..15025.974 rows=237597 loops=1)
> Total runtime: 16786.542 ms
>
> EXPLAIN ANALYZE SELECT smiles FROM structure
> Seq Scan on structure  (cost=0.00..6979.97 rows=237597 width=41) (actual  
> time=0.067..735.884 rows=237597 loops=1)
> Total runtime: 1200.661 ms

OK so it takes 1.2 secs to actually read the data, and 16.8 secs to run  
oe_match... so a call is about 65 microseconds...  Note that this time  
could depend a lot on the smiles column and also on the query string !
What you need now is to estimate the selectivity of your pre filtering  
columns, to be able to select the best possible columns : for various  
smiles queries, compute the row count which gets past the filter, and the  
row count that actually matches the oe_match. Ideally you want the first  
to be as close as possible to the second, but for your test query, as you  
return 0.5% of the table, even an inefficient pre-filter which would let  
10% of the rows through would yield a 10x speed improvement. You'd want to  
get below the 2-3% bar so that postgres will use an index scan, which will  
be even faster. Don't forget to do a sanity-check that all the rows that  
match your smiles query also match your columns filter !
Also, using several columns (say a,b,c,d) is not optimal. Say a,b,c,d  
each contain integers between 0 and 10 with linear distribution ; then a  
query starting with 'a>=0' will automatically match more than 90% of the  
data and not use the index. You'll get a seq scan. So, either you can  
always get your first column very selective, or you'll have to use a gist  
index and integer arrays.
If you get times that you like, then you're done ; else there may be  
another path for optimization, getting your hands dirty in the code, but  
not to the point of creating index types :
You'll have noted that the 'c1ccccc1CC(=O)NC' string gets reparsed for  
every processed row. You should benchmark how much time is lost in this  
parsing. You probably won't be able to do this with postgres (maybe  
matching 'c1ccccc1CC(=O)NC' with an empty smiles string ?), so you may  
have to call the C++ functions directly.If this time is significant, you might want to create a datatype which  
will contain a compiled query string. You'll have to write a few C  
functions for that (dont ask me) but it should be a lot simpler than  
coding a new index type. Then you'd create a special version of oe_match  
which would take a precompiled query string. Depending on the time  
necessary to parse it, it may work.













pgsql-sql by date:

Previous
From: TJ O'Donnell
Date:
Subject: Re: testing/predicting optimization using indexes
Next
From: Ralph Graulich
Date:
Subject: Rule problem with OLD / NEW record set