Thread: testing/predicting optimization using indexes
I have several questions reagaring the kind of increase in speed I can expect when I use a multi-column index. Here's what I've done so far. I've written some search functions which operate on character varying data used to represent molecular structures. We call this a Smiles string. I want to optimize the search using an index. As a test, I've created 9 integer columns in the tables containting atom counts, e.g. number of carbon atoms, oxygen, aromatic carbon, etc. I then made a multi-column index. Here are some samples times 1. When the table contains only smiles, no 9 integer columns and no index: Select count(smiles) from structure where oe_matches(smiles,'c1ccccc1CC(=O)NC'); 1313 rows in about 15 seconds. 2. When the table contains smiles and the 9 integer columns as an index: Select count(smiles) from structure where oe_matches(smiles,'c1ccccc1CC(=O)NC'); 1313 rows in about 20 seconds. 3. When the table contains smiles and the 9 integer columns as an index: Select smiles,id from structure where (nc,nn,no,ns,"n-arom-c","n-arom-n","n-arom-o","n-arom-s",nhalo) >= (3,1,1,0,6,0,0,0,0) and oe_matches(smiles,'c1ccccc1CC(=O)NC'); 1313 rows in about 7 seconds. I'm quite happy with the speedup in 3, but puzzled over the slowdown in 2. Here are my questions. 1. Why does the search slow down after I've created the extra columns and index, even when I don't ask to use the index in the SQL, as in 2. 2. Since I get such a nice speedup in 3, should I go to the trouble to create a new datatype (smiles) and define how it should be indexed in a way analogous to the 9 integer columns? In other words, could I expect an even greater speedup using a new datatype and index? Thanks, TJ
> I'm quite happy with the speedup in 3, but puzzled over the slowdown in > 2.Could you provide : - SELECT count(*) FROM structure;=> NRows- SELECT avg(length(smiles)) FROM structure; Then VACUUM FULL ANALYZE structureRedo your timings and this time post EXPLAIN ANALYZE Also your query returns 1313 rows, so wan you post : EXPLAIN ANALYZE SELECT oe_matches(smiles,'c1ccccc1CC(=O)NC') FROM structure; => time T1 EXPLAIN ANALYZE SELECT smiles FROM structure; => time T2 (T1-T2)/(NRows) will give you an estimate of the time spent in each oe_matches call. Also note that for postgres (a,b) > (c,d) means ((a>c) and (b>d)), which can be misleading, but I think that's what you wanted.
I was puzzled as to why my search slowed down when I added columns. The VACUUM did not restore the former speed, which I had obtained before adding the columns. So, I rebuilt the table with only the smiles column and my original speed was again obtained (not surprising). After I added the extra columns, it slowed down again. Finally, I built the table with all the additional columns created during the initial creation of the table. The original speed was obtained! I conclude that the addition of columns after building all the rows of a table somehow makes the table access less efficient. Is this generally true? Is there a more efficient way to add columns to a table after its initial construction? The secondary issue was one of using an index on the additional columns. This greatly speeds up the overall search, by limiting the number of rows needing to use oe_matches. I am currently working on optimizing the number and nature of these extra columns. However, my initial question still remains. Once I find a good set of columns to use as an index, will I then get even greater speed by defining a new data type and an index method equivalent to my multi-column index? Here are the data you requested. I think this is less important now that I know I should create all my columns from the beginning. Thanks for the tip on how to compute average time spent in my oe_matches functions. This will be very useful for future optimization. SELECT count(*) FROM structure 237597 SELECT avg(length(smiles)) FROM structure 37.6528912402092619 VACUUM FULL ANALYZE structure (no output) EXPLAIN ANALYZE SELECT oe_matches(smiles,'c1ccccc1CC(=O)NC') FROM structure Seq Scan on structure (cost=0.00..7573.96 rows=237597 width=41) (actual time=17.443..15025.974 rows=237597 loops=1) Total runtime: 16786.542 ms EXPLAIN ANALYZE SELECT smiles FROM structure Seq Scan on structure (cost=0.00..6979.97 rows=237597 width=41) (actual time=0.067..735.884 rows=237597 loops=1) Total runtime: 1200.661 ms TJ PFC wrote: > >> I'm quite happy with the speedup in 3, but puzzled over the slowdown >> in 2. > > Could you provide : > > - SELECT count(*) FROM structure; > => NRows > - SELECT avg(length(smiles)) FROM structure; > > Then VACUUM FULL ANALYZE structure > Redo your timings and this time post EXPLAIN ANALYZE > > Also your query returns 1313 rows, so wan you post : > > EXPLAIN ANALYZE SELECT oe_matches(smiles,'c1ccccc1CC(=O)NC') FROM > structure; > => time T1 > EXPLAIN ANALYZE SELECT smiles FROM structure; > => time T2 > > (T1-T2)/(NRows) will give you an estimate of the time spent in each > oe_matches call. > > Also note that for postgres (a,b) > (c,d) means ((a>c) and (b>d)), > which can be misleading, but I think that's what you wanted.
> Finally, I built the table with all the additional columns created > during the initial creation of the table. The original speed was > obtained! Quite strange !Did you vacuum full ? analyze ? Did you set a default value for the columns ? mmm.... maybe it's not the fact of adding the columns, but the fact of filling them with values, which screws up the vacuum if your fsm setting is too small ?Try vacuum verbose, good luck parsing the results ;) > The secondary issue was one of using an index on the additional columns. > This greatly speeds up the overall search, by limiting the number of > rows needing to use oe_matches. I am currently working on optimizing the > number and nature of these extra columns. However, my initial question > still remains. Once I find a good set of columns to use as an index, > will I then get even greater speed by defining a new data type and an > index method equivalent to my multi-column index? You'll know that by counting the rows matched by the pre-filter (your columns), counting the rows actually matched, which will give you the number of calls to oe_match you saved, then look at the mean time for oe_match... > SELECT count(*) FROM structure > 237597 > > SELECT avg(length(smiles)) FROM structure > 37.6528912402092619 Well, your rows have 26 bytes header + then about 45 bytes of TEXT, and 4 bytes per integer column... I don't think the bytes spent in your columns are significant... They could have been if your smiles string had been shorter. > EXPLAIN ANALYZE SELECT oe_matches(smiles,'c1ccccc1CC(=O)NC') FROM > structure > Seq Scan on structure (cost=0.00..7573.96 rows=237597 width=41) (actual > time=17.443..15025.974 rows=237597 loops=1) > Total runtime: 16786.542 ms > > EXPLAIN ANALYZE SELECT smiles FROM structure > Seq Scan on structure (cost=0.00..6979.97 rows=237597 width=41) (actual > time=0.067..735.884 rows=237597 loops=1) > Total runtime: 1200.661 ms OK so it takes 1.2 secs to actually read the data, and 16.8 secs to run oe_match... so a call is about 65 microseconds... Note that this time could depend a lot on the smiles column and also on the query string ! What you need now is to estimate the selectivity of your pre filtering columns, to be able to select the best possible columns : for various smiles queries, compute the row count which gets past the filter, and the row count that actually matches the oe_match. Ideally you want the first to be as close as possible to the second, but for your test query, as you return 0.5% of the table, even an inefficient pre-filter which would let 10% of the rows through would yield a 10x speed improvement. You'd want to get below the 2-3% bar so that postgres will use an index scan, which will be even faster. Don't forget to do a sanity-check that all the rows that match your smiles query also match your columns filter ! Also, using several columns (say a,b,c,d) is not optimal. Say a,b,c,d each contain integers between 0 and 10 with linear distribution ; then a query starting with 'a>=0' will automatically match more than 90% of the data and not use the index. You'll get a seq scan. So, either you can always get your first column very selective, or you'll have to use a gist index and integer arrays. If you get times that you like, then you're done ; else there may be another path for optimization, getting your hands dirty in the code, but not to the point of creating index types : You'll have noted that the 'c1ccccc1CC(=O)NC' string gets reparsed for every processed row. You should benchmark how much time is lost in this parsing. You probably won't be able to do this with postgres (maybe matching 'c1ccccc1CC(=O)NC' with an empty smiles string ?), so you may have to call the C++ functions directly.If this time is significant, you might want to create a datatype which will contain a compiled query string. You'll have to write a few C functions for that (dont ask me) but it should be a lot simpler than coding a new index type. Then you'd create a special version of oe_match which would take a precompiled query string. Depending on the time necessary to parse it, it may work.