Re: testing/predicting optimization using indexes - Mailing list pgsql-sql
From | PFC |
---|---|
Subject | Re: testing/predicting optimization using indexes |
Date | |
Msg-id | opsk8ft0qzth1vuj@musicbox Whole thread Raw |
In response to | Re: testing/predicting optimization using indexes (TJ O'Donnell <tjo@acm.org>) |
List | pgsql-sql |
> Finally, I built the table with all the additional columns created > during the initial creation of the table. The original speed was > obtained! Quite strange !Did you vacuum full ? analyze ? Did you set a default value for the columns ? mmm.... maybe it's not the fact of adding the columns, but the fact of filling them with values, which screws up the vacuum if your fsm setting is too small ?Try vacuum verbose, good luck parsing the results ;) > The secondary issue was one of using an index on the additional columns. > This greatly speeds up the overall search, by limiting the number of > rows needing to use oe_matches. I am currently working on optimizing the > number and nature of these extra columns. However, my initial question > still remains. Once I find a good set of columns to use as an index, > will I then get even greater speed by defining a new data type and an > index method equivalent to my multi-column index? You'll know that by counting the rows matched by the pre-filter (your columns), counting the rows actually matched, which will give you the number of calls to oe_match you saved, then look at the mean time for oe_match... > SELECT count(*) FROM structure > 237597 > > SELECT avg(length(smiles)) FROM structure > 37.6528912402092619 Well, your rows have 26 bytes header + then about 45 bytes of TEXT, and 4 bytes per integer column... I don't think the bytes spent in your columns are significant... They could have been if your smiles string had been shorter. > EXPLAIN ANALYZE SELECT oe_matches(smiles,'c1ccccc1CC(=O)NC') FROM > structure > Seq Scan on structure (cost=0.00..7573.96 rows=237597 width=41) (actual > time=17.443..15025.974 rows=237597 loops=1) > Total runtime: 16786.542 ms > > EXPLAIN ANALYZE SELECT smiles FROM structure > Seq Scan on structure (cost=0.00..6979.97 rows=237597 width=41) (actual > time=0.067..735.884 rows=237597 loops=1) > Total runtime: 1200.661 ms OK so it takes 1.2 secs to actually read the data, and 16.8 secs to run oe_match... so a call is about 65 microseconds... Note that this time could depend a lot on the smiles column and also on the query string ! What you need now is to estimate the selectivity of your pre filtering columns, to be able to select the best possible columns : for various smiles queries, compute the row count which gets past the filter, and the row count that actually matches the oe_match. Ideally you want the first to be as close as possible to the second, but for your test query, as you return 0.5% of the table, even an inefficient pre-filter which would let 10% of the rows through would yield a 10x speed improvement. You'd want to get below the 2-3% bar so that postgres will use an index scan, which will be even faster. Don't forget to do a sanity-check that all the rows that match your smiles query also match your columns filter ! Also, using several columns (say a,b,c,d) is not optimal. Say a,b,c,d each contain integers between 0 and 10 with linear distribution ; then a query starting with 'a>=0' will automatically match more than 90% of the data and not use the index. You'll get a seq scan. So, either you can always get your first column very selective, or you'll have to use a gist index and integer arrays. If you get times that you like, then you're done ; else there may be another path for optimization, getting your hands dirty in the code, but not to the point of creating index types : You'll have noted that the 'c1ccccc1CC(=O)NC' string gets reparsed for every processed row. You should benchmark how much time is lost in this parsing. You probably won't be able to do this with postgres (maybe matching 'c1ccccc1CC(=O)NC' with an empty smiles string ?), so you may have to call the C++ functions directly.If this time is significant, you might want to create a datatype which will contain a compiled query string. You'll have to write a few C functions for that (dont ask me) but it should be a lot simpler than coding a new index type. Then you'd create a special version of oe_match which would take a precompiled query string. Depending on the time necessary to parse it, it may work.