Re: large tables and simple "= constant" queries using indexes - Mailing list pgsql-performance
From | John Beaver |
---|---|
Subject | Re: large tables and simple "= constant" queries using indexes |
Date | |
Msg-id | 47FE27EB.8050804@gmail.com Whole thread Raw |
In response to | Re: large tables and simple "= constant" queries using indexes (Matthew <matthew@flymine.org>) |
Responses |
Re: large tables and simple "= constant" queries using indexes
Re: large tables and simple "= constant" queries using indexes Re: large tables and simple "= constant" queries using indexes |
List | pgsql-performance |
Thanks a lot, all of you - this is excellent advice. With the data clustered and statistics at a more reasonable value of 100, it now reproducibly takes even less time - 20-57 ms per query. After reading the section on "Statistics Used By the Planner" in the manual, I was a little concerned that, while the statistics sped up the queries that I tried immeasurably, that the most_common_vals array was where the speedup was happening, and that the values which wouldn't fit in this array wouldn't be sped up. Though I couldn't offhand find an example where this occurred, the clustering approach seems intuitively like a much more complete and scalable solution, at least for a read-only table like this. As to whether the entire index/table was getting into ram between my statistics calls, I don't think this was the case. Here's the behavior that I found: - With statistics at 10, the query took 25 (or so) seconds no matter how many times I tried different values. The query plan was the same as for the 200 and 800 statistics below. - Trying the same constant a second time gave an instantaneous result, I'm guessing because of query/result caching. - Immediately on increasing the statistics to 200, the query took a reproducibly less amount of time. I tried about 10 different values - Immediately on increasing the statistics to 800, the query reproducibly took less than a second every time. I tried about 30 different values. - Decreasing the statistics to 100 and running the cluster command brought it to 57 ms per query. - The Activity Monitor (OSX) lists the relevant postgres process as taking a little less than 500 megs. - I didn't try decreasing the statistics back to 10 before I ran the cluster command, so I can't show the search times going up because of that. But I tried killing the 500 meg process. The new process uses less than 5 megs of ram, and still reproducibly returns a result in less than 60 ms. Again, this is with a statistics value of 100 and the data clustered by gene_prediction_view_gene_ref_key. And I'll consider the idea of using triggers with an ancillary table for other purposes; seems like it could be a useful solution for something. Matthew wrote: > On Thu, 10 Apr 2008, PFC wrote: > > ... Lots of useful advice ... > >> - If you often query rows with the same gene_ref, consider using >> CLUSTER to physically group those rows on disk. This way you can get >> all rows with the same gene_ref in 1 seek instead of 2000. Clustered >> tables also make Bitmap scan happy. > > In my opinion this is the one that will make the most difference. You > will need to run: > > CLUSTER gene_prediction_view USING gene_prediction_view_gene_ref_key; > > after you insert significant amounts of data into the table. This > re-orders the table according to the index, but new data is always > written out of order, so after adding lots more data the table will > need to be re-clustered again. > >> - Switch to a RAID10 (4 times the IOs per second, however zero gain >> if you're single-threaded, but massive gain when concurrent) > > Greg Stark has a patch in the pipeline that will change this, for > bitmap index scans, by using fadvise(), so a single thread can utilise > multiple discs in a RAID array. > > Matthew >
pgsql-performance by date: