Re: index v. seqscan for certain values - Mailing list pgsql-performance
From | Jeremy Dunn |
---|---|
Subject | Re: index v. seqscan for certain values |
Date | |
Msg-id | 000801c42165$68a67750$4f01a8c0@jeremydunn Whole thread Raw |
In response to | Re: index v. seqscan for certain values (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: index v. seqscan for certain values
|
List | pgsql-performance |
> > When I just tried it again with a value of 300, analyze, > then run the query, I get a *worse* result for an estimate. I don't understand > > this. > > That's annoying. How repeatable are these results --- if you > do ANALYZE over again several times, how much does the row > count estimate change each time? (It should change somewhat, > since ANALYZE is taking a random sample, but one would like > to think not a whole lot.) Is the variance more or less at > the higher stats target? Take a look at a few different CID > values to get a sense of the accuracy, don't look at just one ... Yes, it's repeatable. I tried a bunch of times, and there are only small variations in the stats for the higher stat targets. > (Actually, you might find it more profitable to look at the > pg_stats entry for the CID column rather than > reverse-engineering the stats via ANALYZE. Look at how well > the most-common-values list and associated frequency numbers > track reality.) I checked the accuracy of the stats for various values, and there is a wide variation. I see some values where the estimate is 1.75x the actual; and others where the estimate is .44x the actual. > Also, can you think of any reason for the distribution of CID > values to be nonuniform within the table? For instance, do > rows get inserted in order of increasing CID, or is there any > clustering of rows with the same CID? This is almost certainly the answer. The data is initially inserted in chunks for each CID, and later on there is a more normal distribution of insert/update/deletes across all CIDs; and then again a new CID will come with a large chunk of rows, etc. Interestingly, I tried increasing the stat size for the CID column to 2000, analyzing, and checking the accuracy of the stats again. Even with this relatively high value, the accuracy of the stats is not that close. The value giving .44x previously nows gives an estimate .77x of actual. Another value which was at 1.38x of actual is now at .71x of actual! Then just for kicks I set the statistics size to 100,000 (!), analyzed, and ran the query again. For the same CID I still got an estimated row count that is .71x the actual rows returned. Why is this not better? I wonder how high I'd have to set the statistics collector to get really good data, given the uneven data distribution of this table. Is there any other technique that works better to get good estimates, given uneven distribution of values? So I think this explains the inaccurate stats; and the solution as far as I'm concerned is to increase the two params mentioned yesterday (effective_cache_size & random_page_cost). Thanks again for the help! - Jeremy
pgsql-performance by date: