Re: index v. seqscan for certain values - Mailing list pgsql-performance

From Jeremy Dunn
Subject Re: index v. seqscan for certain values
Date
Msg-id 000801c42165$68a67750$4f01a8c0@jeremydunn
Whole thread Raw
In response to Re: index v. seqscan for certain values  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: index v. seqscan for certain values
List pgsql-performance
> > When I just tried it again with a value of 300, analyze,
> then run the query, I get a *worse* result for an estimate.  I don't
understand
> > this.
>
> That's annoying.  How repeatable are these results --- if you
> do ANALYZE over again several times, how much does the row
> count estimate change each time?  (It should change somewhat,
> since ANALYZE is taking a random sample, but one would like
> to think not a whole lot.)  Is the variance more or less at
> the higher stats target?  Take a look at a few different CID
> values to get a sense of the accuracy, don't look at just one ...

Yes, it's repeatable.  I tried a bunch of times, and there are only
small variations in the stats for the higher stat targets.

> (Actually, you might find it more profitable to look at the
> pg_stats entry for the CID column rather than
> reverse-engineering the stats via ANALYZE.  Look at how well
> the most-common-values list and associated frequency numbers
> track reality.)

I checked the accuracy of the stats for various values, and there is a
wide variation.  I see some values where the estimate is 1.75x the
actual; and others where the estimate is .44x the actual.

> Also, can you think of any reason for the distribution of CID
> values to be nonuniform within the table?  For instance, do
> rows get inserted in order of increasing CID, or is there any
> clustering of rows with the same CID?

This is almost certainly the answer.  The data is initially inserted in
chunks for each CID, and later on there is a more normal distribution of
insert/update/deletes across all CIDs; and then again a new CID will
come with a large chunk of rows, etc.

Interestingly, I tried increasing the stat size for the CID column to
2000, analyzing, and checking the accuracy of the stats again.  Even
with this relatively high value, the accuracy of the stats is not that
close.   The value giving .44x previously nows gives an estimate .77x of
actual.  Another value which was at 1.38x of actual is now at .71x of
actual!

Then just for kicks I set the statistics size to 100,000 (!), analyzed,
and ran the query again.  For the same CID I still got an estimated row
count that is .71x the actual rows returned.  Why is this not better?  I
wonder how high I'd have to set the statistics collector to get really
good data, given the uneven data distribution of this table.  Is there
any other technique that works better to get good estimates, given
uneven distribution of values?

So I think this explains the inaccurate stats; and the solution as far
as I'm concerned is to increase the two params mentioned yesterday
(effective_cache_size & random_page_cost).

Thanks again for the help!
- Jeremy


pgsql-performance by date:

Previous
From: Rajesh Kumar Mallah
Date:
Subject: Re: Deleting certain duplicates
Next
From: Tom Lane
Date:
Subject: Re: index v. seqscan for certain values