I wrote
> (Note to hackers: Ole sent me a 1000-row test case off list.)
>> oletest=> explain select * from av_parts where partnumber = '123456';
>> NOTICE: QUERY PLAN:
>>
>> Index Scan using av_parts_partnumber_index on av_parts (cost=2.04 rows=1
>> width=124)
>>
>> EXPLAIN
>> oletest=> explain select * from av_parts where nsn = '123456';
>> NOTICE: QUERY PLAN:
>>
>> Seq Scan on av_parts (cost=48.00 rows=995 width=124)
> It looks like the highly skewed distribution of nsn values (what you
> sent me had 997 '' entries, only 3 non-empty strings) is confusing the
> selectivity estimation code somehow, such that the system thinks that
> the query is going to match most of the rows. Notice it is estimating
> 995 returned rows for the nsn select! Under these circumstances it will
> prefer a sequential scan, since the more-expensive-per-tuple index scan
> doesn't look like it will be able to avoid reading most of the table.
> That logic is OK, it's the 0.995 selectivity estimate that's wrong...
It turns out that the selectivity estimate for an "=" comparison is just
the attdisbursion statistic calculated by VACUUM ANALYZE, which can be
roughly defined as the frequency of the most common value in the column.
(I took statistics too long ago to recall the exact definition.)
Anyway, given that the test data Ole sent me contains nearly all ''
entries, I'd say that the 0.995 value is about right for disbursion.
Indeed, if one were to do a "select * from av_parts where nsn = ''",
then sequential scan would be the most efficient way to do that.
The system has no clue that that's not really something you'd do much.
By using this estimate, the system is effectively assuming that the
frequency of occurrence of values in the table corresponds to the
probability that you will be searching for each particular value.
So, the selectivity that a search for the most common value would
have is a reasonable estimate for the selectivity of a search for any
value. That's a bogus assumption in this case --- but it's hard to
justify making any other assumption in general.
My inclination is to hack up eqsel() to never return a selectivity
estimate larger than, say, 0.5, even when the measured disbursion
is more. I am not sure that this is a good idea, however. Comments?
regards, tom lane