Ben <bench@silentmedia.com> writes:
> music=# explain ... where ... playlist.stream=1
> -> Seq Scan on playlist (cost=0.00..300.81 rows=2321 width=20)
> music=# explain ... where ... playlist.stream=2
> -> Seq Scan on playlist (cost=0.00..300.81 rows=205 width=20)
I am betting that 1 is the most common value in playlist.stream, or at
least is being chosen as the most common value by VACUUM ANALYZE's
not-totally-accurate estimation process. The 2321 rowcount estimate
then falls out of the stored statistic for the most common value's
frequency. In the second case, the estimator knows that 2 is *not* the
most common value, but it has absolutely no statistical basis on which
to guess what the frequency really is. I think it uses 1/10th of the
most common frequency for anything that's not the most common value
(look in utils/adt/selfuncs.c to be sure). There's probably also some
contribution from the "playlist.played is null" clause, else the row
count estimate would be exactly 1/10th as much. However, I don't
believe that the thing currently makes any serious effort to gauge the
selectivity of IS NULL, which is a shame because that would critically
affect the results here. (You did say some thousands of rows matching
the stream=N clause, but only a few matching IS NULL, right?)
Given the fundamental difference in this initial row count estimate,
the large difference in the subsequent join plan structure is not too
surprising.
In short: another demonstration of the limitations of our current
statistics about data frequencies.
BTW, you didn't actually say which plan was faster. Since the second
one was closer to the true statistic (only a few rows returned from
playlist), I'm hoping it was faster...
regards, tom lane