On 7 March 2016 at 13:23, Victor Yegorov <vyegorov@gmail.com> wrote:
> Your `sc_id` and `scdate` columns are correlated.
Actually not necessarily, although in the majority case that's mostly true.
> Planner has no such knowledge and assumes columns being independent. Your
> `scdate` predicate is
> estimate to return 14394 rows (based on the EXPLAIN of your first post). I
> think, that this corresponds to
> a quite small portion of your table, less than 1% (based on `Rows Removed by
> Filter: 4068865` from the
> same EXPLAIN). Under uniform distribution, these 14394 rows can be anywhere
> in the table.
Oh I see! To make sure I understand what you're saying: given a fully
random distribution of scdate across the table, searching the whole
table for the first instance of scdate=(20160219|20160220|20160221)
will be fastest, because using the scdate index could end up doing
lots of random-accesses across the whole table to get sc_id values,
whereas assuming a random distribution of scdate values I would only
expect to have to scan (sequentially) 3% of the table before I find
one of the scdate values that match, yes?
> Right now (9.5 and earlier versions) I do not know of any options that would
> not require fixing your queries.
Well I can cope with using (sc_id,scdate) index (which I guess wins
purely by being index-only and smaller than the table), but it makes
me unhappy.
I think I realised why the planner won't use (scdate,sc_id): the
multicolumn index page
(http://www.postgresql.org/docs/current/static/indexes-multicolumn.html)
suggests that:
The exact rule is that equality constraints on leading columns,
plus any inequality constraints on the first column that does not
have an equality constraint, will be used to limit the portion of
the index that is scanned.
So since the first column is an inequality constraint, PG won't use
the second column from index(scdate,sc_id) to retrieve the MIN()
value. This explains why it happily uses the index for the
LEAST(MIN(),MIN(),MIN()) version (since each one is an equality
condition); it seems like that rule is just too restrictive, because a
range constraint on the first key, when used in conjunction with an
index-only scan, is almost always going to win, even if the second
constraint matches all of the rows and even if the range constraint
returns all the values in the table (unless the index is larger than
the table itself, I suppose).
I might suggest that perhaps the rule should be relaxed so that an
inequality constraint does not stop the subsequent columns being used
in an index-only scan.
That assumes that I've not completely misunderstood, of course :)
Geoff