Re: Stats for multi-column indexes - Mailing list pgsql-hackers

From Jim C. Nasby
Subject Re: Stats for multi-column indexes
Date
Msg-id 20070320190549.GH83523@nasby.net
Whole thread Raw
In response to Re: Stats for multi-column indexes  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Stats for multi-column indexes  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Mon, Mar 19, 2007 at 06:55:56PM -0700, Jeff Davis wrote:
> On Mon, 2007-03-19 at 21:24 -0400, Tom Lane wrote:
> > Jeff Davis <pgsql@j-davis.com> writes:
> > > We can already keep stats for a functional index. Is there a reason we
> > > can't keep stats for a multi-column index?
> > 
> > The questions that need to be answered are (1) what stats are you gonna
> > collect, and (2) exactly what are you going to do with them when you
> > have 'em?
> > 
> > All the previous discussions have stalled on the question of how to
> > avoid trying to collect stats about an exponentially large number of
> > column combinations; we've never even reached the question of what
> > stats we'd actually want given that a particular combination has been
> > determined to be interesting.  Perhaps that's a trivial question,
> > but it's been a mighty long time since I took statistics ...
> > 
> 
> I know we can't keep stats on every combination of columns. My initial
> idea would be to only keep stats about a multi-column index (and
> probably optional for those, too).
> 
> My thinking was that we could keep a histogram (and MCVs, etc.) of the
> non-scalar key in the multi-column index. That would provide the data
> the planner needs to answer a query like "WHERE a = 1 and b < 1000" if a
> and b are dependent and you have an index on (a,b).
<snip> 
> AndrewSN pointed out on IRC that keeping a histogram of non-scalar
> values is not as easy as I thought, because PostgreSQL doesn't allow
> arrays of composite types, among other problems.

I don't think the array problem is that big a deal, since PostgreSQL
doesn't enforce array dimensions at all. You can just make the arrays
for multi-column stats 2 dimensional, though handling indexes with
different data types among the columns would be a bit tricky... right
now the only choice I can think of would be to require that values could
be cast to and from text and just store text in the array. Though
obviously it'd be better to just allow arrays of composite types...

The other challenge is that you can't make all the same assumptions with
a multi-field histogram that you can with a single-field one. For
example, if this is our index:

a   b
-   -
1   1
1   2
...
1   1000
2   500
2   501
...
3   5000

The histogram would likely position the buckets such that 1,1000 and
2,500 would fall within one bucket, which means the planner has no idea
that b doesn't exceed 1000 when a is 1. I'm not sure how big of an issue
that is in reality, though, because the planner does know that the
bucket can only represent so many rows.

It might be worth coming up with a different means to store the
histogram for the multi-column case.

> Is this a worthwhile area of exploration?

ISTM it trips people up often enough to make it worth at least
exploring...
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


pgsql-hackers by date:

Previous
From: "Joshua D. Drake"
Date:
Subject: Re: Buildfarm feature request: some way to track/classify failures
Next
From: Andrew Dunstan
Date:
Subject: Re: Buildfarm feature request: some way to track/classify failures