Thread: Stats for multi-column indexes

Stats for multi-column indexes

From
Jeff Davis
Date:
I know the idea has come up a few times to do cross-column statistics to
improve plans when the data distributions are dependent. I found a
couple references in the archives:

http://archives.postgresql.org/pgsql-hackers/2006-09/msg02118.php
http://archives.postgresql.org/pgsql-hackers/2006-08/msg00924.php

We can already keep stats for a functional index. Is there a reason we
can't keep stats for a multi-column index?

Out of the two situations outlined by Jim Nasby here:

http://archives.postgresql.org/pgsql-hackers/2006-08/msg00948.php

It would not help with joins, but it would help with columns that are
referred to as a group (e.g. a=1 AND b<20).

Regards,Jeff Davis



Re: Stats for multi-column indexes

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> We can already keep stats for a functional index. Is there a reason we
> can't keep stats for a multi-column index?

The questions that need to be answered are (1) what stats are you gonna
collect, and (2) exactly what are you going to do with them when you
have 'em?

All the previous discussions have stalled on the question of how to
avoid trying to collect stats about an exponentially large number of
column combinations; we've never even reached the question of what
stats we'd actually want given that a particular combination has been
determined to be interesting.  Perhaps that's a trivial question,
but it's been a mighty long time since I took statistics ...
        regards, tom lane


Re: Stats for multi-column indexes

From
Jeff Davis
Date:
On Mon, 2007-03-19 at 21:24 -0400, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > We can already keep stats for a functional index. Is there a reason we
> > can't keep stats for a multi-column index?
> 
> The questions that need to be answered are (1) what stats are you gonna
> collect, and (2) exactly what are you going to do with them when you
> have 'em?
> 
> All the previous discussions have stalled on the question of how to
> avoid trying to collect stats about an exponentially large number of
> column combinations; we've never even reached the question of what
> stats we'd actually want given that a particular combination has been
> determined to be interesting.  Perhaps that's a trivial question,
> but it's been a mighty long time since I took statistics ...
> 

I know we can't keep stats on every combination of columns. My initial
idea would be to only keep stats about a multi-column index (and
probably optional for those, too).

My thinking was that we could keep a histogram (and MCVs, etc.) of the
non-scalar key in the multi-column index. That would provide the data
the planner needs to answer a query like "WHERE a = 1 and b < 1000" if a
and b are dependent and you have an index on (a,b).

It seemed within reach to me initially because I could use a functional
index (in which the function turns multiple values into a comparable
scalar) and postgresql would index that and keep stats. And when it has
those stats, it makes the correct plan. Of course, I have to litter the
SQL with unnecessary function calls (so that it can use the functional
index), which makes this undesirable.

AndrewSN pointed out on IRC that keeping a histogram of non-scalar
values is not as easy as I thought, because PostgreSQL doesn't allow
arrays of composite types, among other problems.

Is this a worthwhile area of exploration?

Regards,Jeff Davis



Re: Stats for multi-column indexes

From
Mark Kirkwood
Date:
Jeff Davis wrote:
>
> I know we can't keep stats on every combination of columns. My initial
> idea would be to only keep stats about a multi-column index (and
> probably optional for those, too).
> 

Maybe you would want to keep single column indexes too, so that (more) 
accurate estimates for bitmap-and type plans are possible.

Cheers

Mark


Re: Stats for multi-column indexes

From
"Simon Riggs"
Date:
On Tue, 2007-03-20 at 14:14 +1200, Mark Kirkwood wrote:
> Jeff Davis wrote:
> >
> > I know we can't keep stats on every combination of columns. My initial
> > idea would be to only keep stats about a multi-column index (and
> > probably optional for those, too).
> > 
> 
> Maybe you would want to keep single column indexes too, so that (more) 
> accurate estimates for bitmap-and type plans are possible.

We should allow the DBA to specify which groups of cols to keep
statistics on, if there is no index on that group.

That solves the combinatorial explosion problem.


--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Stats for multi-column indexes

From
Richard Huxton
Date:
Simon Riggs wrote:
> On Tue, 2007-03-20 at 14:14 +1200, Mark Kirkwood wrote:
>> Jeff Davis wrote:
>>> I know we can't keep stats on every combination of columns. My initial
>>> idea would be to only keep stats about a multi-column index (and
>>> probably optional for those, too).
>>>
>> Maybe you would want to keep single column indexes too, so that (more) 
>> accurate estimates for bitmap-and type plans are possible.
> 
> We should allow the DBA to specify which groups of cols to keep
> statistics on, if there is no index on that group.
> 
> That solves the combinatorial explosion problem.

This is one hint I think everyone can agree on. Being able to say that 
values in different columns are related just gives the planner more 
information to work with.

--   Richard Huxton  Archonet Ltd


Re: Stats for multi-column indexes

From
Alvaro Herrera
Date:
Richard Huxton wrote:
> Simon Riggs wrote:
> >On Tue, 2007-03-20 at 14:14 +1200, Mark Kirkwood wrote:
> >>Jeff Davis wrote:
> >>>I know we can't keep stats on every combination of columns. My initial
> >>>idea would be to only keep stats about a multi-column index (and
> >>>probably optional for those, too).
> >>>
> >>Maybe you would want to keep single column indexes too, so that (more) 
> >>accurate estimates for bitmap-and type plans are possible.
> >
> >We should allow the DBA to specify which groups of cols to keep
> >statistics on, if there is no index on that group.
> >
> >That solves the combinatorial explosion problem.
> 
> This is one hint I think everyone can agree on. Being able to say that 
> values in different columns are related just gives the planner more 
> information to work with.

It was also suggested that column pairs in FK relationship could be
automatically enabled.  So you don't need to specify those manually.

Now, the hard question is deciding what to keep track of.  I don't think
MCV makes much sense, because what's the MCV of two columns?  Some sort
of correlation index would seem to make certain sense.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Stats for multi-column indexes

From
Peter Eisentraut
Date:
Alvaro Herrera wrote:
> Now, the hard question is deciding what to keep track of.  I don't
> think MCV makes much sense, because what's the MCV of two columns? 

The combination that occurs most often.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/


Re: Stats for multi-column indexes

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> It was also suggested that column pairs in FK relationship could be
> automatically enabled.  So you don't need to specify those manually.

Actually, I think you don't particularly need stats for that in most
cases --- if the planner simply took note that the FK relationship
exists, it would know that each row of the FK side joins to exactly
one row of the PK side, which in typical cases is sufficient.
        regards, tom lane


Re: Stats for multi-column indexes

From
Josh Berkus
Date:
Tom,

> Actually, I think you don't particularly need stats for that in most
> cases --- if the planner simply took note that the FK relationship
> exists, it would know that each row of the FK side joins to exactly
> one row of the PK side, which in typical cases is sufficient.

Is it?  What about the other direction?  Currently, doesn't the planner 
assume that the rowcount relationship is 1 to ( child total rows / 
parent total rows) ?  That's ok for tables with relatively even 
distribution, but not for skewed ones.

--Josh Berkus


Re: Stats for multi-column indexes

From
Jeff Davis
Date:
On Tue, 2007-03-20 at 09:03 +0000, Simon Riggs wrote:
> We should allow the DBA to specify which groups of cols to keep
> statistics on, if there is no index on that group.
> 
> That solves the combinatorial explosion problem.
> 

I think it would be a good first step if we could just keep stats on
multiple columns in the same table. If we can do more than that, great.

We could probably keep stats on multiple columns across different
tables, but I don't know how those statistics should be used. Using
statistics to estimate joins seems like a tricky problem. Maybe it's
already solved with known algorithms?

Regards,Jeff Davis



Re: Stats for multi-column indexes

From
Jeff Davis
Date:
On Tue, 2007-03-20 at 18:12 +0100, Josh Berkus wrote:
> Tom,
> 
> > Actually, I think you don't particularly need stats for that in most
> > cases --- if the planner simply took note that the FK relationship
> > exists, it would know that each row of the FK side joins to exactly
> > one row of the PK side, which in typical cases is sufficient.
> 
> Is it?  What about the other direction?  Currently, doesn't the planner 
> assume that the rowcount relationship is 1 to ( child total rows / 
> parent total rows) ?  That's ok for tables with relatively even 
> distribution, but not for skewed ones.
> 

In theory, the PK constrains the available values of the FK, but doesn't
provide any additional information about the relationship between the
columns. 

However, in practice there is limited space to store MCVs and limited
accuracy to n_distinct. So there may be a reason to store more
information, but I don't know what we'd store. Do we have reports of bad
estimates by the planner in this situation?

Regards,Jeff Davis



Re: Stats for multi-column indexes

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
>> Actually, I think you don't particularly need stats for that in most
>> cases --- if the planner simply took note that the FK relationship
>> exists, it would know that each row of the FK side joins to exactly
>> one row of the PK side, which in typical cases is sufficient.

> Is it?  What about the other direction?

I recall that we had decided at the Greenplum meeting last year that
we could use a better heuristic if we noted that a join was being done
on an FK-and-PK combination, but I don't recall the details right at
the moment.  Did anyone take notes?
        regards, tom lane


Re: Stats for multi-column indexes

From
"Jim C. Nasby"
Date:
On Mon, Mar 19, 2007 at 06:55:56PM -0700, Jeff Davis wrote:
> On Mon, 2007-03-19 at 21:24 -0400, Tom Lane wrote:
> > Jeff Davis <pgsql@j-davis.com> writes:
> > > We can already keep stats for a functional index. Is there a reason we
> > > can't keep stats for a multi-column index?
> > 
> > The questions that need to be answered are (1) what stats are you gonna
> > collect, and (2) exactly what are you going to do with them when you
> > have 'em?
> > 
> > All the previous discussions have stalled on the question of how to
> > avoid trying to collect stats about an exponentially large number of
> > column combinations; we've never even reached the question of what
> > stats we'd actually want given that a particular combination has been
> > determined to be interesting.  Perhaps that's a trivial question,
> > but it's been a mighty long time since I took statistics ...
> > 
> 
> I know we can't keep stats on every combination of columns. My initial
> idea would be to only keep stats about a multi-column index (and
> probably optional for those, too).
> 
> My thinking was that we could keep a histogram (and MCVs, etc.) of the
> non-scalar key in the multi-column index. That would provide the data
> the planner needs to answer a query like "WHERE a = 1 and b < 1000" if a
> and b are dependent and you have an index on (a,b).
<snip> 
> AndrewSN pointed out on IRC that keeping a histogram of non-scalar
> values is not as easy as I thought, because PostgreSQL doesn't allow
> arrays of composite types, among other problems.

I don't think the array problem is that big a deal, since PostgreSQL
doesn't enforce array dimensions at all. You can just make the arrays
for multi-column stats 2 dimensional, though handling indexes with
different data types among the columns would be a bit tricky... right
now the only choice I can think of would be to require that values could
be cast to and from text and just store text in the array. Though
obviously it'd be better to just allow arrays of composite types...

The other challenge is that you can't make all the same assumptions with
a multi-field histogram that you can with a single-field one. For
example, if this is our index:

a   b
-   -
1   1
1   2
...
1   1000
2   500
2   501
...
3   5000

The histogram would likely position the buckets such that 1,1000 and
2,500 would fall within one bucket, which means the planner has no idea
that b doesn't exceed 1000 when a is 1. I'm not sure how big of an issue
that is in reality, though, because the planner does know that the
bucket can only represent so many rows.

It might be worth coming up with a different means to store the
histogram for the multi-column case.

> Is this a worthwhile area of exploration?

ISTM it trips people up often enough to make it worth at least
exploring...
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)


Re: Stats for multi-column indexes

From
Tom Lane
Date:
"Jim C. Nasby" <jim@nasby.net> writes:
> It might be worth coming up with a different means to store the
> histogram for the multi-column case.

A separate array for each column involved seems a whole lot less
fragile than pretending we can handle mixed-type arrays.

We probably need a different catalog anyway, or at least a reimagining
of pg_statistic, since it doesn't hold more than one value of staattnum
per row.
        regards, tom lane


Re: Stats for multi-column indexes

From
Csaba Nagy
Date:
On Tue, 2007-03-20 at 18:12, Josh Berkus wrote:
> Tom,
> 
> > Actually, I think you don't particularly need stats for that in most
> > cases --- if the planner simply took note that the FK relationship
> > exists, it would know that each row of the FK side joins to exactly
> > one row of the PK side, which in typical cases is sufficient.
> 
> Is it?  What about the other direction?  Currently, doesn't the planner 
> assume that the rowcount relationship is 1 to ( child total rows / 
> parent total rows) ?  That's ok for tables with relatively even 
> distribution, but not for skewed ones.

Wouldn't that be improved if the MCVs/histogram of the FK column are
taken into account ? Considering that the FK part is unique, the
skewness in the relationship is completely determined by the FK parts
histogram. That would give at least a lower/upper bound and MCVs to the
relationship.

Cheers,
Csaba.



Re: Stats for multi-column indexes

From
Csaba Nagy
Date:
This should read:

> Considering that the FK part is unique, the                    ^^PK^^
> skewness in the relationship is completely determined by the FK part's
> histogram. That would give at least a lower/upper bound and MCVs to the
> relationship.

Cheers,
Csaba.