Thread: Cross column statistics

Cross column statistics

From
Greg Stark
Date:

Just brain storming a bit here. It seems to me there are two approaches for
cross-column statistics within a single table:

A) Treat an index on <a,b,c> the same way Postgres treats an expression index  on ROW(a,b,c). That is, gather full
statisticson the distribution of that  ntuple of columns. I think this would be the easiest option for the  analyzer.
But:   a) The optimizer would then need to do work to detect when all columns are  present in the constraints and
deducethe ntuple to look for in the  statistics.
 
  b) It would only help if all the columns are used. I'm not sure how easy it  would be to generalize this to queries
using<a,b> or worse, <b,c>.
 
  c) It would only work if you create an index on the set of columns being  queried. Often people have things like 
  SELECT * FROM tab WHERE indexed_col = ? AND deleted_flag IS false
  where deleted_flag *isn't* indexed or is a where clause on a partial index.

B) gather a full matrix of the level of "correlation" between each column and  each other column. If this were a single
floatingpoint number per pair  then it might be feasible. It would still obviously be n^2 in the number of  columns
though,so there would have to be some way to limit on how many  columns would be analyzed this way.
 
  The problem is that's it's *very* unclear how to gather this information  using a sample in any remotely efficient
manner.It's not even clear what  this number would be measuring.
 
  It's not actually "correlation" that Postgres usually needs. It's "How many  distinct values of b do we expect to
findgiven a=a_0". Or rather how many  do we expect to find relative to how many we would normally expect to find  if
thecolumns were independent.
 

-- 
greg



Re: Cross column statistics

From
Christopher Kings-Lynne
Date:
> B) gather a full matrix of the level of "correlation" between each column and
>    each other column. If this were a single floating point number per pair
>    then it might be feasible. It would still obviously be n^2 in the number of
>    columns though, so there would have to be some way to limit on how many
>    columns would be analyzed this way.

Use foreign keys to just record those cross-correlations.

Chris


Re: Cross column statistics

From
Greg Stark
Date:
Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:

> > B) gather a full matrix of the level of "correlation" between each column and
> >    each other column. If this were a single floating point number per pair
> >    then it might be feasible. It would still obviously be n^2 in the number of
> >    columns though, so there would have to be some way to limit on how many
> >    columns would be analyzed this way.
> 
> Use foreign keys to just record those cross-correlations.

My email was about cross-column intra-table "correlations". inter-table
"correlations" are a whole other ball of wax.

-- 
greg