Re: pg_upgrade and statistics - Mailing list pgsql-hackers

From Ants Aasma
Subject Re: pg_upgrade and statistics
Date
Msg-id CA+CSw_tEYivSXMLqr7j0S298Mgg1FciLVfnSd=DgTEHt-UnDNQ@mail.gmail.com
Whole thread Raw
In response to Re: pg_upgrade and statistics  (Alvaro Herrera <alvherre@commandprompt.com>)
List pgsql-hackers
On Thu, Mar 15, 2012 at 8:48 PM, Alvaro Herrera <alvherre@commandprompt.com>
> What Peter proposed seems to me pretty reasonable, in the sense that it
> should be possible to come up with a function that creates some text
> representation of whatever is in pg_statistic, and another function to
> load that data into the new catalog(s).  There's no need to keep
> pg_statistic binary-compatible, or even continue to have only
> pg_statistic (IIRC Zoltan/Hans-Jurgen patch for cross-column stats adds
> a new catalog, pg_statistic2 or similar).  If the upgrade target release
> has room for more/improved stats, that's fine -- they'll be unused after
> loading the stats from the dump.  And the old stats can be
> reacommodated, if necessary.

I have been reading up on selectivity estimation research for the last
few days. I must say that I also think that having a text
representation as an intermediate won't create a huge maintenance
burden. The basic concepts that are there are pretty solid.
Conceptually MCV's and histograms continue to be essential even with
the more complex approaches. Trying to maintain binary compatibility
is probably a bad idea, as Tom noted with the array selectivity patch
- encoding of the information could be better. But given a textual
format it won't be too hard to just massage the data to the new
format. Making it possible to dump and load stats has the additional
bonus of enabling more experimentation with custom stats collectors.
One could easily prototype the stats collection with R, scipy, etc. Of
course the proof will be in the pudding.

Re, the patch, current posted WIP cross-col patch doesn't create a new
catalog,. It repurposes the stat slots mechanism to store multiple
dimensions. But I'll most likely rewrite it to use a separate catalog
because the storage requirements are rather different. I'll post a
proposal in the appropriate thread when I have decently clear idea how
this should work. One thing that seems clear is that multi-dimensional
histograms will want this mechanism even more, optimal histogram
construction is NP-hard in the multi-dimensional case and so people
will want to try different algorithms, or make different tradeoffs on
effort spent on constructing the histogram. Or even build one by hand.

Cheers,
Ants Aasma


pgsql-hackers by date:

Previous
From: Daniel Farina
Date:
Subject: Re: pg_terminate_backend for same-role
Next
From: Yeb Havinga
Date:
Subject: Re: [v9.2] Add GUC sepgsql.client_label