Re: estimating # of distinct values - Mailing list pgsql-hackers

From Robert Haas
Subject Re: estimating # of distinct values
Date
Msg-id AANLkTimLhW4nS-DrAM-=CvUnhqKqf=eXJFXt4j4kWH2a@mail.gmail.com
Whole thread Raw
In response to Re: estimating # of distinct values  (Tomas Vondra <tv@fuzzy.cz>)
Responses Re: estimating # of distinct values  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Re: estimating # of distinct values  (Tomas Vondra <tv@fuzzy.cz>)
List pgsql-hackers
On Wed, Jan 19, 2011 at 5:13 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>> Regarding the crash scenario - if the commit fails, just throw away the
>>> local estimator copy, it's not needed. I'm not sure how to take care of
>>> the case when commit succeeds and the write of the merged estimator
>>> fails, but I think it might be possible to write the estimator to xlog
>>> or something like that. So it would be replayed during recovery etc. Or
>>> is it a stupid idea?
>>
>> It's not stupid, in the sense that that is what you'd need to do if
>> you want to avoid ever having to rescan the table.  But it is another
>> thing that I think is going to be way too expensive.
>
> Way too expensive? All you need to put into the logfile is a copy of the
> estimator, which is a few kBs. How is that 'way too expensive'?

At this point, this is all a matter of religion, right?  Neither of us
has a working implementation we've benchmarked.  But yes, I believe
you're going to find that implementing some kind of streaming
estimator is going to impose a...  <pulls number out of rear end> 6%
performance penalty, even after you've optimized the living daylights
out of it.  Now you might say... big deal, it improves my problem
queries by 100x.  OK, but if you could get the same benefit by doing
an occasional full table scan during off hours, you could have the
same performance with a *0%* performance penalty.  Even better, the
code changes would be confined to ANALYZE rather than spread out all
over the system, which has positive implications for robustness and
likelihood of commit.

I'm not trying to argue you out of working on this.  It's obviously
your time to spend, and if works better than I think it will, great!
I'm merely offering you an opinion on what will probably happen if you
go this route - namely, it'll carry an unpalatable run-time penalty.
That opinion may be worth no more than what you paid for it, but there
you have it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Florian Pflug
Date:
Subject: Re: estimating # of distinct values
Next
From: Robert Haas
Date:
Subject: Re: estimating # of distinct values