Home > mailing lists

Re: estimating # of distinct values - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: estimating # of distinct values
Date	January 20, 2011 20:42:59
Msg-id	4D38AC55.70402@fuzzy.cz Whole thread Raw
In response to	Re: estimating # of distinct values (Robert Haas <robertmhaas@gmail.com>)
List	pgsql-hackers

Tree view

Dne 20.1.2011 03:36, Robert Haas napsal(a):
> On Wed, Jan 19, 2011 at 5:13 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>>>> Regarding the crash scenario - if the commit fails, just throw away the
>>>> local estimator copy, it's not needed. I'm not sure how to take care of
>>>> the case when commit succeeds and the write of the merged estimator
>>>> fails, but I think it might be possible to write the estimator to xlog
>>>> or something like that. So it would be replayed during recovery etc. Or
>>>> is it a stupid idea?
>>>
>>> It's not stupid, in the sense that that is what you'd need to do if
>>> you want to avoid ever having to rescan the table.  But it is another
>>> thing that I think is going to be way too expensive.
>>
>> Way too expensive? All you need to put into the logfile is a copy of the
>> estimator, which is a few kBs. How is that 'way too expensive'?
> 
> At this point, this is all a matter of religion, right?  Neither of us
> has a working implementation we've benchmarked.  But yes, I believe
> you're going to find that implementing some kind of streaming
> estimator is going to impose a...  <pulls number out of rear end> 6%
> performance penalty, even after you've optimized the living daylights
> out of it.  Now you might say... big deal, it improves my problem
> queries by 100x.  OK, but if you could get the same benefit by doing
> an occasional full table scan during off hours, you could have the
> same performance with a *0%* performance penalty.  Even better, the
> code changes would be confined to ANALYZE rather than spread out all
> over the system, which has positive implications for robustness and
> likelihood of commit.

Good point. What I was trying to do was to continuously update the
estimator with new data - that was the whole idea behind the collecting
of new values (which might lead to problems with memory etc. as you've
pointed out) and updating a local copy of the estimator (which is a good
idea I think).

But this might be another option - let the user decide if he wants to
continuously update the estimates (and pay the price) or do that off the
hours (and pay almost nothing). That sounds as a very good solution to me.

> I'm not trying to argue you out of working on this.  It's obviously
> your time to spend, and if works better than I think it will, great!
> I'm merely offering you an opinion on what will probably happen if you
> go this route - namely, it'll carry an unpalatable run-time penalty.
> That opinion may be worth no more than what you paid for it, but there
> you have it.

Yes, and I appreciate all feedback. But I still believe this can be done
so that users that don't need the feature don't pay for it.

regards
Tomas

pgsql-hackers by date:

From: "Simone Aiken"
Date: 20 January 2011, 20:40:19
Subject: Re: ToDo List Item - System Table Index Clustering

From: Josh Berkus
Date: 20 January 2011, 20:49:04
Subject: Re: Orphaned statements issue

Re: estimating # of distinct values - Mailing list pgsql-hackers

Previous

Next