Re: estimating # of distinct values - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: estimating # of distinct values
Date
Msg-id 4D3761F6.3060006@fuzzy.cz
Whole thread Raw
In response to Re: estimating # of distinct values  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: estimating # of distinct values  (Nathan Boley <npboley@gmail.com>)
Re: estimating # of distinct values  (Robert Haas <robertmhaas@gmail.com>)
Re: estimating # of distinct values  (Csaba Nagy <ncslists@googlemail.com>)
List pgsql-hackers
Dne 19.1.2011 20:25, Robert Haas napsal(a):
> On Tue, Jan 18, 2011 at 5:16 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> Yes, I was aware of this problem (amount of memory consumed with lots of
>> updates), and I kind of hoped someone will come up with a reasonable
>> solution.
> 
> As far as I can see, periodically sampling some or all of the table is
> really the only thing that has a chance of working OK.  You could try
> to track changes only when they're not too big, but I think that's
> still going to have nasty performance consequences.

Yes, sampling all the table is the only way to get reliable ndistinct
estimates. I'm not sure what you mean by 'tracking changes' - as I've
mentioned in the previous post, this might be solved by updating a local
copy. Which requires a constant space (a few kB, see the previous post).
Is that acceptable? I don't know, that's up to the user if he want's to
pay this price.

>> So the algorithm would be something like this:
>>
>> 1. create copy for all distinct estimators influenced by the INSERT
>> 2. update the local copy
>> 3. commit and merge the local copies back into the original estimator
> 
> Well, maybe.  But DELETEs seem like a problem.  And it's still adding
> foreground work to every query, which I just have a hard time
> believing is ever going to work out

Yes, deletes are difficult to handle. My idea was to compute something
like ((tuples changed + tuples deleted) / tuples total), and indicate
that a rebuild of the estimator is needed if this coefficient changes
too much - e.g. log a message or something like that.

>> Regarding the crash scenario - if the commit fails, just throw away the
>> local estimator copy, it's not needed. I'm not sure how to take care of
>> the case when commit succeeds and the write of the merged estimator
>> fails, but I think it might be possible to write the estimator to xlog
>> or something like that. So it would be replayed during recovery etc. Or
>> is it a stupid idea?
> 
> It's not stupid, in the sense that that is what you'd need to do if
> you want to avoid ever having to rescan the table.  But it is another
> thing that I think is going to be way too expensive.

Way too expensive? All you need to put into the logfile is a copy of the
estimator, which is a few kBs. How is that 'way too expensive'? Sure, it
might be expensive when the user does a lot of small changes in separate
transactions, that's true, but I guess we could minimize the amount of
data written to the xlog by doing a diff of the estimators or something
like that.

>> Yes, as I've mentioned above, the multi-column stats are the reason why
>> I started working on this. And yes, it really should work like this:
>>
>> 1. user realizes there's something wrong as the plans are bad
>> 2. after analysis, the user realizes the reason is an imprecise
>>   estimate due to dependence between columns
>> 3. the user enables cross-column stats on the columns and checks
>>   if it actually solved the issue
>> 4. if the cost outweights the gains, the user drops the stats
>>
>> Does that sound reasonable?
> 
> Yes.  The only caveat I'm trying to insert is that it's hard for me to
> see how the proposed approach could ever be cheap enough to be a win.

IMHO the question is not 'How expensive is that?' but rather 'How
expensive is it compared to the gains?' If the user gets much better
estimates and a then a much better plan, then this may be a completely
acceptable price.

> If we need some kind of more expensive kind of ANALYZE that scans the
> whole table, and it's optional, sure, why not?  But that's a one-time
> hit.  You run it, it sucks, it's over, and now your queries work.
> Odds are good you may never need to touch it again.  Now, if we can
> convince ourselves that multi-column stats are likely to require
> constant updating, then maybe there would be more to recommend the
> stream-based approach, although even then it seems dreadfully complex
> and expensive to me.  But I bet these things are pretty static.  If

No, the multi-column statistics do not require constant updating. There
are cases where a sampling is perfectly fine, although you may need a
bit larger sample. Generally if you can use a multi-dimensional
histogram, you don't need to scan the whole table.

But the multi-dimensional histograms are not applicable to some cases.
Especially to the ZIP code fail case, that was repeatedly discussed. So
in case of discrete data, we need a different approach - and the
solution I've proposed is based on using ndistinct estimates to get the
estimates (actually it's based on one of the papers listed on wiki).

> and expensive to me.  But I bet these things are pretty static.  If
> the city and state tend to imply the zip code when there are 10K rows
> in the table, they probably still will when there are 100K rows in the
> table.  If users with org_id = 1 have a higher karma score on average

OK, how will you measure the "implicativeness"?

We have discussed this in another thread and there is a lot of gotchas
although it seems like a really simple problem. The solution based on
ndistinct estimates turner out to be a reasonable approach that's worth
a try.

regards
Tomas


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: pl/python refactoring
Next
From: Tomas Vondra
Date:
Subject: Re: estimating # of distinct values