Re: estimating # of distinct values - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: estimating # of distinct values
Date
Msg-id 4D38AFE8.1040506@fuzzy.cz
Whole thread Raw
In response to Re: estimating # of distinct values  (Csaba Nagy <ncslists@googlemail.com>)
List pgsql-hackers
Dne 20.1.2011 11:05, Csaba Nagy napsal(a):
> Hi Tomas,
> 
> On Wed, 2011-01-19 at 23:13 +0100, Tomas Vondra wrote:
>> No, the multi-column statistics do not require constant updating. There
>> are cases where a sampling is perfectly fine, although you may need a
>> bit larger sample. Generally if you can use a multi-dimensional
>> histogram, you don't need to scan the whole table.
> 
> In the cases where sampling is enough, you can do that to the updates
> too: do a sampling on the changes, in that you only process every Nth
> change to make it to the estimator. If you can also dynamically tune the
> N to grow it as the statistics stabilize, and lower it if you detect
> high variance, even better.
> 
> If the analyze process could be decoupled from the backends, and maybe
> just get the data passed over to be processed asynchronously, then that
> could be a feasible way to have always up to date statistics when the
> bottleneck is IO and CPU power is in excess. If that then leads to
> better plans, it could really be a win exceeding the overhead.

OK, this sounds interesting. I'm not sure how to do that but it might be
a good solution. What about transactions? If the client inserts data
(and it will be sent asynchronously to update the estimator) and then
rolls back, is the estimator 'rolled back' or what happens?

This was exactly the reason why I initially wanted to collect all the
data at the backend (and send them to the estimator at commit time).
Which was then replaced by the idea to keep a local estimator copy and
merge it back to the original estimator at commit time.

> If this analyze process (or more of them) could also just get the data
> from the modified buffers in a cyclic way, so that backends need nothing
> extra to do, then I don't see any performance disadvantage other than
> possible extra locking contention on the buffers and non-determinism of
> the actual time when a change makes it to the statistics. Then you just
> need to get more CPU power and higher memory bandwidth to pay for the
> accurate statistics.

Well, the possible locking contention sounds like a quite significant
problem to me :-(

The lag between an update and a change to the stats is not that big deal
I guess - we have the same behaviour with the rest of the stats (updated
by autovacuum every once a while).

Tomas


pgsql-hackers by date:

Previous
From: "Kevin Grittner"
Date:
Subject: Re: SSI and Hot Standby
Next
From: Josh Berkus
Date:
Subject: Re: SSI and Hot Standby