Re: Yet another abort-early plan disaster on 9.3 - Mailing list pgsql-performance

From Tomas Vondra
Subject Re: Yet another abort-early plan disaster on 9.3
Date
Msg-id 542F343E.40906@fuzzy.cz
Whole thread Raw
In response to Re: Yet another abort-early plan disaster on 9.3  (Peter Geoghegan <peter.geoghegan86@gmail.com>)
List pgsql-performance
On 3.10.2014 02:54, Peter Geoghegan wrote:
> On Thu, Oct 2, 2014 at 12:56 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> Yes, it's only intractable if you're wedded to the idea of a tiny,
>> fixed-size sample. If we're allowed to sample, say, 1% of the
>> table, we can get a MUCH more accurate n_distinct estimate using
>> multiple algorithms, of which HLL is one. While n_distinct will
>> still have some variance, it'll be over a much smaller range.
>
> I think that HyperLogLog, as a streaming algorithm, will always
> require that the entire set be streamed. This doesn't need to be a
> big deal - in the case of my abbreviated key patch, it appears to
> basically be free because of the fact that we were streaming
> everything anyway. It's a very cool algorithm, with fixed overhead
> and constant memory usage. It makes very useful guarantees around
> accuracy.

I think you're mixing two things here - estimating the number of
distinct values in a sample (which can be done very efficiently using
HLL) and estimating the number of distinct values in the whole table.
For that HLL is not usable, unless you process all the data.

Sadly HLL is rather incompatible with the usual estimators, because the
ones I'm aware of need to know the number of occurences for the distinct
values etc.

But couldn't we just piggyback this on autovacuum? One of the nice HLL
features is that it's additive - you can build "partial counters" for
ranges of blocks (say, a few MBs per range), and then merge them when
needed. By keeping the parts it's possible to rebuild it separately.

But maybe this idea is way too crazy ...

regards
Tomas


pgsql-performance by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Yet another abort-early plan disaster on 9.3
Next
From: Rodrigo Barboza
Date:
Subject: Re: auto vaccum is dying