Re: Yet another abort-early plan disaster on 9.3 - Mailing list pgsql-performance

From Josh Berkus
Subject Re: Yet another abort-early plan disaster on 9.3
Date
Msg-id 542DADEB.2080103@agliodbs.com
Whole thread Raw
In response to Yet another abort-early plan disaster on 9.3  (Josh Berkus <josh@agliodbs.com>)
Responses Re: Yet another abort-early plan disaster on 9.3  (Peter Geoghegan <peter.geoghegan86@gmail.com>)
Re: Yet another abort-early plan disaster on 9.3  (Jeff Janes <jeff.janes@gmail.com>)
Re: Yet another abort-early plan disaster on 9.3  (Greg Stark <stark@mit.edu>)
List pgsql-performance
On 10/02/2014 02:30 AM, Peter Geoghegan wrote:
> On Thu, Oct 2, 2014 at 1:19 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> Having read papers on it, I believe the problem is intractable. Coding
>> is not the issue. To anyone: please prove me wrong, in detail, with
>> references so it can be coded.
>
> I think it might be close to intractable if you're determined to use a
> sampling model. HyperLogLog looks very interesting for n_distinct
> estimation, though. My abbreviated key patch estimates the cardinality
> of abbreviated keys (and original strings that are to be sorted) with
> high precision and fixed overhead.  Maybe we can figure out a way to
> do opportunistic streaming of HLL. Believe it or not, the way I use
> HLL for estimating cardinality is virtually free. Hashing is really
> cheap when the CPU is bottlenecked on memory bandwidth.

Yes, it's only intractable if you're wedded to the idea of a tiny,
fixed-size sample.  If we're allowed to sample, say, 1% of the table, we
can get a MUCH more accurate n_distinct estimate using multiple
algorithms, of which HLL is one.  While n_distinct will still have some
variance, it'll be over a much smaller range.

The n_distinct algo we use in Postgres is specifically designed (by its
author) to choose the smallest reasonable number of distinct values
capable of producing the observed distribution.  This made sense when we
added it because we didn't have query plans where underestimating
n_distinct produced a penalty.  Now we do, and we ought to change algos.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


pgsql-performance by date:

Previous
From: Jeff Janes
Date:
Subject: Re: auto vaccum is dying
Next
From: Peter Geoghegan
Date:
Subject: Re: Yet another abort-early plan disaster on 9.3