Thread: HyperLogLog.c and pg_leftmost_one_pos32()

HyperLogLog.c and pg_leftmost_one_pos32()

From
Jeff Davis
Date:
Is there a reason that HyperLogLog doesn't use pg_leftmost_one_pos32()?

I tried the following patch and some brief performance tests seem to
show an improvement.

This came up because my recent commit 9878b643 uses HLL for estimating
the cardinality of spill files, which solves a few annoyances with
overpartitioning[1]. I think it's overall an improvement, but
addHyperLogLog() itself seemed to show up as a cost, so it can hurt
spilled-but-still-in-memory cases. I'd also like to backpatch this to
13 (as I already did for 9878b643), if that's acceptable.

Regards,
    Jeff Davis

[1] 
https://www.postgresql.org/message-id/CAH2-Wznidojad-zbObnFOzDA5RTCS0JLsqcZkDNu+ou1NGYQYQ@mail.gmail.com

Attachment

Re: HyperLogLog.c and pg_leftmost_one_pos32()

From
Peter Geoghegan
Date:
On Wed, Jul 29, 2020 at 10:08 AM Jeff Davis <pgsql@j-davis.com> wrote:
> Is there a reason that HyperLogLog doesn't use pg_leftmost_one_pos32()?

Yes: HyperLogLog predates pg_leftmost_one_pos32().

> I tried the following patch and some brief performance tests seem to
> show an improvement.

Makes sense.

How did you test this? What kind of difference are we talking about? I
ported this code from the upstream C++ as part of the original
abbreviated keys commit. Note that the cardinality of abbreviated keys
are displayed when you set "trace_sort = on".

> This came up because my recent commit 9878b643 uses HLL for estimating
> the cardinality of spill files, which solves a few annoyances with
> overpartitioning[1].

I think that you should change back the rhs() variable names to match
HyperLogLog upstream (as well as the existing rhs() comments).

> I think it's overall an improvement, but
> addHyperLogLog() itself seemed to show up as a cost, so it can hurt
> spilled-but-still-in-memory cases. I'd also like to backpatch this to
> 13 (as I already did for 9878b643), if that's acceptable.

I still wonder if it was ever necessary to add HLL to abbreviated
keys. It only served to avoid some pretty narrow worse cases, at the
expense of typical cases. Given that the existing users of HLL are
pretty narrow, and given the importance of preserving the favorable
performance characteristics of hash aggregate, I'm inclined to agree
that it's worth backpatching to 13 now. Assuming it is a really
measurable cost in practice.

-- 
Peter Geoghegan



Re: HyperLogLog.c and pg_leftmost_one_pos32()

From
Jeff Davis
Date:
On Wed, 2020-07-29 at 17:32 -0700, Peter Geoghegan wrote:
> How did you test this? What kind of difference are we talking about?

Essentially:
  initHyperLogLog(&hll, 5)
  for i in 0 .. one billion
    addHyperLogLog(&hll, hash(i))
  estimateHyperLogLog

The numbers are the same regardless of bwidth.

Before my patch, it takes about 15.6s. After my patch, it takes about
6.6s, so it's more than a 2X speedup (including the hash calculation).

As a part of HashAgg, when I test the worst case, it goes from a 4-5%
penalty to ~1% (within noise).

> I think that you should change back the rhs() variable names to match
> HyperLogLog upstream (as well as the existing rhs() comments).

Done.

> > I think it's overall an improvement, but
> > addHyperLogLog() itself seemed to show up as a cost, so it can hurt
> > spilled-but-still-in-memory cases. I'd also like to backpatch this
> > to
> > 13 (as I already did for 9878b643), if that's acceptable.
> 
> I still wonder if it was ever necessary to add HLL to abbreviated
> keys. It only served to avoid some pretty narrow worse cases, at the
> expense of typical cases. Given that the existing users of HLL are
> pretty narrow, and given the importance of preserving the favorable
> performance characteristics of hash aggregate, I'm inclined to agree
> that it's worth backpatching to 13 now. Assuming it is a really
> measurable cost in practice.

Yes, the difference (at least in a tight loop, on my machine) is not
subtle. I went ahead and committed and backpatched.

Regards,
    Jeff Davis






Re: HyperLogLog.c and pg_leftmost_one_pos32()

From
Tomas Vondra
Date:
On Thu, Jul 30, 2020 at 09:21:23AM -0700, Jeff Davis wrote:
>On Wed, 2020-07-29 at 17:32 -0700, Peter Geoghegan wrote:
>> How did you test this? What kind of difference are we talking about?
>
>Essentially:
>  initHyperLogLog(&hll, 5)
>  for i in 0 .. one billion
>    addHyperLogLog(&hll, hash(i))
>  estimateHyperLogLog
>
>The numbers are the same regardless of bwidth.
>
>Before my patch, it takes about 15.6s. After my patch, it takes about
>6.6s, so it's more than a 2X speedup (including the hash calculation).
>

Wow. That's a huge improvements.

How does the whole test (data + query) look like? Is it particularly
rare / special case, or something reasonable to expect in practice?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: HyperLogLog.c and pg_leftmost_one_pos32()

From
Jeff Davis
Date:
On Thu, 2020-07-30 at 19:16 +0200, Tomas Vondra wrote:
> > Essentially:
> >  initHyperLogLog(&hll, 5)
> >  for i in 0 .. one billion
> >    addHyperLogLog(&hll, hash(i))
> >  estimateHyperLogLog
> > 
> > The numbers are the same regardless of bwidth.
> > 
> > Before my patch, it takes about 15.6s. After my patch, it takes
> > about
> > 6.6s, so it's more than a 2X speedup (including the hash
> > calculation).
> > 
> 
> Wow. That's a huge improvements.

To be clear: the 2X+ speedup was on the tight loop test.

> How does the whole test (data + query) look like? Is it particularly
> rare / special case, or something reasonable to expect in practice?

The whole-query test was:

config:
  shared_buffers=8GB
  jit = off
  max_parallel_workers_per_gather=0

setup:
  create table t_1m_20(i int);
  vacuum (freeze, analyze) t_1m_20;
  insert into t_1m_20 select (random()*1000000)::int4
    from generate_series(1,20000000);

query:
  set work_mem='2048kB';
  SELECT pg_prewarm('t_1m_20', 'buffer');

  -- median of the three runs
  select distinct i from t_1m_20 offset 10000000;
  select distinct i from t_1m_20 offset 10000000;
  select distinct i
from t_1m_20 offset 10000000;

results:
  f2130e77 (before using HLL):          6787ms 
  f1af75c5 (before my recent commit):   7170ms
  fd734f38 (master now):                6990ms

My previous results before I committed the patch (and therefore not on
the same exact SHA1s) were 6812, 7158, and 6898. So my most recent
batch of results is slightly worse, but the most recent commit
(fd734f38) still does show an improvement of a couple percent.

Regards,
    Jeff Davis