On Sat, Jul 25, 2020 at 4:56 PM Jeff Davis <pgsql@j-davis.com> wrote:
> I wrote a quick patch to use HyperLogLog to estimate the number of
> groups contained in a spill file. It seems to reduce the
> overpartitioning effect, and is a more principled approach than what I
> was doing before.
This pretty much fixes the issue that I observed with overparitioning.
At least in the sense that the number of partitions grows more
predictably -- even when the number of partitions planned is reduced
the change in the number of batches seems smooth-ish. It "looks nice".
> It does seem to hurt the runtime slightly when spilling to disk in some
> cases. I haven't narrowed down whether this is because we end up
> recursing multiple times, or if it's just more efficient to
> overpartition, or if the cost of doing the HLL itself is significant.
I'm glad that this better principled approach is possible. It's hard
to judge how much of a problem this really is, though. We'll need to
think about this aspect some more.
Thanks
--
Peter Geoghegan