Re: Spilling hashed SetOps and aggregates to disk - Mailing list pgsql-hackers

From David Gershuni
Subject Re: Spilling hashed SetOps and aggregates to disk
Date
Msg-id 998FE0E3-CC36-43CC-96D0-5707959A4B0D@cs.cmu.edu
Whole thread Raw
In response to Re: Spilling hashed SetOps and aggregates to disk  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
List pgsql-hackers
As Serge mentioned, we’ve implemented spill-to-disk for SetOps and Aggregates at Salesforce. We were hitting OOMs often enough that this became a high priority for us. However, our current spill implementation is based on dynahash from 9.6, and we’re not happy with its performance (it was primarily an OOM stop-gap and was not focused on optimizing performance).  

Because of this, we’ve spec’d out a new spill-to-disk design for hash-based aggregates (and later SetOps), and we plan to begin implementation very soon.  Since this is an important fix for the community as well, we would be happy (and would prefer) to share our spill-to-disk implementation. 

Our new spill approach is very similar to Jeff’s and Heikki’s and is designed to use simplehash.h. It’s based on an algorithm called “HybridHash with Pre-Partitioning” found in [1]. It may later make sense to implement the “HashSort” Algorithm from [1] as well, which works better for highly skewed grouping keys. The optimizer could eventually choose between the two based on the stats available. We also like Heikki’s suggestions to use logtape.c to reduce disk usage and a trie-based approach to control the size of partitions dynamically.

We’ve also been grappling with how to handle the implementation challenges pointed out in this thread. These include:
• tracking memory usage
• choosing a smart eviction policy (which is touched on in [2])
• serializing opaque user-defined transition values when eviction is required

For 1), we plan to use our WithStats memory context, which Serge mentioned.
For 2), we plan to start with a simple non-eviction policy since we don’t have the stats do anything smarter (i.e. insert until we fill the hash table, then spill until we finish processing the batch, with evictions only happening if a group’s transition value grows too large).
For 3), we don’t have a good solution yet. We could serialize/deserialize for built-in types and rely on users to provide serialize/deserialize functions for user-defined aggregates going forward.

But we’re open to suggestions :)

Regards,
David
Salesforce

[1] Revisiting Aggregation for Data Intensive Applications: A Performance Study
https://arxiv.org/pdf/1311.0059.pdf

[2] DB2 with BLU acceleration: so much more than just a column store
http://delivery.acm.org/10.1145/2540000/2536233/p1080-raman.pdf?ip=204.14.239.107&id=2536233&acc=ACTIVE%20SERVICE&key=37B0A9F49C26EEFC%2E37B0A9F49C26EEFC%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&__acm__=1528414374_aeb9f862ae2acc26db305d591095e3f7

pgsql-hackers by date:

Previous
From: Nico Williams
Date:
Subject: Re: [PATCH v16] GSSAPI encryption support
Next
From: Melanie Plageman
Date:
Subject: Re: Bug in either collation docs or code