Re: Large Scale Aggregation (HashAgg Enhancement) - Mailing list pgsql-hackers

From Greg Stark
Subject Re: Large Scale Aggregation (HashAgg Enhancement)
Date
Msg-id 87y81fwiq5.fsf@stark.xeocode.com
Whole thread Raw
In response to Re: Large Scale Aggregation (HashAgg Enhancement)  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Large Scale Aggregation (HashAgg Enhancement)
List pgsql-hackers
Tom Lane <tgl@sss.pgh.pa.us> writes:

> > Why would we continue to dynamically build the hash table after the
> > start of the outer scan?
> 
> The number of tuples written to a temp file might exceed what we want to
> hold in memory; we won't detect this until the batch is read back in,
> and in that case we have to split the batch at that time.  For hashagg
> this point would apply to the aggregate states not the input tuples, but
> it's still a live problem (especially if the aggregate states aren't
> fixed-size values ... consider a "concat" aggregate for instance).

For a hash aggregate would it be possible to rescan the original table instead
of spilling to temporary files? Then when you run out of working memory you
simply throw out half the hash table and ignore subsequent tuples that fall in
those hash buckets. Then you rescan for the discarded hash bucket regions.

This avoids having to do any disk writes at the expense possibly of additional
reads. I think in terms of i/o it would be much faster in most cases.

The downsides are: a) volatile aggregates or aggregates with side-effects
would be confused by being executed twice. I'm not clear that volatile
aggregate functions make any sense anyways though. b) I'm unclear whether
rescanning the table could potentially find tuples in a different state than
previous scans. If so then the idea doesn't work at all. But I don't think
that's possible is it?

The main problem is c) it may lose in terms of i/o for cases where the
cardinality is low (ie, it's overflowing despite having low cardinality
because the table is really really big too). But most cases will be spilling
because the cardinality is high. So the table may be big but the spill files
are nearly as big anyways and having to write and then read them means double
the i/o.

The upside of not having to write out temporary files is big. I find queries
that require temporary sort files get hit with a *huge* performance penalty.
Often an order of magnitude. Part of that could probably be mitigated by
having the sort files on a separate spindle but I think it's always going to
hurt especially if there are multiple operations spilling to disk
simultaneously.

-- 
greg



pgsql-hackers by date:

Previous
From: "Larry Rosenman"
Date:
Subject: FW: PGBuildfarm member firefly Branch HEAD Failed at Stage Check
Next
From: Tom Lane
Date:
Subject: Re: source documentation tool doxygen