Thread: Large Scale Aggregation (HashAgg Enhancement)

Large Scale Aggregation (HashAgg Enhancement)

From
Rod Taylor
Date:
A couple of days ago I found myself wanting to aggregate 3 Billion
tuples down to 100 Million tuples based on an integer key with six
integer values -- six sum()'s.

PostgreSQL ran out of memory with its Hash Aggregator and doing an old
style Sort & Sum took a fair amount of time to complete (cancelled the
process after 24 hours -- small machine).


Spilling to disk would be nice but I suspect the obvious method would
thrash quite badly with non-sorted input.


One solution is to partially sort the data into various buckets. If we
know how many keys can fit into sort_mem and what the upper and lower
bounds of our keys are then # Keys per MB / sort_mem temporary files can
be created. A sequential scan of the source data would sort each tuple
into the appropriate temporary file.  From there we can loop through a
temporary file, HashAgg the contents, present the results, and move to
the next temporary file.

For my particular problem the lower bound is 1 and the upper bound is
about 100M. The sort_mem setting allows HashAgg to handle 1M keys at a
time.  The first pass through the 3B tuples would create 100 temporary
files on disk. Temp file 1 would get 1 through 1M, temp file 2 gets keys
1M + 1 through 2M, etc. From there it is pretty easy.

This would allow for a 1000 fold increase in the number of distinct keys
PostgreSQL can simultaneously HashAgg in the default configuration at a
reasonable speed.


I've written something similar using a client and COPY with temporary
tables. Even with the Export/Import copy I still beat the Sort&Sum
method PostgreSQL falls back to.


--



Re: Large Scale Aggregation (HashAgg Enhancement)

From
Simon Riggs
Date:
On Mon, 2006-01-16 at 00:07 -0500, Rod Taylor wrote:
> A couple of days ago I found myself wanting to aggregate 3 Billion
> tuples down to 100 Million tuples based on an integer key with six
> integer values -- six sum()'s.
> 
> PostgreSQL ran out of memory with its Hash Aggregator and doing an old
> style Sort & Sum took a fair amount of time to complete (cancelled the
> process after 24 hours -- small machine).

> Spilling to disk would be nice but I suspect the obvious method would
> thrash quite badly with non-sorted input.

There is already hash table overflow (spill to disk) logic in HashJoins,
so this should be possible by reusing that code for HashAggs. That's on
my todo list, but I'd welcome any assistance.

A question: Are the rows in your 3 B row table clumped together based
upon the 100M row key? (or *mostly* so) We might also be able to
pre-aggregate the rows using a plan likeHashAgg    SortedAgg
orSortedAgg    Sort        SortedAgg

The first SortedAgg seems superfluous, buy would reduce the row volume
considerably if incoming rows were frequently naturally adjacent, even
if the values were not actually sorted. (This could also be done during
sorting, but its much easier to slot the extra executor step into the
plan). That might then reduce the size of the later sort, or allow it to
become a HashAgg.

I could make that manually enabled using "enable_pre_agg" to allow us to
measure the effectiveness of that technique and decide what cost model
we'd use to make it automatic. Would that help?

> I've written something similar using a client and COPY with temporary
> tables. Even with the Export/Import copy I still beat the Sort&Sum
> method PostgreSQL falls back to.

You can get round this now by chopping the larger table into pieces with
a WHERE clause and then putting them back together with a UNION. If the
table is partitioned, then do this by partitions.

This should also help when it comes to recalculating the sums again in
the future, since you'll only need to rescan the rows that have been
added since the last summation.

Best Regards, Simon Riggs



Re: Large Scale Aggregation (HashAgg Enhancement)

From
Rod Taylor
Date:
On Mon, 2006-01-16 at 08:32 +0000, Simon Riggs wrote:
> On Mon, 2006-01-16 at 00:07 -0500, Rod Taylor wrote:
> > A couple of days ago I found myself wanting to aggregate 3 Billion
> > tuples down to 100 Million tuples based on an integer key with six
> > integer values -- six sum()'s.
> > 
> > PostgreSQL ran out of memory with its Hash Aggregator and doing an old
> > style Sort & Sum took a fair amount of time to complete (cancelled the
> > process after 24 hours -- small machine).
> 
> > Spilling to disk would be nice but I suspect the obvious method would
> > thrash quite badly with non-sorted input.
> 
> There is already hash table overflow (spill to disk) logic in HashJoins,
> so this should be possible by reusing that code for HashAggs. That's on
> my todo list, but I'd welcome any assistance.

> A question: Are the rows in your 3 B row table clumped together based
> upon the 100M row key? (or *mostly* so) We might also be able to

They are randomly distributed. Fully sorting the data is quite painful.

> pre-aggregate the rows using a plan like
>     HashAgg
>         SortedAgg
> or
>     SortedAgg
>         Sort
>             SortedAgg
> 
> The first SortedAgg seems superfluous, buy would reduce the row volume
> considerably if incoming rows were frequently naturally adjacent, even
> if the values were not actually sorted. (This could also be done during
> sorting, but its much easier to slot the extra executor step into the
> plan). That might then reduce the size of the later sort, or allow it to
> become a HashAgg.
> 
> I could make that manually enabled using "enable_pre_agg" to allow us to
> measure the effectiveness of that technique and decide what cost model
> we'd use to make it automatic. Would that help?

I don't understand how this helps. The problem isn't the 3B data source
rows but rather the 100M destination keys that are being aggregated
against.

The memory constraints of HashAgg are a result of the large number of
target keys and should be the same if it was 100M rows or 10B rows.

I think I need something closer to:

HashAgg-> HashSort (to disk)

HashSort would create a number of files on disk with "similar" data.
Grouping all similar keys into a single temporary file which HashAgg can
deal with individually (100 loops by 1M target keys instead of 1 loop by
100M target keys). The results would be the same as partitioning by
keyblock and running a HashAgg on each partition, but it would be
handled by the Executor rather than by client side code.

> > I've written something similar using a client and COPY with temporary
> > tables. Even with the Export/Import copy I still beat the Sort&Sum
> > method PostgreSQL falls back to.
> 
> You can get round this now by chopping the larger table into pieces with
> a WHERE clause and then putting them back together with a UNION. If the
> table is partitioned, then do this by partitions.

True, except this results in several sequential scans over the source
data. I can extract and sort in a single pass at client side but it
would be far better if I could get PostgreSQL to do the same. I could
probably write a plpgsql function to do that logic but it would be quite
messy.

> This should also help when it comes to recalculating the sums again in
> the future, since you'll only need to rescan the rows that have been
> added since the last summation.

We store the aggregated results and never do this type of calculation on
that dataset again. The original dataset comes from about 300 partitions
(time and source) and they are removed upon completion. While this
calculation is being performed additional partitions are added.

I suppose I could store source data in 300 * 1000 partitions (Approx 300
batches times 1000 segments) but that would probably run into other
problems. PostgreSQL probably has issues with that many tables.

-- 



Re: Large Scale Aggregation (HashAgg Enhancement)

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Mon, 2006-01-16 at 00:07 -0500, Rod Taylor wrote:
>> A couple of days ago I found myself wanting to aggregate 3 Billion
>> tuples down to 100 Million tuples based on an integer key with six
>> integer values -- six sum()'s.

> There is already hash table overflow (spill to disk) logic in HashJoins,
> so this should be possible by reusing that code for HashAggs. That's on
> my todo list, but I'd welcome any assistance.

Yeah, I proposed something similar awhile back in conjunction with
fixing the spill logic for hash joins (which was always there, but was
not adaptive until recently).  I got the join part done but got
distracted before fixing HashAgg :-(

In principle, you just reduce the range of currently-in-memory hash
codes whenever you run low on memory.  The so-far-accumulated working
state for aggregates that are not in the range anymore goes into a temp
file, and subsequently any incoming tuples that hash outside the range
go into another temp file.  After you've completed the scan, you
finalize and emit the aggregates that are still in memory, then you pick
up the first set of dropped aggregates, rescan the associated "TODO"
file of unprocessed tuples, lather rinse repeat till done.

The tricky part is to preserve the existing guarantee that tuples are
merged into their aggregate in arrival order.  (This does not matter for
the standard aggregates but it definitely does for custom aggregates,
and there will be unhappy villagers appearing on our doorsteps if we
break it.)  I think this can work correctly under the above sketch but
it needs to be verified.  It might require different handling of the
TODO files than what hashjoin does.
        regards, tom lane


Re: Large Scale Aggregation (HashAgg Enhancement)

From
Simon Riggs
Date:
On Mon, 2006-01-16 at 09:42 -0500, Rod Taylor wrote:
> On Mon, 2006-01-16 at 08:32 +0000, Simon Riggs wrote:
> > On Mon, 2006-01-16 at 00:07 -0500, Rod Taylor wrote:
> > > 
> > A question: Are the rows in your 3 B row table clumped together based
> > upon the 100M row key? (or *mostly* so) We might also be able to
> 
> They are randomly distributed. Fully sorting the data is quite painful.

...

> I don't understand how this helps. 

It wouldn't since your rows are randomly distributed. The idea was not
related to improving HashAgg, but to improving Aggregation for the case
of naturally grouped data.

> I think I need something closer to:
> 
> HashAgg
>     -> HashSort (to disk)
> 
> HashSort would create a number of files on disk with "similar" data.
> Grouping all similar keys into a single temporary file which HashAgg can
> deal with individually (100 loops by 1M target keys instead of 1 loop by
> 100M target keys). The results would be the same as partitioning by
> keyblock and running a HashAgg on each partition, but it would be
> handled by the Executor rather than by client side code.
> 
> > > I've written something similar using a client and COPY with temporary
> > > tables. Even with the Export/Import copy I still beat the Sort&Sum
> > > method PostgreSQL falls back to.

That is exactly how the spill to disk logic works for HashJoin (and
incidentally, identical to an Oracle one-pass hash join since both are
based upon the hybrid hash join algorithm). 

Multi-pass would only be required to handle very skewed hash
distributions, which HJ doesn't do yet.

So yes, this can be done. 

Best Regards, Simon Riggs



Re: Large Scale Aggregation (HashAgg Enhancement)

From
Simon Riggs
Date:
On Mon, 2006-01-16 at 12:36 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Mon, 2006-01-16 at 00:07 -0500, Rod Taylor wrote:
> >> A couple of days ago I found myself wanting to aggregate 3 Billion
> >> tuples down to 100 Million tuples based on an integer key with six
> >> integer values -- six sum()'s.
> 
> > There is already hash table overflow (spill to disk) logic in HashJoins,
> > so this should be possible by reusing that code for HashAggs. That's on
> > my todo list, but I'd welcome any assistance.
> 
> Yeah, I proposed something similar awhile back in conjunction with
> fixing the spill logic for hash joins (which was always there, but was
> not adaptive until recently).  I got the join part done but got
> distracted before fixing HashAgg :-(

You've done the main work. :-)

> The tricky part is to preserve the existing guarantee that tuples are
> merged into their aggregate in arrival order.  (This does not matter for
> the standard aggregates but it definitely does for custom aggregates,
> and there will be unhappy villagers appearing on our doorsteps if we
> break it.)  I think this can work correctly under the above sketch but
> it needs to be verified.  It might require different handling of the
> TODO files than what hashjoin does.

For HJ we write each outer tuple to its own file-per-batch in the order
they arrive. Reading them back in preserves the original ordering. So
yes, caution required, but I see no difficulty, just reworking the HJ
code (nodeHashjoin and nodeHash). What else do you see?

Best Regards, Simon Riggs



Re: Large Scale Aggregation (HashAgg Enhancement)

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> For HJ we write each outer tuple to its own file-per-batch in the order
> they arrive. Reading them back in preserves the original ordering. So
> yes, caution required, but I see no difficulty, just reworking the HJ
> code (nodeHashjoin and nodeHash). What else do you see?

With dynamic adjustment of the hash partitioning, some tuples will go
through multiple temp files before they ultimately get eaten, and
different tuples destined for the same aggregate may take different
paths through the temp files depending on when they arrive.  It's not
immediately obvious that ordering is preserved when that happens.
I think it can be made to work but it may take different management of
the temp files than hashjoin uses.  (Worst case, we could use just a
single temp file for all unprocessed tuples, but this would result in
extra I/O.)
        regards, tom lane


Re: Large Scale Aggregation (HashAgg Enhancement)

From
Simon Riggs
Date:
On Mon, 2006-01-16 at 14:43 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > For HJ we write each outer tuple to its own file-per-batch in the order
> > they arrive. Reading them back in preserves the original ordering. So
> > yes, caution required, but I see no difficulty, just reworking the HJ
> > code (nodeHashjoin and nodeHash). What else do you see?
> 
> With dynamic adjustment of the hash partitioning, some tuples will go
> through multiple temp files before they ultimately get eaten, and
> different tuples destined for the same aggregate may take different
> paths through the temp files depending on when they arrive.  It's not
> immediately obvious that ordering is preserved when that happens.
> I think it can be made to work but it may take different management of
> the temp files than hashjoin uses.  (Worst case, we could use just a
> single temp file for all unprocessed tuples, but this would result in
> extra I/O.)

Sure hash table is dynamic, but we read all inner rows to create the
hash table (nodeHash) before we get the outer rows (nodeHJ).
Why would we continue to dynamically build the hash table after the
start of the outer scan? (I see that we do this, as you say, but I am
surprised).

Best Regards, Simon Riggs







Re: Large Scale Aggregation (HashAgg Enhancement)

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Sure hash table is dynamic, but we read all inner rows to create the
> hash table (nodeHash) before we get the outer rows (nodeHJ).

But our idea of the number of batches needed can change during that
process, resulting in some inner tuples being initially assigned to the
wrong temp file.  This would also be true for hashagg.

> Why would we continue to dynamically build the hash table after the
> start of the outer scan?

The number of tuples written to a temp file might exceed what we want to
hold in memory; we won't detect this until the batch is read back in,
and in that case we have to split the batch at that time.  For hashagg
this point would apply to the aggregate states not the input tuples, but
it's still a live problem (especially if the aggregate states aren't
fixed-size values ... consider a "concat" aggregate for instance).
        regards, tom lane


Re: Large Scale Aggregation (HashAgg Enhancement)

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> > Why would we continue to dynamically build the hash table after the
> > start of the outer scan?
> 
> The number of tuples written to a temp file might exceed what we want to
> hold in memory; we won't detect this until the batch is read back in,
> and in that case we have to split the batch at that time.  For hashagg
> this point would apply to the aggregate states not the input tuples, but
> it's still a live problem (especially if the aggregate states aren't
> fixed-size values ... consider a "concat" aggregate for instance).

For a hash aggregate would it be possible to rescan the original table instead
of spilling to temporary files? Then when you run out of working memory you
simply throw out half the hash table and ignore subsequent tuples that fall in
those hash buckets. Then you rescan for the discarded hash bucket regions.

This avoids having to do any disk writes at the expense possibly of additional
reads. I think in terms of i/o it would be much faster in most cases.

The downsides are: a) volatile aggregates or aggregates with side-effects
would be confused by being executed twice. I'm not clear that volatile
aggregate functions make any sense anyways though. b) I'm unclear whether
rescanning the table could potentially find tuples in a different state than
previous scans. If so then the idea doesn't work at all. But I don't think
that's possible is it?

The main problem is c) it may lose in terms of i/o for cases where the
cardinality is low (ie, it's overflowing despite having low cardinality
because the table is really really big too). But most cases will be spilling
because the cardinality is high. So the table may be big but the spill files
are nearly as big anyways and having to write and then read them means double
the i/o.

The upside of not having to write out temporary files is big. I find queries
that require temporary sort files get hit with a *huge* performance penalty.
Often an order of magnitude. Part of that could probably be mitigated by
having the sort files on a separate spindle but I think it's always going to
hurt especially if there are multiple operations spilling to disk
simultaneously.

-- 
greg



Re: Large Scale Aggregation (HashAgg Enhancement)

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> For a hash aggregate would it be possible to rescan the original table
> instead of spilling to temporary files?

Sure, but the possible performance gain is finite and the possible
performance loss is not.  The "original table" could be an extremely
expensive join.  We'd like to think that the planner gets the input size
estimate approximately right and so the amount of extra I/O caused by
hash table resizing should normally be minimal.  The cases where it is
not right are *especially* not likely to be a trivial table scan as you
are supposing.
        regards, tom lane


Re: Large Scale Aggregation (HashAgg Enhancement)

From
Simon Riggs
Date:
On Mon, 2006-01-16 at 20:02 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Sure hash table is dynamic, but we read all inner rows to create the
> > hash table (nodeHash) before we get the outer rows (nodeHJ).
> 
> But our idea of the number of batches needed can change during that
> process, resulting in some inner tuples being initially assigned to the
> wrong temp file.  This would also be true for hashagg.

So we correct that before we start reading the outer table.

> > Why would we continue to dynamically build the hash table after the
> > start of the outer scan?
> 
> The number of tuples written to a temp file might exceed what we want to
> hold in memory; we won't detect this until the batch is read back in,
> and in that case we have to split the batch at that time.  For hashagg
> this point would apply to the aggregate states not the input tuples, but
> it's still a live problem (especially if the aggregate states aren't
> fixed-size values ... consider a "concat" aggregate for instance).

OK, I see what you mean. Sounds like we should have a new definition for
Aggregates, "Sort Insensitive" that allows them to work when the input
ordering does not effect the result, since that case can be optimised
much better when using HashAgg. Since we know that applies to the common
cases of SUM, AVG etc this will certainly help people.

For sort-sensitive aggregates sounds like we either:
1. Write to a single file, while we remember the start offset of the
first row of each batch.
2. Write to multiple files, adding a globally incrementing sequenceid.
Batches are then resorted on the sequenceid before processing.
3. We give up, delete the existing batches and restart the scan from the
beginning of the outer table.

Sounds like (1) is best, since the overflow just becomes a SortedAgg.
But all of them sound ugly.

Best Regards, Simon Riggs



Re: Large Scale Aggregation (HashAgg Enhancement)

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Mon, 2006-01-16 at 20:02 -0500, Tom Lane wrote:
>> But our idea of the number of batches needed can change during that
>> process, resulting in some inner tuples being initially assigned to the
>> wrong temp file.  This would also be true for hashagg.

> So we correct that before we start reading the outer table.

Why?  That would require a useless additional pass over the data.  With
the current design, we can process and discard at least *some* of the
data in a temp file when we read it, but a reorganization pass would
mean that it *all* goes back out to disk a second time.

Also, you assume that we can accurately tell how many tuples will fit in
memory in advance of actually processing them --- a presumption clearly
false in the hashagg case, and not that easy to do even for hashjoin.
(You can tell the overall size of a temp file, sure, but how do you know
how it will split when the batch size changes?  A perfectly even split
is unlikely.)

> OK, I see what you mean. Sounds like we should have a new definition for
> Aggregates, "Sort Insensitive" that allows them to work when the input
> ordering does not effect the result, since that case can be optimised
> much better when using HashAgg.

Please don't propose pushing this problem onto the user until it's
demonstrated that there's no other way.  I don't want to become the
next Oracle, with forty zillion knobs that it takes a highly trained
DBA to deal with.

> But all of them sound ugly.

I was thinking along the lines of having multiple temp files per hash
bucket.  If you have a tuple that needs to migrate from bucket M to
bucket N, you know that it arrived before every tuple that was assigned
to bucket N originally, so put such tuples into a separate temp file
and process them before the main bucket-N temp file.  This might get a
little tricky to manage after multiple hash resizings, but in principle
it seems doable.
        regards, tom lane


Re: Large Scale Aggregation (HashAgg Enhancement)

From
Simon Riggs
Date:
On Mon, 2006-01-16 at 12:36 -0500, Tom Lane wrote:

> The tricky part is to preserve the existing guarantee that tuples are
> merged into their aggregate in arrival order.  (This does not matter for
> the standard aggregates but it definitely does for custom aggregates,
> and there will be unhappy villagers appearing on our doorsteps if we
> break it.)  I think this can work correctly under the above sketch but
> it needs to be verified.  It might require different handling of the
> TODO files than what hashjoin does.

You almost had me there... but there isn't any "arrival order". The sort
that precedes an aggregation only sorts on the GROUP BY columns, not on
additional columns - so by the SQL standard there is not a guaranteed
ordering of the data into a aggregate. That is exactly what windowed
aggregates are for. (There isn't any way of specifying an ORDER BY yet
either).

The only way of doing this is by doing a derived tableselect a, sum(b) from (select a,b order by a,b);
but AFAICS this is not part of the standard??

It is highly likely that rows are clumped together, but there just isn't
any guarantee that is the case. Any update of any row would change the
arrival order. Should we support something that has worked by luck?

I've been looking into windowed aggregates; these will provide this
functionality should people require it. I don't see how we'd be able to
do windowed aggregates and hashAgg at the same time, so this seems less
relevant. 

Best Regards, Simon Riggs



Re: Large Scale Aggregation (HashAgg Enhancement)

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Mon, 2006-01-16 at 12:36 -0500, Tom Lane wrote:
>> The tricky part is to preserve the existing guarantee that tuples are
>> merged into their aggregate in arrival order.

> You almost had me there... but there isn't any "arrival order".

The fact that it's not in the spec doesn't mean we don't support it.
Here are a couple of threads on the subject:
http://archives.postgresql.org/pgsql-general/2005-11/msg00304.php
http://archives.postgresql.org/pgsql-sql/2003-06/msg00135.php

Per the second message, this has worked since 7.4, and it was requested
fairly often before that.

> Should we support something that has worked by luck?

No luck about it, and yes people are depending on it.  You don't get to
break it just because it's not in the spec.
        regards, tom lane


Re: Large Scale Aggregation (HashAgg Enhancement)

From
Simon Riggs
Date:
On Tue, 2006-01-17 at 14:41 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Mon, 2006-01-16 at 12:36 -0500, Tom Lane wrote:
> >> The tricky part is to preserve the existing guarantee that tuples are
> >> merged into their aggregate in arrival order.
> 
> > You almost had me there... but there isn't any "arrival order".
> 
> The fact that it's not in the spec doesn't mean we don't support it.
> Here are a couple of threads on the subject:
> http://archives.postgresql.org/pgsql-general/2005-11/msg00304.php
> http://archives.postgresql.org/pgsql-sql/2003-06/msg00135.php
> 
> Per the second message, this has worked since 7.4, and it was requested
> fairly often before that.

OK.... My interest was in expanding the role of HashAgg, which as Rod
says can be used to avoid the sort, so the overlap between those ideas
was low anyway.

On Tue, 2006-01-17 at 09:52 -0500, Tom Lane wrote:
> I was thinking along the lines of having multiple temp files per hash
> bucket.  If you have a tuple that needs to migrate from bucket M to
> bucket N, you know that it arrived before every tuple that was
> assigned
> to bucket N originally, so put such tuples into a separate temp file
> and process them before the main bucket-N temp file.  This might get a
> little tricky to manage after multiple hash resizings, but in
> principle
> it seems doable.

OK, so we do need to do this when we have a defined arrival order: this
idea the best one so far. I don't see any optimization of this by
ignoring the arrival order, so it seems best to preserve the ordering
this way in all cases.

You can manage that with file naming. Rows moved from batch N to batch M
would be renamed N.M, so you'd be able to use file ordering to retrieve
all files for *.M
That scheme would work for multiple splits too, so that filenames could
grow yet retain their sort order and final target batch properties.

Best Regards, Simon Riggs



Re: Large Scale Aggregation (HashAgg Enhancement)

From
Simon Riggs
Date:
On Tue, 2006-01-17 at 21:43 +0000, Simon Riggs wrote:
> OK.... My interest was in expanding the role of HashAgg, which as Rod
> says can be used to avoid the sort, so the overlap between those ideas
> was low anyway.

Am I right in thinking that HashAgg would almost always be quicker than
SortAgg, even for large (> memory) aggregation sets? (Except where the
prior ordering has already been forced via an ORDER BY).

If that is so, then I will probably look to work on this sooner,
especially since we seem to have a clear design.

I'd originally viewed the spill-to-disk logic as a safety measure rather
than as a performance feature.

Best Regards, Simon Riggs



Re: Large Scale Aggregation (HashAgg Enhancement)

From
Simon Riggs
Date:
On Tue, 2006-01-17 at 21:43 +0000, Simon Riggs wrote:
> On Tue, 2006-01-17 at 09:52 -0500, Tom Lane wrote:
> > I was thinking along the lines of having multiple temp files per hash
> > bucket.  If you have a tuple that needs to migrate from bucket M to
> > bucket N, you know that it arrived before every tuple that was
> > assigned
> > to bucket N originally, so put such tuples into a separate temp file
> > and process them before the main bucket-N temp file.  This might get a
> > little tricky to manage after multiple hash resizings, but in
> > principle
> > it seems doable.

> You can manage that with file naming. Rows moved from batch N to batch M
> would be renamed N.M, so you'd be able to use file ordering to retrieve
> all files for *.M
> That scheme would work for multiple splits too, so that filenames could
> grow yet retain their sort order and final target batch properties.

This seems to lead to a super-geometric progression in the number of
files required, if we assume that the current batch could be
redistributed to all future batches each of which could be similarly
redistributed.

batches
1    no files
2    1 file
4    7 files
8    64 files
16    64,000 files
32    4 billion files ish

So it does seem important whether we demand sorted input or not.

Or at least requires us to provide the executor with a starting point
for the number of batches, so we could manage that.

Best Regards, Simon Riggs



Re: Large Scale Aggregation (HashAgg Enhancement)

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> This seems to lead to a super-geometric progression in the number of
> files required,

But we double the number of batches at each step, so there are going to
be at most 20 or so levels, and that's only assuming a *horridly* wrong
initial guess by the planner.  In practice I think it's reasonable to
assume at most a couple rounds of doubling.  If you have more than that,
the extra data-shuffling is going to exhaust your patience anyway.
        regards, tom lane


Re: Large Scale Aggregation (HashAgg Enhancement)

From
Simon Riggs
Date:
On Thu, 2006-01-19 at 18:38 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > This seems to lead to a super-geometric progression in the number of
> > files required,
> 
> But we double the number of batches at each step, so there are going to
> be at most 20 or so levels, and that's only assuming a *horridly* wrong
> initial guess by the planner.  In practice I think it's reasonable to
> assume at most a couple rounds of doubling.  If you have more than that,
> the extra data-shuffling is going to exhaust your patience anyway.

What I'm saying is that if we start from 1 batch and move dynamically
upwards we quickly get an unmanageable number of files. However, if we
start at a particular number N, then we start with N-1 files, then move
to at most 2N(N-1) files etc..

So we can only "get it wrong" and double the number of batches about
twice before we get swamped with files. i.e. if we start at 1 we can
only reasonably get to 8 batches.

So we should start at a number higher than 1, attempting to make an
accurate guess about number of batches (N) required. If we have R rows
to aggregate and we get N correct, then the cost of the HashAgg is
2*R*(N-1)/N I/Os, which is cheaper than a sort, for *any* value of R for
both CPU and I/O costs. If we get it wrong, we have to read and re-write
more and more rows, which could eventually surpass the sort costs,
especially if we have growing transition state data from the aggregate.
I think the cost will be to re-write half of all rows already written
when we double N. If we fail early because we got Ndistinct wrong then
this could be cheap, though if we fail later on because of a growing
aggregate then this could easily be very expensive and quickly exceed
the cost of a sort.

My thought is to collect statistics about an aggregate at CREATE
AGGREGATE time. Simply send the aggregate 100 data values and see if the
output varies in size according to the input, if it does we take much
greater care about selecting HashAgg plans with that aggregate. ...and
that way we don't need the user to define the aggregate type directly.
This would only work with aggregates that return well known datatypes
such as int or char.

So getting the number of groups correct would be critical to making this
work, but HashAgg could be effective for even very large aggregates.

Any holes in that thinking?

Best Regards, Simon Riggs




Re: Large Scale Aggregation (HashAgg Enhancement)

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Any holes in that thinking?

Only that it's about five times more complicated than is currently known
to be necessary ;-).  How about we just implement the dynamic spill to
disk first, and not bother with the other stuff until we see problems in
the field?  Saying "we have to do all this" is a good recipe for not
getting any of it done.
        regards, tom lane