Re: Large Scale Aggregation (HashAgg Enhancement) - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Large Scale Aggregation (HashAgg Enhancement)
Date
Msg-id 1137400335.3180.140.camel@localhost.localdomain
Whole thread Raw
In response to Large Scale Aggregation (HashAgg Enhancement)  (Rod Taylor <pg@rbt.ca>)
Responses Re: Large Scale Aggregation (HashAgg Enhancement)
Re: Large Scale Aggregation (HashAgg Enhancement)
List pgsql-hackers
On Mon, 2006-01-16 at 00:07 -0500, Rod Taylor wrote:
> A couple of days ago I found myself wanting to aggregate 3 Billion
> tuples down to 100 Million tuples based on an integer key with six
> integer values -- six sum()'s.
> 
> PostgreSQL ran out of memory with its Hash Aggregator and doing an old
> style Sort & Sum took a fair amount of time to complete (cancelled the
> process after 24 hours -- small machine).

> Spilling to disk would be nice but I suspect the obvious method would
> thrash quite badly with non-sorted input.

There is already hash table overflow (spill to disk) logic in HashJoins,
so this should be possible by reusing that code for HashAggs. That's on
my todo list, but I'd welcome any assistance.

A question: Are the rows in your 3 B row table clumped together based
upon the 100M row key? (or *mostly* so) We might also be able to
pre-aggregate the rows using a plan likeHashAgg    SortedAgg
orSortedAgg    Sort        SortedAgg

The first SortedAgg seems superfluous, buy would reduce the row volume
considerably if incoming rows were frequently naturally adjacent, even
if the values were not actually sorted. (This could also be done during
sorting, but its much easier to slot the extra executor step into the
plan). That might then reduce the size of the later sort, or allow it to
become a HashAgg.

I could make that manually enabled using "enable_pre_agg" to allow us to
measure the effectiveness of that technique and decide what cost model
we'd use to make it automatic. Would that help?

> I've written something similar using a client and COPY with temporary
> tables. Even with the Export/Import copy I still beat the Sort&Sum
> method PostgreSQL falls back to.

You can get round this now by chopping the larger table into pieces with
a WHERE clause and then putting them back together with a UNION. If the
table is partitioned, then do this by partitions.

This should also help when it comes to recalculating the sums again in
the future, since you'll only need to rescan the rows that have been
added since the last summation.

Best Regards, Simon Riggs



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: ScanKey representation for RowCompare index
Next
From: Simon Riggs
Date:
Subject: Re: Coding standards? Recommendations?