Home > mailing lists

Re: Large Scale Aggregation (HashAgg Enhancement) - Mailing list pgsql-hackers

From	Simon Riggs
Subject	Re: Large Scale Aggregation (HashAgg Enhancement)
Date	January 16, 2006 04:32:17
Msg-id	1137400335.3180.140.camel@localhost.localdomain Whole thread Raw
In response to	Large Scale Aggregation (HashAgg Enhancement) (Rod Taylor <pg@rbt.ca>)
Responses	Re: Large Scale Aggregation (HashAgg Enhancement) Re: Large Scale Aggregation (HashAgg Enhancement)
List	pgsql-hackers

Tree view

On Mon, 2006-01-16 at 00:07 -0500, Rod Taylor wrote:
> A couple of days ago I found myself wanting to aggregate 3 Billion
> tuples down to 100 Million tuples based on an integer key with six
> integer values -- six sum()'s.
> 
> PostgreSQL ran out of memory with its Hash Aggregator and doing an old
> style Sort & Sum took a fair amount of time to complete (cancelled the
> process after 24 hours -- small machine).

> Spilling to disk would be nice but I suspect the obvious method would
> thrash quite badly with non-sorted input.

There is already hash table overflow (spill to disk) logic in HashJoins,
so this should be possible by reusing that code for HashAggs. That's on
my todo list, but I'd welcome any assistance.

A question: Are the rows in your 3 B row table clumped together based
upon the 100M row key? (or *mostly* so) We might also be able to
pre-aggregate the rows using a plan likeHashAgg    SortedAgg
orSortedAgg    Sort        SortedAgg

The first SortedAgg seems superfluous, buy would reduce the row volume
considerably if incoming rows were frequently naturally adjacent, even
if the values were not actually sorted. (This could also be done during
sorting, but its much easier to slot the extra executor step into the
plan). That might then reduce the size of the later sort, or allow it to
become a HashAgg.

I could make that manually enabled using "enable_pre_agg" to allow us to
measure the effectiveness of that technique and decide what cost model
we'd use to make it automatic. Would that help?

> I've written something similar using a client and COPY with temporary
> tables. Even with the Export/Import copy I still beat the Sort&Sum
> method PostgreSQL falls back to.

You can get round this now by chopping the larger table into pieces with
a WHERE clause and then putting them back together with a UNION. If the
table is partitioned, then do this by partitions.

This should also help when it comes to recalculating the sums again in
the future, since you'll only need to rescan the rows that have been
added since the last summation.

Best Regards, Simon Riggs

pgsql-hackers by date:

From: Simon Riggs
Date: 16 January 2006, 04:06:34
Subject: Re: ScanKey representation for RowCompare index

From: Simon Riggs
Date: 16 January 2006, 04:35:12
Subject: Re: Coding standards? Recommendations?

Re: Large Scale Aggregation (HashAgg Enhancement) - Mailing list pgsql-hackers

Previous

Next