Re: Large Scale Aggregation (HashAgg Enhancement) - Mailing list pgsql-hackers

From Rod Taylor
Subject Re: Large Scale Aggregation (HashAgg Enhancement)
Date
Msg-id 1137422526.15377.63.camel@home
Whole thread Raw
In response to Re: Large Scale Aggregation (HashAgg Enhancement)  (Simon Riggs <simon@2ndquadrant.com>)
Responses Re: Large Scale Aggregation (HashAgg Enhancement)
List pgsql-hackers
On Mon, 2006-01-16 at 08:32 +0000, Simon Riggs wrote:
> On Mon, 2006-01-16 at 00:07 -0500, Rod Taylor wrote:
> > A couple of days ago I found myself wanting to aggregate 3 Billion
> > tuples down to 100 Million tuples based on an integer key with six
> > integer values -- six sum()'s.
> > 
> > PostgreSQL ran out of memory with its Hash Aggregator and doing an old
> > style Sort & Sum took a fair amount of time to complete (cancelled the
> > process after 24 hours -- small machine).
> 
> > Spilling to disk would be nice but I suspect the obvious method would
> > thrash quite badly with non-sorted input.
> 
> There is already hash table overflow (spill to disk) logic in HashJoins,
> so this should be possible by reusing that code for HashAggs. That's on
> my todo list, but I'd welcome any assistance.

> A question: Are the rows in your 3 B row table clumped together based
> upon the 100M row key? (or *mostly* so) We might also be able to

They are randomly distributed. Fully sorting the data is quite painful.

> pre-aggregate the rows using a plan like
>     HashAgg
>         SortedAgg
> or
>     SortedAgg
>         Sort
>             SortedAgg
> 
> The first SortedAgg seems superfluous, buy would reduce the row volume
> considerably if incoming rows were frequently naturally adjacent, even
> if the values were not actually sorted. (This could also be done during
> sorting, but its much easier to slot the extra executor step into the
> plan). That might then reduce the size of the later sort, or allow it to
> become a HashAgg.
> 
> I could make that manually enabled using "enable_pre_agg" to allow us to
> measure the effectiveness of that technique and decide what cost model
> we'd use to make it automatic. Would that help?

I don't understand how this helps. The problem isn't the 3B data source
rows but rather the 100M destination keys that are being aggregated
against.

The memory constraints of HashAgg are a result of the large number of
target keys and should be the same if it was 100M rows or 10B rows.

I think I need something closer to:

HashAgg-> HashSort (to disk)

HashSort would create a number of files on disk with "similar" data.
Grouping all similar keys into a single temporary file which HashAgg can
deal with individually (100 loops by 1M target keys instead of 1 loop by
100M target keys). The results would be the same as partitioning by
keyblock and running a HashAgg on each partition, but it would be
handled by the Executor rather than by client side code.

> > I've written something similar using a client and COPY with temporary
> > tables. Even with the Export/Import copy I still beat the Sort&Sum
> > method PostgreSQL falls back to.
> 
> You can get round this now by chopping the larger table into pieces with
> a WHERE clause and then putting them back together with a UNION. If the
> table is partitioned, then do this by partitions.

True, except this results in several sequential scans over the source
data. I can extract and sort in a single pass at client side but it
would be far better if I could get PostgreSQL to do the same. I could
probably write a plpgsql function to do that logic but it would be quite
messy.

> This should also help when it comes to recalculating the sums again in
> the future, since you'll only need to rescan the rows that have been
> added since the last summation.

We store the aggregated results and never do this type of calculation on
that dataset again. The original dataset comes from about 300 partitions
(time and source) and they are removed upon completion. While this
calculation is being performed additional partitions are added.

I suppose I could store source data in 300 * 1000 partitions (Approx 300
batches times 1000 segments) but that would probably run into other
problems. PostgreSQL probably has issues with that many tables.

-- 



pgsql-hackers by date:

Previous
From: "R, Rajesh (STSD)"
Date:
Subject: [PATCH] Better way to check for getaddrinfo function.
Next
From: Martijn van Oosterhout
Date:
Subject: Re: ScanKey representation for RowCompare index conditions