Re: Large Scale Aggregation (HashAgg Enhancement) - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Large Scale Aggregation (HashAgg Enhancement)
Date
Msg-id 1137534189.3180.288.camel@localhost.localdomain
Whole thread Raw
In response to Re: Large Scale Aggregation (HashAgg Enhancement)  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Large Scale Aggregation (HashAgg Enhancement)  (Simon Riggs <simon@2ndquadrant.com>)
Re: Large Scale Aggregation (HashAgg Enhancement)  (Simon Riggs <simon@2ndquadrant.com>)
List pgsql-hackers
On Tue, 2006-01-17 at 14:41 -0500, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Mon, 2006-01-16 at 12:36 -0500, Tom Lane wrote:
> >> The tricky part is to preserve the existing guarantee that tuples are
> >> merged into their aggregate in arrival order.
> 
> > You almost had me there... but there isn't any "arrival order".
> 
> The fact that it's not in the spec doesn't mean we don't support it.
> Here are a couple of threads on the subject:
> http://archives.postgresql.org/pgsql-general/2005-11/msg00304.php
> http://archives.postgresql.org/pgsql-sql/2003-06/msg00135.php
> 
> Per the second message, this has worked since 7.4, and it was requested
> fairly often before that.

OK.... My interest was in expanding the role of HashAgg, which as Rod
says can be used to avoid the sort, so the overlap between those ideas
was low anyway.

On Tue, 2006-01-17 at 09:52 -0500, Tom Lane wrote:
> I was thinking along the lines of having multiple temp files per hash
> bucket.  If you have a tuple that needs to migrate from bucket M to
> bucket N, you know that it arrived before every tuple that was
> assigned
> to bucket N originally, so put such tuples into a separate temp file
> and process them before the main bucket-N temp file.  This might get a
> little tricky to manage after multiple hash resizings, but in
> principle
> it seems doable.

OK, so we do need to do this when we have a defined arrival order: this
idea the best one so far. I don't see any optimization of this by
ignoring the arrival order, so it seems best to preserve the ordering
this way in all cases.

You can manage that with file naming. Rows moved from batch N to batch M
would be renamed N.M, so you'd be able to use file ordering to retrieve
all files for *.M
That scheme would work for multiple splits too, so that filenames could
grow yet retain their sort order and final target batch properties.

Best Regards, Simon Riggs



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Large Scale Aggregation (HashAgg Enhancement)
Next
From: Simon Riggs
Date:
Subject: Re: Large Scale Aggregation (HashAgg Enhancement)