Re: queries with lots of UNIONed relations - Mailing list pgsql-performance

From Jon Nelson
Subject Re: queries with lots of UNIONed relations
Date
Msg-id AANLkTik3580kzdmKLy66ZX-RFufSXJ4iMQUWsabTHqVb@mail.gmail.com
Whole thread Raw
In response to Re: queries with lots of UNIONed relations  (Jon Nelson <jnelson+pgsql@jamponi.net>)
List pgsql-performance
On Fri, Jan 14, 2011 at 2:11 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
> On Thu, Jan 13, 2011 at 6:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Jon Nelson <jnelson+pgsql@jamponi.net> writes:
>>> On Thu, Jan 13, 2011 at 5:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> If you have enough memory to de-dup them individually, you surely have
>>>> enough to de-dup all at once.
>>
>>> If everything were available up-front, sure.
>>> However, and please correct me if I'm wrong, but doesn't postgresql
>>> work in a fairly linear fashion, moving from table to table performing
>>> a series of operations on each?
>>
>> Doing a single sort+uniq works like that.  But the alternate plan you
>> are proposing we should consider involves building all the lower
>> hashtables, and then reading from them to fill the upper hashtable.
>> Max memory consumption *is* worst case here.  Remember HashAggregate
>> is incapable of swapping to disk (and if it did, you wouldn't be nearly
>> as pleased with its performance).
>
> That's not exactly what I'm proposing - but it is probably due to a
> lack of understanding some of the underlying details of how postgresql
> works. I guess I had assumed that the result of a HashAggregate or any
> other de-duplication process was a table-like structure.

And I assumed wrong, I think. I dug into the code (nodeHash.c and
others) and I think I understand now why HashAggregate works only in
certain circumstances, and I think I understand your comments a bit
better now.  Basically, HashAggregate doesn't stream unique Nodes the
way nodeUnique.c does. nodeUnique basically emits Nodes and elides
subsequent, identical Nodes, which is why it relies on the input being
sorted.  HashAggregate works only on entire input sets at once, and
nodeHash.c doesn't emit Nodes at all, really.

This makes me wonder if nodeHash.c and nodeHashjoin.c couldn't be
modified to output Nodes in a streaming fashion. The memory
requirements would not be any worse than now.

Does postgresql support any sort of merge sort?  If it did, then if
the hashtable started consuming too much memory, it could be cleared
and the nodes output from the new hashtable could be directed to
another temporary file, and then a merge sort could be performed on
all of the temporary files (and thus Unique could be used to affect
the UNION operation).

--
Jon

pgsql-performance by date:

Previous
From: Barbara Woolums
Date:
Subject: Problem with query
Next
From: Shaun Thomas
Date:
Subject: Re: Best way to get the latest revision from a table