Re: queries with lots of UNIONed relations - Mailing list pgsql-performance

From Jon Nelson
Subject Re: queries with lots of UNIONed relations
Date
Msg-id AANLkTinubA5_CFAdERCyzDy1H3v409pyVTzdvPjTaPpr@mail.gmail.com
Whole thread Raw
In response to Re: queries with lots of UNIONed relations  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: queries with lots of UNIONed relations  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
On Thu, Jan 13, 2011 at 5:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Jan 13, 2011 at 5:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I don't believe there is any case where hashing each individual relation
>>> is a win compared to hashing them all together.  If the optimizer were
>>> smart enough to be considering the situation as a whole, it would always
>>> do the latter.
>
>> You might be right, but I'm not sure.  Suppose that there are 100
>> inheritance children, and each has 10,000 distinct values, but none of
>> them are common between the tables.  In that situation, de-duplicating
>> each individual table requires a hash table that can hold 10,000
>> entries.  But deduplicating everything at once requires a hash table
>> that can hold 1,000,000 entries.
>
>> Or am I all wet?
>
> If you have enough memory to de-dup them individually, you surely have
> enough to de-dup all at once.  It is not possible for a single hashtable
> to have worse memory consumption than N hashtables followed by a union
> hashtable, and in fact if there are no common values then the latter eats
> twice as much space because every value appears twice in two different
> hashtables.

If everything were available up-front, sure.
However, and please correct me if I'm wrong, but doesn't postgresql
work in a fairly linear fashion, moving from table to table performing
a series of operations on each? That seems to indicate that is what
the plan is:

Compare:

for each table LOOP
  scan table for result rows, append to results
END LOOP
hash / sort + unique results

versus:

for each table LOOP
  scan table for result rows, append to table-results
  hash / sort+unique table-results, append to results
END LOOP
hash / sort + unique results

In the former case, all of the result rows from all tables are
appended together before the de-duplification process can start.

In the latter case, only enough memory for each table's result set is
necessary for de-duplification, and it would only be necessary to
allocate it for that table.

Is that not how this works?

--
Jon

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: queries with lots of UNIONed relations
Next
From: Tom Lane
Date:
Subject: Re: queries with lots of UNIONed relations