Re: Ordered Append Node - Mailing list pgsql-hackers

From Gregory Stark
Subject Re: Ordered Append Node
Date
Msg-id 87ve7ti1fi.fsf@oxford.xeocode.com
Whole thread Raw
In response to Re: Ordered Append Node  (Heikki Linnakangas <heikki@enterprisedb.com>)
Responses Re: Ordered Append Node  (Csaba Nagy <nagy@ecircle-ag.com>)
Re: Ordered Append Node  (Markus Schiltknecht <markus@bluegap.ch>)
Re: Ordered Append Node  (Heikki Linnakangas <heikki@enterprisedb.com>)
Re: Ordered Append Node  (Jens-Wolfhard Schicke <drahflow@gmx.de>)
Re: Ordered Append Node  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

> Markus Schiltknecht wrote:
>> Florian Weimer wrote:
>>>> Given the partitioning case, I'd expect all rows to have an equal
>>>> tuple descriptor. Maybe this is a matter of what to optimize, then?
>>>>
>>>> Could you elaborate on what use case you have in mind?
>>>
>>> You need a priority queue to figure out from which tape (partition)
>>> you need to remove the next tuple.
>>
>> And why do you need lots of heap memory to do that? Anything wrong with the
>> zipper approach I've outlined upthread?
>
> We're talking about a binary heap, with just one node per partition. AFAICT
> it's roughly the same data structure as the zipper tree you envisioned, but not
> implemented with separate executor nodes for each level.

Not quite the same since the Executor-based implementation would have a static
tree structure based on the partitions. Even if the partitions are all empty
except for one or two you would still have to push the result records through
all the nodes for the empty partitions.

A heap only has the next record from each input. If an input has reached eof
then the heap doesn't have an entry for that input. So if one of the inputs is
empty (often the parent of the inheritance tree) it doesn't require a test
anywhere to propagate the record up past it.

I also did an optimization similar to the bounded-sort case where we check if
the next tuple from the same input which last contributed the result record
comes before the top element of the heap. That avoids having to do an insert
and siftup only to pull out the same record you just inserted. In theory this
is not an optimization but in practice I think partitioned tables will often
contain contiguous blocks of key values and queries will often be joining
against that key and therefore often want to order by it.

Ideally we would also be able to do this in the planner. If the planner could
determine from the constraints that all the key values from each partition are
non-overlapping and order them properly then it could generate a regular
append node with a path order without the overhead of the run-time
comparisons.

But that requires a) dealing with the problem of the parent table which has no
constraints and b) an efficient way to prove that constraints don't overlap
and order them properly. The latter looks like an O(n^2) problem to me, though
it's a planning problem which might be worth making slow in exchange for even
a small speedup at run-time.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!


pgsql-hackers by date:

Previous
From: Zdenek Kotala
Date:
Subject: Re: 8.3beta3: Compile Warnings
Next
From: Csaba Nagy
Date:
Subject: Re: Ordered Append Node