Thread: Ordered Append Node

Ordered Append Node

From
Gregory Stark
Date:
I've been hacking on the idea of an Append node which maintains the ordering
of its subtables merging their records in order. This is important for
partitioned tables since otherwise a lot of plans are not available such as
merge joins.

The logic I've followed is to do as follows:

1) Go through all subrels asking for any interesting pathkey lists. Gather up  the union of all of these.

2) Go back through all the subrels and for each accumulated pathkey list ask  for the best path for that subrel for
thatpathkey list. 
 

3) Generate an two append paths for each of these pathkey lists, one using the  best startup_cost subpath and one with
thebest total_cost subpath  available for the pathkey list for each child rel. If there is none  available take the
bestunordered path and put a sort node above it.
 

4) Additionally advertise the traditional unordered append node which our  parent could choose to put a sort node above
sameas ever.
 

5) Append plan nodes look a lot like Sort plan nodes glued onto an Append plan  node, with sort function oids and so
on.

6) Append exec state nodes look a lot like a (a few fields from)  tuplesortstate glued onto an Append node. They have
theScanKey array and  the fmgr sort functions. They also have an array of TupleTableSlot and a  heap of indexes into
thatarray.
 

8) ExecAppend does something similar to tuplesort's bounded sort (I fear I'm  going to get typecasted) or more to the
point,similar to the final merge  of a merge sort. It directly returns the slot the child rel last returned  to it.
 


Open questions:

1) I still haven't completely figured out what to do with equivalence classes.  My hack of just stuffing all the append
subrelvars into there seems to  work fine. I need to understand what's going on to see if there's really a  problem
withit or not.
 

2) I'm not sure this code will work when the append rel is a target (ie UPDATE  and DELETE stmts).

3) It does seem to work when the columns in the subrels don't line up but I  didn't do anything special to handle this
case.

4) I haven't handled mark/restore or random access. I think they could be  handled and they'll probably be worth the
complexitybut I'm not sure.
 

5) Is it considering too many paths? Are there some which maybe aren't worth  considering? For example, maybe it only
makessense to take best start_cost  paths since if that plan doesn't dominate then the best total_cost plan is  likely
tobe the sequential scans + unordered append + sort.
 

6) I haven't looked at setops yet but this is particularly attractive for the  UNION (as opposed to UNION ALL) case.

7) I copied/adapted a bunch of bits from tuplesort to maintain the heap and do  the scankey comparisons. I could
refactorthat code back into tuplesort but  it would mean twisting around tuplesort quite a bit. Essentially it would
meanintroducing a new type of tuplesort which would start off in  FINAL_MERGE state only it would have to behave
differentlysince we don't  want it prefetching lots of records like FINAL_MERGE does, I don't think.
 

Some example plans (though note that the first was with a lot of enable_*
parameters set to off).
                                                        QUERY PLAN
   
 

-----------------------------------------------------------------------------------------------------------------------------Merge
Join (cost=76.03..107.63 rows=12 width=12) (actual time=0.435..0.706 rows=11 loops=1)  Merge Cond: (public.z.i = x.i)
-> Append  (cost=22.36..53.66 rows=12 width=8) (actual time=0.365..0.440 rows=12 loops=1)        ->  Index Scan using
zion z  (cost=0.00..8.27 rows=1 width=8) (actual time=0.089..0.091 rows=1 loops=1)        ->  Index Scan using z1i on
z1z  (cost=0.00..12.28 rows=2 width=8) (actual time=0.063..0.068 rows=2 loops=1)        ->  Index Scan using z2i on z2
z (cost=0.00..12.30 rows=3 width=8) (actual time=0.060..0.066 rows=3 loops=1)        ->  Index Scan using z3i on z3 z
(cost=0.00..12.33rows=5 width=8) (actual time=0.059..0.070 rows=5 loops=1)        ->  Index Scan using zzi on zz z
(cost=0.00..8.27rows=1 width=8) (actual time=0.076..0.079 rows=1 loops=1)  ->  Materialize  (cost=53.67..53.79 rows=12
width=8)(actual time=0.051..0.170 rows=12 loops=1)        ->  Append  (cost=22.36..53.66 rows=12 width=8) (actual
time=0.036..0.104rows=12 loops=1)              ->  Index Scan using zi on z x  (cost=0.00..8.27 rows=1 width=8) (actual
time=0.006..0.006rows=1 loops=1)              ->  Index Scan using z1i on z1 x  (cost=0.00..12.28 rows=2 width=8)
(actualtime=0.004..0.009 rows=2 loops=1)              ->  Index Scan using z2i on z2 x  (cost=0.00..12.30 rows=3
width=8)(actual time=0.005..0.014 rows=3 loops=1)              ->  Index Scan using z3i on z3 x  (cost=0.00..12.33
rows=5width=8) (actual time=0.004..0.016 rows=5 loops=1)              ->  Index Scan using zzi on zz x
(cost=0.00..8.27rows=1 width=8) (actual time=0.005..0.008 rows=1 loops=1)Total runtime: 0.951 ms
 



postgres=# explain analyze select * from t order by i limit 1;
 QUERY PLAN                                                             
 

------------------------------------------------------------------------------------------------------------------------------------Limit
(cost=805.41..805.47 rows=1 width=4) (actual time=44.206..44.208 rows=1 loops=1)  ->  Result  (cost=805.41..18309.89
rows=290003width=4) (actual time=44.201..44.201 rows=1 loops=1)        ->  Append  (cost=805.41..18309.89 rows=290003
width=4)(actual time=44.196..44.196 rows=1 loops=1)              ->  Sort  (cost=1.02..1.02 rows=1 width=4) (actual
time=0.034..0.034rows=1 loops=1)                    Sort Key: public.t.i                    Sort Method:  quicksort
Memory:17kB                    ->  Seq Scan on t  (cost=0.00..1.01 rows=1 width=4) (actual time=0.010..0.013 rows=1
loops=1)             ->  Index Scan using ti1 on t1 t  (cost=0.00..289.25 rows=10000 width=4) (actual time=0.033..0.033
rows=1loops=1)              ->  Index Scan using ti2 on t2 t  (cost=0.00..289.25 rows=10000 width=4) (actual
time=0.028..0.028rows=1 loops=1)              ->  Index Scan using ti3 on t3 t  (cost=0.00..289.25 rows=10000 width=4)
(actualtime=0.027..0.027 rows=1 loops=1)              ->  Index Scan using ti4 on t4 t  (cost=0.00..289.25 rows=10000
width=4)(actual time=0.028..0.028 rows=1 loops=1)              ->  Index Scan using ti5 on t5 t  (cost=0.00..289.25
rows=10000width=4) (actual time=0.028..0.028 rows=1 loops=1)              ->  Index Scan using ti6 on t6 t
(cost=0.00..289.25rows=10000 width=4) (actual time=0.031..0.031 rows=1 loops=1)              ->  Index Scan using ti7
ont7 t  (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1)              ->  Index Scan
usingti8 on t8 t  (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1)              ->
IndexScan using ti9 on t9 t  (cost=0.00..289.25 rows=10000 width=4) (actual time=0.028..0.028 rows=1 loops=1)
  ->  Index Scan using ti0 on t0 t  (cost=0.00..289.27 rows=10001 width=4) (actual time=0.027..0.027 rows=1 loops=1)
         ->  Index Scan using ti11 on t11 t  (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1
loops=1)             ->  Index Scan using ti12 on t12 t  (cost=0.00..289.25 rows=10000 width=4) (actual
time=0.027..0.027rows=1 loops=1)              ->  Index Scan using ti13 on t13 t  (cost=0.00..289.25 rows=10000
width=4)(actual time=0.028..0.028 rows=1 loops=1)              ->  Index Scan using ti14 on t14 t  (cost=0.00..289.25
rows=10000width=4) (actual time=0.027..0.027 rows=1 loops=1)              ->  Index Scan using ti15 on t15 t
(cost=0.00..289.25rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1)              ->  Index Scan using ti16
ont16 t  (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1)              ->  Index Scan
usingti17 on t17 t  (cost=0.00..289.25 rows=10000 width=4) (actual time=0.028..0.028 rows=1 loops=1)              ->
IndexScan using ti18 on t18 t  (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1)
    ->  Index Scan using ti19 on t19 t  (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1
loops=1)             ->  Index Scan using ti20 on t20 t  (cost=0.00..289.27 rows=10001 width=4) (actual
time=0.027..0.027rows=1 loops=1)              ->  Index Scan using ti21 on t21 t  (cost=0.00..739.76 rows=20000
width=4)(actual time=0.028..0.028 rows=1 loops=1)              ->  Index Scan using ti22 on t22 t  (cost=0.00..289.25
rows=10000width=4) (actual time=0.027..0.027 rows=1 loops=1)              ->  Index Scan using ti23 on t23 t
(cost=0.00..289.25rows=10000 width=4) (actual time=0.036..0.036 rows=1 loops=1)              ->  Index Scan using ti24
ont24 t  (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1)              ->  Index Scan
usingti26 on t26 t  (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1)              ->
IndexScan using ti27 on t27 t  (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1)
    ->  Index Scan using ti28 on t28 t  (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1
loops=1)             ->  Sort  (cost=804.39..829.39 rows=10000 width=4) (actual time=43.336..43.336 rows=1 loops=1)
              Sort Key: public.t.i                    Sort Method:  quicksort  Memory: 647kB                    ->  Seq
Scanon t29 t  (cost=0.00..140.00 rows=10000 width=4) (actual time=0.017..21.192 rows=10000 loops=1)Total runtime:
44.737ms
 


--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!


Re: Ordered Append Node

From
Markus Schiltknecht
Date:
Hello Gregory,

Gregory Stark wrote:
> I've been hacking on the idea of an Append node which maintains the ordering
> of its subtables merging their records in order. This is important for
> partitioned tables since otherwise a lot of plans are not available such as
> merge joins.

Cool!

Some time ago, I've been trying something very similar, but didn't get 
very far. I'd like to share my thoughts anyway.

First of, I envisioned that append node to be applicable also to 
unordered reading from multiple partitions, simply interleaving between 
the partitions.

> The logic I've followed is to do as follows:
> 
> 1) Go through all subrels asking for any interesting pathkey lists. Gather up
>    the union of all of these.

I also tried to modify the Append node first, then figured that it might 
be better to base on the merge join node instead. While this seems 
farther away, I had the hope that a binary tree of such 'plain merge' 
nodes would require less comparisons in total.

Plus, I found it a lot simpler to cope with exactly two input relations 
instead of n, as with the append node. :-)

> 2) Go back through all the subrels and for each accumulated pathkey list ask
>    for the best path for that subrel for that pathkey list. 
> 
> 3) Generate an two append paths for each of these pathkey lists, one using the
>    best startup_cost subpath and one with the best total_cost subpath
>    available for the pathkey list for each child rel. If there is none
>    available take the best unordered path and put a sort node above it.
> 
> 4) Additionally advertise the traditional unordered append node which our
>    parent could choose to put a sort node above same as ever.
> 
> 5) Append plan nodes look a lot like Sort plan nodes glued onto an Append plan
>    node, with sort function oids and so on.
> 
> 6) Append exec state nodes look a lot like a (a few fields from)
>    tuplesortstate glued onto an Append node. They have the ScanKey array and
>    the fmgr sort functions. They also have an array of TupleTableSlot and a
>    heap of indexes into that array.
> 
> 8) ExecAppend does something similar to tuplesort's bounded sort (I fear I'm
>    going to get typecasted) or more to the point, similar to the final merge
>    of a merge sort. It directly returns the slot the child rel last returned
>    to it.

Uh.. well, yeah. I guess you have a better understanding of the planner 
and executor that I do.

> Open questions:
> 
> 1) I still haven't completely figured out what to do with equivalence classes.
>    My hack of just stuffing all the append subrel vars into there seems to
>    work fine. I need to understand what's going on to see if there's really a
>    problem with it or not.
> 
> 2) I'm not sure this code will work when the append rel is a target (ie UPDATE
>    and DELETE stmts).
> 
> 3) It does seem to work when the columns in the subrels don't line up but I
>    didn't do anything special to handle this case.

Uh.. is there any use case for that? WRT partitioning certainly not, is 
there?

I'll have a look at your patch.

Regards

Markus



Re: Ordered Append Node

From
Gregory Stark
Date:
"Markus Schiltknecht" <markus@bluegap.ch> writes:

>> 1) Go through all subrels asking for any interesting pathkey lists. Gather up
>>    the union of all of these.
>
> I also tried to modify the Append node first, then figured that it might be
> better to base on the merge join node instead. While this seems farther away, I
> had the hope that a binary tree of such 'plain merge' nodes would require less
> comparisons in total.

It is kind of like a merge join but not quite. It's interleaving rows rather
than matching them up. It's more like the final merge of a sort which also
uses a heap to efficiently find the next value from the source tapes.

>> 3) It does seem to work when the columns in the subrels don't line up but I
>>    didn't do anything special to handle this case.
>
> Uh.. is there any use case for that? WRT partitioning certainly not, is there?

Not necessarily but it is something Postgres supports and I don't think we
want to break it. Actually it's useful for partitioned tables if you build the
new partition in a separate table and then add it to the partitioned table. In
that case you may have gone through several steps of adding columns and
dropping them to get the structure to line up.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!


Re: Ordered Append Node

From
Markus Schiltknecht
Date:
Hello Gregory,

Gregory Stark wrote:
> It is kind of like a merge join but not quite. It's interleaving rows rather
> than matching them up. It's more like the final merge of a sort which also
> uses a heap to efficiently find the next value from the source tapes.

Well, maybe my point here is: why do you need the heap to sort? The 
inputs you've got are already sorted, you only need to merge them. To me 
this sounds very much like the final stage of a merge sort, which would 
not require much memory.

IMO, a merge sort could easier be implemented by a binary tree zipper 
node, as I had in mind. Leading to a plan like that (well, hey, this is 
all made up):

Zipper  (cost..., sort by public.t.i)  ->  Zipper  (cost .., sort by public.t.i)        -> Zipper (cost .., sort by
public.t.i)            -> Index Scan using ti1 on t1             -> Index Scan using t12 on t2        -> Index Scan
usingti2 on t3  ->  Zipper  (cost .., sort by public.t.i)        -> Index Scan using ti4 on t4        -> Index Scan
usingti5 on t5
 

Every zipper node would simply have to keep the two top tuples from its 
inputs in memory, compare them and return the best.

But maybe that's exactly how Knuth's polyphase merge sort internally 
also merge their inputs (or runs). And perhaps it makes sense to show 
the user just one simple append node instead of throwing a tree of 
Zipper nodes at her. ;-)

> Not necessarily but it is something Postgres supports and I don't think we
> want to break it. Actually it's useful for partitioned tables if you build the
> new partition in a separate table and then add it to the partitioned table. In
> that case you may have gone through several steps of adding columns and
> dropping them to get the structure to line up.

Agreed, especially because lining up the columns isn't that hard after all.

OTOH I think Postgres is way too flexible in how it allows partitioning 
to be done and thus it often can't optimize it properly. I'd very much 
like to teach it a stricter and simpler to use partitioning scheme than 
what we have with constraint exclusion.

But that's pipe dreaming, and your improvement to the append node is 
certainly a good step towards the right direction, keep up the good work!

Regards

Markus



Re: Ordered Append Node

From
Florian Weimer
Date:
* Markus Schiltknecht:

>> uses a heap to efficiently find the next value from the source tapes.
>
> Well, maybe my point here is: why do you need the heap to sort?

I think you need it because there are potentially many input types.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99


Re: Ordered Append Node

From
Markus Schiltknecht
Date:
Hi,

Florian Weimer wrote:
> I think you need it because there are potentially many input types.

Given the partitioning case, I'd expect all rows to have an equal tuple 
descriptor. Maybe this is a matter of what to optimize, then?

Could you elaborate on what use case you have in mind?

Regards

Markus


Re: Ordered Append Node

From
Florian Weimer
Date:
* Markus Schiltknecht:

> Florian Weimer wrote:
>> I think you need it because there are potentially many input types.

Eh, "tapes".

> Given the partitioning case, I'd expect all rows to have an equal
> tuple descriptor. Maybe this is a matter of what to optimize, then?
>
> Could you elaborate on what use case you have in mind?

You need a priority queue to figure out from which tape (partition)
you need to remove the next tuple.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99


Re: Ordered Append Node

From
Markus Schiltknecht
Date:
Hi,

Florian Weimer wrote:
>> Florian Weimer wrote:
>>> I think you need it because there are potentially many input types.
> 
> Eh, "tapes".

Aha..

>> Given the partitioning case, I'd expect all rows to have an equal
>> tuple descriptor. Maybe this is a matter of what to optimize, then?
>>
>> Could you elaborate on what use case you have in mind?
> 
> You need a priority queue to figure out from which tape (partition)
> you need to remove the next tuple.

And why do you need lots of heap memory to do that? Anything wrong with 
the zipper approach I've outlined upthread?

Am I missing something?

Regards

Markus





Re: Ordered Append Node

From
Heikki Linnakangas
Date:
Markus Schiltknecht wrote:
> Florian Weimer wrote:
>>> Given the partitioning case, I'd expect all rows to have an equal
>>> tuple descriptor. Maybe this is a matter of what to optimize, then?
>>>
>>> Could you elaborate on what use case you have in mind?
>>
>> You need a priority queue to figure out from which tape (partition)
>> you need to remove the next tuple.
> 
> And why do you need lots of heap memory to do that? Anything wrong with 
> the zipper approach I've outlined upthread?

We're talking about a binary heap, with just one node per partition. 
AFAICT it's roughly the same data structure as the zipper tree you 
envisioned, but not implemented with separate executor nodes for each level.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Ordered Append Node

From
Florian Weimer
Date:
* Markus Schiltknecht:

>> You need a priority queue to figure out from which tape (partition)
>> you need to remove the next tuple.
>
> And why do you need lots of heap memory to do that? Anything wrong
> with the zipper approach I've outlined upthread?

"heap" == "priority queue" here, I guess.  Looking at your zipper
again, it's actually an implementation of a heap.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99


Re: Ordered Append Node

From
Markus Schiltknecht
Date:
Hi,

Heikki Linnakangas wrote:
> AFAICT it's roughly the same data structure as the zipper tree you 
> envisioned, but not implemented with separate executor nodes for each 
> level.

Aha, that sounds good to me. ;-)

As I've already mentioned, I think it's even better to simply show the 
user a single append node, instead of lots of small nodes from a binary 
tree merger.

Regards

Markus


Re: Ordered Append Node

From
Gregory Stark
Date:
"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

> Markus Schiltknecht wrote:
>> Florian Weimer wrote:
>>>> Given the partitioning case, I'd expect all rows to have an equal
>>>> tuple descriptor. Maybe this is a matter of what to optimize, then?
>>>>
>>>> Could you elaborate on what use case you have in mind?
>>>
>>> You need a priority queue to figure out from which tape (partition)
>>> you need to remove the next tuple.
>>
>> And why do you need lots of heap memory to do that? Anything wrong with the
>> zipper approach I've outlined upthread?
>
> We're talking about a binary heap, with just one node per partition. AFAICT
> it's roughly the same data structure as the zipper tree you envisioned, but not
> implemented with separate executor nodes for each level.

Not quite the same since the Executor-based implementation would have a static
tree structure based on the partitions. Even if the partitions are all empty
except for one or two you would still have to push the result records through
all the nodes for the empty partitions.

A heap only has the next record from each input. If an input has reached eof
then the heap doesn't have an entry for that input. So if one of the inputs is
empty (often the parent of the inheritance tree) it doesn't require a test
anywhere to propagate the record up past it.

I also did an optimization similar to the bounded-sort case where we check if
the next tuple from the same input which last contributed the result record
comes before the top element of the heap. That avoids having to do an insert
and siftup only to pull out the same record you just inserted. In theory this
is not an optimization but in practice I think partitioned tables will often
contain contiguous blocks of key values and queries will often be joining
against that key and therefore often want to order by it.

Ideally we would also be able to do this in the planner. If the planner could
determine from the constraints that all the key values from each partition are
non-overlapping and order them properly then it could generate a regular
append node with a path order without the overhead of the run-time
comparisons.

But that requires a) dealing with the problem of the parent table which has no
constraints and b) an efficient way to prove that constraints don't overlap
and order them properly. The latter looks like an O(n^2) problem to me, though
it's a planning problem which might be worth making slow in exchange for even
a small speedup at run-time.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!


Re: Ordered Append Node

From
Csaba Nagy
Date:
On Fri, 2007-11-23 at 12:36 +0000, Gregory Stark wrote:
> I also did an optimization similar to the bounded-sort case where we check if
> the next tuple from the same input which last contributed the result record
> comes before the top element of the heap. That avoids having to do an insert
> and siftup only to pull out the same record you just inserted. In theory this
> is not an optimization but in practice I think partitioned tables will often
> contain contiguous blocks of key values and queries will often be joining
> against that key and therefore often want to order by it.

If it is an option, you could also do this by a new method on the heap
which adds a new entry and removes the resulting new head in one atomic
operation. That would work with one single comparison for the less than
current head situation, and it would not need to repeat that comparison
if that fails. Also it could directly remove the head and balance the
tree in one go.

Cheers,
Csaba.




Re: Ordered Append Node

From
Markus Schiltknecht
Date:
Gregory Stark wrote:
> Not quite the same since the Executor-based implementation would have a static
> tree structure based on the partitions. Even if the partitions are all empty
> except for one or two you would still have to push the result records through
> all the nodes for the empty partitions.
> 
> A heap only has the next record from each input. If an input has reached eof
> then the heap doesn't have an entry for that input. So if one of the inputs is
> empty (often the parent of the inheritance tree) it doesn't require a test
> anywhere to propagate the record up past it.

Ah, so the binary tree (binary heap?) gets adjusted dynamically. Very 
clever! (OTOH probably a micro optimization, but as code is already 
there, use it, yeah!)

> I also did an optimization similar to the bounded-sort case where we check if
> the next tuple from the same input which last contributed the result record
> comes before the top element of the heap. That avoids having to do an insert
> and siftup only to pull out the same record you just inserted. In theory this
> is not an optimization but in practice I think partitioned tables will often
> contain contiguous blocks of key values and queries will often be joining
> against that key and therefore often want to order by it.

Hm... that assumes range partitioning, no? If you partition among three 
partitions by "id modulo 3", tuples are most probably coming from one 
partition after the other, i.e.:  1 2 3 1 2 3 1 2 3 ...

And with hash partitioning, you're completely unable to predict the 
ordering.

> Ideally we would also be able to do this in the planner. If the planner could
> determine from the constraints that all the key values from each partition are
> non-overlapping and order them properly then it could generate a regular
> append node with a path order without the overhead of the run-time
> comparisons.

Agreed.

> But that requires a) dealing with the problem of the parent table which has no
> constraints and b) an efficient way to prove that constraints don't overlap
> and order them properly. The latter looks like an O(n^2) problem to me, though
> it's a planning problem which might be worth making slow in exchange for even
> a small speedup at run-time.

Well, I think someday, Postgres needs better support for vertical data 
partitioning in general. Dealing with constraints and inheritance is way 
too flexible and prone to error. I'll shortly start a new thread about 
that, to outline my current thoughts about that topic.

Regards

Markus


Re: Ordered Append Node

From
Heikki Linnakangas
Date:
Gregory Stark wrote:
> Ideally we would also be able to do this in the planner. If the planner could
> determine from the constraints that all the key values from each partition are
> non-overlapping and order them properly then it could generate a regular
> append node with a path order without the overhead of the run-time
> comparisons.
> 
> But that requires a) dealing with the problem of the parent table which has no
> constraints and ...

It would help even if you could only prove it for some partitions. You 
could use a regular append node for the partitions you know not to 
overlap, and merge the output of that with the new kind of ordered 
append node. We'd see that the parent table is empty on the first 
invocation, and after that the ordered append node would just pass 
through the tuples.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Ordered Append Node

From
"Zeugswetter Andreas ADI SD"
Date:
> > But that requires a) dealing with the problem of the parent table
which has no
> > constraints and ...

Imho we should provide a mechanism that forces the parent to be empty
and let the planner know.
e.g. a false constraint on parent ONLY.

Andreas



Re: Ordered Append Node

From
Jens-Wolfhard Schicke
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Gregory Stark wrote:
> But that requires a) dealing with the problem of the parent table which has no
> constraints and b) an efficient way to prove that constraints don't overlap
> and order them properly. The latter looks like an O(n^2) problem to me, though
> it's a planning problem which might be worth making slow in exchange for even
> a small speedup at run-time.

Is it really worthwile to optimize away the heap access by thinking about what the
child tables hold? If the tables are read using IO, I think the complete plan would
turn out to be IO-bound, and the heap is of no interest. If the tables reside in
memory, the heap still only slows the process down by O(log(<number of tables>)) which
usually won't be that much imho.

Nonetheless, in the case of range partitioning, a sort on the lower ends of the ranges
and a linear test of neighbouring ranges for "overlap", skipping emtpy ranges,
should work in O(n log(n)).
  Jens-W. Schicke
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHRu2xzhchXT4RR5ARAu//AKCZWZj680RhnbivbU/UqLBvsigBggCgq0Tw
GB+OYl0VOidmzVcK6ckhFBw=
=gbt7
-----END PGP SIGNATURE-----


Re: Ordered Append Node

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> "Heikki Linnakangas" <heikki@enterprisedb.com> writes:
>> Markus Schiltknecht wrote:
>>> And why do you need lots of heap memory to do that? Anything wrong with the
>>> zipper approach I've outlined upthread?
>> 
>> We're talking about a binary heap, with just one node per partition. AFAICT
>> it's roughly the same data structure as the zipper tree you envisioned, but not
>> implemented with separate executor nodes for each level.

> Not quite the same since the Executor-based implementation would have a static
> tree structure based on the partitions. Even if the partitions are all empty
> except for one or two you would still have to push the result records through
> all the nodes for the empty partitions.

Also, the overhead per executor node visit is not exactly trivial.
I think that "zipper" scheme would be quite slow compared to a standard
heap merge within a single node.
        regards, tom lane


Re: Ordered Append Node

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> I've been hacking on the idea of an Append node which maintains the ordering
> of its subtables merging their records in order.

I finally got round to looking at this ...

> 1) I still haven't completely figured out what to do with equivalence classes.
>    My hack of just stuffing all the append subrel vars into there seems to
>    work fine. I need to understand what's going on to see if there's really a
>    problem with it or not.

This still makes me itchy, but it might be all right, because we'd never
really consider a plan for a single subnode as representing a useful
plan for the whole query.  What it would need is some additions to the
README files (ahem) describing what's going on.

> 2) I'm not sure this code will work when the append rel is a target (ie UPDATE
>    and DELETE stmts).

It won't, at least not for the case where the appendrel is an
inheritance tree in which the children are unlike the parent.
You have missed updating the estate->es_result_relation_info and
estate->es_junkFilter to match the tuple actually returned.  In a plain
Append it is only necessary to update those when switching from one
subnode to the next, so we handle it in exec_append_initialize_next.
In an ordered Append you'd have to track which subnode each entry in the
heap came from (easy, but you aren't doing it now) and update those
fields for every tuple returned.

If you need an example of what I'm talking about:
create table p1(f1 int, f2 int);create table p2(f3 int, f4 int);create table c() inherits(p1, p2);... insert some data
...updatep2 set ...
 

Of course a plain UPDATE p2 wouldn't have much interest in
ordered-Append plans, but you could force the failure by doing 
a joining update with a mergejoin plan.

> 4) I haven't handled mark/restore or random access. I think they could be
>    handled and they'll probably be worth the complexity but I'm not sure.

I don't think it'd be nearly as easy as you think, eg just how are you
going to "back up" the merge?  A heap merge doesn't normally remember
the past few tuples it emitted.  I'd definitely recommend not trying
that in v1.

BTW, I concur with Heikki's suggestion that this might be better done
as a separate executor node type.  There'd be a certain amount of
duplicated boilerplate, but that would force you to look at each and
every reference to Append nodes and decide what's needed.  There are a
whole bunch of places you've missed touching that know something about
what Append nodes can do, and you'll need to look at them all in any
case.

> 5) Is it considering too many paths?

Hard to say for sure, but I concur that that needs thinking about.
One point is that as soon as you have even one subnode that hasn't
got a cheap-startup-cost path, it probably stops being interesting
to try to build a cheap-startup-cost ordered-Append path.  But
exactly what's the threshold of "cheap enough", I'm not sure.
        regards, tom lane


Re: Ordered Append Node

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <stark@enterprisedb.com> writes:
>> I've been hacking on the idea of an Append node which maintains the ordering
>> of its subtables merging their records in order.
>
> I finally got round to looking at this ...

A lot of things to chew on. Thanks very much.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production
Tuning