Home > mailing lists
Ordered Append Node - Mailing list pgsql-hackers

From	Gregory Stark
Subject	Ordered Append Node
Date	November 22, 2007 10:40:52
Msg-id	87k5oa7385.fsf@oxford.xeocode.com Whole thread Raw
Responses	Re: Ordered Append Node Re: Ordered Append Node
List	pgsql-hackers
Tree view
I've been hacking on the idea of an Append node which maintains the ordering
of its subtables merging their records in order. This is important for
partitioned tables since otherwise a lot of plans are not available such as
merge joins.

The logic I've followed is to do as follows:

1) Go through all subrels asking for any interesting pathkey lists. Gather up  the union of all of these.

2) Go back through all the subrels and for each accumulated pathkey list ask  for the best path for that subrel for
thatpathkey list. 
 

3) Generate an two append paths for each of these pathkey lists, one using the  best startup_cost subpath and one with
thebest total_cost subpath  available for the pathkey list for each child rel. If there is none  available take the
bestunordered path and put a sort node above it.
 

4) Additionally advertise the traditional unordered append node which our  parent could choose to put a sort node above
sameas ever.
 

5) Append plan nodes look a lot like Sort plan nodes glued onto an Append plan  node, with sort function oids and so
on.

6) Append exec state nodes look a lot like a (a few fields from)  tuplesortstate glued onto an Append node. They have
theScanKey array and  the fmgr sort functions. They also have an array of TupleTableSlot and a  heap of indexes into
thatarray.
 

8) ExecAppend does something similar to tuplesort's bounded sort (I fear I'm  going to get typecasted) or more to the
point,similar to the final merge  of a merge sort. It directly returns the slot the child rel last returned  to it.
 


Open questions:

1) I still haven't completely figured out what to do with equivalence classes.  My hack of just stuffing all the append
subrelvars into there seems to  work fine. I need to understand what's going on to see if there's really a  problem
withit or not.
 

2) I'm not sure this code will work when the append rel is a target (ie UPDATE  and DELETE stmts).

3) It does seem to work when the columns in the subrels don't line up but I  didn't do anything special to handle this
case.

4) I haven't handled mark/restore or random access. I think they could be  handled and they'll probably be worth the
complexitybut I'm not sure.
 

5) Is it considering too many paths? Are there some which maybe aren't worth  considering? For example, maybe it only
makessense to take best start_cost  paths since if that plan doesn't dominate then the best total_cost plan is  likely
tobe the sequential scans + unordered append + sort.
 

6) I haven't looked at setops yet but this is particularly attractive for the  UNION (as opposed to UNION ALL) case.

7) I copied/adapted a bunch of bits from tuplesort to maintain the heap and do  the scankey comparisons. I could
refactorthat code back into tuplesort but  it would mean twisting around tuplesort quite a bit. Essentially it would
meanintroducing a new type of tuplesort which would start off in  FINAL_MERGE state only it would have to behave
differentlysince we don't  want it prefetching lots of records like FINAL_MERGE does, I don't think.
 

Some example plans (though note that the first was with a lot of enable_*
parameters set to off).
                                                        QUERY PLAN
   
 

-----------------------------------------------------------------------------------------------------------------------------Merge
Join (cost=76.03..107.63 rows=12 width=12) (actual time=0.435..0.706 rows=11 loops=1)  Merge Cond: (public.z.i = x.i)
-> Append  (cost=22.36..53.66 rows=12 width=8) (actual time=0.365..0.440 rows=12 loops=1)        ->  Index Scan using
zion z  (cost=0.00..8.27 rows=1 width=8) (actual time=0.089..0.091 rows=1 loops=1)        ->  Index Scan using z1i on
z1z  (cost=0.00..12.28 rows=2 width=8) (actual time=0.063..0.068 rows=2 loops=1)        ->  Index Scan using z2i on z2
z (cost=0.00..12.30 rows=3 width=8) (actual time=0.060..0.066 rows=3 loops=1)        ->  Index Scan using z3i on z3 z
(cost=0.00..12.33rows=5 width=8) (actual time=0.059..0.070 rows=5 loops=1)        ->  Index Scan using zzi on zz z
(cost=0.00..8.27rows=1 width=8) (actual time=0.076..0.079 rows=1 loops=1)  ->  Materialize  (cost=53.67..53.79 rows=12
width=8)(actual time=0.051..0.170 rows=12 loops=1)        ->  Append  (cost=22.36..53.66 rows=12 width=8) (actual
time=0.036..0.104rows=12 loops=1)              ->  Index Scan using zi on z x  (cost=0.00..8.27 rows=1 width=8) (actual
time=0.006..0.006rows=1 loops=1)              ->  Index Scan using z1i on z1 x  (cost=0.00..12.28 rows=2 width=8)
(actualtime=0.004..0.009 rows=2 loops=1)              ->  Index Scan using z2i on z2 x  (cost=0.00..12.30 rows=3
width=8)(actual time=0.005..0.014 rows=3 loops=1)              ->  Index Scan using z3i on z3 x  (cost=0.00..12.33
rows=5width=8) (actual time=0.004..0.016 rows=5 loops=1)              ->  Index Scan using zzi on zz x
(cost=0.00..8.27rows=1 width=8) (actual time=0.005..0.008 rows=1 loops=1)Total runtime: 0.951 ms
 



postgres=# explain analyze select * from t order by i limit 1;
 QUERY PLAN                                                             
 

------------------------------------------------------------------------------------------------------------------------------------Limit
(cost=805.41..805.47 rows=1 width=4) (actual time=44.206..44.208 rows=1 loops=1)  ->  Result  (cost=805.41..18309.89
rows=290003width=4) (actual time=44.201..44.201 rows=1 loops=1)        ->  Append  (cost=805.41..18309.89 rows=290003
width=4)(actual time=44.196..44.196 rows=1 loops=1)              ->  Sort  (cost=1.02..1.02 rows=1 width=4) (actual
time=0.034..0.034rows=1 loops=1)                    Sort Key: public.t.i                    Sort Method:  quicksort
Memory:17kB                    ->  Seq Scan on t  (cost=0.00..1.01 rows=1 width=4) (actual time=0.010..0.013 rows=1
loops=1)             ->  Index Scan using ti1 on t1 t  (cost=0.00..289.25 rows=10000 width=4) (actual time=0.033..0.033
rows=1loops=1)              ->  Index Scan using ti2 on t2 t  (cost=0.00..289.25 rows=10000 width=4) (actual
time=0.028..0.028rows=1 loops=1)              ->  Index Scan using ti3 on t3 t  (cost=0.00..289.25 rows=10000 width=4)
(actualtime=0.027..0.027 rows=1 loops=1)              ->  Index Scan using ti4 on t4 t  (cost=0.00..289.25 rows=10000
width=4)(actual time=0.028..0.028 rows=1 loops=1)              ->  Index Scan using ti5 on t5 t  (cost=0.00..289.25
rows=10000width=4) (actual time=0.028..0.028 rows=1 loops=1)              ->  Index Scan using ti6 on t6 t
(cost=0.00..289.25rows=10000 width=4) (actual time=0.031..0.031 rows=1 loops=1)              ->  Index Scan using ti7
ont7 t  (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1)              ->  Index Scan
usingti8 on t8 t  (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1)              ->
IndexScan using ti9 on t9 t  (cost=0.00..289.25 rows=10000 width=4) (actual time=0.028..0.028 rows=1 loops=1)
  ->  Index Scan using ti0 on t0 t  (cost=0.00..289.27 rows=10001 width=4) (actual time=0.027..0.027 rows=1 loops=1)
         ->  Index Scan using ti11 on t11 t  (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1
loops=1)             ->  Index Scan using ti12 on t12 t  (cost=0.00..289.25 rows=10000 width=4) (actual
time=0.027..0.027rows=1 loops=1)              ->  Index Scan using ti13 on t13 t  (cost=0.00..289.25 rows=10000
width=4)(actual time=0.028..0.028 rows=1 loops=1)              ->  Index Scan using ti14 on t14 t  (cost=0.00..289.25
rows=10000width=4) (actual time=0.027..0.027 rows=1 loops=1)              ->  Index Scan using ti15 on t15 t
(cost=0.00..289.25rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1)              ->  Index Scan using ti16
ont16 t  (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1)              ->  Index Scan
usingti17 on t17 t  (cost=0.00..289.25 rows=10000 width=4) (actual time=0.028..0.028 rows=1 loops=1)              ->
IndexScan using ti18 on t18 t  (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1)
    ->  Index Scan using ti19 on t19 t  (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1
loops=1)             ->  Index Scan using ti20 on t20 t  (cost=0.00..289.27 rows=10001 width=4) (actual
time=0.027..0.027rows=1 loops=1)              ->  Index Scan using ti21 on t21 t  (cost=0.00..739.76 rows=20000
width=4)(actual time=0.028..0.028 rows=1 loops=1)              ->  Index Scan using ti22 on t22 t  (cost=0.00..289.25
rows=10000width=4) (actual time=0.027..0.027 rows=1 loops=1)              ->  Index Scan using ti23 on t23 t
(cost=0.00..289.25rows=10000 width=4) (actual time=0.036..0.036 rows=1 loops=1)              ->  Index Scan using ti24
ont24 t  (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1)              ->  Index Scan
usingti26 on t26 t  (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1)              ->
IndexScan using ti27 on t27 t  (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1 loops=1)
    ->  Index Scan using ti28 on t28 t  (cost=0.00..289.25 rows=10000 width=4) (actual time=0.027..0.027 rows=1
loops=1)             ->  Sort  (cost=804.39..829.39 rows=10000 width=4) (actual time=43.336..43.336 rows=1 loops=1)
              Sort Key: public.t.i                    Sort Method:  quicksort  Memory: 647kB                    ->  Seq
Scanon t29 t  (cost=0.00..140.00 rows=10000 width=4) (actual time=0.017..21.192 rows=10000 loops=1)Total runtime:
44.737ms
 


--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
pgsql-hackers by date:
From: Tatsuo Ishii
Date: 22 November 2007, 06:52:06
Subject: Internal document bug?
From: Andrew Dunstan
Date: 22 November 2007, 11:03:53
Subject: Re: run_build.pl ~ By Andrew Dunstan
Ordered Append Node - Mailing list pgsql-hackers

Previous

Next