Thread: WIP: avoiding tuple construction/deconstruction overhead

WIP: avoiding tuple construction/deconstruction overhead

From
Tom Lane
Date:
Attached is the current state of a patch to reduce the overhead of passing
tuple data up through many levels of plan nodes.  It's not tested enough
to apply yet, but I thought I'd put it out for comment.  It seems to get
about a factor of 4 speedup on Miroslav's nested-joins example (above
and beyond what we got from Atsushi Ogawa's patch).

The basic point of the patch is to allow a TupleTableSlot to contain
a "virtual" tuple instead of a regular heap tuple.  The virtual tuple
is just an array of Datums, with any pass-by-reference Datums pointing
at original storage (either a lower-level slot or expression result
storage).  This representation is essentially the raw output of
ExecProject.  This not only avoids the overhead of forming the data into
a tuple (heap_formtuple) but also saves cycles when extracting the
data at the next level up, since we can just grab the Datums directly.
(This behavior builds on and shares code with Ogawa's patch to cache
extracted Datums in TupleTableSlots.  When a slot contains a physical
tuple, the same Datum arrays cache any Datums extracted from it.)

Since a slot may or may not contain a regular tuple, you can't just grab
slot->val anymore; there are new API functions ExecCopySlotTuple()
and ExecFetchSlotTuple() (the former when you want to make your own copy,
the latter when you don't).  These force construction of a real tuple
if the slot is virtual.  I also made an ExecCopySlot() convenience routine
for the common case of copying one slot's contents into another slot.

A related API modification is to change tuple receivers (DestReceivers)
to receive a TupleTableSlot instead of separate tuple and tuple descriptor
parameters.  This makes it possible to avoid an unnecessary tuple
construction/deconstruction at the final output phase as well.

It also turned out to be useful to make a short-circuit path for
ExecProject when the targetlist is entirely simple Vars.  This only
requires copying Datums from lower to upper slots, and we can
implement it that way instead of going through ExecEvalExpr.

Finally, I have made some progress towards making the tuple access
routines consistently use "bool isNull" arrays as null markers, instead
of the char 'n' or ' ' convention that was previously used in some but
not all contexts.  I don't think we can retire heap_formtuple or
heap_modifytuple for a long time, if ever, but we can deprecate them
in favor of the parallel new routines with the bool interface.

Comments?

            regards, tom lane


Attachment

Re: WIP: avoiding tuple construction/deconstruction overhead

From
a_ogawa
Date:
Tom Lane wrote:
> Attached is the current state of a patch to reduce the overhead of
> passing tuple data up through many levels of plan nodes.

It is a good idea.
I think that this patch improves performance of the whole executor.

I have three comments.

(1)We can improve compare_heap() by using TableTupleSlot instead of
HeapTuple. Please see attached patch.

(2)In ExecStoreTuple(), we can omit initialization of slot->tts_nvalid.
If slot->tts_isempty is false, tts_nvalid is initialized by
ExecClearTuple(). If it is true, tts_nvalid is always zero.

(3)There is a description of slot->val in comment of execTuple.c.
This had better revise it.

> Finally, I have made some progress towards making the tuple access
> routines consistently use "bool isNull" arrays as null markers, instead
> of the char 'n' or ' ' convention that was previously used in some but
> not all contexts.

I agree. I think that this progress improves readability.

regards,

--- Atsushi Ogawa

Attachment

Re: WIP: avoiding tuple construction/deconstruction overhead

From
Tom Lane
Date:
a_ogawa <a_ogawa@hi-ho.ne.jp> writes:
> (1)We can improve compare_heap() by using TableTupleSlot instead of
> HeapTuple. Please see attached patch.

Did you measure any performance improvement from that?  I considered it
but thought it would likely be a wash or a loss, because in most cases
only one attribute will be pulled from a tuple during comparetup_heap.
slot_getattr cannot improve on heap_getattr in that case, and is quite
likely to be slower.

> (3)There is a description of slot->val in comment of execTuple.c.
> This had better revise it.

Drat, how'd I miss that?  Thanks.

            regards, tom lane

Re: WIP: avoiding tuple construction/deconstruction overhead

From
a_ogawa
Date:
Tom Lane wrote:
> a_ogawa <a_ogawa@hi-ho.ne.jp> writes:
> > (1)We can improve compare_heap() by using TableTupleSlot instead of
> > HeapTuple. Please see attached patch.
>
> Did you measure any performance improvement from that?  I considered it
> but thought it would likely be a wash or a loss, because in most cases
> only one attribute will be pulled from a tuple during comparetup_heap.
> slot_getattr cannot improve on heap_getattr in that case, and is quite
> likely to be slower.

I measured performance of heap_getattr and slot_getattr in
comparetup_heap.

I made the table which had ten varchar attributes, and registered
data for tests.  (Attached file includes SQL doing this.)
I carried out the following tests.

(case 1)
 test1: select * from sort_test order by v1 limit 100;
 test2: select * from sort_test order by v1, v2 limit 100;
 test3: select * from sort_test order by v1, v2, v3 limit 100;
 test4: select * from sort_test order by v1, v2, v3, v4 limit 100;
 test5: select * from sort_test order by v1, v2, v3, v4, v5 limit 100;

 result:        test1    test2    test3    test4    test5
-----------------------------------------------------------------------
 heap_getattr  2.149s   2.602s   3.204s   3.830s   4.159s
 slot_getattr  2.523s   3.422s   3.977s   4.453s   4.721s

(case 2)
 test1: select * from sort_test order by v10 limit 100;
 test2: select * from sort_test order by v10, v9 limit 100;
 test3: select * from sort_test order by v10, v9, v8 limit 100;
 test4: select * from sort_test order by v10, v9, v8, v7 limit 100;
 test5: select * from sort_test order by v10, v9, v8, v7, v6 limit 100;

 result:        test1    test2    test3    test4    test5
-----------------------------------------------------------------------
 heap_getattr  3.654s   5.549s   6.575s   7.367s   7.870s
 slot_getattr  4.027s   4.930s   5.249s   5.555s   5.756s

(case 3)
 test1: select * from sort_test order by v5 limit 100;
 test2: select * from sort_test order by v5, v6 limit 100;
 test3: select * from sort_test order by v5, v6, v7 limit 100;
 test4: select * from sort_test order by v5, v6, v7, v8 limit 100;
 test5: select * from sort_test order by v5, v6, v7, v8, v9 limit 100;

 result:        test1    test2    test3    test4    test5
-----------------------------------------------------------------------
 heap_getattr  2.657s   4.207s   5.194s   6.179s  6.662s
 slot_getattr  3.126s   4.233s   4.806s   5.271s  5.557s

In most cases, heap_getattr is fast.
When the following conditions occurred, slot_getattr is fast.
 (1)Tuple have varlen attributes.
 (2)Sort key have more than two attributes.
 (3)A position of a sort key is far from the head of tuple.
 (4)As for the data of a sort key, there be many repetition.
Actually it will be rare that these conditions are occurred.

Thinking from a result, I think that we had better continue using
heap_getattr in comparetup_heap.

regards,

--- Atsushi Ogawa

Attachment

Re: WIP: avoiding tuple construction/deconstruction overhead

From
Bruce Momjian
Date:
I am very excited there has been so much reduction in tuple processing
overhead in the past few weeks.  This is always and area I thought
needed improvement, and its great to see it.

We will certainly have some big performance improvements in 8.1 because
we already have several (e.g. SMP) and we have many more months to go.

---------------------------------------------------------------------------

a_ogawa wrote:
>
> Tom Lane wrote:
> > a_ogawa <a_ogawa@hi-ho.ne.jp> writes:
> > > (1)We can improve compare_heap() by using TableTupleSlot instead of
> > > HeapTuple. Please see attached patch.
> >
> > Did you measure any performance improvement from that?  I considered it
> > but thought it would likely be a wash or a loss, because in most cases
> > only one attribute will be pulled from a tuple during comparetup_heap.
> > slot_getattr cannot improve on heap_getattr in that case, and is quite
> > likely to be slower.
>
> I measured performance of heap_getattr and slot_getattr in
> comparetup_heap.
>
> I made the table which had ten varchar attributes, and registered
> data for tests.  (Attached file includes SQL doing this.)
> I carried out the following tests.
>
> (case 1)
>  test1: select * from sort_test order by v1 limit 100;
>  test2: select * from sort_test order by v1, v2 limit 100;
>  test3: select * from sort_test order by v1, v2, v3 limit 100;
>  test4: select * from sort_test order by v1, v2, v3, v4 limit 100;
>  test5: select * from sort_test order by v1, v2, v3, v4, v5 limit 100;
>
>  result:        test1    test2    test3    test4    test5
> -----------------------------------------------------------------------
>  heap_getattr  2.149s   2.602s   3.204s   3.830s   4.159s
>  slot_getattr  2.523s   3.422s   3.977s   4.453s   4.721s
>
> (case 2)
>  test1: select * from sort_test order by v10 limit 100;
>  test2: select * from sort_test order by v10, v9 limit 100;
>  test3: select * from sort_test order by v10, v9, v8 limit 100;
>  test4: select * from sort_test order by v10, v9, v8, v7 limit 100;
>  test5: select * from sort_test order by v10, v9, v8, v7, v6 limit 100;
>
>  result:        test1    test2    test3    test4    test5
> -----------------------------------------------------------------------
>  heap_getattr  3.654s   5.549s   6.575s   7.367s   7.870s
>  slot_getattr  4.027s   4.930s   5.249s   5.555s   5.756s
>
> (case 3)
>  test1: select * from sort_test order by v5 limit 100;
>  test2: select * from sort_test order by v5, v6 limit 100;
>  test3: select * from sort_test order by v5, v6, v7 limit 100;
>  test4: select * from sort_test order by v5, v6, v7, v8 limit 100;
>  test5: select * from sort_test order by v5, v6, v7, v8, v9 limit 100;
>
>  result:        test1    test2    test3    test4    test5
> -----------------------------------------------------------------------
>  heap_getattr  2.657s   4.207s   5.194s   6.179s  6.662s
>  slot_getattr  3.126s   4.233s   4.806s   5.271s  5.557s
>
> In most cases, heap_getattr is fast.
> When the following conditions occurred, slot_getattr is fast.
>  (1)Tuple have varlen attributes.
>  (2)Sort key have more than two attributes.
>  (3)A position of a sort key is far from the head of tuple.
>  (4)As for the data of a sort key, there be many repetition.
> Actually it will be rare that these conditions are occurred.
>
> Thinking from a result, I think that we had better continue using
> heap_getattr in comparetup_heap.
>
> regards,
>
> --- Atsushi Ogawa

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073