Thread: Cache last known per-tuple offsets to speed long tuple access

Cache last known per-tuple offsets to speed long tuple access

From
a_ogawa
Date:
I made a patch for "Cache last known per-tuple offsets to speed
long tuple access" that is in TODO list.

This problem was discussed on hackers-list as "Terrible performance
on wide selects".
The point of this problem is nocachegetattr() used from ExecEvalVar().
If tuple has many columns, and it has varlen column or null data,
time spent in nocachegetattr() is O(N^2) in the number of fields.

I referred URL below for implementation.
 http://archives.postgresql.org/pgsql-performance/2003-01/msg00262.php

The point of this patch is as follows:
(1)heap_deformtuple_incr() is added.
 This function can extract attributes of tupple, incrementally.

(2)The cache which keeps the result of heap_deformtuple_incr(),
 is added inside TupleTableSlot.

(3)In ExecEvalVar(), heap_deformtuple_incr() is used in place of
 nocachegetattr(). This would reduce the time from O(N^2) to O(N).

In order to measure the effect, I executed the test below.
-------------------
Table has 15,000tuples, 200columns. All data type is text.
Table name is aaa. Column name is t001...t200.
Executed SQL is,
select t100, t110, t120, t130, t140, t150,
       t160, t170, t180, t190, t200
from aaa;

The profile result of original code is as follows.
-------------------
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 70.05      1.31     1.31   163846     0.00     0.00  nocachegetattr
  8.02      1.46     0.15   163840     0.00     0.00  FunctionCall3
  1.87      1.50     0.04   397763     0.00     0.00  AllocSetFreeIndex
  1.60      1.52     0.03   163840     0.00     0.00  ExecEvalVar
  1.34      1.55     0.03   200414     0.00     0.00  AllocSetAlloc

The profile result after the patch applying is as follows.
-------------------
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 39.73      0.29     0.29   180224     0.00     0.00  heap_deformtuple_incr
  9.59      0.36     0.07   163840     0.00     0.00  FunctionCall3
  6.85      0.41     0.05    16384     0.00     0.02  ExecTargetList
  5.48      0.45     0.04    23477     0.00     0.00  hash_any
  4.11      0.48     0.03   200414     0.00     0.00  AllocSetAlloc

Regards,

--- Atsushi Ogawa (a_ogawa@hi-ho.ne.jp)

Attachment

Re: Cache last known per-tuple offsets to speed long tuple access

From
Tom Lane
Date:
a_ogawa <a_ogawa@hi-ho.ne.jp> writes:
> I made a patch for "Cache last known per-tuple offsets to speed
> long tuple access" that is in TODO list.
> I referred URL below for implementation.
>  http://archives.postgresql.org/pgsql-performance/2003-01/msg00262.php

This wasn't quite what I had in mind.  In particular I think it's a
bad idea to put a significant part of the implementation directly into
ExecEvalVar, because we'll have to duplicate that code anywhere else
that proves to be a hot-spot.  I think the right way is to make a new
function that is just like heap_getattr except it is given a
TupleTableSlot instead of separate tuple and tupdesc arguments, and
encapsulate the knowledge about optimizing access using previously
determined offsets inside that.

Also, heap_deformtuple is a hot spot in its own right now, so slowing
it down so that it can share code with the TupleTableSlot optimization
doesn't seem to me like a win.  There's not that much code in it, and
the data structures involved are very unlikely to change much, so I think
it'd be better to just duplicate the code and then optimize separately.
This would also avoid the confusion you've introduced into
heap_deformtuple about the roles of the various variables (the first
comment is effectively backwards now...)

We did not have heap_deformtuple in its current incarnation at the time
of the discussion you reference, so there may be some additional thought
needed.  It'd be nice if heap_deformtuple could make use of
already-computed info, for instance.  I'm not sure if that can be done
without fairly wide-ranging API changes (most of its callers don't have
Slots available), so it may be something to leave for a future patch.

Lastly, I don't see any point in introducing DeformTupleState and
deformtuple_cache as abstractions in their own right.  I think it'd
be simpler and more readable just to regard all this as fields of a
TupleTableSlot.  If there were any other contexts that we might want
to cache partial tuple disassembly info in, then it'd be worth making
a separation, but I don't foresee any such application.

BTW, why is it that your profile shows *more* calls to
heap_deformtuple_incr after the patch than there were nocachegetattr
calls before?  Seems like there should be fewer, or at least the same
number.  I'm now wondering whether there is any point to the "incremental"
aspect at all, as opposed to just doing a heap_deformtuple call the
first time a column value is demanded and always extracting all the
fields at that time.  (Of course, this will always look like a win in
the context of test cases that access most or all of the fields.  What
we need to worry about is the penalty incurred in cases that don't.)

            regards, tom lane

Re: Cache last known per-tuple offsets to speed long tuple

From
a_ogawa
Date:
Thank you for advice.
I am going to remake a patch, in order to make it simple.
The plan of a new patch is as follows.

(1)Add fields to TupleTableSlot and TupleTableData.
This fields are for holding the tuple disassembly information.

(2)Add the codes which initializes/cleans new  fields.
These codes are added to execTuples.c.

(3)Add two functions to execQual.c.
One function is just like heap_deformtuple. It is given a
TupleTableSlot. And it extracts the field of tuple incrementary
using the new fields of TupleTableSlot.

The meaning of incrementary is as the following example.
Example: The tupple has 100 columns.
 - first call to get col5 will fill first 5 positions in the array.
 - next call to get col75 will fill starts from 5 and up to 75.
 - next call to get col60 will only refer to array, because
   already extracted.

Another function is just like heap_getattr and fast_getattr.
It is given a TupleTableSlot. And this function uses new
function(like a heap_deformtuple), instead of nocachegetattr.

(4)ExecEvalVar uses new function(like a heap_getattr) instead of
heap_getattr.

With a new patch, only three files of tuptable.h, and execTuple.c
and execQual.c are due to be changed.

> BTW, why is it that your profile shows *more* calls to
> heap_deformtuple_incr after the patch than there were nocachegetattr
> calls before?

Many one is for printtup.
(printtup -> heap_deformtuple -> heap_deformtuple_incr)
Since the code of heap_deformtuple and heap_deformtuple_incr has been
share, heap_deformtuple_incr looks many calls than nocachegetattr.

If a part for the number of calls of heap_deformtuple_incr
by printtup is removed, heap_deformtuple_incr and nocachegetattr
will be the same number of calls.

With my test being to access the column in ascending order
(select t100, t110 ...), heap_deformtuple_incr and nocachegetattr
is the same calls.
If the column is accessed in descending order(select t200, t190...),
number of calls of heap_deformtuple_incr will decrease sharply.
It is because a result is cached by the first call to get t200.

regards,

--- Atsushi Ogawa
a_ogawa@hi-ho.ne.jp


Re: Cache last known per-tuple offsets to speed long tuple

From
a_ogawa
Date:
I remaked patch for "Cache last known per-tuple offsets to speed
long tuple access" that is in TODO list.

The point of this patch is as follows:
(1)Add fields to TupleTableSlot and TupleTableData.
This fields are for holding the tuple disassembly information.

(2)Add the codes which initializes/cleans new  fields.
These codes are added to execTuples.c.

(3)Add two functions to execQual.c.
This function name is slot_deformtuple and this is
just like heap_deformtuple. This function can be resumed
from the previous execution using the information
encapsulated in the TupleTableSlot.

Another function is just like heap_getattr and fast_getattr.
This function name is slot_getattr. This is just like
heap_getattr and fast_getattr macro, except it is given a
TupleTableSlot, and this function uses slot_deformtuple
insetead of nocachegetattr.

(4)ExecEvalVar uses new function slot_getattr instead of
heap_getattr.

I executed the test below.
-------------------
Table has 16,384tuples, 200columns. All data type is text.
Table name is aaa. Column name is t001...t200.
Executed SQL is,
select t100, t110, t120, t130, t140, t150,
       t160, t170, t180, t190, t200
from aaa;

The profile result of original code is as follows.
-------------------
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 74.19      1.38     1.38   163846     0.00     0.00  nocachegetattr
  4.30      1.46     0.08   163840     0.00     0.00  FunctionCall3
  1.61      1.49     0.03   397750     0.00     0.00  AllocSetFreeIndex
  1.61      1.52     0.03    16384     0.00     0.00  ExecTargetList
  1.08      1.54     0.02   344152     0.00     0.00  appendBinaryStringInfo

The profile result after the patch applying is as follows.
-------------------
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 30.38      0.24     0.24   163840     0.00     0.00  slot_deformtuple
 10.13      0.32     0.08   163840     0.00     0.00  FunctionCall3
  5.06      0.36     0.04   163840     0.00     0.00  slot_getattr
  5.06      0.40     0.04    16384     0.00     0.00  heap_deformtuple
  3.80      0.43     0.03    49159     0.00     0.00  ExecClearTuple

regards,

--- Atsushi Ogawa
a_ogawa@hi-ho.ne.jp

Attachment

Re: Cache last known per-tuple offsets to speed long tuple

From
Bruce Momjian
Date:
This has been saved for the 8.1 release:

    http:/momjian.postgresql.org/cgi-bin/pgpatches2

---------------------------------------------------------------------------

a_ogawa wrote:
>
> I remaked patch for "Cache last known per-tuple offsets to speed
> long tuple access" that is in TODO list.
>
> The point of this patch is as follows:
> (1)Add fields to TupleTableSlot and TupleTableData.
> This fields are for holding the tuple disassembly information.
>
> (2)Add the codes which initializes/cleans new  fields.
> These codes are added to execTuples.c.
>
> (3)Add two functions to execQual.c.
> This function name is slot_deformtuple and this is
> just like heap_deformtuple. This function can be resumed
> from the previous execution using the information
> encapsulated in the TupleTableSlot.
>
> Another function is just like heap_getattr and fast_getattr.
> This function name is slot_getattr. This is just like
> heap_getattr and fast_getattr macro, except it is given a
> TupleTableSlot, and this function uses slot_deformtuple
> insetead of nocachegetattr.
>
> (4)ExecEvalVar uses new function slot_getattr instead of
> heap_getattr.
>
> I executed the test below.
> -------------------
> Table has 16,384tuples, 200columns. All data type is text.
> Table name is aaa. Column name is t001...t200.
> Executed SQL is,
> select t100, t110, t120, t130, t140, t150,
>        t160, t170, t180, t190, t200
> from aaa;
>
> The profile result of original code is as follows.
> -------------------
> Each sample counts as 0.01 seconds.
>   %   cumulative   self              self     total
>  time   seconds   seconds    calls   s/call   s/call  name
>  74.19      1.38     1.38   163846     0.00     0.00  nocachegetattr
>   4.30      1.46     0.08   163840     0.00     0.00  FunctionCall3
>   1.61      1.49     0.03   397750     0.00     0.00  AllocSetFreeIndex
>   1.61      1.52     0.03    16384     0.00     0.00  ExecTargetList
>   1.08      1.54     0.02   344152     0.00     0.00  appendBinaryStringInfo
>
> The profile result after the patch applying is as follows.
> -------------------
> Each sample counts as 0.01 seconds.
>   %   cumulative   self              self     total
>  time   seconds   seconds    calls  ms/call  ms/call  name
>  30.38      0.24     0.24   163840     0.00     0.00  slot_deformtuple
>  10.13      0.32     0.08   163840     0.00     0.00  FunctionCall3
>   5.06      0.36     0.04   163840     0.00     0.00  slot_getattr
>   5.06      0.40     0.04    16384     0.00     0.00  heap_deformtuple
>   3.80      0.43     0.03    49159     0.00     0.00  ExecClearTuple
>
> regards,
>
> --- Atsushi Ogawa
> a_ogawa@hi-ho.ne.jp

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Cache last known per-tuple offsets to speed long tuple

From
Tom Lane
Date:
a_ogawa <a_ogawa@hi-ho.ne.jp> writes:
> I remaked patch for "Cache last known per-tuple offsets to speed
> long tuple access" that is in TODO list.

I've applied this patch with some revisions (mostly to put the
functionality in places that seemed more appropriate than execQual.c).
Many thanks!

            regards, tom lane