Thread: Rethinking TupleTableSlot deforming

Rethinking TupleTableSlot deforming

From

Andres Freund

Date:

22 July 2016, 04:56:11

Hi,

I've previously mentioned (e.g. [1]) that tuple deforming is a serious
bottlneck. I've also experimented successfully [2] making
slot_deform_tuple() faster.

But nontheless, tuple deforming is still a *major* bottleneck in many
cases, if not the *the* major bottleneck.

We could partially address that by JITing the work slot_deform_tuple
does. Various people have, with good but not raving success, played with
that.

Alternatively/Additionally we can change the tuple format to make
deforming faster.

But I think the bigger issue than the above is actually that we're just
performing a lot of useless work in a number of common scenarios. We're
always deforming all columns up to the one needed. Very often that's a
lot of useless work. I've experimented with selectively replacing
slot_getattr calls heap_getattr(), and for some queries that can yield
massive speedups. And obviously significant slowdowns in others. That's
the case even when preceding columns are varlena and/or contain nulls.
I.e. a good chunk of the problem is storing the results of deforming,
not accessing the data.

ISTM, we need to change slots so that they contain information about
which columns are interesting. For the hot paths we'd then only ever
allow access to those columns, and we'd only ever deform them. Combined
with the approach in [2] that allows us to deform tuples a lot more
efficiently.

What I'm basically thinking is that expression evaluation would always
make sure the slots have computed the relevant column set, and deform at
the beginning. There's some cases where we likely would still need to
fall back to a slower path (e.g. whole row refs), but that seems fine.

That then also allows us to nearly always avoid the slot_getattr() call,
and instead look at tts_values/nulls directly. The checks slot_getattr()
performs, and the call itself, are quite expensive.

What I'm thinking about is
a) a new ExecInitExpr()/ExecBuildProjectionInfo() which always compute a set of interesting columns.
b) replacing all accesses to tts_values/isnull with an inline function. In optimized builds that functions won't do
anythingbut reference the relevant element, but in assert enabled builds it'd check whether said column is actually
knownto be accessed.

c) Make ExecEvalExpr(), ExecProject(), ExecQual() (and perhaps some other places) call the new deforming function
whichensures the relevant columns are available.

d) Replace nearly all slot_getattr/slot_getsomeattrs calls with the function introduced in b).

To me it seems this work will be a good bit easier once [2] is actually
implemented instead of prototyped, because treating ExecInitExpr()
non-recursively allows to build such 'column sets' more easily /
naturally.

Comments? Alternative suggestions?

Greetings,

Andres Freund

[1] http://archives.postgresql.org/20160624232953.beub22r6yqux4gcp@alap3.anarazel.de
[2] http://archives.postgresql.org/message-id/20160714011850.bd5zhu35szle3n3c%40alap3.anarazel.de

Re: Rethinking TupleTableSlot deforming

From

Greg Stark

Date:

22 July 2016, 16:34:18

On Fri, Jul 22, 2016 at 2:56 AM, Andres Freund <andres@anarazel.de> wrote:
>
> But I think the bigger issue than the above is actually that we're just
> performing a lot of useless work in a number of common scenarios. We're
> always deforming all columns up to the one needed. Very often that's a
> lot of useless work.  I've experimented with selectively replacing
> slot_getattr calls heap_getattr(), and for some queries that can yield
> massive speedups. And obviously significant slowdowns in others.  That's
> the case even when preceding columns are varlena and/or contain nulls.
> I.e. a good chunk of the problem is storing the results of deforming,
> not accessing the data.

As I said when we chatted I'm a bit puzzled. My understanding was that
the whole point of the "slots" thing was precisely to avoid this kind
of extra work. So I would love to learn where my understanding
diverges from reality. I wonder if there's some simple bug in the code
leading to it not doing what it's intended to do.

My understanding is that the idea of slots is that when we decode the
Nth attribute we decode every attribute up until that attribute and
have somewhere to put it -- in the slot. When we later refer to an
earlier attribute we can just fetch it from the slot directly. This is
the difference between using slot_getattr and heap_getattr directly.

-- 
greg

Re: Rethinking TupleTableSlot deforming

From

Tom Lane

Date:

22 July 2016, 17:09:25

Greg Stark <stark@mit.edu> writes:
> On Fri, Jul 22, 2016 at 2:56 AM, Andres Freund <andres@anarazel.de> wrote:
>> But I think the bigger issue than the above is actually that we're just
>> performing a lot of useless work in a number of common scenarios. We're
>> always deforming all columns up to the one needed. Very often that's a
>> lot of useless work.

> As I said when we chatted I'm a bit puzzled.

I'm really suspicious of this line of argument as well.  It's possible
that if you only consider all-fixed-width, never-null columns, it might
look like deforming the columns before the one you need is a waste of
effort.  But as soon as you relax either of those assumptions, you have
to crawl over the earlier columns anyway, and saving aside the results
is going to be close to free.

I can certainly believe that there's some merit in trying to arrange for
the columns we need to be earlier rather than later.  In a sorting or
grouping step, we know we only need access to certain columns --- but
those columns are likely to be at the end not the beginning.  If we're
doing a projection step anyway to construct the input, we could probably
rearrange that.  Maybe we could even go further, and require the planner
to always set up the input so that the sort/group columns are exactly 1..N
in order, removing the need for the executor to cope with any variation.
        regards, tom lane

Re: Rethinking TupleTableSlot deforming

From

Andres Freund

Date:

22 July 2016, 21:35:44

On 2016-07-22 14:33:32 +0100, Greg Stark wrote:
> On Fri, Jul 22, 2016 at 2:56 AM, Andres Freund <andres@anarazel.de> wrote:
> >
> > But I think the bigger issue than the above is actually that we're just
> > performing a lot of useless work in a number of common scenarios. We're
> > always deforming all columns up to the one needed. Very often that's a
> > lot of useless work.  I've experimented with selectively replacing
> > slot_getattr calls heap_getattr(), and for some queries that can yield
> > massive speedups. And obviously significant slowdowns in others.  That's
> > the case even when preceding columns are varlena and/or contain nulls.
> > I.e. a good chunk of the problem is storing the results of deforming,
> > not accessing the data.
> 
> As I said when we chatted I'm a bit puzzled.

I think that's a separate slant on slots than what I/we talked around
pgcon.

> My understanding was that
> the whole point of the "slots" thing was precisely to avoid this kind
> of extra work. So I would love to learn where my understanding
> diverges from reality. I wonder if there's some simple bug in the code
> leading to it not doing what it's intended to do.
> 
> My understanding is that the idea of slots is that when we decode the
> Nth attribute we decode every attribute up until that attribute and
> have somewhere to put it -- in the slot. When we later refer to an
> earlier attribute we can just fetch it from the slot directly. This is
> the difference between using slot_getattr and heap_getattr directly.

Sure, it's intentional. But not working that well.  The problem is that
very often we actually don't ever access the preceding
columns. Sometimes that knowledge would let us skip directly to the
interesting column(s) (fixed width, no nulls columns preceding). But
even if not, just the fact that we don't have to copy the column data
for columns never accessed can make for a *significant* speed bump.  For
!varlena datatypes we don't ever have to access the data, just the NULL
bitmap. And for varlena ones, we just need to check the varlena header
for the width, not make variable width copy.

So what I'm saying is, that we should pre-compute the set of columns
we're interested in, and deform just those, but still in bulk.

Andres

Re: Rethinking TupleTableSlot deforming

From

Andres Freund

Date:

22 July 2016, 21:46:58

On 2016-07-22 10:09:18 -0400, Tom Lane wrote:
> Greg Stark <stark@mit.edu> writes:
> > On Fri, Jul 22, 2016 at 2:56 AM, Andres Freund <andres@anarazel.de> wrote:
> >> But I think the bigger issue than the above is actually that we're just
> >> performing a lot of useless work in a number of common scenarios. We're
> >> always deforming all columns up to the one needed. Very often that's a
> >> lot of useless work.
> 
> > As I said when we chatted I'm a bit puzzled.
> 
> I'm really suspicious of this line of argument as well.  It's possible
> that if you only consider all-fixed-width, never-null columns, it might
> look like deforming the columns before the one you need is a waste of
> effort.  But as soon as you relax either of those assumptions, you have
> to crawl over the earlier columns anyway, and saving aside the results
> is going to be close to free.

Not according to my measurements. And that doesn't seems that
surprising. For a null check you'll need to access (not cheaply so!) the
null bitmap, to skip a varlena datum one needs to access the varlena
header. Copying the actual datum, *especially* in the varlena case, is a
more expensive than that; especially because otherwise you'll often not
have to touch the source cachelines at all.

> I can certainly believe that there's some merit in trying to arrange for
> the columns we need to be earlier rather than later.  In a sorting or
> grouping step, we know we only need access to certain columns --- but
> those columns are likely to be at the end not the beginning.  If we're
> doing a projection step anyway to construct the input, we could probably
> rearrange that.

Oh, that's a neat idea as well.

> Maybe we could even go further, and require the planner
> to always set up the input so that the sort/group columns are exactly 1..N
> in order, removing the need for the executor to cope with any variation.

I guess that columns both returned to the upper node, and involved in
sorting would make it hard to entirely and efficiently rely on that,
without forcing more expensive projections. But it's probably worth
tinkering with.

Greetings,

Andres Freund

Re: Rethinking TupleTableSlot deforming

From

Jim Nasby

Date:

22 July 2016, 21:51:45

On 7/22/16 1:35 PM, Andres Freund wrote:
> Sure, it's intentional. But not working that well.  The problem is that
> very often we actually don't ever access the preceding
> columns. Sometimes that knowledge would let us skip directly to the
> interesting column(s) (fixed width, no nulls columns preceding). But
> even if not, just the fact that we don't have to copy the column data
> for columns never accessed can make for a *significant* speed bump.  For
> !varlena datatypes we don't ever have to access the data, just the NULL
> bitmap. And for varlena ones, we just need to check the varlena header
> for the width, not make variable width copy.
>
> So what I'm saying is, that we should pre-compute the set of columns
> we're interested in, and deform just those, but still in bulk.

Another option would be to remember the tuple offsets (NOT attcacheoff) 
that have been computed as well as whether a varlena attribute has 
actually been deformed. That eliminates the need to pre-declare what 
attributes you're allowed to reference.

Somewhat related, it's occurred to me that it would be nice if expanded 
objects supported partial expansion. IE: if we converted JSONB to be an 
expanded object, you would not want to proactively expand every value in 
a JSON object, because each of those values could itself be very large. 
Obviously most of the work there would be up to the type that was 
handling the details of the expansion, but IIRC there's some additional 
support that would be needed in expanded objects themselves.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461

Re: Rethinking TupleTableSlot deforming

From

Andres Freund

Date:

22 July 2016, 21:58:15

On 2016-07-22 13:51:31 -0500, Jim Nasby wrote:
> Another option would be to remember the
> tuple offsets (NOT attcacheoff) that have
> been computed as well as whether a
> varlena attribute has actually been
> deformed. That eliminates the need to
> pre-declare what attributes you're
> allowed to reference.

That'd make access more expensive, so that doesn't seem a very enticing
solution.

Re: Rethinking TupleTableSlot deforming

From

Tom Lane

Date:

22 July 2016, 22:00:43

Andres Freund <andres@anarazel.de> writes:
> On 2016-07-22 10:09:18 -0400, Tom Lane wrote:
>> I'm really suspicious of this line of argument as well.  It's possible
>> that if you only consider all-fixed-width, never-null columns, it might
>> look like deforming the columns before the one you need is a waste of
>> effort.  But as soon as you relax either of those assumptions, you have
>> to crawl over the earlier columns anyway, and saving aside the results
>> is going to be close to free.

> Not according to my measurements. And that doesn't seems that
> surprising. For a null check you'll need to access (not cheaply so!) the
> null bitmap, to skip a varlena datum one needs to access the varlena
> header. Copying the actual datum, *especially* in the varlena case, is a
> more expensive than that; especially because otherwise you'll often not
> have to touch the source cachelines at all.

But that's nonsense.  We don't copy varlena datums during tuple deforming,
only save aside pointers to them.  And we would have to access the
column's null bit and varlena header in any case if our goal is to find a
column located later.  Yeah, we could skip physically storing the pointer
into the slot's Datum array, but not much more than that.

>> Maybe we could even go further, and require the planner
>> to always set up the input so that the sort/group columns are exactly 1..N
>> in order, removing the need for the executor to cope with any variation.

> I guess that columns both returned to the upper node, and involved in
> sorting would make it hard to entirely and efficiently rely on that,
> without forcing more expensive projections. But it's probably worth
> tinkering with.

Well, it's a question of whether an extra projection at the scan level is
worth the savings in column access during the sort or group stage.  My gut
feeling is that it very likely would be a win for a multicolumn sort.
(For a single sort key column we already amortize fetching of the column
datum, so maybe not so much in that case.)

Whether the column is needed at upper levels doesn't seem relevant to me.
setrefs.c will find it wherever it is.
        regards, tom lane

Re: Rethinking TupleTableSlot deforming

From

Jim Nasby

Date:

22 July 2016, 22:07:05

On 7/22/16 1:58 PM, Andres Freund wrote:
> On 2016-07-22 13:51:31 -0500, Jim Nasby wrote:
>> Another option would be to remember the
>> tuple offsets (NOT attcacheoff) that have
>> been computed as well as whether a
>> varlena attribute has actually been
>> deformed. That eliminates the need to
>> pre-declare what attributes you're
>> allowed to reference.
>
> That'd make access more expensive, so that doesn't seem a very enticing
> solution.

Yeah, it's certainly a tradeoff. The advantage is this might be less 
invasive; it might also work in more cases like heap_deform_tuple (which 
I think is used a lot more than table slots).

It's certainly possible it's not worth it, but I wanted to put the idea 
out there.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461

Re: Rethinking TupleTableSlot deforming

From

Andres Freund

Date:

22 July 2016, 22:15:50

On 2016-07-22 15:00:32 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2016-07-22 10:09:18 -0400, Tom Lane wrote:
> >> I'm really suspicious of this line of argument as well.  It's possible
> >> that if you only consider all-fixed-width, never-null columns, it might
> >> look like deforming the columns before the one you need is a waste of
> >> effort.  But as soon as you relax either of those assumptions, you have
> >> to crawl over the earlier columns anyway, and saving aside the results
> >> is going to be close to free.
> 
> > Not according to my measurements. And that doesn't seems that
> > surprising. For a null check you'll need to access (not cheaply so!) the
> > null bitmap, to skip a varlena datum one needs to access the varlena
> > header. Copying the actual datum, *especially* in the varlena case, is a
> > more expensive than that; especially because otherwise you'll often not
> > have to touch the source cachelines at all.
> 
> But that's nonsense.  We don't copy varlena datums during tuple deforming,
> only save aside pointers to them.

Indeed.

I've hacked up a test for this, I'll dig that up, and show the results.

> Well, it's a question of whether an extra projection at the scan level is
> worth the savings in column access during the sort or group stage.  My gut
> feeling is that it very likely would be a win for a multicolumn sort.
> (For a single sort key column we already amortize fetching of the column
> datum, so maybe not so much in that case.)
> 
> Whether the column is needed at upper levels doesn't seem relevant to me.
> setrefs.c will find it wherever it is.

Well, the projection is what I was thinking of.

Andres

Re: Rethinking TupleTableSlot deforming

From

Tom Lane

Date:

22 July 2016, 23:53:09

Andres Freund <andres@anarazel.de> writes:
> On 2016-07-22 15:00:32 -0400, Tom Lane wrote:
>> Well, it's a question of whether an extra projection at the scan level is
>> worth the savings in column access during the sort or group stage.

> Well, the projection is what I was thinking of.

One point here is that unless your test case is a blind "SELECT *",
there's almost certainly going to be a projection before a sort anyway,
because we always want to get rid of any unreferenced columns to minimize
the data volume going through the sort.
        regards, tom lane

Re: Rethinking TupleTableSlot deforming

From

Alvaro Herrera

Date:

27 July 2016, 23:05:46

Andres Freund wrote:
> Hi,
> 
> I've previously mentioned (e.g. [1]) that tuple deforming is a serious
> bottlneck. I've also experimented successfully [2] making
> slot_deform_tuple() faster.

I'm currently messing with making heapam.c be just one possible
implementation of tuple storage, to allow other modules to implement
tuple storage in different ways.  Each such module would have a pg_am
row, and among the set of routines that such a module would have to
provide is the equivalent of slot_deform_tuple, which takes a tuple in
whatever format the storage module uses and puts it into
executor-formatted tts_values/tts_isnull arrays.  For tuples coming from
heapam.c-formatted tables that would be pretty much slot_deform_tuple,
but of course other modules could have tuples in completely different
formats.

(This is one part in the larger project of columnar storage, where a
storage module would store tuples in some columnar format rather than
row-oriented.  Obviously, tuple deforming means a completely different
thing in that world than in heapam.c).

No patch to show at present; I mentioned this project in Brussels.
I intend to present for 10.0.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services