Thread: A few more opportunities to use TupleTableSlotOps fields
Perhaps it's just a matter of taste, but I think the TupleTableSlotOps structure, once initialized, should be used wherever possible. At least for me personally, when I read the code, the particular callback function name is a bit disturbing wherever it's not necessary. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Attachment
On Tue, May 21, 2019 at 8:02 AM Antonin Houska <ah@cybertec.at> wrote: > Perhaps it's just a matter of taste, but I think the TupleTableSlotOps > structure, once initialized, should be used wherever possible. At least for me > personally, when I read the code, the particular callback function name is a > bit disturbing wherever it's not necessary. But it's significantly more efficient. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, May 21, 2019 at 8:02 AM Antonin Houska <ah@cybertec.at> wrote: > > Perhaps it's just a matter of taste, but I think the TupleTableSlotOps > > structure, once initialized, should be used wherever possible. At least for me > > personally, when I read the code, the particular callback function name is a > > bit disturbing wherever it's not necessary. > > But it's significantly more efficient. Do you refer to the fact that for example the address of tts_virtual_clear(dstslot); is immediately available in the text section while in this case dstslot->tts_ops->clear(dstslot); the CPU first needs to fetch the address of tts_ops and also that of the ->clear function? I admit I didn't think about this problem. Nevertheless I imagine that due to constness of the variables like TTSOpsVirtual (and due to several other const declarations) the compiler might be able to compute the address of the tts_ops->clear() expression. Thus the only extra work for the CPU would be to fetch tts_ops from dstslot, but that should not be a big deal because other fields of dstslot are accessed by the surrounding code (so all of them might be available in the CPU cache). I don't pretend to be an expert in this area though, it's possible that I still miss something. -- Antonin Houska Web: https://www.cybertec-postgresql.com
On Tue, May 21, 2019 at 10:48 AM Antonin Houska <ah@cybertec.at> wrote: > Do you refer to the fact that for example the address of > > tts_virtual_clear(dstslot); > > is immediately available in the text section while in this case > > dstslot->tts_ops->clear(dstslot); > > the CPU first needs to fetch the address of tts_ops and also that of the > ->clear function? Yes. And since tts_virtual_clear() is marked static, it seems like it might even possible for it to inline that function at compile time. > I admit I didn't think about this problem. Nevertheless I imagine that due to > constness of the variables like TTSOpsVirtual (and due to several other const > declarations) the compiler might be able to compute the address of the > tts_ops->clear() expression. Thus the only extra work for the CPU would be to > fetch tts_ops from dstslot, but that should not be a big deal because other > fields of dstslot are accessed by the surrounding code (so all of them might > be available in the CPU cache). I think the issue is pipeline stalls. If the compiler can see a direct call coming up, it can start fetching the instructions from the target address. If it sees an indirect call, that's probably harder to do. But really, I'm not an expect on this area. I would write the code as Andres did on the general principle of making things as easy for the compiler and CPU as possible, but I do not know how much it really matters. Andres probably does know... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2019-05-21 12:16:22 -0400, Robert Haas wrote: > On Tue, May 21, 2019 at 10:48 AM Antonin Houska <ah@cybertec.at> wrote: > > Do you refer to the fact that for example the address of > > > > tts_virtual_clear(dstslot); > > > > is immediately available in the text section while in this case > > > > dstslot->tts_ops->clear(dstslot); > > > > the CPU first needs to fetch the address of tts_ops and also that of the > > ->clear function? > > Yes. And since tts_virtual_clear() is marked static, it seems like it > might even possible for it to inline that function at compile time. I sure hope so. And I did verify that at some point. We, for example, don't want changes to slot->tts_flags tts_virtual_clear does to be actually written to memory, if it's called from a callsite that knows it's a virtual slot. I just checked, and for me tts_virtual_copyslot() inlines all of tts_virtual_clear(), and eliminates most of its contents except for the pfree() branch. All the rest of the state changes are overwritten by tts_virtual_copyslot() anyway. > > I admit I didn't think about this problem. Nevertheless I imagine that due to > > constness of the variables like TTSOpsVirtual (and due to several other const > > declarations) the compiler might be able to compute the address of the > > tts_ops->clear() expression. Thus the only extra work for the CPU would be to > > fetch tts_ops from dstslot, but that should not be a big deal because other > > fields of dstslot are accessed by the surrounding code (so all of them might > > be available in the CPU cache). > > I think the issue is pipeline stalls. If the compiler can see a > direct call coming up, it can start fetching the instructions from the > target address. If it sees an indirect call, that's probably harder > to do. Some CPUs can do so, but it'll often be more expensive / have a higher chance of misspeculating (rather than the 0 chance in case of a straight line code. > But really, I'm not an expect on this area. I would write the code as > Andres did on the general principle of making things as easy for the > compiler and CPU as possible, but I do not know how much it really > matters. Andres probably does know... I think the inlining bit is more crucial, but that having as few indirect calls as possible matters here. It wasn't that easy to get the slot virtualization to not regress performance meaningfully. If anything, I really want to go the *opposite* direction, i.e. remove *more* indirect calls from within execTuples.c, and get the compiler to realize at callsites external to execTuples.c that it can cache tts_ops in more places. Greetings, Andres Freund
Hi, On 2019-05-21 16:47:50 +0200, Antonin Houska wrote: > I admit I didn't think about this problem. Nevertheless I imagine that due to > constness of the variables like TTSOpsVirtual (and due to several other const > declarations) the compiler might be able to compute the address of the > tts_ops->clear() expression. It really can't, without actually fetching tts_ops, and reading the callback's location. How would e.g. tts_virtual_copyslot() know that the slot's tts_ops point to TTSOpsVirtual? There's simply no way to express that in C. If this were a class in C++, the compiler would have decent chance at it these days (because if it's a final method it can infer that it has to be, and because whole program optimization allows devirtualization passes to do so), but well, it's not. And then there's the whole inlining issue explained in my other recent mail on the topic. Greetings, Andres Freund