Thread: Does TupleQueueReaderNext() really need to copy its result?
Hi, A comment in tqueue.c says that the bytes return by shm_mq_receive() "had better be sufficiently aligned", before assigning the pointer to htup.t_data. Then it copies htup and returns the copy (and it did so in the earlier version that had all the remapping stuff, too, but sometimes it deformed it directly so it really did need to be suitably aligned in that case IIUC). Given that shm_mq.c proudly documents that it avoids copying the data on the receiving side (unless it has to reconstruct a message that was split up), and given that it promises that the pointed-to data remains valid until your next call, it seems that it should be safe to return a pointer to the same HeapTupleData object every time (perhaps a member of the TupleQueueReader struct) and just adjust its t_data and t_len members every time, so that the gather node emits tuples directly from the shared memory queue (and then of course tell the slot not to pfree()). Alternatively, if the value returned by shm_mq_receive() is not really suitably aligned, then the comment is a bit misleading. -- Thomas Munro https://enterprisedb.com
On Thu, Aug 22, 2019 at 12:08 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Given that shm_mq.c proudly documents that it avoids copying the data > on the receiving side (unless it has to reconstruct a message that was > split up), and given that it promises that the pointed-to data remains > valid until your next call, it seems that it should be safe to return > a pointer to the same HeapTupleData object every time (perhaps a > member of the TupleQueueReader struct) and just adjust its t_data and > t_len members every time, so that the gather node emits tuples > directly from the shared memory queue (and then of course tell the > slot not to pfree()). Alternatively, if the value returned by > shm_mq_receive() is not really suitably aligned, then the comment is a > bit misleading. Couldn't resist trying this, and it seems to work. Based on the comment "the buffer size is a multiple of MAXIMUM_ALIGNOF, and each read and write is as well", it should always work (though I wish shm_mq_receive_bytes()'s documentation would discuss message alignment explicitly if that's true). On the other hand, I doubt it makes a difference, so this is more of a question: is this the way it was supposed to work? (It was ~40% faster at shunting a large SELECT * through the queue with asserts enabled, which made me happy for moment, but it was only an illusion and not measurable in the noise without asserts). -- Thomas Munro https://enterprisedb.com
Attachment
On Fri, Aug 23, 2019 at 10:22 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Couldn't resist trying this, and it seems to work. Based on the > comment "the buffer size is a multiple of MAXIMUM_ALIGNOF, and each > read and write is as well", it should always work (though I wish > shm_mq_receive_bytes()'s documentation would discuss message alignment > explicitly if that's true). On the other hand, I doubt it makes a > difference, so this is more of a question: is this the way it was > supposed to work? There's a comment in htup.h which says: * * Separately allocated tuple: t_data points to a palloc'd chunk that * is not adjacent to the HeapTupleData. (This case is deprecated since * it's difficult to tell apart from case #1. It should be used only in * limited contexts where the code knows that case #1 will never apply.) I got scared and ran away. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2019-08-26 14:09:45 -0400, Robert Haas wrote: > On Fri, Aug 23, 2019 at 10:22 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > Couldn't resist trying this, and it seems to work. Based on the > > comment "the buffer size is a multiple of MAXIMUM_ALIGNOF, and each > > read and write is as well", it should always work (though I wish > > shm_mq_receive_bytes()'s documentation would discuss message alignment > > explicitly if that's true). On the other hand, I doubt it makes a > > difference, so this is more of a question: is this the way it was > > supposed to work? > > There's a comment in htup.h which says: > > * * Separately allocated tuple: t_data points to a palloc'd chunk that > * is not adjacent to the HeapTupleData. (This case is deprecated since > * it's difficult to tell apart from case #1. It should be used only in > * limited contexts where the code knows that case #1 will never apply.) > > I got scared and ran away. Perhaps this'd could be sidestepped by funneling through MinimalTuples instead of HeapTuples. Afaict that should always be sufficient, because all system column accesses ought to happen below (including being projected into a separate column, if needed above). With the added benefit of needing less space, of course. Greetings, Andres Freund
On Tue, Aug 27, 2019 at 6:35 AM Andres Freund <andres@anarazel.de> wrote: > On 2019-08-26 14:09:45 -0400, Robert Haas wrote: > > There's a comment in htup.h which says: > > > > * * Separately allocated tuple: t_data points to a palloc'd chunk that > > * is not adjacent to the HeapTupleData. (This case is deprecated since > > * it's difficult to tell apart from case #1. It should be used only in > > * limited contexts where the code knows that case #1 will never apply.) > > > > I got scared and ran away. > > Perhaps this'd could be sidestepped by funneling through MinimalTuples > instead of HeapTuples. Afaict that should always be sufficient, because > all system column accesses ought to happen below (including being > projected into a separate column, if needed above). With the added > benefit of needing less space, of course. I tried that out (attached). That makes various simple tests like this to go 10%+ faster on my development machine: create table s as select generate_series(1, 50000000)::int i, 'hello world' a, 'this is a message' b, 42 c; select pg_prewarm('s'); set force_parallel_mode = on; explain analyze select * from s; PS It looks like the following load of mq_ring_size might be running a little hot due to false sharing with the atomic counters: if (mqh->mqh_consume_pending > mq->mq_ring_size / 4)
Attachment
On Thu, May 14, 2020 at 10:55 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Tue, Aug 27, 2019 at 6:35 AM Andres Freund <andres@anarazel.de> wrote: > > Perhaps this'd could be sidestepped by funneling through MinimalTuples > > instead of HeapTuples. Afaict that should always be sufficient, because > > all system column accesses ought to happen below (including being > > projected into a separate column, if needed above). With the added > > benefit of needing less space, of course. Right, create_gather[_merge]_plan() does create_plan_recurse(..., CP_EXACT_TLIST). Here's a new version that updates the comment there to note that this is not merely a good idea but a requirement, due to the MinimalTuple conveyance. (I think there may be another reason the system columns are projected away even without that, but saying so explicitly and documenting it seems useful either way). > I tried that out (attached). That makes various simple tests like > this to go 10%+ faster on my development machine: I registered this patch as https://commitfest.postgresql.org/28/2560/ in case someone would like to review it.
Attachment
Hi Thomas, +1 to the idea! I ran some experiments on both of your patches. I could reproduce the speed gain that you saw for a plan with a simple parallel sequential scan. However, I got no gain at all for a parallel hash join and parallel agg query. ----------------------------------------------------------------------- select pg_prewarm('lineitem'); -- lineitem is 17G. (TPCH scale = 20). shared_buffers = 30G explain analyze select * from lineitem; [w/o any patch] 99s [w/ first patch] 89s [w/ last minimal tuple patch] 79s ----------------------------------------------------------------------- select pg_prewarm('lineitem'); -- lineitem is 17G. (TPCH scale = 20). shared_buffers = 30G explain analyze select count(*) from lineitem; [w/o any patch] 10s [w/ first patch] 10s [w/ last minimal tuple patch] 10s ----------------------------------------------------------------------- select pg_prewarm('lineitem'); select pg_prewarm('orders'); -- lineitem is 17G, orders is 4G. (TPCH scale = 20). shared_buffers = 30G explain analyze select count(*) from lineitem join orders on l_orderkey = o_orderkey where o_totalprice > 5.00; [w/o any patch] 54s [w/ first patch] 53s [w/ last minimal tuple patch] 56s ----------------------------------------------------------------------- Maybe I'm missing something, since there should be improvements with anything that has a gather? As for gather merge, is it possible to have a situation where the slot input to tqueueReceiveSlot() is a heap slot (as would be the case for a simple select *)? If yes, in those scenarios, we would be incurring an extra call to minimal_tuple_from_heap_tuple() because of the extra call to ExecFetchSlotMinimalTuple() inside tqueueReceiveSlot() in your patch. And since, in a gather merge, we can't avoid the copy on the leader side (heap_copy_minimal_tuple() inside gm_readnext_tuple()), we would be doing extra work in that scenario. I couldn't come up with a plan that creates a scenario like this however. Regards, Soumyadeep (VMware)
On Sat, Jul 11, 2020 at 1:37 PM Soumyadeep Chakraborty <soumyadeep2007@gmail.com> wrote: > +1 to the idea! I ran some experiments on both of your patches. Hi Soumyadeep, Thanks for testing! > I could reproduce the speed gain that you saw for a plan with a simple > parallel sequential scan. However, I got no gain at all for a parallel > hash join and parallel agg query. Right, it's not going to make a difference when you only send one tuple through the queue, like COUNT(*) does. > As for gather merge, is it possible to have a situation where the slot > input to tqueueReceiveSlot() is a heap slot (as would be the case for a > simple select *)? If yes, in those scenarios, we would be incurring an > extra call to minimal_tuple_from_heap_tuple() because of the extra call > to ExecFetchSlotMinimalTuple() inside tqueueReceiveSlot() in your patch. > And since, in a gather merge, we can't avoid the copy on the leader side > (heap_copy_minimal_tuple() inside gm_readnext_tuple()), we would be > doing extra work in that scenario. I couldn't come up with a plan that > creates a scenario like this however. Hmm. I wish we had a way to do an "in-place" copy-to-minimal-tuple where the caller supplies the memory, with some fast protocol to get the size right. We could use that for copying tuples into shm queues, hash join tables etc without an extra palloc()/pfree() and double copy.
Hey Thomas, On Fri, Jul 10, 2020 at 7:30 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > I could reproduce the speed gain that you saw for a plan with a simple > > parallel sequential scan. However, I got no gain at all for a parallel > > hash join and parallel agg query. > > Right, it's not going to make a difference when you only send one > tuple through the queue, like COUNT(*) does. How silly of me! I should have paid more attention to the rows output from each worker and that there was a select count(*) on the join query. Anyway, these are a new set of results: ----------------------------------------------------------------------- select pg_prewarm('lineitem'); select pg_prewarm('orders'); -- lineitem is 17G, orders is 4G. (TPCH scale = 20). shared_buffers = 30G explain analyze select * from lineitem join orders on l_orderkey = o_orderkey where o_totalprice > 5.00; [w/o any patch] 637s [w/ first patch] 635s [w/ last minimal tuple patch] 568s ----------------------------------------------------------------------- We do indeed get the speedup. > > As for gather merge, is it possible to have a situation where the slot > > input to tqueueReceiveSlot() is a heap slot (as would be the case for a > > simple select *)? If yes, in those scenarios, we would be incurring an > > extra call to minimal_tuple_from_heap_tuple() because of the extra call > > to ExecFetchSlotMinimalTuple() inside tqueueReceiveSlot() in your patch. > > And since, in a gather merge, we can't avoid the copy on the leader side > > (heap_copy_minimal_tuple() inside gm_readnext_tuple()), we would be > > doing extra work in that scenario. I couldn't come up with a plan that > > creates a scenario like this however. > > Hmm. I wish we had a way to do an "in-place" copy-to-minimal-tuple > where the caller supplies the memory, with some fast protocol to get > the size right. We could use that for copying tuples into shm queues, > hash join tables etc without an extra palloc()/pfree() and double > copy. Do you mean that we should have an implementation for get_minimal_tuple() for the heap AM and have it return a pointer to the minimal tuple from the MINIMAL_TUPLE_OFFSET? And then a caller such as tqueueReceiveSlot() will ensure that the heap tuple from which it wants to extract the minimal tuple was allocated in the tuple queue in the first place? If we consider that the node directly below a gather is a SeqScan, then we could possibly, in ExecInitSeqScan() set-up the ss_ScanTupleSlot to point to memory in the shared tuple queue? Similarly, other ExecInit*() methods can do the same for other executor nodes that involve parallelism? Of course, things would be slightly different for the other use cases you mentioned (such as hash table population) All things considered, I think the patch in its current form should go in. Having the in-place copy, could be done as a separate patch? Do you agree? Regards, Soumyadeep (VMware)
On Sun, Jul 12, 2020 at 7:25 AM Soumyadeep Chakraborty <soumyadeep2007@gmail.com> wrote: > Do you mean that we should have an implementation for > get_minimal_tuple() for the heap AM and have it return a pointer to the > minimal tuple from the MINIMAL_TUPLE_OFFSET? And then a caller such as > tqueueReceiveSlot() will ensure that the heap tuple from which it wants > to extract the minimal tuple was allocated in the tuple queue in the > first place? If we consider that the node directly below a gather is a > SeqScan, then we could possibly, in ExecInitSeqScan() set-up the > ss_ScanTupleSlot to point to memory in the shared tuple queue? > Similarly, other ExecInit*() methods can do the same for other executor > nodes that involve parallelism? Of course, things would be slightly > different for > the other use cases you mentioned (such as hash table population) What I mean is that where ExecHashTableInsert() and tqueueReceiveSlot() do ExecFetchSlotMinimalTuple(), you usually get a freshly allocated copy, and then you copy that again, and free it. There may be something similar going on in tuplestore and sort code. Perhaps we could have something like ExecFetchSlotMinimalTupleInPlace(slot, output_buffer, output_buffer_size) that returns a value that indicates either success or hey-that-buffer's-too-small-I-need-N-bytes, or something like that. That silly extra copy is something Andres pointed out to me in some perf results involving TPCH hash joins, a couple of years ago. > All things considered, I think the patch in its current form should go > in. Thanks for the testing and review! Pushed. > Having the in-place copy, could be done as a separate patch? Do you > agree? Yeah.
Hi, I am starting a new thread that continues with the following point that was discussed in [1] .... On Fri, 17 Jul 2020 at 09:05, Thomas Munro <thomas.munro@gmail.com> wrote: > > On Sun, Jul 12, 2020 at 7:25 AM Soumyadeep Chakraborty > <soumyadeep2007@gmail.com> wrote: > > Do you mean that we should have an implementation for > > get_minimal_tuple() for the heap AM and have it return a pointer to the > > minimal tuple from the MINIMAL_TUPLE_OFFSET? And then a caller such as > > tqueueReceiveSlot() will ensure that the heap tuple from which it wants > > to extract the minimal tuple was allocated in the tuple queue in the > > first place? If we consider that the node directly below a gather is a > > SeqScan, then we could possibly, in ExecInitSeqScan() set-up the > > ss_ScanTupleSlot to point to memory in the shared tuple queue? > > Similarly, other ExecInit*() methods can do the same for other executor > > nodes that involve parallelism? Of course, things would be slightly > > different for > > the other use cases you mentioned (such as hash table population) > > What I mean is that where ExecHashTableInsert() and > tqueueReceiveSlot() do ExecFetchSlotMinimalTuple(), you usually get a > freshly allocated copy, and then you copy that again, and free it. > There may be something similar going on in tuplestore and sort code. > Perhaps we could have something like > ExecFetchSlotMinimalTupleInPlace(slot, output_buffer, > output_buffer_size) that returns a value that indicates either success > or hey-that-buffer's-too-small-I-need-N-bytes, or something like that. > That silly extra copy is something Andres pointed out to me in some > perf results involving TPCH hash joins, a couple of years ago. I went ahead and tried doing this. I chose an approach where we can return the pointer to the in-place minimal tuple data if it's a heap/buffer/minimal tuple slot. A new function ExecFetchSlotMinimalTupleData() returns in-place minimal tuple data. If it's neither heap, buffer or minimal tuple, it returns a copy as usual. The receiver should not assume the data is directly taken from MinimalTupleData, so it should set it's t_len to the number of bytes returned. Patch attached (0001-Avoid-redundant-tuple-copy-while-sending-tuples-to-G.patch) Thomas, I guess you had a different approach in mind when you said about "returning either success or hey-that-buffer's-too-small-I-need-N-bytes". But what looks clear to me is that avoiding the copy shows consistent improvement of 4 to 10% for simple parallel table scans. I tried my patch on both x86_64 and arm64, and observed this speedup on both. create table tab as select generate_series(1, 20000000) id, 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz' v; select pg_prewarm('tab'::regclass); explain analyze select * from tab where id %2 = 0; Times in milli-secs : HEAD : 1833.119 1816.522 1875.648 1872.153 1834.383 Patch'ed : 1763.786 1721.160 1714.665 1719.738 1704.478 This was with the default 2 parallel workers. With 3 or 4 workers, for the above testcase I didn't see a noticeable difference. I think, if I double the number of rows, the difference will be noticeable. In any case, the gain would go on reducing with the number of workers, because the tuple copy also gets parallelized. In some scenarios, parallel_leader_participation=off causes the difference to amplify. Haven't had a chance to see if this helps any of the TPC-H queries. Also attached is a patch guc_for_testing.patch that I used for testing the gain. This patch is only for testing. Without this, in order to compare the performance figures it requires server restart, and the figures anyway shift back and forth by 5-15 percent after each restart, which creates lot of noise when comparing figures with and without fix. Use this GUC enable_fix to enable/disable the fix. [1] https://www.postgresql.org/message-id/CA%2BhUKGLrN2M18-hACEJbNoj2sn_WoUj9rkkBeoPK7SY427pAnA%40mail.gmail.com -- Thanks, -Amit Khandekar Huawei Technologies
Attachment
On Wed, Sep 9, 2020 at 5:23 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > I went ahead and tried doing this. I chose an approach where we can > return the pointer to the in-place minimal tuple data if it's a > heap/buffer/minimal tuple slot. A new function > ExecFetchSlotMinimalTupleData() returns in-place minimal tuple data. > If it's neither heap, buffer or minimal tuple, it returns a copy as > usual. The receiver should not assume the data is directly taken from > MinimalTupleData, so it should set it's t_len to the number of bytes > returned. Patch attached > (0001-Avoid-redundant-tuple-copy-while-sending-tuples-to-G.patch) +char * +ExecFetchSlotMinimalTupleData(TupleTableSlot *slot, uint32 *len, + bool *shouldFree) Interesting approach. It's a bit of a weird interface, returning a pointer to a non-valid MinimalTuple that requires extra tweaking after you copy it to make it a valid one and that you're not allowed to tweak in-place. I'd probably make that return "const void *" just for a bit of extra documentation. I wonder if there is a way we could make "Minimal Tuples but with the length travelling separately (and perhaps chopped off?)" into a first-class concept... It's also a shame to be schlepping a bunch of padding bytes around. tuple = (MinimalTuple) data; - Assert(tuple->t_len == nbytes); + tuple->t_len = nbytes; Hmm, so you have to scribble on shared memory on the receiving side. I wondered about a couple of different ways to share the length field with the shm_mq envelope, but that all seems a bit too weird... > Thomas, I guess you had a different approach in mind when you said > about "returning either success or > hey-that-buffer's-too-small-I-need-N-bytes". But what looks clear to Yeah I tried some things like that, but I wasn't satisfied with any of them; basically the extra work involved in negotiating the size was a bit too high. On the other hand, because my interface was "please write a MinimalTuple here!", it had the option to *form* a MinimalTuple directly in place, whereas your approach can only avoid creating and destroying a temporary tuple when the source is a heap tuple. > me is that avoiding the copy shows consistent improvement of 4 to 10% > for simple parallel table scans. I tried my patch on both x86_64 and > arm64, and observed this speedup on both. I think that's a great validation of the goal but I hope we can figure out a way that avoids the temporary tuple for more cases. FWIW I saw hash self joins running a couple of percent faster with one of my abandoned patches.
Hi, On 2020-09-17 14:20:50 +1200, Thomas Munro wrote: > I wonder if there is a way we could make "Minimal Tuples but with the > length travelling separately (and perhaps chopped off?)" into a > first-class concept... It's also a shame to be schlepping a bunch of > padding bytes around. There really is no justification for having MinimalTuples, as we have them today at least, anymore. We used to rely on being able to construct pointers to MinimalTuples that are mostly compatible with HeapTuple. But I think there's none of those left since PG 12. I think it'd make a bit more sense to do some steps towards having a more suitable "minimal" tuple representation, rather than doing this local, pretty ugly, hacks. A good way would be to just starting to remove the padding, unnecessary fields etc from MinimalTuple. I also think that it'd be good to look at a few of the other places that are heavily bottlenecked by MinimalTuple overhead before designing new API around this. IIRC it's e.g. very easy to see hash joins spending a lot of time doing MinimalTuple copies & conversions. > > tuple = (MinimalTuple) data; > - Assert(tuple->t_len == nbytes); > + tuple->t_len = nbytes; > > Hmm, so you have to scribble on shared memory on the receiving side. Ick, I would really like to avoid this. > > Thomas, I guess you had a different approach in mind when you said > > about "returning either success or > > hey-that-buffer's-too-small-I-need-N-bytes". But what looks clear to > > Yeah I tried some things like that, but I wasn't satisfied with any of > them; basically the extra work involved in negotiating the size was a > bit too high. On the other hand, because my interface was "please > write a MinimalTuple here!", it had the option to *form* a > MinimalTuple directly in place, whereas your approach can only avoid > creating and destroying a temporary tuple when the source is a heap > tuple. There's a lot of cases where the source is a virtual slot (since we'll often project stuff below Gather). So it'd be quite advantageous to avoid building an unnecessary HeapTuple in those cases. I wonder if it would be sensible to build minimal tuples using tts_values/isnull in some cases. This might even be advantageous in case of heap / minimal tuples, because IIRC right now the code will materialize the slot unnecessarily. Not sure how common that is. Greetings, Andres Freund
On Thu, 17 Sep 2020 at 08:55, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2020-09-17 14:20:50 +1200, Thomas Munro wrote: > > I wonder if there is a way we could make "Minimal Tuples but with the > > length travelling separately (and perhaps chopped off?)" into a > > first-class concept... It's also a shame to be schlepping a bunch of > > padding bytes around. Yeah, I think we can pass a "length" data separately, but since the receiver end already is assuming that it knows the received data is a minimal tuple, I thought why not skip passing this redundant component. But anyways, if you and Andres are suggesting that being able to skip the copy is important for virtual tuples as well, then I think the approach you suggested (supplying an allocated memory to the tuple API for conversion) would be one of the better options with us, if not the only good option. Maybe I will try looking into the shm_mq working to see if we can come up with a good solution. > > There really is no justification for having MinimalTuples, as we have > them today at least, anymore. We used to rely on being able to construct > pointers to MinimalTuples that are mostly compatible with HeapTuple. But > I think there's none of those left since PG 12. Ah ok. > > I think it'd make a bit more sense to do some steps towards having a > more suitable "minimal" tuple representation, rather than doing this > local, pretty ugly, hacks. A good way would be to just starting to > remove the padding, unnecessary fields etc from MinimalTuple. So there are two things we wish to do : 1. Prevent an extra tuple forming step before sending minimal tuple data. Possibly device an shm_mq API to get memory to write tuple of a given length, and device something like FormMinimalTupleDataInHere(memory_allocated_by_shm_mq) which will write minimal tuple data. 2. Shrink the MinimalTupleData structure because it no longer needs the current padding etc and we can substitute this new MinimalTuple structure with the current one all over the code wherever it is currently being used. If we remove the unnecessary fields from the tuple data being sent to Gather node, then we need to again form a MinimalTuple at the receiving end, which again adds an extra tuple forming. So I understand, that's the reason why you are saying we should shrink the MinimalTupleData structure itself, in which case we will continue to use the received new MinimalTupledata as an already-formed tuple, like how we are doing now. Now, the above two things (1. and 2.) look independent to me. Suppose we first do 1. i.e. we come up with a good way to form an in-place MinimalTuple at the sender's end, without any change to the MinimalTupleData. And then when we do 2. i.e. shrink the MinimalTupleData; but for that, we won't require any change in the in-place-tuple-forming API we wrote in 1. . Just the existing underlying function heap_form_minimal_tuple() or something similar might need to be changed. At least that's what I feel right now. > > I also think that it'd be good to look at a few of the other places that > are heavily bottlenecked by MinimalTuple overhead before designing new > API around this. IIRC it's e.g. very easy to see hash joins spending a > lot of time doing MinimalTuple copies & conversions. Yeah, makes sense. The above FormMinimalTupleDataInHere() should be able to be used for these other places as well. Will keep that in mind. > > > > Thomas, I guess you had a different approach in mind when you said > > > about "returning either success or > > > hey-that-buffer's-too-small-I-need-N-bytes". But what looks clear to > > > > Yeah I tried some things like that, but I wasn't satisfied with any of > > them; basically the extra work involved in negotiating the size was a > > bit too high. Hmm, ok. Let me see if there is any way around this. >> On the other hand, because my interface was "please > > write a MinimalTuple here!", it had the option to *form* a > > MinimalTuple directly in place, whereas your approach can only avoid > > creating and destroying a temporary tuple when the source is a heap > > tuple. True. > > There's a lot of cases where the source is a virtual slot (since we'll > often project stuff below Gather). So it'd be quite advantageous to > avoid building an unnecessary HeapTuple in those cases. Yeah right.