Thread: Does TupleQueueReaderNext() really need to copy its result?

Does TupleQueueReaderNext() really need to copy its result?

From
Thomas Munro
Date:
Hi,

A comment in tqueue.c says that the bytes return by shm_mq_receive()
"had better be sufficiently aligned", before assigning the pointer to
htup.t_data.  Then it copies htup and returns the copy (and it did so
in the earlier version that had all the remapping stuff, too, but
sometimes it deformed it directly so it really did need to be suitably
aligned in that case IIUC).

Given that shm_mq.c proudly documents that it avoids copying the data
on the receiving side (unless it has to reconstruct a message that was
split up), and given that it promises that the pointed-to data remains
valid until your next call, it seems that it should be safe to return
a pointer to the same HeapTupleData object every time (perhaps a
member of the TupleQueueReader struct) and just adjust its t_data and
t_len members every time, so that the gather node emits tuples
directly from the shared memory queue (and then of course tell the
slot not to pfree()).  Alternatively, if the value returned by
shm_mq_receive() is not really suitably aligned, then the comment is a
bit misleading.

-- 
Thomas Munro
https://enterprisedb.com



Re: Does TupleQueueReaderNext() really need to copy its result?

From
Thomas Munro
Date:
On Thu, Aug 22, 2019 at 12:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Given that shm_mq.c proudly documents that it avoids copying the data
> on the receiving side (unless it has to reconstruct a message that was
> split up), and given that it promises that the pointed-to data remains
> valid until your next call, it seems that it should be safe to return
> a pointer to the same HeapTupleData object every time (perhaps a
> member of the TupleQueueReader struct) and just adjust its t_data and
> t_len members every time, so that the gather node emits tuples
> directly from the shared memory queue (and then of course tell the
> slot not to pfree()).  Alternatively, if the value returned by
> shm_mq_receive() is not really suitably aligned, then the comment is a
> bit misleading.

Couldn't resist trying this, and it seems to work.  Based on the
comment "the buffer size is a multiple of MAXIMUM_ALIGNOF, and each
read and write is as well", it should always work (though I wish
shm_mq_receive_bytes()'s documentation would discuss message alignment
explicitly if that's true).  On the other hand, I doubt it makes a
difference, so this is more of a question: is this the way it was
supposed to work?

(It was ~40% faster at shunting a large SELECT * through the queue
with asserts enabled, which made me happy for moment, but it was only
an illusion and not measurable in the noise without asserts).

-- 
Thomas Munro
https://enterprisedb.com

Attachment

Re: Does TupleQueueReaderNext() really need to copy its result?

From
Robert Haas
Date:
On Fri, Aug 23, 2019 at 10:22 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Couldn't resist trying this, and it seems to work.  Based on the
> comment "the buffer size is a multiple of MAXIMUM_ALIGNOF, and each
> read and write is as well", it should always work (though I wish
> shm_mq_receive_bytes()'s documentation would discuss message alignment
> explicitly if that's true).  On the other hand, I doubt it makes a
> difference, so this is more of a question: is this the way it was
> supposed to work?

There's a comment in htup.h which says:

 * * Separately allocated tuple: t_data points to a palloc'd chunk that
 *       is not adjacent to the HeapTupleData.  (This case is deprecated since
 *       it's difficult to tell apart from case #1.  It should be used only in
 *       limited contexts where the code knows that case #1 will never apply.)

I got scared and ran away.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Does TupleQueueReaderNext() really need to copy its result?

From
Andres Freund
Date:
Hi,

On 2019-08-26 14:09:45 -0400, Robert Haas wrote:
> On Fri, Aug 23, 2019 at 10:22 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Couldn't resist trying this, and it seems to work.  Based on the
> > comment "the buffer size is a multiple of MAXIMUM_ALIGNOF, and each
> > read and write is as well", it should always work (though I wish
> > shm_mq_receive_bytes()'s documentation would discuss message alignment
> > explicitly if that's true).  On the other hand, I doubt it makes a
> > difference, so this is more of a question: is this the way it was
> > supposed to work?
> 
> There's a comment in htup.h which says:
> 
>  * * Separately allocated tuple: t_data points to a palloc'd chunk that
>  *       is not adjacent to the HeapTupleData.  (This case is deprecated since
>  *       it's difficult to tell apart from case #1.  It should be used only in
>  *       limited contexts where the code knows that case #1 will never apply.)
> 
> I got scared and ran away.

Perhaps this'd could be sidestepped by funneling through MinimalTuples
instead of HeapTuples. Afaict that should always be sufficient, because
all system column accesses ought to happen below (including being
projected into a separate column, if needed above). With the added
benefit of needing less space, of course.

Greetings,

Andres Freund



Re: Does TupleQueueReaderNext() really need to copy its result?

From
Thomas Munro
Date:
On Tue, Aug 27, 2019 at 6:35 AM Andres Freund <andres@anarazel.de> wrote:
> On 2019-08-26 14:09:45 -0400, Robert Haas wrote:
> > There's a comment in htup.h which says:
> >
> >  * * Separately allocated tuple: t_data points to a palloc'd chunk that
> >  *       is not adjacent to the HeapTupleData.  (This case is deprecated since
> >  *       it's difficult to tell apart from case #1.  It should be used only in
> >  *       limited contexts where the code knows that case #1 will never apply.)
> >
> > I got scared and ran away.
>
> Perhaps this'd could be sidestepped by funneling through MinimalTuples
> instead of HeapTuples. Afaict that should always be sufficient, because
> all system column accesses ought to happen below (including being
> projected into a separate column, if needed above). With the added
> benefit of needing less space, of course.

I tried that out (attached).  That makes various simple tests like
this to go 10%+ faster on my development machine:

  create table s as select generate_series(1, 50000000)::int i,
                           'hello world' a,
                           'this is a message' b,
                           42 c;
  select pg_prewarm('s');
  set force_parallel_mode = on;

  explain analyze select * from s;

PS  It looks like the following load of mq_ring_size might be running
a little hot due to false sharing with the atomic counters:

       if (mqh->mqh_consume_pending > mq->mq_ring_size / 4)

Attachment

Re: Does TupleQueueReaderNext() really need to copy its result?

From
Thomas Munro
Date:
On Thu, May 14, 2020 at 10:55 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Tue, Aug 27, 2019 at 6:35 AM Andres Freund <andres@anarazel.de> wrote:
> > Perhaps this'd could be sidestepped by funneling through MinimalTuples
> > instead of HeapTuples. Afaict that should always be sufficient, because
> > all system column accesses ought to happen below (including being
> > projected into a separate column, if needed above). With the added
> > benefit of needing less space, of course.

Right, create_gather[_merge]_plan() does create_plan_recurse(...,
CP_EXACT_TLIST).  Here's a new version that updates the comment there
to note that this is not merely a good idea but a requirement, due to
the MinimalTuple conveyance.  (I think there may be another reason the
system columns are projected away even without that, but saying so
explicitly and documenting it seems useful either way).

> I tried that out (attached).  That makes various simple tests like
> this to go 10%+ faster on my development machine:

I registered this patch as https://commitfest.postgresql.org/28/2560/
in case someone would like to review it.

Attachment

Re: Does TupleQueueReaderNext() really need to copy its result?

From
Soumyadeep Chakraborty
Date:
Hi Thomas,

+1 to the idea! I ran some experiments on both of your patches.

I could reproduce the speed gain that you saw for a plan with a simple
parallel sequential scan. However, I got no gain at all for a parallel
hash join and parallel agg query.

-----------------------------------------------------------------------

select pg_prewarm('lineitem');
-- lineitem is 17G. (TPCH scale = 20). shared_buffers = 30G
explain analyze select * from lineitem;

[w/o any patch]               99s
[w/ first patch]              89s
[w/ last minimal tuple patch] 79s

-----------------------------------------------------------------------

select pg_prewarm('lineitem');
-- lineitem is 17G. (TPCH scale = 20). shared_buffers = 30G
explain analyze select count(*) from lineitem;

[w/o any patch]               10s
[w/ first patch]              10s
[w/ last minimal tuple patch] 10s

-----------------------------------------------------------------------

select pg_prewarm('lineitem');
select pg_prewarm('orders');
-- lineitem is 17G, orders is 4G. (TPCH scale = 20). shared_buffers = 30G

explain analyze select count(*)
  from lineitem
  join orders on l_orderkey = o_orderkey
 where o_totalprice > 5.00;

[w/o any patch]               54s
[w/ first patch]              53s
[w/ last minimal tuple patch] 56s

-----------------------------------------------------------------------

Maybe I'm missing something, since there should be improvements with
anything that has a gather?

As for gather merge, is it possible to have a situation where the slot
input to tqueueReceiveSlot() is a heap slot (as would be the case for a
simple select *)? If yes, in those scenarios, we would be incurring an
extra call to minimal_tuple_from_heap_tuple() because of the extra call
to ExecFetchSlotMinimalTuple() inside tqueueReceiveSlot() in your patch.
And since, in a gather merge, we can't avoid the copy on the leader side
(heap_copy_minimal_tuple() inside gm_readnext_tuple()), we would be
doing extra work in that scenario. I couldn't come up with a plan that
creates a scenario like this however.

Regards,
Soumyadeep (VMware)



Re: Does TupleQueueReaderNext() really need to copy its result?

From
Thomas Munro
Date:
On Sat, Jul 11, 2020 at 1:37 PM Soumyadeep Chakraborty
<soumyadeep2007@gmail.com> wrote:
> +1 to the idea! I ran some experiments on both of your patches.

Hi Soumyadeep,

Thanks for testing!

> I could reproduce the speed gain that you saw for a plan with a simple
> parallel sequential scan. However, I got no gain at all for a parallel
> hash join and parallel agg query.

Right, it's not going to make a difference when you only send one
tuple through the queue, like COUNT(*) does.

> As for gather merge, is it possible to have a situation where the slot
> input to tqueueReceiveSlot() is a heap slot (as would be the case for a
> simple select *)? If yes, in those scenarios, we would be incurring an
> extra call to minimal_tuple_from_heap_tuple() because of the extra call
> to ExecFetchSlotMinimalTuple() inside tqueueReceiveSlot() in your patch.
> And since, in a gather merge, we can't avoid the copy on the leader side
> (heap_copy_minimal_tuple() inside gm_readnext_tuple()), we would be
> doing extra work in that scenario. I couldn't come up with a plan that
> creates a scenario like this however.

Hmm.  I wish we had a way to do an "in-place" copy-to-minimal-tuple
where the caller supplies the memory, with some fast protocol to get
the size right.  We could use that for copying tuples into shm queues,
hash join tables etc without an extra palloc()/pfree() and double
copy.



Re: Does TupleQueueReaderNext() really need to copy its result?

From
Soumyadeep Chakraborty
Date:
Hey Thomas,

On Fri, Jul 10, 2020 at 7:30 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> > I could reproduce the speed gain that you saw for a plan with a simple
> > parallel sequential scan. However, I got no gain at all for a parallel
> > hash join and parallel agg query.
>
> Right, it's not going to make a difference when you only send one
> tuple through the queue, like COUNT(*) does.

How silly of me! I should have paid more attention to the rows output
from each worker and that there was a select count(*) on the join query.
Anyway, these are a new set of results:

-----------------------------------------------------------------------

select pg_prewarm('lineitem');
select pg_prewarm('orders');
-- lineitem is 17G, orders is 4G. (TPCH scale = 20). shared_buffers = 30G

explain analyze select *
  from lineitem
  join orders on l_orderkey = o_orderkey
 where o_totalprice > 5.00;

[w/o any patch]               637s
[w/ first patch]              635s
[w/ last minimal tuple patch] 568s

-----------------------------------------------------------------------

We do indeed get the speedup.

> > As for gather merge, is it possible to have a situation where the slot
> > input to tqueueReceiveSlot() is a heap slot (as would be the case for a
> > simple select *)? If yes, in those scenarios, we would be incurring an
> > extra call to minimal_tuple_from_heap_tuple() because of the extra call
> > to ExecFetchSlotMinimalTuple() inside tqueueReceiveSlot() in your patch.
> > And since, in a gather merge, we can't avoid the copy on the leader side
> > (heap_copy_minimal_tuple() inside gm_readnext_tuple()), we would be
> > doing extra work in that scenario. I couldn't come up with a plan that
> > creates a scenario like this however.
>
> Hmm.  I wish we had a way to do an "in-place" copy-to-minimal-tuple
> where the caller supplies the memory, with some fast protocol to get
> the size right.  We could use that for copying tuples into shm queues,
> hash join tables etc without an extra palloc()/pfree() and double
> copy.

Do you mean that we should have an implementation for
get_minimal_tuple() for the heap AM and have it return a pointer to the
minimal tuple from the MINIMAL_TUPLE_OFFSET? And then a caller such as
tqueueReceiveSlot() will ensure that the heap tuple from which it wants
to extract the minimal tuple was allocated in the tuple queue in the
first place? If we consider that the node directly below a gather is a
SeqScan, then we could possibly, in ExecInitSeqScan() set-up the
ss_ScanTupleSlot to point to memory in the shared tuple queue?
Similarly, other ExecInit*() methods can do the same for other executor
nodes that involve parallelism? Of course, things would be slightly
different for
the other use cases you mentioned (such as hash table population)

All things considered, I think the patch in its current form should go
in. Having the in-place copy, could be done as a separate patch? Do you
agree?

Regards,
Soumyadeep (VMware)



Re: Does TupleQueueReaderNext() really need to copy its result?

From
Thomas Munro
Date:
On Sun, Jul 12, 2020 at 7:25 AM Soumyadeep Chakraborty
<soumyadeep2007@gmail.com> wrote:
> Do you mean that we should have an implementation for
> get_minimal_tuple() for the heap AM and have it return a pointer to the
> minimal tuple from the MINIMAL_TUPLE_OFFSET? And then a caller such as
> tqueueReceiveSlot() will ensure that the heap tuple from which it wants
> to extract the minimal tuple was allocated in the tuple queue in the
> first place? If we consider that the node directly below a gather is a
> SeqScan, then we could possibly, in ExecInitSeqScan() set-up the
> ss_ScanTupleSlot to point to memory in the shared tuple queue?
> Similarly, other ExecInit*() methods can do the same for other executor
> nodes that involve parallelism? Of course, things would be slightly
> different for
> the other use cases you mentioned (such as hash table population)

What I mean is that where ExecHashTableInsert() and
tqueueReceiveSlot() do ExecFetchSlotMinimalTuple(), you usually get a
freshly allocated copy, and then you copy that again, and free it.
There may be something similar going on in tuplestore and sort code.
Perhaps we could have something like
ExecFetchSlotMinimalTupleInPlace(slot, output_buffer,
output_buffer_size) that returns a value that indicates either success
or hey-that-buffer's-too-small-I-need-N-bytes, or something like that.
That silly extra copy is something Andres pointed out to me in some
perf results involving TPCH hash joins, a couple of years ago.

> All things considered, I think the patch in its current form should go
> in.

Thanks for the testing and review!  Pushed.

> Having the in-place copy, could be done as a separate patch? Do you
> agree?

Yeah.



Redundant tuple copy in tqueueReceiveSlot()

From
Amit Khandekar
Date:
Hi,

I am starting a new thread that continues with the following point
that was discussed in [1] ....

On Fri, 17 Jul 2020 at 09:05, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Sun, Jul 12, 2020 at 7:25 AM Soumyadeep Chakraborty
> <soumyadeep2007@gmail.com> wrote:
> > Do you mean that we should have an implementation for
> > get_minimal_tuple() for the heap AM and have it return a pointer to the
> > minimal tuple from the MINIMAL_TUPLE_OFFSET? And then a caller such as
> > tqueueReceiveSlot() will ensure that the heap tuple from which it wants
> > to extract the minimal tuple was allocated in the tuple queue in the
> > first place? If we consider that the node directly below a gather is a
> > SeqScan, then we could possibly, in ExecInitSeqScan() set-up the
> > ss_ScanTupleSlot to point to memory in the shared tuple queue?
> > Similarly, other ExecInit*() methods can do the same for other executor
> > nodes that involve parallelism? Of course, things would be slightly
> > different for
> > the other use cases you mentioned (such as hash table population)
>
> What I mean is that where ExecHashTableInsert() and
> tqueueReceiveSlot() do ExecFetchSlotMinimalTuple(), you usually get a
> freshly allocated copy, and then you copy that again, and free it.
> There may be something similar going on in tuplestore and sort code.
> Perhaps we could have something like
> ExecFetchSlotMinimalTupleInPlace(slot, output_buffer,
> output_buffer_size) that returns a value that indicates either success
> or hey-that-buffer's-too-small-I-need-N-bytes, or something like that.
> That silly extra copy is something Andres pointed out to me in some
> perf results involving TPCH hash joins, a couple of years ago.

I went ahead and tried doing this. I chose an approach where we can
return the pointer to the in-place minimal tuple data if it's a
heap/buffer/minimal tuple slot. A new function
ExecFetchSlotMinimalTupleData() returns in-place minimal tuple data.
If it's neither heap, buffer or minimal tuple, it returns a copy as
usual. The receiver should not assume the data is directly taken from
MinimalTupleData, so it should set it's t_len to the number of bytes
returned. Patch attached
(0001-Avoid-redundant-tuple-copy-while-sending-tuples-to-G.patch)

Thomas, I guess you had a different approach in mind when you said
about "returning either success or
hey-that-buffer's-too-small-I-need-N-bytes".  But what looks clear to
me is that avoiding the copy shows consistent improvement of 4 to 10%
for simple parallel table scans. I tried my patch on both x86_64 and
arm64, and observed this speedup on both.

create table tab as select generate_series(1, 20000000) id,
'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz' v;
select pg_prewarm('tab'::regclass);
explain analyze select * from tab where id %2 = 0;
Times in milli-secs :
HEAD : 1833.119  1816.522  1875.648  1872.153  1834.383
Patch'ed : 1763.786  1721.160  1714.665  1719.738  1704.478
This was with the default 2 parallel workers. With 3 or 4 workers, for
the above testcase I didn't see a noticeable difference. I think, if I
double the number of rows, the difference will be noticeable. In any
case, the gain would go on reducing with the number of workers,
because the tuple copy also gets parallelized. In some scenarios,
parallel_leader_participation=off causes the difference to amplify.

Haven't had a chance to see if this helps any of the TPC-H queries.

Also attached is a patch guc_for_testing.patch that I used for testing
the gain. This patch is only for testing. Without this, in order to
compare the performance figures it requires server restart, and the
figures anyway shift back and forth by 5-15 percent after each
restart, which creates lot of noise when comparing figures with and
without fix. Use this GUC enable_fix to enable/disable the fix.

[1] https://www.postgresql.org/message-id/CA%2BhUKGLrN2M18-hACEJbNoj2sn_WoUj9rkkBeoPK7SY427pAnA%40mail.gmail.com

-- 
Thanks,
-Amit Khandekar
Huawei Technologies

Attachment

Re: Redundant tuple copy in tqueueReceiveSlot()

From
Thomas Munro
Date:
On Wed, Sep 9, 2020 at 5:23 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> I went ahead and tried doing this. I chose an approach where we can
> return the pointer to the in-place minimal tuple data if it's a
> heap/buffer/minimal tuple slot. A new function
> ExecFetchSlotMinimalTupleData() returns in-place minimal tuple data.
> If it's neither heap, buffer or minimal tuple, it returns a copy as
> usual. The receiver should not assume the data is directly taken from
> MinimalTupleData, so it should set it's t_len to the number of bytes
> returned. Patch attached
> (0001-Avoid-redundant-tuple-copy-while-sending-tuples-to-G.patch)

+char *
+ExecFetchSlotMinimalTupleData(TupleTableSlot *slot, uint32 *len,
+                              bool *shouldFree)

Interesting approach.  It's a bit of a weird interface, returning a
pointer to a non-valid MinimalTuple that requires extra tweaking after
you copy it to make it a valid one and that you're not allowed to
tweak in-place.  I'd probably make that return "const void *" just for
a bit of extra documentation.  I wonder if there is a way we could
make "Minimal Tuples but with the length travelling separately (and
perhaps chopped off?)" into a first-class concept...  It's also a
shame to be schlepping a bunch of padding bytes around.

     tuple = (MinimalTuple) data;
-    Assert(tuple->t_len == nbytes);
+    tuple->t_len = nbytes;

Hmm, so you have to scribble on shared memory on the receiving side.
I wondered about a couple of different ways to share the length field
with the shm_mq envelope, but that all seems a bit too weird...

> Thomas, I guess you had a different approach in mind when you said
> about "returning either success or
> hey-that-buffer's-too-small-I-need-N-bytes".  But what looks clear to

Yeah I tried some things like that, but I wasn't satisfied with any of
them; basically the extra work involved in negotiating the size was a
bit too high.  On the other hand, because my interface was "please
write a MinimalTuple here!", it had the option to *form* a
MinimalTuple directly in place, whereas your approach can only avoid
creating and destroying a temporary tuple when the source is a heap
tuple.

> me is that avoiding the copy shows consistent improvement of 4 to 10%
> for simple parallel table scans. I tried my patch on both x86_64 and
> arm64, and observed this speedup on both.

I think that's a great validation of the goal but I hope we can figure
out a way that avoids the temporary tuple for more cases.  FWIW I saw
hash self joins running a couple of percent faster with one of my
abandoned patches.



Re: Redundant tuple copy in tqueueReceiveSlot()

From
Andres Freund
Date:
Hi,

On 2020-09-17 14:20:50 +1200, Thomas Munro wrote:
> I wonder if there is a way we could make "Minimal Tuples but with the
> length travelling separately (and perhaps chopped off?)" into a
> first-class concept...  It's also a shame to be schlepping a bunch of
> padding bytes around.

There really is no justification for having MinimalTuples, as we have
them today at least, anymore. We used to rely on being able to construct
pointers to MinimalTuples that are mostly compatible with HeapTuple. But
I think there's none of those left since PG 12.

I think it'd make a bit more sense to do some steps towards having a
more suitable "minimal" tuple representation, rather than doing this
local, pretty ugly, hacks. A good way would be to just starting to
remove the padding, unnecessary fields etc from MinimalTuple.

I also think that it'd be good to look at a few of the other places that
are heavily bottlenecked by MinimalTuple overhead before designing new
API around this. IIRC it's e.g. very easy to see hash joins spending a
lot of time doing MinimalTuple copies & conversions.

> 
>      tuple = (MinimalTuple) data;
> -    Assert(tuple->t_len == nbytes);
> +    tuple->t_len = nbytes;
> 
> Hmm, so you have to scribble on shared memory on the receiving side.

Ick, I would really like to avoid this.


> > Thomas, I guess you had a different approach in mind when you said
> > about "returning either success or
> > hey-that-buffer's-too-small-I-need-N-bytes".  But what looks clear to
> 
> Yeah I tried some things like that, but I wasn't satisfied with any of
> them; basically the extra work involved in negotiating the size was a
> bit too high.  On the other hand, because my interface was "please
> write a MinimalTuple here!", it had the option to *form* a
> MinimalTuple directly in place, whereas your approach can only avoid
> creating and destroying a temporary tuple when the source is a heap
> tuple.

There's a lot of cases where the source is a virtual slot (since we'll
often project stuff below Gather). So it'd be quite advantageous to
avoid building an unnecessary HeapTuple in those cases.

I wonder if it would be sensible to build minimal tuples using
tts_values/isnull in some cases.  This might even be advantageous in
case of heap / minimal tuples, because IIRC right now the code will
materialize the slot unnecessarily. Not sure how common that is.

Greetings,

Andres Freund



Re: Redundant tuple copy in tqueueReceiveSlot()

From
Amit Khandekar
Date:
On Thu, 17 Sep 2020 at 08:55, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-09-17 14:20:50 +1200, Thomas Munro wrote:
> > I wonder if there is a way we could make "Minimal Tuples but with the
> > length travelling separately (and perhaps chopped off?)" into a
> > first-class concept...  It's also a shame to be schlepping a bunch of
> > padding bytes around.

Yeah, I think we can pass a  "length" data separately, but since the
receiver end already is assuming that it knows the received data is a
minimal tuple, I thought why not skip passing this redundant
component. But anyways, if you and Andres are suggesting that being
able to skip the copy is important for virtual tuples as well, then I
think the approach you suggested (supplying an allocated memory to the
tuple API for conversion) would be one of the better options with us,
if not the only good option. Maybe I will try looking into the shm_mq
working to see if we can come up with a good solution.

>
> There really is no justification for having MinimalTuples, as we have
> them today at least, anymore. We used to rely on being able to construct
> pointers to MinimalTuples that are mostly compatible with HeapTuple. But
> I think there's none of those left since PG 12.

Ah ok.

>
> I think it'd make a bit more sense to do some steps towards having a
> more suitable "minimal" tuple representation, rather than doing this
> local, pretty ugly, hacks. A good way would be to just starting to
> remove the padding, unnecessary fields etc from MinimalTuple.

So there are two things we wish to do :
1. Prevent an extra tuple forming step before sending minimal tuple
data. Possibly device an shm_mq API to get memory to write tuple of a
given length, and device something like
FormMinimalTupleDataInHere(memory_allocated_by_shm_mq) which will
write minimal tuple data.
2. Shrink the MinimalTupleData structure because it no longer needs
the current padding etc and we can substitute this new MinimalTuple
structure with the current one all over the code wherever it is
currently being used.

If we remove the unnecessary fields from the tuple data being sent to
Gather node, then we need to again form a MinimalTuple at the
receiving end, which again adds an extra tuple forming. So I
understand, that's the reason why you are saying we should shrink the
MinimalTupleData structure itself, in which case we will continue to
use the received new MinimalTupledata as an already-formed tuple, like
how we are doing now.

Now, the above two things (1. and 2.) look independent to me. Suppose
we first do 1. i.e. we come up with a good way to form an in-place
MinimalTuple at the sender's end, without any change to the
MinimalTupleData. And then when we do 2. i.e. shrink the
MinimalTupleData; but for that, we won't require any change in the
in-place-tuple-forming API we wrote in 1. . Just the existing
underlying function heap_form_minimal_tuple() or something similar
might need to be changed. At least that's what I feel right now.

>
> I also think that it'd be good to look at a few of the other places that
> are heavily bottlenecked by MinimalTuple overhead before designing new
> API around this. IIRC it's e.g. very easy to see hash joins spending a
> lot of time doing MinimalTuple copies & conversions.

Yeah, makes sense. The above FormMinimalTupleDataInHere() should be
able to be used for these other places as well. Will keep that in
mind.

>
> > > Thomas, I guess you had a different approach in mind when you said
> > > about "returning either success or
> > > hey-that-buffer's-too-small-I-need-N-bytes".  But what looks clear to
> >
> > Yeah I tried some things like that, but I wasn't satisfied with any of
> > them; basically the extra work involved in negotiating the size was a
> > bit too high.

Hmm, ok. Let me see if there is any way around this.

>> On the other hand, because my interface was "please
> > write a MinimalTuple here!", it had the option to *form* a
> > MinimalTuple directly in place, whereas your approach can only avoid
> > creating and destroying a temporary tuple when the source is a heap
> > tuple.
True.
>
> There's a lot of cases where the source is a virtual slot (since we'll
> often project stuff below Gather). So it'd be quite advantageous to
> avoid building an unnecessary HeapTuple in those cases.

Yeah right.