Thread: Use virtual tuple slot for Unique node

Use virtual tuple slot for Unique node

From
Денис Смирнов
Date:
Hi,

I have inspected the performance of the GROUP BY and DISTINCT queries for the sorted data streams and found out, that
Groupnode (produced by GROUP BY) works faster then the Unique node (produced by DISTINCT).  The flame graph should out
thereason - Unique palloc`s tuples for the result slot while the Group node doesn’t. 

I wonder, why do we use minimal tuples for the Unique node instead of the virtual ones? It looks like there is no
actualreason for that as Unique doesn’t make any materialization. 

Attachment

Re: Use virtual tuple slot for Unique node

From
David Rowley
Date:
On Thu, 31 Aug 2023 at 05:37, Денис Смирнов <darthunix@gmail.com> wrote:
> I have inspected the performance of the GROUP BY and DISTINCT queries for the sorted data streams and found out, that
Groupnode (produced by GROUP BY) works faster then the Unique node (produced by DISTINCT).  The flame graph should out
thereason - Unique palloc`s tuples for the result slot while the Group node doesn’t. 
>
> I wonder, why do we use minimal tuples for the Unique node instead of the virtual ones? It looks like there is no
actualreason for that as Unique doesn’t make any materialization. 

It would be good to see example queries and a demonstration of the
performance increase. I'm not disputing your claims, but showing some
performance numbers might catch the eye of a reviewer more quickly.

You should also add this to the September commitfest at
https://commitfest.postgresql.org/44/

David



Re: Use virtual tuple slot for Unique node

From
Denis Smirnov
Date:
Here is the example (checked on the current master branch with release build + I've made about 10 runs for each explain analyze to get repeatable results)

Before the patch:

adb=# create table t(a int, primary key(a));


adb=# insert into t select random() * 5000000

from generate_series(1, 5000000)

on conflict do nothing;


adb=# explain analyze select a from t group by a;

                                                              QUERY PLAN                                                               

---------------------------------------------------------------------------------------------------------------------------------------

 Group  (cost=0.43..98761.06 rows=3160493 width=4) (actual time=0.085..1225.139 rows=3160493 loops=1)

   Group Key: a

   ->  Index Only Scan using t_pkey on t  (cost=0.43..90859.82 rows=3160493 width=4) (actual time=0.081..641.567 rows=3160493 loops=1)

         Heap Fetches: 0

 Planning Time: 0.188 ms

 Execution Time: 1370.027 ms

(6 rows)




adb=# explain analyze select distinct a from t;

                                                              QUERY PLAN                                                               

---------------------------------------------------------------------------------------------------------------------------------------

 Unique  (cost=0.43..98761.06 rows=3160493 width=4) (actual time=0.135..1525.704 rows=3160493 loops=1)

   ->  Index Only Scan using t_pkey on t  (cost=0.43..90859.82 rows=3160493 width=4) (actual time=0.130..635.742 rows=3160493 loops=1)

         Heap Fetches: 0

 Planning Time: 0.273 ms

 Execution Time: 1660.857 ms

(5 rows)



We can see that ExecCopySlot occupies 24% of the CPU inside ExecUnique function (thanks to palloc in Unique’s minimal tuples). On the other hand ExecCopySlot is only 6% of the ExecGroup function (we use virtual tuples in Group node).


After the patch Unique node works a little bit faster then the Group node:


adb=# explain analyze select distinct a from t;

                                                              QUERY PLAN                                                               

---------------------------------------------------------------------------------------------------------------------------------------

 Unique  (cost=0.43..98761.06 rows=3160493 width=4) (actual time=0.094..1072.007 rows=3160493 loops=1)

   ->  Index Only Scan using t_pkey on t  (cost=0.43..90859.82 rows=3160493 width=4) (actual time=0.092..592.619 rows=3160493 loops=1)

         Heap Fetches: 0

 Planning Time: 0.203 ms

 Execution Time: 1209.940 ms

(5 rows)


adb=# explain analyze select a from t group by a;

                                                              QUERY PLAN                                                               

---------------------------------------------------------------------------------------------------------------------------------------

 Group  (cost=0.43..98761.06 rows=3160493 width=4) (actual time=0.074..1140.644 rows=3160493 loops=1)

   Group Key: a

   ->  Index Only Scan using t_pkey on t  (cost=0.43..90859.82 rows=3160493 width=4) (actual time=0.070..591.930 rows=3160493 loops=1)

         Heap Fetches: 0

 Planning Time: 0.193 ms

 Execution Time: 1276.026 ms

(6 rows)


I have added current patch to the commitfest.



31 авг. 2023 г., в 04:59, David Rowley <dgrowleyml@gmail.com> написал(а):

On Thu, 31 Aug 2023 at 05:37, Денис Смирнов <darthunix@gmail.com> wrote:
I have inspected the performance of the GROUP BY and DISTINCT queries for the sorted data streams and found out, that Group node (produced by GROUP BY) works faster then the Unique node (produced by DISTINCT).  The flame graph should out the reason - Unique palloc`s tuples for the result slot while the Group node doesn’t.

I wonder, why do we use minimal tuples for the Unique node instead of the virtual ones? It looks like there is no actual reason for that as Unique doesn’t make any materialization.

It would be good to see example queries and a demonstration of the
performance increase. I'm not disputing your claims, but showing some
performance numbers might catch the eye of a reviewer more quickly.

You should also add this to the September commitfest at
https://commitfest.postgresql.org/44/

David

Attachment

Re: Use virtual tuple slot for Unique node

From
Denis Smirnov
Date:
It looks like my patch was not analyzed by the hackers mailing list due to incorrect mime type, so I duplicate it here.


> 31 авг. 2023 г., в 10:06, Denis Smirnov <darthunix@gmail.com> написал(а):
>
> <v2-use-virtual-slots-for-unique-node.patch.txt>


Attachment

Re: Use virtual tuple slot for Unique node

From
Denis Smirnov
Date:
I have made a small research and found out that though the patch itself is correct (i.e. we can benefit from replacing TTSOpsMinimalTuple with TTSOpsVirtual for the Unique node), my explanation WHY was wrong.

1. We always materialize the new unique tuple in the slot, never mind what type of tuple table slots do we use.
2. But the virtual tuple materialization (tts_virtual_copyslot) have performance benefits over the minimal tuple one (tts_minimal_copyslot):
    - tts_minimal_copyslot always allocates zeroed memory with palloc0 (expensive according to the flame graph);
    - tts_minimal_copyslot() doesn’t allocate additional memory if the tuples are constructed from the passed by value column (but for the variable-size columns we still need memory allocation);
    - if tts_minimal_copyslot() need allocation it doesn’t need to zero the memory;

So as a result we seriously benefit from virtual TTS for the tuples constructed from the fixed-sized columns when get a Unique node in the plan.



31 авг. 2023 г., в 10:28, Denis Smirnov <darthunix@gmail.com> написал(а):

It looks like my patch was not analyzed by the hackers mailing list due to incorrect mime type, so I duplicate it here.
<v2-use-virtual-slots-for-unique-node.patch.txt>

31 авг. 2023 г., в 10:06, Denis Smirnov <darthunix@gmail.com> написал(а):

<v2-use-virtual-slots-for-unique-node.patch.txt>


Attachment

Re: Use virtual tuple slot for Unique node

From
Денис Смирнов
Date:
Again the new patch hasn't been attached to the thread, so resend it.
Attachment

Re: Use virtual tuple slot for Unique node

From
Heikki Linnakangas
Date:
I did a little more perf testing with this. I'm seeing the same benefit 
with the query you posted. But can we find a case where it's not 
beneficial? If I understand correctly, when the input slot is a virtual 
slot, it's cheaper to copy it to another virtual slot than to form a 
minimal tuple. Like in your test case. What if the input is a minimial 
tuple?

On master:

postgres=# set enable_hashagg=off;
SET
postgres=# explain analyze select distinct g::text, 'a', 'b', 'c','d', 
'e','f','g','h' from generate_series(1, 5000000) g;
 
QUERY PLAN 


---------------------------------------------------------------------------------------------------------------------------------------------------
  Unique  (cost=2630852.42..2655852.42 rows=200 width=288) (actual 
time=4525.212..6576.992 rows=5000000 loops=1)
    ->  Sort  (cost=2630852.42..2643352.42 rows=5000000 width=288) 
(actual time=4525.211..5960.967 rows=5000000 loops=1)
          Sort Key: ((g)::text)
          Sort Method: external merge  Disk: 165296kB
          ->  Function Scan on generate_series g  (cost=0.00..75000.00 
rows=5000000 width=288) (actual time=518.914..1194.702 rows=5000000 loops=1)
  Planning Time: 0.036 ms
  JIT:
    Functions: 5
    Options: Inlining true, Optimization true, Expressions true, 
Deforming true
    Timing: Generation 0.242 ms (Deform 0.035 ms), Inlining 63.457 ms, 
Optimization 29.764 ms, Emission 20.592 ms, Total 114.056 ms
  Execution Time: 6766.399 ms
(11 rows)


With the patch:

postgres=# set enable_hashagg=off;
SET
postgres=# explain analyze select distinct g::text, 'a', 'b', 'c','d', 
'e','f','g','h' from generate_series(1, 5000000) g;
 
QUERY PLAN 


---------------------------------------------------------------------------------------------------------------------------------------------------
  Unique  (cost=2630852.42..2655852.42 rows=200 width=288) (actual 
time=4563.639..7362.467 rows=5000000 loops=1)
    ->  Sort  (cost=2630852.42..2643352.42 rows=5000000 width=288) 
(actual time=4563.637..6069.000 rows=5000000 loops=1)
          Sort Key: ((g)::text)
          Sort Method: external merge  Disk: 165296kB
          ->  Function Scan on generate_series g  (cost=0.00..75000.00 
rows=5000000 width=288) (actual time=528.060..1191.483 rows=5000000 loops=1)
  Planning Time: 0.720 ms
  JIT:
    Functions: 5
    Options: Inlining true, Optimization true, Expressions true, 
Deforming true
    Timing: Generation 0.406 ms (Deform 0.065 ms), Inlining 68.385 ms, 
Optimization 21.656 ms, Emission 21.033 ms, Total 111.480 ms
  Execution Time: 7585.306 ms
(11 rows)


So not a win in this case. Could you peek at the outer slot type, and 
use the same kind of slot for the Unique's result? Or some more 
complicated logic, like use a virtual slot if all the values are 
pass-by-val? I'd also like to keep this simple, though...

Would this kind of optimization make sense elsewhere?

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Use virtual tuple slot for Unique node

From
David Rowley
Date:
On Sat, 23 Sept 2023 at 03:15, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> So not a win in this case. Could you peek at the outer slot type, and
> use the same kind of slot for the Unique's result? Or some more
> complicated logic, like use a virtual slot if all the values are
> pass-by-val? I'd also like to keep this simple, though...
>
> Would this kind of optimization make sense elsewhere?

There are a few usages of ExecGetResultSlotOps(). e.g ExecInitHashJoin().

If I adjust the patch to:

-       ExecInitResultTupleSlotTL(&uniquestate->ps, &TTSOpsMinimalTuple);
+       ExecInitResultTupleSlotTL(&uniquestate->ps,
+
ExecGetResultSlotOps(outerPlanState(uniquestate),
+
                            NULL));

Then I get the following performance on my Zen2 machine.

Test 1

drop table if exists t;
create table t(a int, b int);
insert into t select x,x from generate_series(1,1000000)x;
create index on t (a,b);
vacuum analyze t;

explain (analyze, timing off) select distinct a,b from t;

Master:
Execution Time: 149.669 ms
Execution Time: 149.019 ms
Execution Time: 151.240 ms

Patched:
Execution Time: 96.950 ms
Execution Time: 94.509 ms
Execution Time: 93.498 ms

Test 2

drop table if exists t;
create table t(a text, b text);
insert into t select x::text,x::text from generate_series(1,1000000)x;
create index on t (a,b);
vacuum analyze t;

explain (analyze, timing off) select distinct a,b from t;

Master:
Execution Time: 185.282 ms
Execution Time: 178.948 ms
Execution Time: 179.217 ms

Patched:
Execution Time: 141.031 ms
Execution Time: 141.136 ms
Execution Time: 142.163 ms

Test 3

set enable_hashagg=off;
explain (analyze, timing off) select distinct g::text, 'a', 'b',
'c','d', 'e','f','g','h' from generate_series(1, 50000) g;

Master:
Execution Time: 87.599 ms
Execution Time: 87.721 ms
Execution Time: 87.635 ms

Patched:
Execution Time: 83.449 ms
Execution Time: 84.314 ms
Execution Time: 86.239 ms

David



Re: Use virtual tuple slot for Unique node

From
David Rowley
Date:
On Wed, 27 Sept 2023 at 20:01, David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Sat, 23 Sept 2023 at 03:15, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > So not a win in this case. Could you peek at the outer slot type, and
> > use the same kind of slot for the Unique's result? Or some more
> > complicated logic, like use a virtual slot if all the values are
> > pass-by-val? I'd also like to keep this simple, though...
> >
> > Would this kind of optimization make sense elsewhere?
>
> There are a few usages of ExecGetResultSlotOps(). e.g ExecInitHashJoin().
>
> If I adjust the patch to:
>
> -       ExecInitResultTupleSlotTL(&uniquestate->ps, &TTSOpsMinimalTuple);
> +       ExecInitResultTupleSlotTL(&uniquestate->ps,
> +
> ExecGetResultSlotOps(outerPlanState(uniquestate),
> +
>                             NULL));

Just to keep this from going cold, here's that in patch form for
anyone who wants to test.

I spent a bit more time running some more benchmarks and I don't see
any workload where it slows things down.  I'd be happy if someone else
had a go at finding a regression.

David

Attachment

Re: Use virtual tuple slot for Unique node

From
Ashutosh Bapat
Date:
On Tue, Oct 10, 2023 at 2:23 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Wed, 27 Sept 2023 at 20:01, David Rowley <dgrowleyml@gmail.com> wrote:
> >
> > On Sat, 23 Sept 2023 at 03:15, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > > So not a win in this case. Could you peek at the outer slot type, and
> > > use the same kind of slot for the Unique's result? Or some more
> > > complicated logic, like use a virtual slot if all the values are
> > > pass-by-val? I'd also like to keep this simple, though...
> > >
> > > Would this kind of optimization make sense elsewhere?
> >
> > There are a few usages of ExecGetResultSlotOps(). e.g ExecInitHashJoin().
> >
> > If I adjust the patch to:
> >
> > -       ExecInitResultTupleSlotTL(&uniquestate->ps, &TTSOpsMinimalTuple);
> > +       ExecInitResultTupleSlotTL(&uniquestate->ps,
> > +
> > ExecGetResultSlotOps(outerPlanState(uniquestate),
> > +
> >                             NULL));
>
> Just to keep this from going cold, here's that in patch form for
> anyone who wants to test.

Thanks.

I don't recollect why we chose MinimalTupleSlot here - may be because
we expected the underlying node to always produce a minimal tupe. But
Unique node copies the tuple returned by the underlying node. This
copy is carried out by the TupleTableSlot specific copy function
copyslot. Every implementation of this function first converts the
source slot tuple into the required form and then copies it. Having
both the TupleTableSlots, ouput slot from the underlying node and the
output slot of Unique node, of the same type avoids the first step and
just copies the slot. It makes sense that it performs better. The code
looks fine to me.

>
> I spent a bit more time running some more benchmarks and I don't see
> any workload where it slows things down.  I'd be happy if someone else
> had a go at finding a regression.

I built on your experiments and I might have found a minor regression.

Setup
=====
drop table if exists t_int;
create table t_int(a int, b int);
insert into t_int select x, x from generate_series(1,1000000)x;
create index on t_int (a,b);
vacuum analyze t_int;

drop table if exists t_text;
create table t_text(a text, b text);
insert into t_text select lpad(x::text, 1000, '0'), x::text from
generate_series(1,1000000)x;
create index on t_text (a,b);
vacuum analyze t_text;

drop table if exists t_mixed; -- this one is new but it doesn't matter much
create table t_mixed(a text, b int);
insert into t_mixed select lpad(x::text, 1000, '0'), x from
generate_series(1,1000000)x;
create index on t_mixed (a,b);
vacuum analyze t_mixed;

Queries and measurements (average execution time from 3 runs - on my
Thinkpad T490)
======================
Q1 select distinct a,b from t_int';
HEAD: 544.45 ms
patched: 381.55 ms

Q2 select distinct a,b from t_text
HEAD: 609.90 ms
patched: 513.42 ms

Q3 select distinct a,b from t_mixed
HEAD: 626.80 ms
patched: 468.22 ms

The more the pass by ref data, more memory is allocated which seems to
reduce the gain by this patch.
Above nodes use Buffer or HeapTupleTableSlot.
Try some different nodes which output minimal or virtual TTS.

set enable_hashagg to off;
Q4 select distinct a,b from (select sum(a) over (order by a rows 2
preceding) a, b from t_int) q
HEAD: 2529.58 ms
patched: 2332.23

Q5 select distinct a,b from (select sum(a) over (order by a rows 2
preceding) a, b from t_int order by a, b) q
HEAD: 2633.69 ms
patched: 2255.99 ms


Q6 select distinct a,b from (select string_agg(a, ', ') over (order by
a rows 2 preceding) a, b from t_text) q
HEAD: 108589.85 ms
patched: 107226.82 ms

Q7 select distinct a,b from (select string_agg(left(a, 100), ', ')
over (order by a rows 2 preceding) a, b from t_text) q
HEAD: 16070.62 ms
patched: 16182.16 ms

This one is surprising though. May be the advantage of using the same
tuple table slot is so narrow when large data needs to be copied that
the execution times almost match. The patched and unpatched execution
times differ by the margin of error either way.

--
Best Wishes,
Ashutosh Bapat



Re: Use virtual tuple slot for Unique node

From
David Rowley
Date:
On Thu, 12 Oct 2023 at 23:06, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> Q7 select distinct a,b from (select string_agg(left(a, 100), ', ')
> over (order by a rows 2 preceding) a, b from t_text) q
> HEAD: 16070.62 ms
> patched: 16182.16 ms

Did you time the SELECT or EXPLAIN ANALYZE?

With SELECT, I'm unable to recreate this slowdown. Using your setup:

$ cat bench.sql
set enable_hashagg=0;
set work_mem='10GB';
select distinct a,b from (select string_agg(left(a, 100), ', ') over
(order by a rows 2 preceding) a, b from t_text) q;

Master @ 13d00729d
$ pgbench -n -f bench.sql -T 300 postgres | grep latency
latency average = 7739.250 ms

Master + use_subnode_slot_type_for_nodeunique.patch
$ pgbench -n -f bench.sql -T 300 postgres | grep latency
latency average = 7718.007 ms

It's hard to imagine why there would be a slowdown as this query uses
a TTSOpsMinimalTuple slot type in the patch and the unpatched version.

David



Re: Use virtual tuple slot for Unique node

From
David Rowley
Date:
On Thu, 19 Oct 2023 at 22:29, David Rowley <dgrowleyml@gmail.com> wrote:
> It's hard to imagine why there would be a slowdown as this query uses
> a TTSOpsMinimalTuple slot type in the patch and the unpatched version.

I shrunk down your table sizes to 10k rows instead of 1 million rows
to reduce the CPU cache pressure on the queries.

I ran pgbench for 1 minute on each query and did pg_prewarm on each
table. Here are the times I got in milliseconds:

Query   master   Master + 0001   compare
Q1        2.576     1.979                 130.17%
Q2        9.546     9.941                   96.03%
Q3        9.069     9.536                   95.10%
Q4        7.285     7.208                 101.07%
Q5        7.585     6.904                 109.86%
Q6     162.253 161.434                100.51%
Q7       62.507   58.922                106.08%

I also noted down the slot type that nodeUnique.c is using in each of
the queries:

Q1 TTSOpsVirtual
Q2 TTSOpsVirtual
Q3 TTSOpsVirtual
Q4 TTSOpsMinimalTuple
Q5 TTSOpsVirtual
Q6 TTSOpsMinimalTuple
Q7 TTSOpsMinimalTuple

So, I'm not really expecting Q4, Q6 or Q7 to change much. However, Q7
does seem to be above noise level faster and I'm not sure why.

We can see that Q2 and Q3 become a bit slower.  This makes sense as
tts_virtual_materialize() is quite a bit more complex than
heap_copy_minimal_tuple() which is a simple palloc/memcpy.

We'd likely see Q2 and Q3 do better with the patched version if there
were more duplicates as there'd be less tuple deforming going on
because of the virtual slots.

Overall, the patched version is 5.55% faster than master.  However,
it's pretty hard to say if we should do this or not. Q3 has a mix of
varlena and byval types and that came out slower with the patched
version.

I've attached the script I used to get the results and the setup,
which is just your tables shrunk down to 10k rows.

David

Attachment

Re: Use virtual tuple slot for Unique node

From
Ashutosh Bapat
Date:
On Thu, Oct 19, 2023 at 4:26 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Thu, 19 Oct 2023 at 22:29, David Rowley <dgrowleyml@gmail.com> wrote:
> > It's hard to imagine why there would be a slowdown as this query uses
> > a TTSOpsMinimalTuple slot type in the patch and the unpatched version.
>
> I shrunk down your table sizes to 10k rows instead of 1 million rows
> to reduce the CPU cache pressure on the queries.
>
> I ran pgbench for 1 minute on each query and did pg_prewarm on each
> table. Here are the times I got in milliseconds:
>
> Query   master   Master + 0001   compare
> Q1        2.576     1.979                 130.17%
> Q2        9.546     9.941                   96.03%
> Q3        9.069     9.536                   95.10%
> Q4        7.285     7.208                 101.07%
> Q5        7.585     6.904                 109.86%
> Q6     162.253 161.434                100.51%
> Q7       62.507   58.922                106.08%
>
> I also noted down the slot type that nodeUnique.c is using in each of
> the queries:
>
> Q1 TTSOpsVirtual
> Q2 TTSOpsVirtual
> Q3 TTSOpsVirtual
> Q4 TTSOpsMinimalTuple
> Q5 TTSOpsVirtual
> Q6 TTSOpsMinimalTuple
> Q7 TTSOpsMinimalTuple
>
> So, I'm not really expecting Q4, Q6 or Q7 to change much. However, Q7
> does seem to be above noise level faster and I'm not sure why.

I ran my experiments again. It seems on my machine the execution times
do vary a bit. I ran EXPLAIN ANALYZE on the query 5 times and took
average of execution times. I did this three times. For each run the
standard deviation was within 2%. Here are the numbers
master: 13548.33, 13878.88, 14572.52
master + 0001: 13734.58, 14193.83, 14574.73

So for me, I would say, this particular query performs the same with
or without patch.

>
> We can see that Q2 and Q3 become a bit slower.  This makes sense as
> tts_virtual_materialize() is quite a bit more complex than
> heap_copy_minimal_tuple() which is a simple palloc/memcpy.
>

If the source slot is a materialized virtual slot,
tts_virtual_copyslot() could perform a memcpy of the materialized data
itself rather than materialising from datums. That might be more
efficient.

> We'd likely see Q2 and Q3 do better with the patched version if there
> were more duplicates as there'd be less tuple deforming going on
> because of the virtual slots.
>
> Overall, the patched version is 5.55% faster than master.  However,
> it's pretty hard to say if we should do this or not. Q3 has a mix of
> varlena and byval types and that came out slower with the patched
> version.

Theoretically using the same slot type is supposed to be faster. We
use same slot types for input and output in other places where as
well. May be we should fix the above said inefficiency in
tt_virtual_copyslot()?

--
Best Wishes,
Ashutosh Bapat



Re: Use virtual tuple slot for Unique node

From
David Rowley
Date:
On Fri, 20 Oct 2023 at 22:30, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> I ran my experiments again. It seems on my machine the execution times
> do vary a bit. I ran EXPLAIN ANALYZE on the query 5 times and took
> average of execution times. I did this three times. For each run the
> standard deviation was within 2%. Here are the numbers
> master: 13548.33, 13878.88, 14572.52
> master + 0001: 13734.58, 14193.83, 14574.73
>
> So for me, I would say, this particular query performs the same with
> or without patch.

I'm not really sure which of the 7 queries you're referring to here.
The times you're quoting seem to align best to Q7 from your previous
results, so I'll assume you mean Q7.

I'm not really concerned with Q7 as both patched and unpatched use
TTSOpsMinimalTuple.

I also think you need to shrink the size of your benchmark down.  With
1 million tuples, you're more likely to be also measuring the time it
takes to get cache lines from memory into the CPU. A smaller scale
test will make this less likely. Also, you'd be better timing SELECT
rather than the time it takes to EXPLAIN ANALYZE. They're not the same
thing. EXPLAIN ANALYZE has additional timing going on and we may end
up not de-toasting toasted Datums.

> On Thu, Oct 19, 2023 at 4:26 PM David Rowley <dgrowleyml@gmail.com> wrote:
> > We can see that Q2 and Q3 become a bit slower.  This makes sense as
> > tts_virtual_materialize() is quite a bit more complex than
> > heap_copy_minimal_tuple() which is a simple palloc/memcpy.
> >
>
> If the source slot is a materialized virtual slot,
> tts_virtual_copyslot() could perform a memcpy of the materialized data
> itself rather than materialising from datums. That might be more
> efficient.

I think you're talking about just performing a memcpy() of the
VirtualTupleTableSlot->data field.  Unfortunately, you'd not be able
to do just that as you'd also need to repoint the non-byval Datums in
tts_values at the newly memcpy'd memory. If you skipped that part,
those would remain pointing to the original memory. If that memory
goes away, then bad things will happen. I think you'd still need to do
the 2nd loop in tts_virtual_materialize()

> > We'd likely see Q2 and Q3 do better with the patched version if there
> > were more duplicates as there'd be less tuple deforming going on
> > because of the virtual slots.
> >
> > Overall, the patched version is 5.55% faster than master.  However,
> > it's pretty hard to say if we should do this or not. Q3 has a mix of
> > varlena and byval types and that came out slower with the patched
> > version.
>
> Theoretically using the same slot type is supposed to be faster. We
> use same slot types for input and output in other places where as
> well.

Which theory?

> May be we should fix the above said inefficiency in
> tt_virtual_copyslot()?

I don't have any bright ideas on how to make tts_virtual_materialize()
itself faster.  If there were some way to remember !attbyval
attributes for the 2nd loop, that might be good, but creating
somewhere to store that might result in further overheads.

tts_virtual_copyslot() perhaps could be sped up a little by doing a
memcpy of the values/isnull arrays when the src and dst descriptors
have the same number of attributes. aka, something like:

if (srcdesc->natts == dstslot->tts_tupleDescriptor->natts)
    memcpy(dstslot->tts_values, srcslot->tts_values,
                    MAXALIGN(srcdesc->natts * sizeof(Datum)) +
                    MAXALIGN(srcdesc->natts * sizeof(bool)));
else
{
    for (int natt = 0; natt < srcdesc->natts; natt++)
    {
        dstslot->tts_values[natt] = srcslot->tts_values[natt];
        dstslot->tts_isnull[natt] = srcslot->tts_isnull[natt];
    }
}

I imagine we'd only start to notice gains by doing that for larger
natts values.  Each of the 7 queries here, I imagine, wouldn't have
enough columns for it to make much of a difference.

That seems valid enough to do based on how MakeTupleTableSlot()
allocates those arrays. ExecSetSlotDescriptor() does not seem on board
with the single allocation method, however. (Those pfree's in
ExecSetSlotDescriptor() do look a bit fragile.
https://coverage.postgresql.org/src/backend/executor/execTuples.c.gcov.html
says they're never called)

David



Re: Use virtual tuple slot for Unique node

From
Ashutosh Bapat
Date:
On Tue, Oct 24, 2023 at 4:30 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Fri, 20 Oct 2023 at 22:30, Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> > I ran my experiments again. It seems on my machine the execution times
> > do vary a bit. I ran EXPLAIN ANALYZE on the query 5 times and took
> > average of execution times. I did this three times. For each run the
> > standard deviation was within 2%. Here are the numbers
> > master: 13548.33, 13878.88, 14572.52
> > master + 0001: 13734.58, 14193.83, 14574.73
> >
> > So for me, I would say, this particular query performs the same with
> > or without patch.
>
> I'm not really sure which of the 7 queries you're referring to here.
> The times you're quoting seem to align best to Q7 from your previous
> results, so I'll assume you mean Q7.
>
> I'm not really concerned with Q7 as both patched and unpatched use
> TTSOpsMinimalTuple.

It's Q7. Yes. I was responding to your statement: " However, Q7 does
seem to be above noise level faster and I'm not sure why.". Anyway, we
can set that aside.

>
> I also think you need to shrink the size of your benchmark down.  With
> 1 million tuples, you're more likely to be also measuring the time it
> takes to get cache lines from memory into the CPU. A smaller scale
> test will make this less likely. Also, you'd be better timing SELECT
> rather than the time it takes to EXPLAIN ANALYZE. They're not the same
> thing. EXPLAIN ANALYZE has additional timing going on and we may end
> up not de-toasting toasted Datums.

I ran experiments with 10K rows and measured timing using \timing in
psql. The measurements are much more flaky than a larger set of rows
and EXPLAIN ANALYZE. But I think your observations are good enough.

>
> > On Thu, Oct 19, 2023 at 4:26 PM David Rowley <dgrowleyml@gmail.com> wrote:
> > > We can see that Q2 and Q3 become a bit slower.  This makes sense as
> > > tts_virtual_materialize() is quite a bit more complex than
> > > heap_copy_minimal_tuple() which is a simple palloc/memcpy.
> > >
> >
> > If the source slot is a materialized virtual slot,
> > tts_virtual_copyslot() could perform a memcpy of the materialized data
> > itself rather than materialising from datums. That might be more
> > efficient.
>
> I think you're talking about just performing a memcpy() of the
> VirtualTupleTableSlot->data field.  Unfortunately, you'd not be able
> to do just that as you'd also need to repoint the non-byval Datums in
> tts_values at the newly memcpy'd memory. If you skipped that part,
> those would remain pointing to the original memory. If that memory
> goes away, then bad things will happen. I think you'd still need to do
> the 2nd loop in tts_virtual_materialize()

Yes, we will need repoint non-byval Datums ofc.

>
> > May be we should fix the above said inefficiency in
> > tt_virtual_copyslot()?
>
> I don't have any bright ideas on how to make tts_virtual_materialize()
> itself faster.  If there were some way to remember !attbyval
> attributes for the 2nd loop, that might be good, but creating
> somewhere to store that might result in further overheads.

We may save the size of data in VirtualTupleTableSlot, thus avoiding
the first loop. I assume that when allocating
VirtualTupleTableSlot->data, we always know what size we are
allocating so it should be just a matter of saving it in
VirtualTupleTableSlot->size. This should avoid the first loop in
tts_virtual_materialize() and give some speed up. We will need a loop
to repoint non-byval Datums. I imagine that the pointers to non-byval
Datums can be computed as dest_slot->tts_values[natts] =
dest_vslot->data + (src_slot->tts_values[natts] - src_vslot->data).
This would work as long as all the non-byval datums in the source slot
are all stored flattened in source slot's data. I am assuming that
that would be true in a materialized virtual slot. The byval datums
are copied as is. I think, this will avoid multiple memcpy calls, one
per non-byval attribute and hence some speedup. I may be wrong though.

--
Best Wishes,
Ashutosh Bapat



Re: Use virtual tuple slot for Unique node

From
David Rowley
Date:
On Wed, 25 Oct 2023 at 22:48, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
> We may save the size of data in VirtualTupleTableSlot, thus avoiding
> the first loop. I assume that when allocating
> VirtualTupleTableSlot->data, we always know what size we are
> allocating so it should be just a matter of saving it in
> VirtualTupleTableSlot->size. This should avoid the first loop in
> tts_virtual_materialize() and give some speed up. We will need a loop
> to repoint non-byval Datums. I imagine that the pointers to non-byval
> Datums can be computed as dest_slot->tts_values[natts] =
> dest_vslot->data + (src_slot->tts_values[natts] - src_vslot->data).
> This would work as long as all the non-byval datums in the source slot
> are all stored flattened in source slot's data. I am assuming that
> that would be true in a materialized virtual slot. The byval datums
> are copied as is. I think, this will avoid multiple memcpy calls, one
> per non-byval attribute and hence some speedup. I may be wrong though.

hmm, do you think it's common enough that we copy an already
materialised virtual slot?

I tried adding the following code totts_virtual_copyslot and didn't
see the NOTICE message when running each of your test queries. "make
check" also worked without anything failing after adjusting
nodeUnique.c to always use a TTSOpsVirtual slot.

+       if (srcslot->tts_ops == &TTSOpsVirtual && TTS_SHOULDFREE(srcslot))
+               elog(NOTICE, "We copied a materialized virtual slot!");

I did get a failure in postgres_fdw's tests:

   server loopback options (table_name 'tab_batch_sharded_p1_remote');
 insert into tab_batch_sharded select * from tab_batch_local;
+NOTICE:  We copied a materialized virtual slot!
+NOTICE:  We copied a materialized virtual slot!

so I think it's probably not that common that we'd gain from that optimisation.

What might buy us a bit more would be to get rid of the for loop
inside tts_virtual_copyslot() and copy the guts of
tts_virtual_materialize() into tts_virtual_copyslot() and set the
dstslot tts_isnull and tts_values arrays in the same loop that we use
to calculate the size.

I tried that in the attached patch and then tested it alongside the
patch that changes the slot type.

master = 74604a37f
1 = [1]
2 = optimize_tts_virtual_copyslot.patch

Using the script from [2] and the setup from [3] but reduced to 10k
tuples instead of 1 million.

Times the average query time in milliseconds for a 60 second pgbench run.

query    master      master+1        master+1+2    m/m+1         m/m+1+2
Q1        2.616         2.082                   1.903       125.65%
    137.47%
Q2        9.479        10.593                10.361         89.48%
     91.49%
Q3        10.329      10.357                10.627         99.73%
    97.20%
Q4        7.272          7.306                 6.941          99.53%
     104.77%
Q5        7.597          7.043                 6.645        107.87%
     114.33%
Q6        162.177  161.037              162.807        100.71%         99.61%
Q7        59.288      59.43                  61.294          99.76%
     96.73%

     103.25%        105.94%

I only expect Q2 and Q3 to gain from this. Q1 shouldn't improve but
did, so the results might not be stable enough to warrant making any
decisions from.

I was uncertain if the old behaviour of when srcslot contains fewer
attributes than dstslot was intended or not.  What happens there is
that we'd leave the additional old dstslot tts_values in place and
only overwrite up to srcslot->natts but then we'd go on and
materialize all the dstslot attributes. I think this might not be
needed as we do dstslot->tts_nvalid = srcdesc->natts. I suspect we may
be ok just to materialize the srcslot attributes and ignore the
previous additional dstslot attributes.  Changing it to that would
make the attached patch more simple.

David

[1] https://www.postgresql.org/message-id/attachment/151110/use_subnode_slot_type_for_nodeunique.patch
[2] https://www.postgresql.org/message-id/attachment/151342/uniquebench.sh.txt
[3] https://www.postgresql.org/message-id/CAExHW5uhTMdkk26oJg9f2ZVufbi5J4Lquj79MdSO%2BipnGJ_muw%40mail.gmail.com

Attachment

Re: Use virtual tuple slot for Unique node

From
Ashutosh Bapat
Date:
On Fri, Oct 27, 2023 at 8:48 AM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Wed, 25 Oct 2023 at 22:48, Ashutosh Bapat
> <ashutosh.bapat.oss@gmail.com> wrote:
> > We may save the size of data in VirtualTupleTableSlot, thus avoiding
> > the first loop. I assume that when allocating
> > VirtualTupleTableSlot->data, we always know what size we are
> > allocating so it should be just a matter of saving it in
> > VirtualTupleTableSlot->size. This should avoid the first loop in
> > tts_virtual_materialize() and give some speed up. We will need a loop
> > to repoint non-byval Datums. I imagine that the pointers to non-byval
> > Datums can be computed as dest_slot->tts_values[natts] =
> > dest_vslot->data + (src_slot->tts_values[natts] - src_vslot->data).
> > This would work as long as all the non-byval datums in the source slot
> > are all stored flattened in source slot's data. I am assuming that
> > that would be true in a materialized virtual slot. The byval datums
> > are copied as is. I think, this will avoid multiple memcpy calls, one
> > per non-byval attribute and hence some speedup. I may be wrong though.
>
> hmm, do you think it's common enough that we copy an already
> materialised virtual slot?
>
> I tried adding the following code totts_virtual_copyslot and didn't
> see the NOTICE message when running each of your test queries. "make
> check" also worked without anything failing after adjusting
> nodeUnique.c to always use a TTSOpsVirtual slot.
>
> +       if (srcslot->tts_ops == &TTSOpsVirtual && TTS_SHOULDFREE(srcslot))
> +               elog(NOTICE, "We copied a materialized virtual slot!");
>
> I did get a failure in postgres_fdw's tests:
>
>    server loopback options (table_name 'tab_batch_sharded_p1_remote');
>  insert into tab_batch_sharded select * from tab_batch_local;
> +NOTICE:  We copied a materialized virtual slot!
> +NOTICE:  We copied a materialized virtual slot!
>
> so I think it's probably not that common that we'd gain from that optimisation.

Thanks for this analysis. If we aren't copying a materialized virtual
slot often, no point in adding that optimization.

>
> What might buy us a bit more would be to get rid of the for loop
> inside tts_virtual_copyslot() and copy the guts of
> tts_virtual_materialize() into tts_virtual_copyslot() and set the
> dstslot tts_isnull and tts_values arrays in the same loop that we use
> to calculate the size.
>
> I tried that in the attached patch and then tested it alongside the
> patch that changes the slot type.
>
> master = 74604a37f
> 1 = [1]
> 2 = optimize_tts_virtual_copyslot.patch
>
> Using the script from [2] and the setup from [3] but reduced to 10k
> tuples instead of 1 million.
>
> Times the average query time in milliseconds for a 60 second pgbench run.
>
> query    master      master+1        master+1+2    m/m+1         m/m+1+2
> Q1        2.616         2.082                   1.903       125.65%
>     137.47%
> Q2        9.479        10.593                10.361         89.48%
>      91.49%
> Q3        10.329      10.357                10.627         99.73%
>     97.20%
> Q4        7.272          7.306                 6.941          99.53%
>      104.77%
> Q5        7.597          7.043                 6.645        107.87%
>      114.33%
> Q6        162.177  161.037              162.807        100.71%         99.61%
> Q7        59.288      59.43                  61.294          99.76%
>      96.73%
>
>      103.25%        105.94%
>
> I only expect Q2 and Q3 to gain from this. Q1 shouldn't improve but
> did, so the results might not be stable enough to warrant making any
> decisions from.

I am actually surprised to see that eliminating loop is showing
improvements. There aren't hundreds of attributes involved in those
queries. So I share your stability concerns. And even with these
patches, Q2 and Q3 are still slower.

Q1 is consistently giving performance > 25% for both of us. But other
queries aren't showing a whole lot improvement. So I am wondering
whether it's worth pursuing this change; similar to the opinion you
expressed a few emails earlier.

>
> I was uncertain if the old behaviour of when srcslot contains fewer
> attributes than dstslot was intended or not.  What happens there is
> that we'd leave the additional old dstslot tts_values in place and
> only overwrite up to srcslot->natts but then we'd go on and
> materialize all the dstslot attributes. I think this might not be
> needed as we do dstslot->tts_nvalid = srcdesc->natts. I suspect we may
> be ok just to materialize the srcslot attributes and ignore the
> previous additional dstslot attributes.  Changing it to that would
> make the attached patch more simple.

We seem to use both tt_nvalid and tts_tupleDescriptor->natts. I forgot
what's the difference. If we do what you say, we might end up trying
to access unmaterialized values beyond tts_nvalid. Better to
investigate whether such a hazard exists.

--
Best Wishes,
Ashutosh Bapat



Re: Use virtual tuple slot for Unique node

From
David Rowley
Date:
On Fri, 27 Oct 2023 at 22:05, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Fri, Oct 27, 2023 at 8:48 AM David Rowley <dgrowleyml@gmail.com> wrote:
> > I was uncertain if the old behaviour of when srcslot contains fewer
> > attributes than dstslot was intended or not.  What happens there is
> > that we'd leave the additional old dstslot tts_values in place and
> > only overwrite up to srcslot->natts but then we'd go on and
> > materialize all the dstslot attributes. I think this might not be
> > needed as we do dstslot->tts_nvalid = srcdesc->natts. I suspect we may
> > be ok just to materialize the srcslot attributes and ignore the
> > previous additional dstslot attributes.  Changing it to that would
> > make the attached patch more simple.
>
> We seem to use both tt_nvalid and tts_tupleDescriptor->natts. I forgot
> what's the difference. If we do what you say, we might end up trying
> to access unmaterialized values beyond tts_nvalid. Better to
> investigate whether such a hazard exists.

The TupleDesc's natts is the number of attributes in the tuple
descriptor.  tts_nvalid is the greatest attribute number that's been
deformed in the tuple slot.  For slot types other than virtual slots,
we'll call slot_getsomeattrs() to deform more attributes from the
tuple.

The reason the code in question looks suspicious to me is that we do
"dstslot->tts_nvalid = srcdesc->natts;" and there's no way to deform
more attributes in a virtual slot. Note that
tts_virtual_getsomeattrs() unconditionally does elog(ERROR,
"getsomeattrs is not required to be called on a virtual tuple table
slot");.  We shouldn't ever be accessing tts_values elements above
what tts_nvalid is set to, so either we should be setting
dstslot->tts_nvalid = to the dstdesc->natts so that we can access
tts_values elements above srcdesc->natts or we're needlessly
materialising too many attributes in tts_virtual_copyslot().

David



David



Re: Use virtual tuple slot for Unique node

From
Denis Smirnov
Date:
I have taken a look at this discussion, at the code and I am confused how we choose tuple table slot (TTS) type in PG.
Maybe you can clarify this topic or me.  

1. Brief intro. There are four types of TTS. Plan tree «leaves»:
- buffer heap (produced by index and table scans, has system columns and keeps shared buffer pins)
- heap (produced by FDW: has system columns, but doesn’t keep any pins)
- minimal (produced by values and materializations nodes like sort, agg, etc.)
Plan «branches»:
- virtual (keeps datum references to the columns of the tuples in the child nodes)

Virtual TTS is cheeper to copy among the plan (as we copy datum references), but more expensive to materialize (we have
toconstruct a tuple from pieces). 

Leaves are cheeper to materialize (as we make a memcmp under hood), but very expensive to copy (we copy the value, not
thedatum reference). 

2. If we take a look at the materialize nodes in the plan, they produce different result TTS.
- Outer child TTS type: gater, gather merge, lock rows, limit;
- Minimal: material, sort, incremental sort, memoize, unique, hash, setup (can be heap as well);
- Virtual: group, agg, window agg.

From my point of view, the materialization node should preserve the incoming TTS type. For the sort node (that
materializesincoming tuples as minimal) it is ok to output minimal result as well. Looks that unique should use the
outerchild’d TTS (instead of hardcoded minimal). But can anyone explain me why do group, agg and window agg return the
virtualinstead of the same TTS type as outer child has? Do we expect that the parent node exists and requires exactly
virtualtuples (but what if the parent node is sort and benefits from minimal TTS)? So, it looks like we need to take a
looknot only at the unique, but also inspect all the materialization nodes. 





Re: Use virtual tuple slot for Unique node

From
"Andrey M. Borodin"
Date:

> On 29 Oct 2023, at 21:30, Denis Smirnov <darthunix@gmail.com> wrote:
>
> I have taken a look at this discussion, at the code and I am confused how we choose tuple table slot (TTS) type in
PG. 

After offline discussion with Denis, we decided to withdraw this patch from CF for now. If anyone is willing to revive
workingon this, please register a new entry in next commitfest. 
Thanks!


Best regards, Andrey Borodin.