Thread: Use virtual tuple slot for Unique node
Hi, I have inspected the performance of the GROUP BY and DISTINCT queries for the sorted data streams and found out, that Groupnode (produced by GROUP BY) works faster then the Unique node (produced by DISTINCT). The flame graph should out thereason - Unique palloc`s tuples for the result slot while the Group node doesn’t. I wonder, why do we use minimal tuples for the Unique node instead of the virtual ones? It looks like there is no actualreason for that as Unique doesn’t make any materialization.
Attachment
On Thu, 31 Aug 2023 at 05:37, Денис Смирнов <darthunix@gmail.com> wrote: > I have inspected the performance of the GROUP BY and DISTINCT queries for the sorted data streams and found out, that Groupnode (produced by GROUP BY) works faster then the Unique node (produced by DISTINCT). The flame graph should out thereason - Unique palloc`s tuples for the result slot while the Group node doesn’t. > > I wonder, why do we use minimal tuples for the Unique node instead of the virtual ones? It looks like there is no actualreason for that as Unique doesn’t make any materialization. It would be good to see example queries and a demonstration of the performance increase. I'm not disputing your claims, but showing some performance numbers might catch the eye of a reviewer more quickly. You should also add this to the September commitfest at https://commitfest.postgresql.org/44/ David
adb=# create table t(a int, primary key(a));
adb=# insert into t select random() * 5000000
from generate_series(1, 5000000)
on conflict do nothing;
adb=# explain analyze select a from t group by a;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Group (cost=0.43..98761.06 rows=3160493 width=4) (actual time=0.085..1225.139 rows=3160493 loops=1)
Group Key: a
-> Index Only Scan using t_pkey on t (cost=0.43..90859.82 rows=3160493 width=4) (actual time=0.081..641.567 rows=3160493 loops=1)
Heap Fetches: 0
Planning Time: 0.188 ms
Execution Time: 1370.027 ms
(6 rows)
adb=# explain analyze select distinct a from t;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.43..98761.06 rows=3160493 width=4) (actual time=0.135..1525.704 rows=3160493 loops=1)
-> Index Only Scan using t_pkey on t (cost=0.43..90859.82 rows=3160493 width=4) (actual time=0.130..635.742 rows=3160493 loops=1)
Heap Fetches: 0
Planning Time: 0.273 ms
Execution Time: 1660.857 ms
(5 rows)
We can see that ExecCopySlot occupies 24% of the CPU inside ExecUnique function (thanks to palloc in Unique’s minimal tuples). On the other hand ExecCopySlot is only 6% of the ExecGroup function (we use virtual tuples in Group node).
After the patch Unique node works a little bit faster then the Group node:
adb=# explain analyze select distinct a from t;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.43..98761.06 rows=3160493 width=4) (actual time=0.094..1072.007 rows=3160493 loops=1)
-> Index Only Scan using t_pkey on t (cost=0.43..90859.82 rows=3160493 width=4) (actual time=0.092..592.619 rows=3160493 loops=1)
Heap Fetches: 0
Planning Time: 0.203 ms
Execution Time: 1209.940 ms
(5 rows)
adb=# explain analyze select a from t group by a;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Group (cost=0.43..98761.06 rows=3160493 width=4) (actual time=0.074..1140.644 rows=3160493 loops=1)
Group Key: a
-> Index Only Scan using t_pkey on t (cost=0.43..90859.82 rows=3160493 width=4) (actual time=0.070..591.930 rows=3160493 loops=1)
Heap Fetches: 0
Planning Time: 0.193 ms
Execution Time: 1276.026 ms
(6 rows)
I have added current patch to the commitfest.
31 авг. 2023 г., в 04:59, David Rowley <dgrowleyml@gmail.com> написал(а):On Thu, 31 Aug 2023 at 05:37, Денис Смирнов <darthunix@gmail.com> wrote:I have inspected the performance of the GROUP BY and DISTINCT queries for the sorted data streams and found out, that Group node (produced by GROUP BY) works faster then the Unique node (produced by DISTINCT). The flame graph should out the reason - Unique palloc`s tuples for the result slot while the Group node doesn’t.
I wonder, why do we use minimal tuples for the Unique node instead of the virtual ones? It looks like there is no actual reason for that as Unique doesn’t make any materialization.
It would be good to see example queries and a demonstration of the
performance increase. I'm not disputing your claims, but showing some
performance numbers might catch the eye of a reviewer more quickly.
You should also add this to the September commitfest at
https://commitfest.postgresql.org/44/
David
Attachment
It looks like my patch was not analyzed by the hackers mailing list due to incorrect mime type, so I duplicate it here. > 31 авг. 2023 г., в 10:06, Denis Smirnov <darthunix@gmail.com> написал(а): > > <v2-use-virtual-slots-for-unique-node.patch.txt>
Attachment
31 авг. 2023 г., в 10:28, Denis Smirnov <darthunix@gmail.com> написал(а):It looks like my patch was not analyzed by the hackers mailing list due to incorrect mime type, so I duplicate it here.
<v2-use-virtual-slots-for-unique-node.patch.txt>31 авг. 2023 г., в 10:06, Denis Smirnov <darthunix@gmail.com> написал(а):
<v2-use-virtual-slots-for-unique-node.patch.txt>
Attachment
Attachment
I did a little more perf testing with this. I'm seeing the same benefit with the query you posted. But can we find a case where it's not beneficial? If I understand correctly, when the input slot is a virtual slot, it's cheaper to copy it to another virtual slot than to form a minimal tuple. Like in your test case. What if the input is a minimial tuple? On master: postgres=# set enable_hashagg=off; SET postgres=# explain analyze select distinct g::text, 'a', 'b', 'c','d', 'e','f','g','h' from generate_series(1, 5000000) g; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------- Unique (cost=2630852.42..2655852.42 rows=200 width=288) (actual time=4525.212..6576.992 rows=5000000 loops=1) -> Sort (cost=2630852.42..2643352.42 rows=5000000 width=288) (actual time=4525.211..5960.967 rows=5000000 loops=1) Sort Key: ((g)::text) Sort Method: external merge Disk: 165296kB -> Function Scan on generate_series g (cost=0.00..75000.00 rows=5000000 width=288) (actual time=518.914..1194.702 rows=5000000 loops=1) Planning Time: 0.036 ms JIT: Functions: 5 Options: Inlining true, Optimization true, Expressions true, Deforming true Timing: Generation 0.242 ms (Deform 0.035 ms), Inlining 63.457 ms, Optimization 29.764 ms, Emission 20.592 ms, Total 114.056 ms Execution Time: 6766.399 ms (11 rows) With the patch: postgres=# set enable_hashagg=off; SET postgres=# explain analyze select distinct g::text, 'a', 'b', 'c','d', 'e','f','g','h' from generate_series(1, 5000000) g; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------- Unique (cost=2630852.42..2655852.42 rows=200 width=288) (actual time=4563.639..7362.467 rows=5000000 loops=1) -> Sort (cost=2630852.42..2643352.42 rows=5000000 width=288) (actual time=4563.637..6069.000 rows=5000000 loops=1) Sort Key: ((g)::text) Sort Method: external merge Disk: 165296kB -> Function Scan on generate_series g (cost=0.00..75000.00 rows=5000000 width=288) (actual time=528.060..1191.483 rows=5000000 loops=1) Planning Time: 0.720 ms JIT: Functions: 5 Options: Inlining true, Optimization true, Expressions true, Deforming true Timing: Generation 0.406 ms (Deform 0.065 ms), Inlining 68.385 ms, Optimization 21.656 ms, Emission 21.033 ms, Total 111.480 ms Execution Time: 7585.306 ms (11 rows) So not a win in this case. Could you peek at the outer slot type, and use the same kind of slot for the Unique's result? Or some more complicated logic, like use a virtual slot if all the values are pass-by-val? I'd also like to keep this simple, though... Would this kind of optimization make sense elsewhere? -- Heikki Linnakangas Neon (https://neon.tech)
On Sat, 23 Sept 2023 at 03:15, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > So not a win in this case. Could you peek at the outer slot type, and > use the same kind of slot for the Unique's result? Or some more > complicated logic, like use a virtual slot if all the values are > pass-by-val? I'd also like to keep this simple, though... > > Would this kind of optimization make sense elsewhere? There are a few usages of ExecGetResultSlotOps(). e.g ExecInitHashJoin(). If I adjust the patch to: - ExecInitResultTupleSlotTL(&uniquestate->ps, &TTSOpsMinimalTuple); + ExecInitResultTupleSlotTL(&uniquestate->ps, + ExecGetResultSlotOps(outerPlanState(uniquestate), + NULL)); Then I get the following performance on my Zen2 machine. Test 1 drop table if exists t; create table t(a int, b int); insert into t select x,x from generate_series(1,1000000)x; create index on t (a,b); vacuum analyze t; explain (analyze, timing off) select distinct a,b from t; Master: Execution Time: 149.669 ms Execution Time: 149.019 ms Execution Time: 151.240 ms Patched: Execution Time: 96.950 ms Execution Time: 94.509 ms Execution Time: 93.498 ms Test 2 drop table if exists t; create table t(a text, b text); insert into t select x::text,x::text from generate_series(1,1000000)x; create index on t (a,b); vacuum analyze t; explain (analyze, timing off) select distinct a,b from t; Master: Execution Time: 185.282 ms Execution Time: 178.948 ms Execution Time: 179.217 ms Patched: Execution Time: 141.031 ms Execution Time: 141.136 ms Execution Time: 142.163 ms Test 3 set enable_hashagg=off; explain (analyze, timing off) select distinct g::text, 'a', 'b', 'c','d', 'e','f','g','h' from generate_series(1, 50000) g; Master: Execution Time: 87.599 ms Execution Time: 87.721 ms Execution Time: 87.635 ms Patched: Execution Time: 83.449 ms Execution Time: 84.314 ms Execution Time: 86.239 ms David
On Wed, 27 Sept 2023 at 20:01, David Rowley <dgrowleyml@gmail.com> wrote: > > On Sat, 23 Sept 2023 at 03:15, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > So not a win in this case. Could you peek at the outer slot type, and > > use the same kind of slot for the Unique's result? Or some more > > complicated logic, like use a virtual slot if all the values are > > pass-by-val? I'd also like to keep this simple, though... > > > > Would this kind of optimization make sense elsewhere? > > There are a few usages of ExecGetResultSlotOps(). e.g ExecInitHashJoin(). > > If I adjust the patch to: > > - ExecInitResultTupleSlotTL(&uniquestate->ps, &TTSOpsMinimalTuple); > + ExecInitResultTupleSlotTL(&uniquestate->ps, > + > ExecGetResultSlotOps(outerPlanState(uniquestate), > + > NULL)); Just to keep this from going cold, here's that in patch form for anyone who wants to test. I spent a bit more time running some more benchmarks and I don't see any workload where it slows things down. I'd be happy if someone else had a go at finding a regression. David
Attachment
On Tue, Oct 10, 2023 at 2:23 PM David Rowley <dgrowleyml@gmail.com> wrote: > > On Wed, 27 Sept 2023 at 20:01, David Rowley <dgrowleyml@gmail.com> wrote: > > > > On Sat, 23 Sept 2023 at 03:15, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > So not a win in this case. Could you peek at the outer slot type, and > > > use the same kind of slot for the Unique's result? Or some more > > > complicated logic, like use a virtual slot if all the values are > > > pass-by-val? I'd also like to keep this simple, though... > > > > > > Would this kind of optimization make sense elsewhere? > > > > There are a few usages of ExecGetResultSlotOps(). e.g ExecInitHashJoin(). > > > > If I adjust the patch to: > > > > - ExecInitResultTupleSlotTL(&uniquestate->ps, &TTSOpsMinimalTuple); > > + ExecInitResultTupleSlotTL(&uniquestate->ps, > > + > > ExecGetResultSlotOps(outerPlanState(uniquestate), > > + > > NULL)); > > Just to keep this from going cold, here's that in patch form for > anyone who wants to test. Thanks. I don't recollect why we chose MinimalTupleSlot here - may be because we expected the underlying node to always produce a minimal tupe. But Unique node copies the tuple returned by the underlying node. This copy is carried out by the TupleTableSlot specific copy function copyslot. Every implementation of this function first converts the source slot tuple into the required form and then copies it. Having both the TupleTableSlots, ouput slot from the underlying node and the output slot of Unique node, of the same type avoids the first step and just copies the slot. It makes sense that it performs better. The code looks fine to me. > > I spent a bit more time running some more benchmarks and I don't see > any workload where it slows things down. I'd be happy if someone else > had a go at finding a regression. I built on your experiments and I might have found a minor regression. Setup ===== drop table if exists t_int; create table t_int(a int, b int); insert into t_int select x, x from generate_series(1,1000000)x; create index on t_int (a,b); vacuum analyze t_int; drop table if exists t_text; create table t_text(a text, b text); insert into t_text select lpad(x::text, 1000, '0'), x::text from generate_series(1,1000000)x; create index on t_text (a,b); vacuum analyze t_text; drop table if exists t_mixed; -- this one is new but it doesn't matter much create table t_mixed(a text, b int); insert into t_mixed select lpad(x::text, 1000, '0'), x from generate_series(1,1000000)x; create index on t_mixed (a,b); vacuum analyze t_mixed; Queries and measurements (average execution time from 3 runs - on my Thinkpad T490) ====================== Q1 select distinct a,b from t_int'; HEAD: 544.45 ms patched: 381.55 ms Q2 select distinct a,b from t_text HEAD: 609.90 ms patched: 513.42 ms Q3 select distinct a,b from t_mixed HEAD: 626.80 ms patched: 468.22 ms The more the pass by ref data, more memory is allocated which seems to reduce the gain by this patch. Above nodes use Buffer or HeapTupleTableSlot. Try some different nodes which output minimal or virtual TTS. set enable_hashagg to off; Q4 select distinct a,b from (select sum(a) over (order by a rows 2 preceding) a, b from t_int) q HEAD: 2529.58 ms patched: 2332.23 Q5 select distinct a,b from (select sum(a) over (order by a rows 2 preceding) a, b from t_int order by a, b) q HEAD: 2633.69 ms patched: 2255.99 ms Q6 select distinct a,b from (select string_agg(a, ', ') over (order by a rows 2 preceding) a, b from t_text) q HEAD: 108589.85 ms patched: 107226.82 ms Q7 select distinct a,b from (select string_agg(left(a, 100), ', ') over (order by a rows 2 preceding) a, b from t_text) q HEAD: 16070.62 ms patched: 16182.16 ms This one is surprising though. May be the advantage of using the same tuple table slot is so narrow when large data needs to be copied that the execution times almost match. The patched and unpatched execution times differ by the margin of error either way. -- Best Wishes, Ashutosh Bapat
On Thu, 12 Oct 2023 at 23:06, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > Q7 select distinct a,b from (select string_agg(left(a, 100), ', ') > over (order by a rows 2 preceding) a, b from t_text) q > HEAD: 16070.62 ms > patched: 16182.16 ms Did you time the SELECT or EXPLAIN ANALYZE? With SELECT, I'm unable to recreate this slowdown. Using your setup: $ cat bench.sql set enable_hashagg=0; set work_mem='10GB'; select distinct a,b from (select string_agg(left(a, 100), ', ') over (order by a rows 2 preceding) a, b from t_text) q; Master @ 13d00729d $ pgbench -n -f bench.sql -T 300 postgres | grep latency latency average = 7739.250 ms Master + use_subnode_slot_type_for_nodeunique.patch $ pgbench -n -f bench.sql -T 300 postgres | grep latency latency average = 7718.007 ms It's hard to imagine why there would be a slowdown as this query uses a TTSOpsMinimalTuple slot type in the patch and the unpatched version. David
On Thu, 19 Oct 2023 at 22:29, David Rowley <dgrowleyml@gmail.com> wrote: > It's hard to imagine why there would be a slowdown as this query uses > a TTSOpsMinimalTuple slot type in the patch and the unpatched version. I shrunk down your table sizes to 10k rows instead of 1 million rows to reduce the CPU cache pressure on the queries. I ran pgbench for 1 minute on each query and did pg_prewarm on each table. Here are the times I got in milliseconds: Query master Master + 0001 compare Q1 2.576 1.979 130.17% Q2 9.546 9.941 96.03% Q3 9.069 9.536 95.10% Q4 7.285 7.208 101.07% Q5 7.585 6.904 109.86% Q6 162.253 161.434 100.51% Q7 62.507 58.922 106.08% I also noted down the slot type that nodeUnique.c is using in each of the queries: Q1 TTSOpsVirtual Q2 TTSOpsVirtual Q3 TTSOpsVirtual Q4 TTSOpsMinimalTuple Q5 TTSOpsVirtual Q6 TTSOpsMinimalTuple Q7 TTSOpsMinimalTuple So, I'm not really expecting Q4, Q6 or Q7 to change much. However, Q7 does seem to be above noise level faster and I'm not sure why. We can see that Q2 and Q3 become a bit slower. This makes sense as tts_virtual_materialize() is quite a bit more complex than heap_copy_minimal_tuple() which is a simple palloc/memcpy. We'd likely see Q2 and Q3 do better with the patched version if there were more duplicates as there'd be less tuple deforming going on because of the virtual slots. Overall, the patched version is 5.55% faster than master. However, it's pretty hard to say if we should do this or not. Q3 has a mix of varlena and byval types and that came out slower with the patched version. I've attached the script I used to get the results and the setup, which is just your tables shrunk down to 10k rows. David
Attachment
On Thu, Oct 19, 2023 at 4:26 PM David Rowley <dgrowleyml@gmail.com> wrote: > > On Thu, 19 Oct 2023 at 22:29, David Rowley <dgrowleyml@gmail.com> wrote: > > It's hard to imagine why there would be a slowdown as this query uses > > a TTSOpsMinimalTuple slot type in the patch and the unpatched version. > > I shrunk down your table sizes to 10k rows instead of 1 million rows > to reduce the CPU cache pressure on the queries. > > I ran pgbench for 1 minute on each query and did pg_prewarm on each > table. Here are the times I got in milliseconds: > > Query master Master + 0001 compare > Q1 2.576 1.979 130.17% > Q2 9.546 9.941 96.03% > Q3 9.069 9.536 95.10% > Q4 7.285 7.208 101.07% > Q5 7.585 6.904 109.86% > Q6 162.253 161.434 100.51% > Q7 62.507 58.922 106.08% > > I also noted down the slot type that nodeUnique.c is using in each of > the queries: > > Q1 TTSOpsVirtual > Q2 TTSOpsVirtual > Q3 TTSOpsVirtual > Q4 TTSOpsMinimalTuple > Q5 TTSOpsVirtual > Q6 TTSOpsMinimalTuple > Q7 TTSOpsMinimalTuple > > So, I'm not really expecting Q4, Q6 or Q7 to change much. However, Q7 > does seem to be above noise level faster and I'm not sure why. I ran my experiments again. It seems on my machine the execution times do vary a bit. I ran EXPLAIN ANALYZE on the query 5 times and took average of execution times. I did this three times. For each run the standard deviation was within 2%. Here are the numbers master: 13548.33, 13878.88, 14572.52 master + 0001: 13734.58, 14193.83, 14574.73 So for me, I would say, this particular query performs the same with or without patch. > > We can see that Q2 and Q3 become a bit slower. This makes sense as > tts_virtual_materialize() is quite a bit more complex than > heap_copy_minimal_tuple() which is a simple palloc/memcpy. > If the source slot is a materialized virtual slot, tts_virtual_copyslot() could perform a memcpy of the materialized data itself rather than materialising from datums. That might be more efficient. > We'd likely see Q2 and Q3 do better with the patched version if there > were more duplicates as there'd be less tuple deforming going on > because of the virtual slots. > > Overall, the patched version is 5.55% faster than master. However, > it's pretty hard to say if we should do this or not. Q3 has a mix of > varlena and byval types and that came out slower with the patched > version. Theoretically using the same slot type is supposed to be faster. We use same slot types for input and output in other places where as well. May be we should fix the above said inefficiency in tt_virtual_copyslot()? -- Best Wishes, Ashutosh Bapat
On Fri, 20 Oct 2023 at 22:30, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > I ran my experiments again. It seems on my machine the execution times > do vary a bit. I ran EXPLAIN ANALYZE on the query 5 times and took > average of execution times. I did this three times. For each run the > standard deviation was within 2%. Here are the numbers > master: 13548.33, 13878.88, 14572.52 > master + 0001: 13734.58, 14193.83, 14574.73 > > So for me, I would say, this particular query performs the same with > or without patch. I'm not really sure which of the 7 queries you're referring to here. The times you're quoting seem to align best to Q7 from your previous results, so I'll assume you mean Q7. I'm not really concerned with Q7 as both patched and unpatched use TTSOpsMinimalTuple. I also think you need to shrink the size of your benchmark down. With 1 million tuples, you're more likely to be also measuring the time it takes to get cache lines from memory into the CPU. A smaller scale test will make this less likely. Also, you'd be better timing SELECT rather than the time it takes to EXPLAIN ANALYZE. They're not the same thing. EXPLAIN ANALYZE has additional timing going on and we may end up not de-toasting toasted Datums. > On Thu, Oct 19, 2023 at 4:26 PM David Rowley <dgrowleyml@gmail.com> wrote: > > We can see that Q2 and Q3 become a bit slower. This makes sense as > > tts_virtual_materialize() is quite a bit more complex than > > heap_copy_minimal_tuple() which is a simple palloc/memcpy. > > > > If the source slot is a materialized virtual slot, > tts_virtual_copyslot() could perform a memcpy of the materialized data > itself rather than materialising from datums. That might be more > efficient. I think you're talking about just performing a memcpy() of the VirtualTupleTableSlot->data field. Unfortunately, you'd not be able to do just that as you'd also need to repoint the non-byval Datums in tts_values at the newly memcpy'd memory. If you skipped that part, those would remain pointing to the original memory. If that memory goes away, then bad things will happen. I think you'd still need to do the 2nd loop in tts_virtual_materialize() > > We'd likely see Q2 and Q3 do better with the patched version if there > > were more duplicates as there'd be less tuple deforming going on > > because of the virtual slots. > > > > Overall, the patched version is 5.55% faster than master. However, > > it's pretty hard to say if we should do this or not. Q3 has a mix of > > varlena and byval types and that came out slower with the patched > > version. > > Theoretically using the same slot type is supposed to be faster. We > use same slot types for input and output in other places where as > well. Which theory? > May be we should fix the above said inefficiency in > tt_virtual_copyslot()? I don't have any bright ideas on how to make tts_virtual_materialize() itself faster. If there were some way to remember !attbyval attributes for the 2nd loop, that might be good, but creating somewhere to store that might result in further overheads. tts_virtual_copyslot() perhaps could be sped up a little by doing a memcpy of the values/isnull arrays when the src and dst descriptors have the same number of attributes. aka, something like: if (srcdesc->natts == dstslot->tts_tupleDescriptor->natts) memcpy(dstslot->tts_values, srcslot->tts_values, MAXALIGN(srcdesc->natts * sizeof(Datum)) + MAXALIGN(srcdesc->natts * sizeof(bool))); else { for (int natt = 0; natt < srcdesc->natts; natt++) { dstslot->tts_values[natt] = srcslot->tts_values[natt]; dstslot->tts_isnull[natt] = srcslot->tts_isnull[natt]; } } I imagine we'd only start to notice gains by doing that for larger natts values. Each of the 7 queries here, I imagine, wouldn't have enough columns for it to make much of a difference. That seems valid enough to do based on how MakeTupleTableSlot() allocates those arrays. ExecSetSlotDescriptor() does not seem on board with the single allocation method, however. (Those pfree's in ExecSetSlotDescriptor() do look a bit fragile. https://coverage.postgresql.org/src/backend/executor/execTuples.c.gcov.html says they're never called) David
On Tue, Oct 24, 2023 at 4:30 AM David Rowley <dgrowleyml@gmail.com> wrote: > > On Fri, 20 Oct 2023 at 22:30, Ashutosh Bapat > <ashutosh.bapat.oss@gmail.com> wrote: > > I ran my experiments again. It seems on my machine the execution times > > do vary a bit. I ran EXPLAIN ANALYZE on the query 5 times and took > > average of execution times. I did this three times. For each run the > > standard deviation was within 2%. Here are the numbers > > master: 13548.33, 13878.88, 14572.52 > > master + 0001: 13734.58, 14193.83, 14574.73 > > > > So for me, I would say, this particular query performs the same with > > or without patch. > > I'm not really sure which of the 7 queries you're referring to here. > The times you're quoting seem to align best to Q7 from your previous > results, so I'll assume you mean Q7. > > I'm not really concerned with Q7 as both patched and unpatched use > TTSOpsMinimalTuple. It's Q7. Yes. I was responding to your statement: " However, Q7 does seem to be above noise level faster and I'm not sure why.". Anyway, we can set that aside. > > I also think you need to shrink the size of your benchmark down. With > 1 million tuples, you're more likely to be also measuring the time it > takes to get cache lines from memory into the CPU. A smaller scale > test will make this less likely. Also, you'd be better timing SELECT > rather than the time it takes to EXPLAIN ANALYZE. They're not the same > thing. EXPLAIN ANALYZE has additional timing going on and we may end > up not de-toasting toasted Datums. I ran experiments with 10K rows and measured timing using \timing in psql. The measurements are much more flaky than a larger set of rows and EXPLAIN ANALYZE. But I think your observations are good enough. > > > On Thu, Oct 19, 2023 at 4:26 PM David Rowley <dgrowleyml@gmail.com> wrote: > > > We can see that Q2 and Q3 become a bit slower. This makes sense as > > > tts_virtual_materialize() is quite a bit more complex than > > > heap_copy_minimal_tuple() which is a simple palloc/memcpy. > > > > > > > If the source slot is a materialized virtual slot, > > tts_virtual_copyslot() could perform a memcpy of the materialized data > > itself rather than materialising from datums. That might be more > > efficient. > > I think you're talking about just performing a memcpy() of the > VirtualTupleTableSlot->data field. Unfortunately, you'd not be able > to do just that as you'd also need to repoint the non-byval Datums in > tts_values at the newly memcpy'd memory. If you skipped that part, > those would remain pointing to the original memory. If that memory > goes away, then bad things will happen. I think you'd still need to do > the 2nd loop in tts_virtual_materialize() Yes, we will need repoint non-byval Datums ofc. > > > May be we should fix the above said inefficiency in > > tt_virtual_copyslot()? > > I don't have any bright ideas on how to make tts_virtual_materialize() > itself faster. If there were some way to remember !attbyval > attributes for the 2nd loop, that might be good, but creating > somewhere to store that might result in further overheads. We may save the size of data in VirtualTupleTableSlot, thus avoiding the first loop. I assume that when allocating VirtualTupleTableSlot->data, we always know what size we are allocating so it should be just a matter of saving it in VirtualTupleTableSlot->size. This should avoid the first loop in tts_virtual_materialize() and give some speed up. We will need a loop to repoint non-byval Datums. I imagine that the pointers to non-byval Datums can be computed as dest_slot->tts_values[natts] = dest_vslot->data + (src_slot->tts_values[natts] - src_vslot->data). This would work as long as all the non-byval datums in the source slot are all stored flattened in source slot's data. I am assuming that that would be true in a materialized virtual slot. The byval datums are copied as is. I think, this will avoid multiple memcpy calls, one per non-byval attribute and hence some speedup. I may be wrong though. -- Best Wishes, Ashutosh Bapat
On Wed, 25 Oct 2023 at 22:48, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > We may save the size of data in VirtualTupleTableSlot, thus avoiding > the first loop. I assume that when allocating > VirtualTupleTableSlot->data, we always know what size we are > allocating so it should be just a matter of saving it in > VirtualTupleTableSlot->size. This should avoid the first loop in > tts_virtual_materialize() and give some speed up. We will need a loop > to repoint non-byval Datums. I imagine that the pointers to non-byval > Datums can be computed as dest_slot->tts_values[natts] = > dest_vslot->data + (src_slot->tts_values[natts] - src_vslot->data). > This would work as long as all the non-byval datums in the source slot > are all stored flattened in source slot's data. I am assuming that > that would be true in a materialized virtual slot. The byval datums > are copied as is. I think, this will avoid multiple memcpy calls, one > per non-byval attribute and hence some speedup. I may be wrong though. hmm, do you think it's common enough that we copy an already materialised virtual slot? I tried adding the following code totts_virtual_copyslot and didn't see the NOTICE message when running each of your test queries. "make check" also worked without anything failing after adjusting nodeUnique.c to always use a TTSOpsVirtual slot. + if (srcslot->tts_ops == &TTSOpsVirtual && TTS_SHOULDFREE(srcslot)) + elog(NOTICE, "We copied a materialized virtual slot!"); I did get a failure in postgres_fdw's tests: server loopback options (table_name 'tab_batch_sharded_p1_remote'); insert into tab_batch_sharded select * from tab_batch_local; +NOTICE: We copied a materialized virtual slot! +NOTICE: We copied a materialized virtual slot! so I think it's probably not that common that we'd gain from that optimisation. What might buy us a bit more would be to get rid of the for loop inside tts_virtual_copyslot() and copy the guts of tts_virtual_materialize() into tts_virtual_copyslot() and set the dstslot tts_isnull and tts_values arrays in the same loop that we use to calculate the size. I tried that in the attached patch and then tested it alongside the patch that changes the slot type. master = 74604a37f 1 = [1] 2 = optimize_tts_virtual_copyslot.patch Using the script from [2] and the setup from [3] but reduced to 10k tuples instead of 1 million. Times the average query time in milliseconds for a 60 second pgbench run. query master master+1 master+1+2 m/m+1 m/m+1+2 Q1 2.616 2.082 1.903 125.65% 137.47% Q2 9.479 10.593 10.361 89.48% 91.49% Q3 10.329 10.357 10.627 99.73% 97.20% Q4 7.272 7.306 6.941 99.53% 104.77% Q5 7.597 7.043 6.645 107.87% 114.33% Q6 162.177 161.037 162.807 100.71% 99.61% Q7 59.288 59.43 61.294 99.76% 96.73% 103.25% 105.94% I only expect Q2 and Q3 to gain from this. Q1 shouldn't improve but did, so the results might not be stable enough to warrant making any decisions from. I was uncertain if the old behaviour of when srcslot contains fewer attributes than dstslot was intended or not. What happens there is that we'd leave the additional old dstslot tts_values in place and only overwrite up to srcslot->natts but then we'd go on and materialize all the dstslot attributes. I think this might not be needed as we do dstslot->tts_nvalid = srcdesc->natts. I suspect we may be ok just to materialize the srcslot attributes and ignore the previous additional dstslot attributes. Changing it to that would make the attached patch more simple. David [1] https://www.postgresql.org/message-id/attachment/151110/use_subnode_slot_type_for_nodeunique.patch [2] https://www.postgresql.org/message-id/attachment/151342/uniquebench.sh.txt [3] https://www.postgresql.org/message-id/CAExHW5uhTMdkk26oJg9f2ZVufbi5J4Lquj79MdSO%2BipnGJ_muw%40mail.gmail.com
Attachment
On Fri, Oct 27, 2023 at 8:48 AM David Rowley <dgrowleyml@gmail.com> wrote: > > On Wed, 25 Oct 2023 at 22:48, Ashutosh Bapat > <ashutosh.bapat.oss@gmail.com> wrote: > > We may save the size of data in VirtualTupleTableSlot, thus avoiding > > the first loop. I assume that when allocating > > VirtualTupleTableSlot->data, we always know what size we are > > allocating so it should be just a matter of saving it in > > VirtualTupleTableSlot->size. This should avoid the first loop in > > tts_virtual_materialize() and give some speed up. We will need a loop > > to repoint non-byval Datums. I imagine that the pointers to non-byval > > Datums can be computed as dest_slot->tts_values[natts] = > > dest_vslot->data + (src_slot->tts_values[natts] - src_vslot->data). > > This would work as long as all the non-byval datums in the source slot > > are all stored flattened in source slot's data. I am assuming that > > that would be true in a materialized virtual slot. The byval datums > > are copied as is. I think, this will avoid multiple memcpy calls, one > > per non-byval attribute and hence some speedup. I may be wrong though. > > hmm, do you think it's common enough that we copy an already > materialised virtual slot? > > I tried adding the following code totts_virtual_copyslot and didn't > see the NOTICE message when running each of your test queries. "make > check" also worked without anything failing after adjusting > nodeUnique.c to always use a TTSOpsVirtual slot. > > + if (srcslot->tts_ops == &TTSOpsVirtual && TTS_SHOULDFREE(srcslot)) > + elog(NOTICE, "We copied a materialized virtual slot!"); > > I did get a failure in postgres_fdw's tests: > > server loopback options (table_name 'tab_batch_sharded_p1_remote'); > insert into tab_batch_sharded select * from tab_batch_local; > +NOTICE: We copied a materialized virtual slot! > +NOTICE: We copied a materialized virtual slot! > > so I think it's probably not that common that we'd gain from that optimisation. Thanks for this analysis. If we aren't copying a materialized virtual slot often, no point in adding that optimization. > > What might buy us a bit more would be to get rid of the for loop > inside tts_virtual_copyslot() and copy the guts of > tts_virtual_materialize() into tts_virtual_copyslot() and set the > dstslot tts_isnull and tts_values arrays in the same loop that we use > to calculate the size. > > I tried that in the attached patch and then tested it alongside the > patch that changes the slot type. > > master = 74604a37f > 1 = [1] > 2 = optimize_tts_virtual_copyslot.patch > > Using the script from [2] and the setup from [3] but reduced to 10k > tuples instead of 1 million. > > Times the average query time in milliseconds for a 60 second pgbench run. > > query master master+1 master+1+2 m/m+1 m/m+1+2 > Q1 2.616 2.082 1.903 125.65% > 137.47% > Q2 9.479 10.593 10.361 89.48% > 91.49% > Q3 10.329 10.357 10.627 99.73% > 97.20% > Q4 7.272 7.306 6.941 99.53% > 104.77% > Q5 7.597 7.043 6.645 107.87% > 114.33% > Q6 162.177 161.037 162.807 100.71% 99.61% > Q7 59.288 59.43 61.294 99.76% > 96.73% > > 103.25% 105.94% > > I only expect Q2 and Q3 to gain from this. Q1 shouldn't improve but > did, so the results might not be stable enough to warrant making any > decisions from. I am actually surprised to see that eliminating loop is showing improvements. There aren't hundreds of attributes involved in those queries. So I share your stability concerns. And even with these patches, Q2 and Q3 are still slower. Q1 is consistently giving performance > 25% for both of us. But other queries aren't showing a whole lot improvement. So I am wondering whether it's worth pursuing this change; similar to the opinion you expressed a few emails earlier. > > I was uncertain if the old behaviour of when srcslot contains fewer > attributes than dstslot was intended or not. What happens there is > that we'd leave the additional old dstslot tts_values in place and > only overwrite up to srcslot->natts but then we'd go on and > materialize all the dstslot attributes. I think this might not be > needed as we do dstslot->tts_nvalid = srcdesc->natts. I suspect we may > be ok just to materialize the srcslot attributes and ignore the > previous additional dstslot attributes. Changing it to that would > make the attached patch more simple. We seem to use both tt_nvalid and tts_tupleDescriptor->natts. I forgot what's the difference. If we do what you say, we might end up trying to access unmaterialized values beyond tts_nvalid. Better to investigate whether such a hazard exists. -- Best Wishes, Ashutosh Bapat
On Fri, 27 Oct 2023 at 22:05, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Fri, Oct 27, 2023 at 8:48 AM David Rowley <dgrowleyml@gmail.com> wrote: > > I was uncertain if the old behaviour of when srcslot contains fewer > > attributes than dstslot was intended or not. What happens there is > > that we'd leave the additional old dstslot tts_values in place and > > only overwrite up to srcslot->natts but then we'd go on and > > materialize all the dstslot attributes. I think this might not be > > needed as we do dstslot->tts_nvalid = srcdesc->natts. I suspect we may > > be ok just to materialize the srcslot attributes and ignore the > > previous additional dstslot attributes. Changing it to that would > > make the attached patch more simple. > > We seem to use both tt_nvalid and tts_tupleDescriptor->natts. I forgot > what's the difference. If we do what you say, we might end up trying > to access unmaterialized values beyond tts_nvalid. Better to > investigate whether such a hazard exists. The TupleDesc's natts is the number of attributes in the tuple descriptor. tts_nvalid is the greatest attribute number that's been deformed in the tuple slot. For slot types other than virtual slots, we'll call slot_getsomeattrs() to deform more attributes from the tuple. The reason the code in question looks suspicious to me is that we do "dstslot->tts_nvalid = srcdesc->natts;" and there's no way to deform more attributes in a virtual slot. Note that tts_virtual_getsomeattrs() unconditionally does elog(ERROR, "getsomeattrs is not required to be called on a virtual tuple table slot");. We shouldn't ever be accessing tts_values elements above what tts_nvalid is set to, so either we should be setting dstslot->tts_nvalid = to the dstdesc->natts so that we can access tts_values elements above srcdesc->natts or we're needlessly materialising too many attributes in tts_virtual_copyslot(). David David
I have taken a look at this discussion, at the code and I am confused how we choose tuple table slot (TTS) type in PG. Maybe you can clarify this topic or me. 1. Brief intro. There are four types of TTS. Plan tree «leaves»: - buffer heap (produced by index and table scans, has system columns and keeps shared buffer pins) - heap (produced by FDW: has system columns, but doesn’t keep any pins) - minimal (produced by values and materializations nodes like sort, agg, etc.) Plan «branches»: - virtual (keeps datum references to the columns of the tuples in the child nodes) Virtual TTS is cheeper to copy among the plan (as we copy datum references), but more expensive to materialize (we have toconstruct a tuple from pieces). Leaves are cheeper to materialize (as we make a memcmp under hood), but very expensive to copy (we copy the value, not thedatum reference). 2. If we take a look at the materialize nodes in the plan, they produce different result TTS. - Outer child TTS type: gater, gather merge, lock rows, limit; - Minimal: material, sort, incremental sort, memoize, unique, hash, setup (can be heap as well); - Virtual: group, agg, window agg. From my point of view, the materialization node should preserve the incoming TTS type. For the sort node (that materializesincoming tuples as minimal) it is ok to output minimal result as well. Looks that unique should use the outerchild’d TTS (instead of hardcoded minimal). But can anyone explain me why do group, agg and window agg return the virtualinstead of the same TTS type as outer child has? Do we expect that the parent node exists and requires exactly virtualtuples (but what if the parent node is sort and benefits from minimal TTS)? So, it looks like we need to take a looknot only at the unique, but also inspect all the materialization nodes.
> On 29 Oct 2023, at 21:30, Denis Smirnov <darthunix@gmail.com> wrote: > > I have taken a look at this discussion, at the code and I am confused how we choose tuple table slot (TTS) type in PG. After offline discussion with Denis, we decided to withdraw this patch from CF for now. If anyone is willing to revive workingon this, please register a new entry in next commitfest. Thanks! Best regards, Andrey Borodin.