Thread: Bug in row_number() optimization
Hello, We've accidentally found a subtle bug introduced by commit 9d9c02ccd1aea8e9131d8f4edb21bf1687e40782 Author: David Rowley Date: Fri Apr 8 10:34:36 2022 +1200 Teach planner and executor about monotonic window funcs On a 32-bit system Valgrind reports use-after-free when running the "window" test: ==35487== Invalid read of size 4 ==35487== at 0x48398A4: memcpy (vg_replace_strmem.c:1035) ==35487== by 0x1A2902: fill_val (heaptuple.c:287) ==35487== by 0x1A2902: heap_fill_tuple (heaptuple.c:336) ==35487== by 0x1A3C29: heap_form_minimal_tuple (heaptuple.c:1412) ==35487== by 0x3D4555: tts_virtual_copy_minimal_tuple (execTuples.c:290) ==35487== by 0x72FC33: ExecCopySlotMinimalTuple (tuptable.h:473) ==35487== by 0x72FC33: tuplesort_puttupleslot (tuplesortvariants.c:610) ==35487== by 0x403463: ExecSort (nodeSort.c:153) ==35487== by 0x3D0C8E: ExecProcNodeFirst (execProcnode.c:464) ==35487== by 0x40AF09: ExecProcNode (executor.h:259) ==35487== by 0x40AF09: begin_partition (nodeWindowAgg.c:1106) ==35487== by 0x40D259: ExecWindowAgg (nodeWindowAgg.c:2125) ==35487== by 0x3D0C8E: ExecProcNodeFirst (execProcnode.c:464) ==35487== by 0x405E17: ExecProcNode (executor.h:259) ==35487== by 0x405E17: SubqueryNext (nodeSubqueryscan.c:53) ==35487== by 0x3D41C7: ExecScanFetch (execScan.c:133) ==35487== by 0x3D41C7: ExecScan (execScan.c:199) ==35487== Address 0xe3e8af0 is 168 bytes inside a block of size 8,192 alloc'd ==35487== at 0x483463B: malloc (vg_replace_malloc.c:299) ==35487== by 0x712B63: AllocSetContextCreateInternal (aset.c:446) ==35487== by 0x3D82BE: CreateExprContextInternal (execUtils.c:253) ==35487== by 0x3D84DC: CreateExprContext (execUtils.c:303) ==35487== by 0x3D8750: ExecAssignExprContext (execUtils.c:482) ==35487== by 0x40BC1A: ExecInitWindowAgg (nodeWindowAgg.c:2382) ==35487== by 0x3D1232: ExecInitNode (execProcnode.c:346) ==35487== by 0x4035E0: ExecInitSort (nodeSort.c:265) ==35487== by 0x3D11AB: ExecInitNode (execProcnode.c:321) ==35487== by 0x40BD36: ExecInitWindowAgg (nodeWindowAgg.c:2432) ==35487== by 0x3D1232: ExecInitNode (execProcnode.c:346) ==35487== by 0x405E99: ExecInitSubqueryScan (nodeSubqueryscan.c:126) It's faster to run just this test under Valgrind: make installcheck-test TESTS='test_setup window' This can also be reproduced on a 64-bit system by forcing int8 to be passed by reference: --- a/src/include/pg_config_manual.h +++ b/src/include/pg_config_manual.h @@ -82,9 +82,7 @@ * * Changing this requires an initdb. */ -#if SIZEOF_VOID_P >= 8 -#define USE_FLOAT8_BYVAL 1 -#endif +#undef USE_FLOAT8_BYVAL /* * When we don't have native spinlocks, we use semaphores to simulate them. Futhermore, zeroing freed memory makes the test fail: --- a/src/include/utils/memdebug.h +++ b/src/include/utils/memdebug.h @@ -39,7 +39,7 @@ static inline void wipe_mem(void *ptr, size_t size) { VALGRIND_MAKE_MEM_UNDEFINED(ptr, size); - memset(ptr, 0x7F, size); + memset(ptr, 0, size); VALGRIND_MAKE_MEM_NOACCESS(ptr, size); } $ cat src/test/regress/regression.diffs diff -U3 /home/sergey/pgwork/devel/src/src/test/regress/expected/window.out /home/sergey/pgwork/devel/src/src/test/regress/results/window.out --- /home/sergey/pgwork/devel/src/src/test/regress/expected/window.out 2022-11-03 18:26:52.203624217 +0300 +++ /home/sergey/pgwork/devel/src/src/test/regress/results/window.out 2022-11-16 01:47:18.494273352 +0300 @@ -3721,7 +3721,8 @@ -----------+-------+--------+-------------+----+----+----+---- personnel | 5 | 3500 | 12-10-2007 | 2 | 1 | 2 | 2 sales | 3 | 4800 | 08-01-2007 | 3 | 1 | 3 | 3 -(2 rows) + sales | 4 | 4800 | 08-08-2007 | 3 | 0 | 3 | 3 +(3 rows) -- Tests to ensure we don't push down the run condition when it's not valid to -- do so. The failing query is: SELECT * FROM (SELECT *, count(salary) OVER (PARTITION BY depname || '') c1, -- w1 row_number() OVER (PARTITION BY depname) rn, -- w2 count(*) OVER (PARTITION BY depname) c2, -- w2 count(*) OVER (PARTITION BY '' || depname) c3 -- w3 FROM empsalary ) e WHERE rn <= 1 AND c1 <= 3; As far as I understand, ExecWindowAgg for the intermediate WindowAgg node switches into pass-through mode, stops evaluating row_number(), and returns the previous value instead. But if int8 is passed by reference, the previous value stored in econtext->ecxt_aggvalues becomes a dangling pointer when the per-output-tuple memory context is reset. Attaching a patch that makes the window test fail on a 64-bit system. Best regards, -- Sergey Shinderuk https://postgrespro.com/
Attachment
On Wed, Nov 16, 2022 at 7:38 AM Sergey Shinderuk <s.shinderuk@postgrespro.ru> wrote:
The failing query is:
SELECT * FROM
(SELECT *,
count(salary) OVER (PARTITION BY depname || '') c1, -- w1
row_number() OVER (PARTITION BY depname) rn, -- w2
count(*) OVER (PARTITION BY depname) c2, -- w2
count(*) OVER (PARTITION BY '' || depname) c3 -- w3
FROM empsalary
) e WHERE rn <= 1 AND c1 <= 3;
As far as I understand, ExecWindowAgg for the intermediate WindowAgg
node switches into pass-through mode, stops evaluating row_number(), and
returns the previous value instead. But if int8 is passed by reference,
the previous value stored in econtext->ecxt_aggvalues becomes a dangling
pointer when the per-output-tuple memory context is reset.
Yeah, you're right. In this example the window function row_number()
goes into pass-through mode after the second evaluation because its
run condition does not hold true any more. The remaining run would just
return the result from the second evaluation, which is stored in
econtext->ecxt_aggvalues[wfuncno].
If int8 is configured as pass-by-ref, the precomputed value from the
second evaluation is actually located in a memory area from context
ecxt_per_tuple_memory, with its pointer stored in ecxt_aggvalues. As
this memory context is reset once per tuple, we would be prone to wrong
results.
I tried with memory context ecxt_per_query_memory when evaluating
window function in the case where int8 is configured as pass-by-ref and
I can see the problem vanishes. I'm using the changes as below
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -1027,8 +1027,14 @@ eval_windowfunction(WindowAggState *winstate, WindowStatePerFunc perfuncstate,
{
LOCAL_FCINFO(fcinfo, FUNC_MAX_ARGS);
MemoryContext oldContext;
-
- oldContext = MemoryContextSwitchTo(winstate->ss.ps.ps_ExprContext->ecxt_per_tuple_memory);
+ MemoryContext evalWfuncContext;
+
+#ifdef USE_FLOAT8_BYVAL
+ evalWfuncContext = winstate->ss.ps.ps_ExprContext->ecxt_per_tuple_memory;
+#else
+ evalWfuncContext = winstate->ss.ps.ps_ExprContext->ecxt_per_query_memory;
+#endif
+ oldContext = MemoryContextSwitchTo(evalWfuncContext);
Thanks
Richard
goes into pass-through mode after the second evaluation because its
run condition does not hold true any more. The remaining run would just
return the result from the second evaluation, which is stored in
econtext->ecxt_aggvalues[wfuncno].
If int8 is configured as pass-by-ref, the precomputed value from the
second evaluation is actually located in a memory area from context
ecxt_per_tuple_memory, with its pointer stored in ecxt_aggvalues. As
this memory context is reset once per tuple, we would be prone to wrong
results.
I tried with memory context ecxt_per_query_memory when evaluating
window function in the case where int8 is configured as pass-by-ref and
I can see the problem vanishes. I'm using the changes as below
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -1027,8 +1027,14 @@ eval_windowfunction(WindowAggState *winstate, WindowStatePerFunc perfuncstate,
{
LOCAL_FCINFO(fcinfo, FUNC_MAX_ARGS);
MemoryContext oldContext;
-
- oldContext = MemoryContextSwitchTo(winstate->ss.ps.ps_ExprContext->ecxt_per_tuple_memory);
+ MemoryContext evalWfuncContext;
+
+#ifdef USE_FLOAT8_BYVAL
+ evalWfuncContext = winstate->ss.ps.ps_ExprContext->ecxt_per_tuple_memory;
+#else
+ evalWfuncContext = winstate->ss.ps.ps_ExprContext->ecxt_per_query_memory;
+#endif
+ oldContext = MemoryContextSwitchTo(evalWfuncContext);
Thanks
Richard
On Tue, Nov 22, 2022 at 3:44 PM Richard Guo <guofenglinux@gmail.com> wrote:
On Wed, Nov 16, 2022 at 7:38 AM Sergey Shinderuk <s.shinderuk@postgrespro.ru> wrote:The failing query is:
SELECT * FROM
(SELECT *,
count(salary) OVER (PARTITION BY depname || '') c1, -- w1
row_number() OVER (PARTITION BY depname) rn, -- w2
count(*) OVER (PARTITION BY depname) c2, -- w2
count(*) OVER (PARTITION BY '' || depname) c3 -- w3
FROM empsalary
) e WHERE rn <= 1 AND c1 <= 3;
As far as I understand, ExecWindowAgg for the intermediate WindowAgg
node switches into pass-through mode, stops evaluating row_number(), and
returns the previous value instead. But if int8 is passed by reference,
the previous value stored in econtext->ecxt_aggvalues becomes a dangling
pointer when the per-output-tuple memory context is reset.Yeah, you're right. In this example the window function row_number()
goes into pass-through mode after the second evaluation because its
run condition does not hold true any more. The remaining run would just
return the result from the second evaluation, which is stored in
econtext->ecxt_aggvalues[wfuncno].
If int8 is configured as pass-by-ref, the precomputed value from the
second evaluation is actually located in a memory area from context
ecxt_per_tuple_memory, with its pointer stored in ecxt_aggvalues. As
this memory context is reset once per tuple, we would be prone to wrong
results.
Regarding how to fix this problem, firstly I believe we need to evaluate
window functions in the per-tuple memory context, as the HEAD does.
When we decide we need to go into pass-through mode, I'm thinking that
we can just copy out the results of the last evaluation to the per-query
memory context, while still storing their pointers in ecxt_aggvalues.
Does this idea work?
Thanks
Richard
window functions in the per-tuple memory context, as the HEAD does.
When we decide we need to go into pass-through mode, I'm thinking that
we can just copy out the results of the last evaluation to the per-query
memory context, while still storing their pointers in ecxt_aggvalues.
Does this idea work?
Thanks
Richard
On 24.11.2022 06:16, Richard Guo wrote: > Regarding how to fix this problem, firstly I believe we need to evaluate > window functions in the per-tuple memory context, as the HEAD does. > When we decide we need to go into pass-through mode, I'm thinking that > we can just copy out the results of the last evaluation to the per-query > memory context, while still storing their pointers in ecxt_aggvalues. > > Does this idea work? Although I'm not familiar with the code, this makes sense to me. You proposed: +#ifdef USE_FLOAT8_BYVAL + evalWfuncContext = winstate->ss.ps.ps_ExprContext->ecxt_per_tuple_memory; +#else + evalWfuncContext = winstate->ss.ps.ps_ExprContext->ecxt_per_query_memory; +#endif Shouldn't we handle any pass-by-reference type the same? I suppose, a user-defined window function can return some other type, not int8. Best regards, -- Sergey Shinderuk https://postgrespro.com/
On Fri, 25 Nov 2022 at 00:52, Sergey Shinderuk <s.shinderuk@postgrespro.ru> wrote: > Shouldn't we handle any pass-by-reference type the same? I suppose, a > user-defined window function can return some other type, not int8. Thanks for reporting this and to you and Richard for working on a fix. I've just looked at it and it seems that valgrind is complaining because a tuple formed by an upper-level WindowAgg contains a pointer to free'd memory due to the byref type and eval_windowaggregates() not having been executed to fill in ecxt_aggvalues and ecxt_aggnulls on the lower-level WindowAgg. Since upper-level WindowAggs cannot reference values calculated in some lower-level WindowAgg, why can't we just NULLify the pointers instead? See attached. It is possible to have a monotonic window function that does not return int8. Technically something like MAX(text_col) OVER (PARTITION BY somecol ORDER BY text_col) is monotonically increasing, it's just that I didn't add a support function to tell the planner about that. Someone could come along in the future and suggest we do that and show us some convincing use case. So whatever the fix, it cannot assume the window function's return type is int8. David
Attachment
On Fri, Nov 25, 2022 at 7:34 AM David Rowley <dgrowleyml@gmail.com> wrote:
Since upper-level WindowAggs cannot reference values calculated in
some lower-level WindowAgg, why can't we just NULLify the pointers
instead? See attached.
Verified the problem is fixed with this patch. I'm not familiar with
the WindowAgg execution codes. As far as I understand, this patch works
because we set ecxt_aggnulls to true, making it a NULL value. And the
top-level WindowAgg node's "Filter" is strict so that it can filter out
all the tuples that don't match the intermediate WindowAgg node's run
condition. So I find the comments about "WindowAggs above us cannot
reference the result of another WindowAgg" confusing. But maybe I'm
missing something.
Thanks
Richard
the WindowAgg execution codes. As far as I understand, this patch works
because we set ecxt_aggnulls to true, making it a NULL value. And the
top-level WindowAgg node's "Filter" is strict so that it can filter out
all the tuples that don't match the intermediate WindowAgg node's run
condition. So I find the comments about "WindowAggs above us cannot
reference the result of another WindowAgg" confusing. But maybe I'm
missing something.
Thanks
Richard
On Fri, 25 Nov 2022 at 16:00, Richard Guo <guofenglinux@gmail.com> wrote: > Verified the problem is fixed with this patch. I'm not familiar with > the WindowAgg execution codes. As far as I understand, this patch works > because we set ecxt_aggnulls to true, making it a NULL value. And the > top-level WindowAgg node's "Filter" is strict so that it can filter out > all the tuples that don't match the intermediate WindowAgg node's run > condition. So I find the comments about "WindowAggs above us cannot > reference the result of another WindowAgg" confusing. But maybe I'm > missing something. There are two different pass-through modes that the WindowAgg can move into when it detects that the run condition is no longer true: 1) WINDOWAGG_PASSTHROUGH and 2) WINDOWAGG_PASSTHROUGH_STRICT #2 is used when the WindowAgg is the top-level one in this query level. Remember we'll need multiple WindowAgg nodes when there are multiple different windows to evaluate. The reason that we need #1 is that if there are multiple WindowAggs, then the top-level one (or just any WindowAgg above it) might need all the rows from a lower-level WindowAgg. For example: SELECT * FROM (SELECT row_number() over(order by id) rn, sum(qty) over (order by date) qty from t) t where rn <= 10; if the "order by id" window is evaluated first, we can't stop outputting rows when rn <= 10 is no longer true as the "order by date" window might need those. In this case, once rn <= 10 is no longer true, the WindowAgg for that window would go into WINDOWAGG_PASSTHROUGH. This means we can stop window func evaluation on any additional rows. The final query will never see rn==11, so we don't need to generate that. The problem is that once the "order by id" window stops evaluating the window funcs, if the window result is byref, then we leave junk in the aggregate output columns. Since we continue to output rows from that WindowAgg for the top-level "order by date" window, we don't want to form tuples with free'd memory. Since nothing in the query will ever seen rn==11 and beyond, there's no need to put anything in that part of the output tuple. We can just make it an SQL NULL. What I mean by "WindowAggs above us cannot reference the result of another WindowAgg" is that the evaluation of sum(qty) over (order by date) can't see the "rn" column. SQL does not allow it. If it did, that would have to look something like: SELECT * FROM (SELECT SUM(row_number() over (order by id)) over (order by date) qty from t); -- not valid SQL WINDOWAGG_PASSTHROUGH_STRICT not only does not evaluate window funcs, it also does not even bother to store tuples in the tuple store. In this case there's no higher-level WindowAgg that will need these tuples, so we can just read our sub-node until we find the next partition, or stop when there's no PARTITION BY clause. Just thinking of the patch a bit more, what I wrote ends up continually zeroing the values and marking the columns as NULL. Likely we can just do this once when we do: winstate->status = WINDOWAGG_PASSTHROUGH; I'll test that out and make sure it works. David
On Fri, Nov 25, 2022 at 11:26 AM David Rowley <dgrowleyml@gmail.com> wrote:
There are two different pass-through modes that the WindowAgg can move
into when it detects that the run condition is no longer true:
1) WINDOWAGG_PASSTHROUGH and
2) WINDOWAGG_PASSTHROUGH_STRICT
#2 is used when the WindowAgg is the top-level one in this query
level. Remember we'll need multiple WindowAgg nodes when there are
multiple different windows to evaluate. The reason that we need #1 is
that if there are multiple WindowAggs, then the top-level one (or just
any WindowAgg above it) might need all the rows from a lower-level
WindowAgg.
Thanks for the explanation! I think now I understand pass-through modes
much better.
much better.
What I mean by "WindowAggs above us cannot reference the result of
another WindowAgg" is that the evaluation of sum(qty) over (order by
date) can't see the "rn" column. SQL does not allow it.
I think I get your point. Yeah, the 'rn' column is not needed for the
evaluation of WindowAggs above. But ISTM it might be needed to evaluate
the quals of WindowAggs above. Such as in the plan below
explain (costs off) SELECT * FROM
(SELECT
count(salary) OVER (PARTITION BY depname || '') c1, -- w1
row_number() OVER (PARTITION BY depname) rn -- w2
FROM empsalary
) e WHERE rn <= 1;
QUERY PLAN
-------------------------------------------------------------------
Subquery Scan on e
-> WindowAgg
Filter: ((row_number() OVER (?)) <= 1)
-> Sort
Sort Key: (((empsalary.depname)::text || ''::text))
-> WindowAgg
Run Condition: (row_number() OVER (?) <= 1)
-> Sort
Sort Key: empsalary.depname
-> Seq Scan on empsalary
(10 rows)
The 'rn' column is calculated in the lower-level WindowAgg, and it is
needed to evaluate the 'Filter' of the upper-level WindowAgg. In
pass-through mode, the lower-level WindowAgg would not be evaluated any
more, so we need to mark the 'rn' column to something that can false the
'Filter'. Considering the 'Filter' is a strict function, marking it as
NULL would do. I think this is why this patch works.
evaluation of WindowAggs above. But ISTM it might be needed to evaluate
the quals of WindowAggs above. Such as in the plan below
explain (costs off) SELECT * FROM
(SELECT
count(salary) OVER (PARTITION BY depname || '') c1, -- w1
row_number() OVER (PARTITION BY depname) rn -- w2
FROM empsalary
) e WHERE rn <= 1;
QUERY PLAN
-------------------------------------------------------------------
Subquery Scan on e
-> WindowAgg
Filter: ((row_number() OVER (?)) <= 1)
-> Sort
Sort Key: (((empsalary.depname)::text || ''::text))
-> WindowAgg
Run Condition: (row_number() OVER (?) <= 1)
-> Sort
Sort Key: empsalary.depname
-> Seq Scan on empsalary
(10 rows)
The 'rn' column is calculated in the lower-level WindowAgg, and it is
needed to evaluate the 'Filter' of the upper-level WindowAgg. In
pass-through mode, the lower-level WindowAgg would not be evaluated any
more, so we need to mark the 'rn' column to something that can false the
'Filter'. Considering the 'Filter' is a strict function, marking it as
NULL would do. I think this is why this patch works.
Just thinking of the patch a bit more, what I wrote ends up
continually zeroing the values and marking the columns as NULL. Likely
we can just do this once when we do: winstate->status =
WINDOWAGG_PASSTHROUGH;
Yes, I also think we can do this only once when we go into pass-through
mode.
Thanks
Richard
mode.
Thanks
Richard
On 25.11.2022 15:46, Richard Guo wrote: > Considering the 'Filter' is a strict function, marking it as > NULL would do. I think this is why this patch works. What about user-defined operators? I created my own <= operator for int8 which returns true on null input, and put it in a btree operator class. With my operator I get: depname | empno | salary | enroll_date | c1 | rn | c2 | c3 -----------+-------+--------+-------------+----+----+----+---- personnel | 5 | 3500 | 2007-12-10 | 2 | 1 | 2 | 2 sales | 3 | 4800 | 2007-08-01 | 3 | 1 | 3 | 3 sales | 4 | 4800 | 2007-08-08 | 3 | | | 3 (3 rows) Admittedly, it's weird that (null <= 1) evaluates to true. But does it violate the contract of the btree operator class or something? Didn't find a clear answer in the docs. -- Sergey Shinderuk https://postgrespro.com/
Sergey Shinderuk <s.shinderuk@postgrespro.ru> writes: > What about user-defined operators? I created my own <= operator for int8 > which returns true on null input, and put it in a btree operator class. > Admittedly, it's weird that (null <= 1) evaluates to true. But does it > violate the contract of the btree operator class or something? Didn't > find a clear answer in the docs. It's pretty unlikely that this would work during an actual index scan. I'm fairly sure that btree (and other index AMs) have hard-wired assumptions that index operators are strict. regards, tom lane
On Sat, 26 Nov 2022 at 05:19, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Sergey Shinderuk <s.shinderuk@postgrespro.ru> writes: > > What about user-defined operators? I created my own <= operator for int8 > > which returns true on null input, and put it in a btree operator class. > > Admittedly, it's weird that (null <= 1) evaluates to true. But does it > > violate the contract of the btree operator class or something? Didn't > > find a clear answer in the docs. > > It's pretty unlikely that this would work during an actual index scan. > I'm fairly sure that btree (and other index AMs) have hard-wired > assumptions that index operators are strict. If we're worried about that then we could just restrict this optimization to only work with strict quals. The proposal to copy the datums into the query context does not seem to me to be a good idea. If there are a large number of partitions then it sounds like we'll leak lots of memory. We could invent some partition context that we reset after each partition, but that's probably more complexity than it would be worth. I've attached a draft patch to move the code to nullify the aggregate results so that's only done once per partition and adjusted the planner to limit this to strict quals. David
Attachment
On 28.11.2022 03:23, David Rowley wrote: > On Sat, 26 Nov 2022 at 05:19, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> Sergey Shinderuk <s.shinderuk@postgrespro.ru> writes: >>> What about user-defined operators? I created my own <= operator for int8 >>> which returns true on null input, and put it in a btree operator class. >>> Admittedly, it's weird that (null <= 1) evaluates to true. But does it >>> violate the contract of the btree operator class or something? Didn't >>> find a clear answer in the docs. >> >> It's pretty unlikely that this would work during an actual index scan. >> I'm fairly sure that btree (and other index AMs) have hard-wired >> assumptions that index operators are strict. > > If we're worried about that then we could just restrict this > optimization to only work with strict quals. Not sure this is necessary if btree operators must be strict anyway. > The proposal to copy the datums into the query context does not seem > to me to be a good idea. If there are a large number of partitions > then it sounds like we'll leak lots of memory. We could invent some > partition context that we reset after each partition, but that's > probably more complexity than it would be worth. Ah, good point. > I've attached a draft patch to move the code to nullify the aggregate > results so that's only done once per partition and adjusted the > planner to limit this to strict quals. Not quite sure that we don't need to do anything for the WINDOWAGG_PASSTHROUGH_STRICT case. Although, we won't return any more tuples for the current partition, we still call ExecProject with dangling pointers. Is it okay? + if (!func_strict(opexpr->opfuncid)) + return false; Should return true instead? -- Sergey Shinderuk https://postgrespro.com/
On Mon, Nov 28, 2022 at 5:59 PM Sergey Shinderuk <s.shinderuk@postgrespro.ru> wrote:
Not quite sure that we don't need to do anything for the
WINDOWAGG_PASSTHROUGH_STRICT case. Although, we won't return any more
tuples for the current partition, we still call ExecProject with
dangling pointers. Is it okay?
AFAIU once we go into WINDOWAGG_PASSTHROUGH_STRICT we will spool all the
remaining tuples in the current partition without storing them and then
move to the next partition if available and become WINDOWAGG_RUN again
or become WINDOWAGG_DONE if there are no further partitions. It seems
we would not have chance to see the dangling pointers.
remaining tuples in the current partition without storing them and then
move to the next partition if available and become WINDOWAGG_RUN again
or become WINDOWAGG_DONE if there are no further partitions. It seems
we would not have chance to see the dangling pointers.
+ if (!func_strict(opexpr->opfuncid))
+ return false;
Should return true instead?
Yeah, you're right. This should be a thinko.
Thanks
Richard
Thanks
Richard
On 01.12.2022 11:18, Richard Guo wrote: > > On Mon, Nov 28, 2022 at 5:59 PM Sergey Shinderuk > <s.shinderuk@postgrespro.ru <mailto:s.shinderuk@postgrespro.ru>> wrote: > > Not quite sure that we don't need to do anything for the > WINDOWAGG_PASSTHROUGH_STRICT case. Although, we won't return any more > tuples for the current partition, we still call ExecProject with > dangling pointers. Is it okay? > > AFAIU once we go into WINDOWAGG_PASSTHROUGH_STRICT we will spool all the > remaining tuples in the current partition without storing them and then > move to the next partition if available and become WINDOWAGG_RUN again > or become WINDOWAGG_DONE if there are no further partitions. It seems > we would not have chance to see the dangling pointers. Maybe I'm missing something, but the previous call to spool_tuples() might have read extra tuples (if the tuplestore spilled to disk), and after switching to WINDOWAGG_PASSTHROUGH_STRICT mode we nevertheless would loop through these extra tuples and call ExecProject if only to increment winstate->currentpos. -- Sergey Shinderuk https://postgrespro.com/
On Fri, 2 Dec 2022 at 00:21, Sergey Shinderuk <s.shinderuk@postgrespro.ru> wrote: > Maybe I'm missing something, but the previous call to spool_tuples() > might have read extra tuples (if the tuplestore spilled to disk), and > after switching to WINDOWAGG_PASSTHROUGH_STRICT mode we nevertheless > would loop through these extra tuples and call ExecProject if only to > increment winstate->currentpos. The tuples which are spooled in the WindowAgg node are the ones from the WindowAgg's subnode. Since these don't contain the results of the WindowFunc, then I don't think there's any issue with what's stored in any of the spooled tuples. What matters is what we pass along to the node that's reading from the WindowAgg. If we NULL out the memory where we store the WindowFunc (and maybe an Aggref) results then the ExecProject in ExecWindowAgg() will no longer fill the WindowAgg's output slot with the address of free'd memory (or some stale byval value which has lingered for byval return type WindowFuncs). Since the patch I sent sets the context's ecxt_aggnulls to true, it means that when we do the ExecProject(), the EEOP_WINDOW_FUNC in ExecInterpExpr (or the JIT equivalent) will put an SQL NULL in the *output* slot for the WindowAgg node. The same is true for EEOP_AGGREFs as the WindowAgg node that we are running in WINDOWAGG_PASSTHROUGH mode could also contain normal aggregate functions, not just WindowFuncs. David
On Mon, 28 Nov 2022 at 22:59, Sergey Shinderuk <s.shinderuk@postgrespro.ru> wrote: > > On 28.11.2022 03:23, David Rowley wrote: > > On Sat, 26 Nov 2022 at 05:19, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> It's pretty unlikely that this would work during an actual index scan. > >> I'm fairly sure that btree (and other index AMs) have hard-wired > >> assumptions that index operators are strict. > > > > If we're worried about that then we could just restrict this > > optimization to only work with strict quals. > > Not sure this is necessary if btree operators must be strict anyway. I'd rather see the func_strict() test in there. You've already demonstrated you can get wrong results with a non-strict operator. I'm not disputing that it sounds like a broken operator class or not. I just want to ensure we don't leave any holes open for this optimisation to return incorrect results. David
On Thu, 1 Dec 2022 at 21:18, Richard Guo <guofenglinux@gmail.com> wrote: >> + if (!func_strict(opexpr->opfuncid)) >> + return false; >> >> Should return true instead? > > > Yeah, you're right. This should be a thinko. Yeah, oops. That's wrong. I've adjusted that in the attached. I'm keen to move along and push the fix for this bug. If there are no objections to the method in the attached and also adding the restriction to limit the optimization to only working with strict OpExprs, then I'm going to push this, likely about 24 hours from now. David