Thread: Bug in row_number() optimization

Bug in row_number() optimization

From
Sergey Shinderuk
Date:
Hello,

We've accidentally found a subtle bug introduced by

commit 9d9c02ccd1aea8e9131d8f4edb21bf1687e40782
Author: David Rowley
Date:   Fri Apr 8 10:34:36 2022 +1200

     Teach planner and executor about monotonic window funcs


On a 32-bit system Valgrind reports use-after-free when running the 
"window" test:

==35487== Invalid read of size 4
==35487==    at 0x48398A4: memcpy (vg_replace_strmem.c:1035)
==35487==    by 0x1A2902: fill_val (heaptuple.c:287)
==35487==    by 0x1A2902: heap_fill_tuple (heaptuple.c:336)
==35487==    by 0x1A3C29: heap_form_minimal_tuple (heaptuple.c:1412)
==35487==    by 0x3D4555: tts_virtual_copy_minimal_tuple (execTuples.c:290)
==35487==    by 0x72FC33: ExecCopySlotMinimalTuple (tuptable.h:473)
==35487==    by 0x72FC33: tuplesort_puttupleslot (tuplesortvariants.c:610)
==35487==    by 0x403463: ExecSort (nodeSort.c:153)
==35487==    by 0x3D0C8E: ExecProcNodeFirst (execProcnode.c:464)
==35487==    by 0x40AF09: ExecProcNode (executor.h:259)
==35487==    by 0x40AF09: begin_partition (nodeWindowAgg.c:1106)
==35487==    by 0x40D259: ExecWindowAgg (nodeWindowAgg.c:2125)
==35487==    by 0x3D0C8E: ExecProcNodeFirst (execProcnode.c:464)
==35487==    by 0x405E17: ExecProcNode (executor.h:259)
==35487==    by 0x405E17: SubqueryNext (nodeSubqueryscan.c:53)
==35487==    by 0x3D41C7: ExecScanFetch (execScan.c:133)
==35487==    by 0x3D41C7: ExecScan (execScan.c:199)
==35487==  Address 0xe3e8af0 is 168 bytes inside a block of size 8,192 
alloc'd
==35487==    at 0x483463B: malloc (vg_replace_malloc.c:299)
==35487==    by 0x712B63: AllocSetContextCreateInternal (aset.c:446)
==35487==    by 0x3D82BE: CreateExprContextInternal (execUtils.c:253)
==35487==    by 0x3D84DC: CreateExprContext (execUtils.c:303)
==35487==    by 0x3D8750: ExecAssignExprContext (execUtils.c:482)
==35487==    by 0x40BC1A: ExecInitWindowAgg (nodeWindowAgg.c:2382)
==35487==    by 0x3D1232: ExecInitNode (execProcnode.c:346)
==35487==    by 0x4035E0: ExecInitSort (nodeSort.c:265)
==35487==    by 0x3D11AB: ExecInitNode (execProcnode.c:321)
==35487==    by 0x40BD36: ExecInitWindowAgg (nodeWindowAgg.c:2432)
==35487==    by 0x3D1232: ExecInitNode (execProcnode.c:346)
==35487==    by 0x405E99: ExecInitSubqueryScan (nodeSubqueryscan.c:126)


It's faster to run just this test under Valgrind:

    make installcheck-test TESTS='test_setup window'


This can also be reproduced on a 64-bit system by forcing int8 to be 
passed by reference:

--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -82,9 +82,7 @@
   *
   * Changing this requires an initdb.
   */
-#if SIZEOF_VOID_P >= 8
-#define USE_FLOAT8_BYVAL 1
-#endif
+#undef USE_FLOAT8_BYVAL

  /*
   * When we don't have native spinlocks, we use semaphores to simulate 
them.


Futhermore, zeroing freed memory makes the test fail:

--- a/src/include/utils/memdebug.h
+++ b/src/include/utils/memdebug.h
@@ -39,7 +39,7 @@ static inline void
  wipe_mem(void *ptr, size_t size)
  {
     VALGRIND_MAKE_MEM_UNDEFINED(ptr, size);
-   memset(ptr, 0x7F, size);
+   memset(ptr, 0, size);
     VALGRIND_MAKE_MEM_NOACCESS(ptr, size);
  }

$ cat src/test/regress/regression.diffs
diff -U3 
/home/sergey/pgwork/devel/src/src/test/regress/expected/window.out 
/home/sergey/pgwork/devel/src/src/test/regress/results/window.out
--- /home/sergey/pgwork/devel/src/src/test/regress/expected/window.out 
2022-11-03 18:26:52.203624217 +0300
+++ /home/sergey/pgwork/devel/src/src/test/regress/results/window.out 
2022-11-16 01:47:18.494273352 +0300
@@ -3721,7 +3721,8 @@
  -----------+-------+--------+-------------+----+----+----+----
   personnel |     5 |   3500 | 12-10-2007  |  2 |  1 |  2 |  2
   sales     |     3 |   4800 | 08-01-2007  |  3 |  1 |  3 |  3
-(2 rows)
+ sales     |     4 |   4800 | 08-08-2007  |  3 |  0 |  3 |  3
+(3 rows)

  -- Tests to ensure we don't push down the run condition when it's not 
valid to
  -- do so.


The failing query is:

SELECT * FROM
   (SELECT *,
           count(salary) OVER (PARTITION BY depname || '') c1, -- w1
           row_number() OVER (PARTITION BY depname) rn, -- w2
           count(*) OVER (PARTITION BY depname) c2, -- w2
           count(*) OVER (PARTITION BY '' || depname) c3 -- w3
    FROM empsalary
) e WHERE rn <= 1 AND c1 <= 3;


As far as I understand, ExecWindowAgg for the intermediate WindowAgg 
node switches into pass-through mode, stops evaluating row_number(), and 
returns the previous value instead. But if int8 is passed by reference, 
the previous value stored in econtext->ecxt_aggvalues becomes a dangling 
pointer when the per-output-tuple memory context is reset.

Attaching a patch that makes the window test fail on a 64-bit system.

Best regards,

-- 
Sergey Shinderuk        https://postgrespro.com/
Attachment

Re: Bug in row_number() optimization

From
Richard Guo
Date:

On Wed, Nov 16, 2022 at 7:38 AM Sergey Shinderuk <s.shinderuk@postgrespro.ru> wrote:
The failing query is:
SELECT * FROM
   (SELECT *,
           count(salary) OVER (PARTITION BY depname || '') c1, -- w1
           row_number() OVER (PARTITION BY depname) rn, -- w2
           count(*) OVER (PARTITION BY depname) c2, -- w2
           count(*) OVER (PARTITION BY '' || depname) c3 -- w3
    FROM empsalary
) e WHERE rn <= 1 AND c1 <= 3;
As far as I understand, ExecWindowAgg for the intermediate WindowAgg
node switches into pass-through mode, stops evaluating row_number(), and
returns the previous value instead. But if int8 is passed by reference,
the previous value stored in econtext->ecxt_aggvalues becomes a dangling
pointer when the per-output-tuple memory context is reset.
 
Yeah, you're right.  In this example the window function row_number()
goes into pass-through mode after the second evaluation because its
run condition does not hold true any more.  The remaining run would just
return the result from the second evaluation, which is stored in
econtext->ecxt_aggvalues[wfuncno].

If int8 is configured as pass-by-ref, the precomputed value from the
second evaluation is actually located in a memory area from context
ecxt_per_tuple_memory, with its pointer stored in ecxt_aggvalues.  As
this memory context is reset once per tuple, we would be prone to wrong
results.

I tried with memory context ecxt_per_query_memory when evaluating
window function in the case where int8 is configured as pass-by-ref and
I can see the problem vanishes.  I'm using the changes as below

--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -1027,8 +1027,14 @@ eval_windowfunction(WindowAggState *winstate, WindowStatePerFunc perfuncstate,
 {
        LOCAL_FCINFO(fcinfo, FUNC_MAX_ARGS);
        MemoryContext oldContext;
-
-       oldContext = MemoryContextSwitchTo(winstate->ss.ps.ps_ExprContext->ecxt_per_tuple_memory);
+       MemoryContext evalWfuncContext;
+
+#ifdef USE_FLOAT8_BYVAL
+       evalWfuncContext = winstate->ss.ps.ps_ExprContext->ecxt_per_tuple_memory;
+#else
+       evalWfuncContext = winstate->ss.ps.ps_ExprContext->ecxt_per_query_memory;
+#endif
+       oldContext = MemoryContextSwitchTo(evalWfuncContext);

Thanks
Richard
 

Re: Bug in row_number() optimization

From
Richard Guo
Date:

On Tue, Nov 22, 2022 at 3:44 PM Richard Guo <guofenglinux@gmail.com> wrote:
On Wed, Nov 16, 2022 at 7:38 AM Sergey Shinderuk <s.shinderuk@postgrespro.ru> wrote:
The failing query is:
SELECT * FROM
   (SELECT *,
           count(salary) OVER (PARTITION BY depname || '') c1, -- w1
           row_number() OVER (PARTITION BY depname) rn, -- w2
           count(*) OVER (PARTITION BY depname) c2, -- w2
           count(*) OVER (PARTITION BY '' || depname) c3 -- w3
    FROM empsalary
) e WHERE rn <= 1 AND c1 <= 3;
As far as I understand, ExecWindowAgg for the intermediate WindowAgg
node switches into pass-through mode, stops evaluating row_number(), and
returns the previous value instead. But if int8 is passed by reference,
the previous value stored in econtext->ecxt_aggvalues becomes a dangling
pointer when the per-output-tuple memory context is reset.
 
Yeah, you're right.  In this example the window function row_number()
goes into pass-through mode after the second evaluation because its
run condition does not hold true any more.  The remaining run would just
return the result from the second evaluation, which is stored in
econtext->ecxt_aggvalues[wfuncno].

If int8 is configured as pass-by-ref, the precomputed value from the
second evaluation is actually located in a memory area from context
ecxt_per_tuple_memory, with its pointer stored in ecxt_aggvalues.  As
this memory context is reset once per tuple, we would be prone to wrong
results.
 
Regarding how to fix this problem, firstly I believe we need to evaluate
window functions in the per-tuple memory context, as the HEAD does.
When we decide we need to go into pass-through mode, I'm thinking that
we can just copy out the results of the last evaluation to the per-query
memory context, while still storing their pointers in ecxt_aggvalues.

Does this idea work?

Thanks
Richard

Re: Bug in row_number() optimization

From
Sergey Shinderuk
Date:
On 24.11.2022 06:16, Richard Guo wrote:
> Regarding how to fix this problem, firstly I believe we need to evaluate
> window functions in the per-tuple memory context, as the HEAD does.
> When we decide we need to go into pass-through mode, I'm thinking that
> we can just copy out the results of the last evaluation to the per-query
> memory context, while still storing their pointers in ecxt_aggvalues.
> 
> Does this idea work?
Although I'm not familiar with the code, this makes sense to me.

You proposed:

+#ifdef USE_FLOAT8_BYVAL
+       evalWfuncContext = 
winstate->ss.ps.ps_ExprContext->ecxt_per_tuple_memory;
+#else
+       evalWfuncContext = 
winstate->ss.ps.ps_ExprContext->ecxt_per_query_memory;
+#endif

Shouldn't we handle any pass-by-reference type the same? I suppose, a 
user-defined window function can return some other type, not int8.

Best regards,

-- 
Sergey Shinderuk        https://postgrespro.com/




Re: Bug in row_number() optimization

From
David Rowley
Date:
On Fri, 25 Nov 2022 at 00:52, Sergey Shinderuk
<s.shinderuk@postgrespro.ru> wrote:
> Shouldn't we handle any pass-by-reference type the same? I suppose, a
> user-defined window function can return some other type, not int8.

Thanks for reporting this and to you and Richard for working on a fix.

I've just looked at it and it seems that valgrind is complaining
because a tuple formed by an upper-level WindowAgg contains a pointer
to free'd memory due to the byref type and eval_windowaggregates() not
having been executed to fill in ecxt_aggvalues and ecxt_aggnulls on
the lower-level WindowAgg.

Since upper-level WindowAggs cannot reference values calculated in
some lower-level WindowAgg, why can't we just NULLify the pointers
instead? See attached.

It is possible to have a monotonic window function that does not
return int8.  Technically something like MAX(text_col) OVER (PARTITION
BY somecol ORDER BY text_col) is monotonically increasing, it's just
that I didn't add a support function to tell the planner about that.
Someone could come along in the future and suggest we do that and show
us some convincing use case.  So whatever the fix, it cannot assume
the window function's return type is int8.

David

Attachment

Re: Bug in row_number() optimization

From
Richard Guo
Date:

On Fri, Nov 25, 2022 at 7:34 AM David Rowley <dgrowleyml@gmail.com> wrote:
Since upper-level WindowAggs cannot reference values calculated in
some lower-level WindowAgg, why can't we just NULLify the pointers
instead? See attached.
 
Verified the problem is fixed with this patch.  I'm not familiar with
the WindowAgg execution codes.  As far as I understand, this patch works
because we set ecxt_aggnulls to true, making it a NULL value.  And the
top-level WindowAgg node's "Filter" is strict so that it can filter out
all the tuples that don't match the intermediate WindowAgg node's run
condition.  So I find the comments about "WindowAggs above us cannot
reference the result of another WindowAgg" confusing.  But maybe I'm
missing something.

Thanks
Richard

Re: Bug in row_number() optimization

From
David Rowley
Date:
On Fri, 25 Nov 2022 at 16:00, Richard Guo <guofenglinux@gmail.com> wrote:
> Verified the problem is fixed with this patch.  I'm not familiar with
> the WindowAgg execution codes.  As far as I understand, this patch works
> because we set ecxt_aggnulls to true, making it a NULL value.  And the
> top-level WindowAgg node's "Filter" is strict so that it can filter out
> all the tuples that don't match the intermediate WindowAgg node's run
> condition.  So I find the comments about "WindowAggs above us cannot
> reference the result of another WindowAgg" confusing.  But maybe I'm
> missing something.

There are two different pass-through modes that the WindowAgg can move
into when it detects that the run condition is no longer true:

1) WINDOWAGG_PASSTHROUGH and
2) WINDOWAGG_PASSTHROUGH_STRICT

#2 is used when the WindowAgg is the top-level one in this query
level. Remember we'll need multiple WindowAgg nodes when there are
multiple different windows to evaluate.  The reason that we need #1 is
that if there are multiple WindowAggs, then the top-level one (or just
any WindowAgg above it) might need all the rows from a lower-level
WindowAgg.  For example:

SELECT * FROM (SELECT row_number() over(order by id) rn, sum(qty) over
(order by date) qty from t) t where rn <= 10;

if the "order by id" window is evaluated first, we can't stop
outputting rows when rn <= 10 is no longer true as the "order by date"
window might need those.  In this case, once rn <= 10 is no longer
true, the WindowAgg for that window would go into
WINDOWAGG_PASSTHROUGH. This means we can stop window func evaluation
on any additional rows.  The final query will never see rn==11, so we
don't need to generate that.

The problem is that once the "order by id" window stops evaluating the
window funcs, if the window result is byref, then we leave junk in the
aggregate output columns.  Since we continue to output rows from that
WindowAgg for the top-level "order by date" window, we don't want to
form tuples with free'd memory.

Since nothing in the query will ever seen rn==11 and beyond, there's
no need to put anything in that part of the output tuple. We can just
make it an SQL NULL.

What I mean by "WindowAggs above us cannot reference the result of
another WindowAgg" is that the evaluation of sum(qty) over (order by
date) can't see the "rn" column. SQL does not allow it. If it did,
that would have to look something like:

SELECT * FROM (SELECT SUM(row_number() over (order by id)) over (order
by date) qty from t); -- not valid SQL

WINDOWAGG_PASSTHROUGH_STRICT not only does not evaluate window funcs,
it also does not even bother to store tuples in the tuple store. In
this case there's no higher-level WindowAgg that will need these
tuples, so we can just read our sub-node until we find the next
partition, or stop when there's no PARTITION BY clause.

Just thinking of the patch a bit more, what I wrote ends up
continually zeroing the values and marking the columns as NULL. Likely
we can just do this once when we do: winstate->status =
WINDOWAGG_PASSTHROUGH;  I'll test that out and make sure it works.

David



Re: Bug in row_number() optimization

From
Richard Guo
Date:

On Fri, Nov 25, 2022 at 11:26 AM David Rowley <dgrowleyml@gmail.com> wrote:
There are two different pass-through modes that the WindowAgg can move
into when it detects that the run condition is no longer true:

1) WINDOWAGG_PASSTHROUGH and
2) WINDOWAGG_PASSTHROUGH_STRICT

#2 is used when the WindowAgg is the top-level one in this query
level. Remember we'll need multiple WindowAgg nodes when there are
multiple different windows to evaluate.  The reason that we need #1 is
that if there are multiple WindowAggs, then the top-level one (or just
any WindowAgg above it) might need all the rows from a lower-level
WindowAgg. 
 
Thanks for the explanation!  I think now I understand pass-through modes
much better.
 
What I mean by "WindowAggs above us cannot reference the result of
another WindowAgg" is that the evaluation of sum(qty) over (order by
date) can't see the "rn" column. SQL does not allow it. 
 
I think I get your point.  Yeah, the 'rn' column is not needed for the
evaluation of WindowAggs above.  But ISTM it might be needed to evaluate
the quals of WindowAggs above.  Such as in the plan below

explain (costs off) SELECT * FROM
   (SELECT
           count(salary) OVER (PARTITION BY depname || '') c1, -- w1
           row_number() OVER (PARTITION BY depname) rn -- w2
    FROM empsalary
) e WHERE rn <= 1;
                            QUERY PLAN
-------------------------------------------------------------------
 Subquery Scan on e
   ->  WindowAgg
         Filter: ((row_number() OVER (?)) <= 1)
         ->  Sort
               Sort Key: (((empsalary.depname)::text || ''::text))
               ->  WindowAgg
                     Run Condition: (row_number() OVER (?) <= 1)
                     ->  Sort
                           Sort Key: empsalary.depname
                           ->  Seq Scan on empsalary
(10 rows)

The 'rn' column is calculated in the lower-level WindowAgg, and it is
needed to evaluate the 'Filter' of the upper-level WindowAgg.  In
pass-through mode, the lower-level WindowAgg would not be evaluated any
more, so we need to mark the 'rn' column to something that can false the
'Filter'.  Considering the 'Filter' is a strict function, marking it as
NULL would do.  I think this is why this patch works.
 
Just thinking of the patch a bit more, what I wrote ends up
continually zeroing the values and marking the columns as NULL. Likely
we can just do this once when we do: winstate->status =
WINDOWAGG_PASSTHROUGH; 
 
Yes, I also think we can do this only once when we go into pass-through
mode.

Thanks
Richard

Re: Bug in row_number() optimization

From
Sergey Shinderuk
Date:
On 25.11.2022 15:46, Richard Guo wrote:
> Considering the 'Filter' is a strict function, marking it as
> NULL would do.  I think this is why this patch works.

What about user-defined operators? I created my own <= operator for int8 
which returns true on null input, and put it in a btree operator class. 
With my operator I get:

   depname  | empno | salary | enroll_date | c1 | rn | c2 | c3
-----------+-------+--------+-------------+----+----+----+----
  personnel |     5 |   3500 | 2007-12-10  |  2 |  1 |  2 |  2
  sales     |     3 |   4800 | 2007-08-01  |  3 |  1 |  3 |  3
  sales     |     4 |   4800 | 2007-08-08  |  3 |    |    |  3
(3 rows)

Admittedly, it's weird that (null <= 1) evaluates to true. But does it 
violate  the contract of the btree operator class or something? Didn't 
find a clear answer in the docs.

-- 
Sergey Shinderuk        https://postgrespro.com/




Re: Bug in row_number() optimization

From
Tom Lane
Date:
Sergey Shinderuk <s.shinderuk@postgrespro.ru> writes:
> What about user-defined operators? I created my own <= operator for int8 
> which returns true on null input, and put it in a btree operator class. 
> Admittedly, it's weird that (null <= 1) evaluates to true. But does it 
> violate  the contract of the btree operator class or something? Didn't 
> find a clear answer in the docs.

It's pretty unlikely that this would work during an actual index scan.
I'm fairly sure that btree (and other index AMs) have hard-wired
assumptions that index operators are strict.

            regards, tom lane



Re: Bug in row_number() optimization

From
David Rowley
Date:
On Sat, 26 Nov 2022 at 05:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Sergey Shinderuk <s.shinderuk@postgrespro.ru> writes:
> > What about user-defined operators? I created my own <= operator for int8
> > which returns true on null input, and put it in a btree operator class.
> > Admittedly, it's weird that (null <= 1) evaluates to true. But does it
> > violate  the contract of the btree operator class or something? Didn't
> > find a clear answer in the docs.
>
> It's pretty unlikely that this would work during an actual index scan.
> I'm fairly sure that btree (and other index AMs) have hard-wired
> assumptions that index operators are strict.

If we're worried about that then we could just restrict this
optimization to only work with strict quals.

The proposal to copy the datums into the query context does not seem
to me to be a good idea. If there are a large number of partitions
then it sounds like we'll leak lots of memory.  We could invent some
partition context that we reset after each partition, but that's
probably more complexity than it would be worth.

I've attached a draft patch to move the code to nullify the aggregate
results so that's only done once per partition and adjusted the
planner to limit this to strict quals.

David

Attachment

Re: Bug in row_number() optimization

From
Sergey Shinderuk
Date:
On 28.11.2022 03:23, David Rowley wrote:
> On Sat, 26 Nov 2022 at 05:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Sergey Shinderuk <s.shinderuk@postgrespro.ru> writes:
>>> What about user-defined operators? I created my own <= operator for int8
>>> which returns true on null input, and put it in a btree operator class.
>>> Admittedly, it's weird that (null <= 1) evaluates to true. But does it
>>> violate  the contract of the btree operator class or something? Didn't
>>> find a clear answer in the docs.
>>
>> It's pretty unlikely that this would work during an actual index scan.
>> I'm fairly sure that btree (and other index AMs) have hard-wired
>> assumptions that index operators are strict.
> 
> If we're worried about that then we could just restrict this
> optimization to only work with strict quals.

Not sure this is necessary if btree operators must be strict anyway.


> The proposal to copy the datums into the query context does not seem
> to me to be a good idea. If there are a large number of partitions
> then it sounds like we'll leak lots of memory.  We could invent some
> partition context that we reset after each partition, but that's
> probably more complexity than it would be worth.

Ah, good point.


> I've attached a draft patch to move the code to nullify the aggregate
> results so that's only done once per partition and adjusted the
> planner to limit this to strict quals.

Not quite sure that we don't need to do anything for the 
WINDOWAGG_PASSTHROUGH_STRICT case. Although, we won't return any more 
tuples for the current partition, we still call ExecProject with 
dangling pointers. Is it okay?


+   if (!func_strict(opexpr->opfuncid))
+       return false;

Should return true instead?

-- 
Sergey Shinderuk        https://postgrespro.com/




Re: Bug in row_number() optimization

From
Richard Guo
Date:

On Mon, Nov 28, 2022 at 5:59 PM Sergey Shinderuk <s.shinderuk@postgrespro.ru> wrote:
Not quite sure that we don't need to do anything for the
WINDOWAGG_PASSTHROUGH_STRICT case. Although, we won't return any more
tuples for the current partition, we still call ExecProject with
dangling pointers. Is it okay?
 
AFAIU once we go into WINDOWAGG_PASSTHROUGH_STRICT we will spool all the
remaining tuples in the current partition without storing them and then
move to the next partition if available and become WINDOWAGG_RUN again
or become WINDOWAGG_DONE if there are no further partitions.  It seems
we would not have chance to see the dangling pointers.
 
+   if (!func_strict(opexpr->opfuncid))
+       return false;

Should return true instead?
 
Yeah, you're right.  This should be a thinko.

Thanks
Richard

Re: Bug in row_number() optimization

From
Sergey Shinderuk
Date:
On 01.12.2022 11:18, Richard Guo wrote:
> 
> On Mon, Nov 28, 2022 at 5:59 PM Sergey Shinderuk 
> <s.shinderuk@postgrespro.ru <mailto:s.shinderuk@postgrespro.ru>> wrote:
> 
>     Not quite sure that we don't need to do anything for the
>     WINDOWAGG_PASSTHROUGH_STRICT case. Although, we won't return any more
>     tuples for the current partition, we still call ExecProject with
>     dangling pointers. Is it okay?
> 
> AFAIU once we go into WINDOWAGG_PASSTHROUGH_STRICT we will spool all the
> remaining tuples in the current partition without storing them and then
> move to the next partition if available and become WINDOWAGG_RUN again
> or become WINDOWAGG_DONE if there are no further partitions.  It seems
> we would not have chance to see the dangling pointers.

Maybe I'm missing something, but the previous call to spool_tuples() 
might have read extra tuples (if the tuplestore spilled to disk), and 
after switching to WINDOWAGG_PASSTHROUGH_STRICT mode we nevertheless 
would loop through these extra tuples and call ExecProject if only to 
increment winstate->currentpos.

-- 
Sergey Shinderuk        https://postgrespro.com/




Re: Bug in row_number() optimization

From
David Rowley
Date:
On Fri, 2 Dec 2022 at 00:21, Sergey Shinderuk
<s.shinderuk@postgrespro.ru> wrote:
> Maybe I'm missing something, but the previous call to spool_tuples()
> might have read extra tuples (if the tuplestore spilled to disk), and
> after switching to WINDOWAGG_PASSTHROUGH_STRICT mode we nevertheless
> would loop through these extra tuples and call ExecProject if only to
> increment winstate->currentpos.

The tuples which are spooled in the WindowAgg node are the ones from
the WindowAgg's subnode.  Since these don't contain the results of the
WindowFunc, then I don't think there's any issue with what's stored in
any of the spooled tuples.

What matters is what we pass along to the node that's reading from the
WindowAgg. If we NULL out the memory where we store the WindowFunc
(and maybe an Aggref) results then the ExecProject in ExecWindowAgg()
will no longer fill the WindowAgg's output slot with the address of
free'd memory (or some stale byval value which has lingered for byval
return type WindowFuncs).

Since the patch I sent sets the context's ecxt_aggnulls to true, it
means that when we do the ExecProject(), the EEOP_WINDOW_FUNC in
ExecInterpExpr (or the JIT equivalent) will put an SQL NULL in the
*output* slot for the WindowAgg node. The same is true for
EEOP_AGGREFs as the WindowAgg node that we are running in
WINDOWAGG_PASSTHROUGH mode could also contain normal aggregate
functions, not just WindowFuncs.

David



Re: Bug in row_number() optimization

From
David Rowley
Date:
On Mon, 28 Nov 2022 at 22:59, Sergey Shinderuk
<s.shinderuk@postgrespro.ru> wrote:
>
> On 28.11.2022 03:23, David Rowley wrote:
> > On Sat, 26 Nov 2022 at 05:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> It's pretty unlikely that this would work during an actual index scan.
> >> I'm fairly sure that btree (and other index AMs) have hard-wired
> >> assumptions that index operators are strict.
> >
> > If we're worried about that then we could just restrict this
> > optimization to only work with strict quals.
>
> Not sure this is necessary if btree operators must be strict anyway.

I'd rather see the func_strict() test in there. You've already
demonstrated you can get wrong results with a non-strict operator. I'm
not disputing that it sounds like a broken operator class or not. I
just want to ensure we don't leave any holes open for this
optimisation to return incorrect results.

David



Re: Bug in row_number() optimization

From
David Rowley
Date:
On Thu, 1 Dec 2022 at 21:18, Richard Guo <guofenglinux@gmail.com> wrote:
>> +   if (!func_strict(opexpr->opfuncid))
>> +       return false;
>>
>> Should return true instead?
>
>
> Yeah, you're right.  This should be a thinko.

Yeah, oops. That's wrong.

I've adjusted that in the attached.

I'm keen to move along and push the fix for this bug.  If there are no
objections to the method in the attached and also adding the
restriction to limit the optimization to only working with strict
OpExprs, then I'm going to push this, likely about 24 hours from now.

David

Attachment