Thread: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger

BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      17798
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 15.2
Operating system:   Ubuntu 22.04
Description:

When executing the following queries under valgrind:
cat << "EOF" | psql
CREATE TABLE bruttest(cnt int DEFAULT 0, t text);
CREATE FUNCTION noop_tfunc() RETURNS trigger LANGUAGE plpgsql AS $$ BEGIN
RETURN NEW; END$$;
CREATE TRIGGER brupdate_trigger BEFORE UPDATE ON bruttest FOR EACH ROW
EXECUTE FUNCTION noop_tfunc();
INSERT INTO bruttest(t) VALUES (repeat('x', 1000));
EOF

for ((i=1;i<=4;i++)); do
  echo "iteration $i"
  psql -c "UPDATE bruttest SET cnt = cnt + 1" &
  psql -c "UPDATE bruttest SET cnt = cnt + 1" &
  wait
done

I get a failure:
...
iteration 4
UPDATE 1
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
connection to server was lost

server.log contains:
==00:00:00:08.877 3381345== Invalid read of size 1
==00:00:00:08.877 3381345==    at 0x1E4721: heap_compute_data_size
(heaptuple.c:138)
==00:00:00:08.877 3381345==    by 0x1E5674: heap_form_tuple
(heaptuple.c:1061)
==00:00:00:08.877 3381345==    by 0x413AF6: tts_buffer_heap_materialize
(execTuples.c:748)
==00:00:00:08.877 3381345==    by 0x414CB9: ExecFetchSlotHeapTuple
(execTuples.c:1654)
==00:00:00:08.877 3381345==    by 0x3DF10B: ExecBRUpdateTriggers
(trigger.c:3074)
==00:00:00:08.877 3381345==    by 0x438E2C: ExecUpdatePrologue
(nodeModifyTable.c:1912)
==00:00:00:08.877 3381345==    by 0x43A623: ExecUpdate
(nodeModifyTable.c:2269)
==00:00:00:08.877 3381345==    by 0x43D8BD: ExecModifyTable
(nodeModifyTable.c:3856)
==00:00:00:08.877 3381345==    by 0x40E7F1: ExecProcNodeFirst
(execProcnode.c:464)
==00:00:00:08.877 3381345==    by 0x406E19: ExecProcNode (executor.h:259)
==00:00:00:08.877 3381345==    by 0x406E19: ExecutePlan (execMain.c:1636)
==00:00:00:08.877 3381345==    by 0x406FF9: standard_ExecutorRun
(execMain.c:363)
==00:00:00:08.877 3381345==    by 0x4070C5: ExecutorRun (execMain.c:307)
==00:00:00:08.877 3381345==  Address 0x957d114 is in a rw- anonymous
segment
==00:00:00:08.877 3381345==
{
   <insert_a_suppression_name_here>
...
==00:00:00:08.877 3381345==
==00:00:00:08.877 3381345== Exit program on first error
(--exit-on-first-error=yes)
2023-02-17 11:27:17.663 MSK [3381268] LOG:  server process (PID 3381345)
exited with exit code 1
2023-02-17 11:27:17.663 MSK [3381268] DETAIL:  Failed process was running:
UPDATE bruttest SET cnt = cnt + 1

I've also seen a relevant failure detected by asan (on master):
==1912964==ERROR: AddressSanitizer: heap-buffer-overflow on address
0x7f223dd52c78 at pc 0x55c9e846146a bp 0x7fff0c63e560 sp 0x7fff0c63dd30
WRITE of size 42205122 at 0x7f223dd52c78 thread T0
...
#8  0x000055c9e846953f in __asan::ReportGenericError(unsigned long, unsigned
long, unsigned long, unsigned long, bool, unsigned long, unsigned int, bool)
()
#9  0x000055c9e846148c in __asan_memcpy ()
#10 0x000055c9e84d09c5 in fill_val (att=<optimized out>, bit=<optimized
out>, bitmask=bitmask@entry=0x7fff0c63e620, 
    dataP=dataP@entry=0x7fff0c63e5e0,
infomask=infomask@entry=0x7f223dd09864, datum=<optimized out>, 
    isnull=<optimized out>) at heaptuple.c:270
#11 0x000055c9e84d0080 in heap_fill_tuple (tupleDesc=<optimized out>,
values=<optimized out>, isnull=<optimized out>, 
    data=<optimized out>, data_size=<optimized out>, infomask=<optimized
out>, bit=<optimized out>) at heaptuple.c:336
#12 0x000055c9e84d530e in heap_form_tuple (tupleDescriptor=0x7f22898f76e8,
values=0x1d3084, isnull=0x6250000c85d8)
    at heaptuple.c:1090
#13 0x000055c9e8bc8d0e in tts_buffer_heap_materialize (slot=0x6250000c8548)
at execTuples.c:748
#14 0x000055c9e8bcd02a in ExecFetchSlotHeapTuple
(slot=slot@entry=0x6250000c8548, materialize=<optimized out>, 
    shouldFree=shouldFree@entry=0x7fff0c63e810) at execTuples.c:1654
#15 0x000055c9e8afbd9a in ExecBRUpdateTriggers (estate=<optimized out>,
epqstate=<optimized out>, 
    relinfo=0x6250000a2e80, tupleid=<optimized out>,
fdw_trigtuple=<optimized out>, newslot=<optimized out>, 
    tmfd=<optimized out>) at trigger.c:3022
...

According to git bisect, first bad commit for this anomaly is 86dc90056.

That commit changed just one thing in trigger.c:
-                       epqslot_clean =
ExecFilterJunk(relinfo->ri_junkFilter, epqslot_candidate);
+                       epqslot_clean = ExecGetUpdateNewTuple(relinfo,
epqslot_candidate,
+
                     oldslot);

I've found the following explanation for the failure:
1) After the ExecGetUpdateNewTuple() call the newslot and the oldslot are
 linked together (their slot->tts_values[1] in this case point to the same
 memory address (inside the oldslot' buffer)).
2) Previously, GetTupleForTrigger() could get a tuple with a new buffer,
 so the oldslot would be the only user of the buffer at that moment.
 (The newslot doesn't become an official user of the buffer.)
3) Then, trigtuple = ExecFetchSlotHeapTuple(oldslot, ...) invokes
 tts_buffer_heap_materialize() where the oldslot->buffer is released.
4) Finally, newtuple = ExecFetchSlotHeapTuple(newslot, ...) invokes
 tts_buffer_heap_materialize() where an incorrect access to memory
 that became anonymous occurs, and that is detected by valgrind.
 If not detected, different consequences are possible (in the asan case
 it was memcpy with an incorrectly read extra large data_len).


17.02.2023 13:00, PG Bug reporting form wrote:
> The following bug has been logged on the website:
> ...
>
> I've found the following explanation for the failure:
> 1) After the ExecGetUpdateNewTuple() call the newslot and the oldslot are
>   linked together (their slot->tts_values[1] in this case point to the same
>   memory address (inside the oldslot' buffer)).
> 2) Previously, GetTupleForTrigger() could get a tuple with a new buffer,
>   so the oldslot would be the only user of the buffer at that moment.
>   (The newslot doesn't become an official user of the buffer.)
> 3) Then, trigtuple = ExecFetchSlotHeapTuple(oldslot, ...) invokes
>   tts_buffer_heap_materialize() where the oldslot->buffer is released.
> 4) Finally, newtuple = ExecFetchSlotHeapTuple(newslot, ...) invokes
>   tts_buffer_heap_materialize() where an incorrect access to memory
>   that became anonymous occurs, and that is detected by valgrind.
>   If not detected, different consequences are possible (in the asan case
>   it was memcpy with an incorrectly read extra large data_len).
I've tried to materialize newslot before the oldslot materialization
(in ExecFetchSlotHeapTuple(), where their common memory is released),
and it looks like it fixes the issue.
The similar thing done in 75e03eabe.
But I don't know the code good enough so maybe I'm missing something.

Best regards,
Alexander
Attachment
07.03.2023 09:00, Alexander Lakhin wrote:
> I've tried to materialize newslot before the oldslot materialization
> (in ExecFetchSlotHeapTuple(), where their common memory is released),
> and it looks like it fixes the issue.

I've made a simple isolation test to illustrate the bug, which I'd consider
as serious. On master it shows (under valgrind):
# using temp instance on port 61696 with PID 614130
not ok 1     - bru-trigger                              2147 ms
# (test process exited with exit code 1)

src/test/isolation/output_iso/log/postmaster.log contains:
...
==00:00:00:05.840 615284== Invalid read of size 1
==00:00:00:05.840 615284==    at 0x1E376C: heap_compute_data_size (heaptuple.c:147)
==00:00:00:05.840 615284==    by 0x1E4458: heap_form_tuple (heaptuple.c:1061)
==00:00:00:05.840 615284==    by 0x3DB74A: tts_buffer_heap_materialize (execTuples.c:749)
==00:00:00:05.840 615284==    by 0x3DC5EB: ExecFetchSlotHeapTuple (execTuples.c:1655)
==00:00:00:05.840 615284==    by 0x3A6BA7: ExecBRUpdateTriggers (trigger.c:3032)
==00:00:00:05.840 615284==    by 0x3FE207: ExecUpdatePrologue (nodeModifyTable.c:1916)
==00:00:00:05.840 615284==    by 0x3FF838: ExecUpdate (nodeModifyTable.c:2289)
==00:00:00:05.840 615284==    by 0x401BD4: ExecModifyTable (nodeModifyTable.c:3795)
==00:00:00:05.840 615284==    by 0x3D65FF: ExecProcNodeFirst (execProcnode.c:464)
==00:00:00:05.840 615284==    by 0x3CE4F5: ExecProcNode (executor.h:272)
==00:00:00:05.840 615284==    by 0x3CE585: ExecutePlan (execMain.c:1633)
==00:00:00:05.840 615284==    by 0x3CF220: standard_ExecutorRun (execMain.c:364)
...
2023-04-01 14:26:31.543 MSK postmaster[615243] LOG:  server process (PID 615284) exited with exit code 1
2023-04-01 14:26:31.543 MSK postmaster[615243] DETAIL:  Failed process was running: UPDATE bruttest SET cnt = cnt + 1;

Maybe the test could supplement a fix (I'm still unsure how to fix the issue
right way).

Best regards,
Alexander
Attachment

On Tue, Mar 7, 2023 at 2:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> I've found the following explanation for the failure:
> 1) After the ExecGetUpdateNewTuple() call the newslot and the oldslot are
>   linked together (their slot->tts_values[1] in this case point to the same
>   memory address (inside the oldslot' buffer)).
> 2) Previously, GetTupleForTrigger() could get a tuple with a new buffer,
>   so the oldslot would be the only user of the buffer at that moment.
>   (The newslot doesn't become an official user of the buffer.)
> 3) Then, trigtuple = ExecFetchSlotHeapTuple(oldslot, ...) invokes
>   tts_buffer_heap_materialize() where the oldslot->buffer is released.
> 4) Finally, newtuple = ExecFetchSlotHeapTuple(newslot, ...) invokes
>   tts_buffer_heap_materialize() where an incorrect access to memory
>   that became anonymous occurs, and that is detected by valgrind.
>   If not detected, different consequences are possible (in the asan case
>   it was memcpy with an incorrectly read extra large data_len).
I've tried to materialize newslot before the oldslot materialization
(in ExecFetchSlotHeapTuple(), where their common memory is released),
and it looks like it fixes the issue.

Reproduced this issue on master with your queries.  I looked into this
issue and I agree with your analysis.  I think this is exactly what
happened.

I also agree that we should materialize the newslot before we fetch
trigtuple from the oldslot which would materialize the oldslot and
release all buffer pins.  But I'm not too familiar with the arounding
codes so need someone else to have a look.

Thanks
Richard

On Mon, Apr 3, 2023 at 2:29 PM Richard Guo <guofenglinux@gmail.com> wrote:
Reproduced this issue on master with your queries.  I looked into this
issue and I agree with your analysis.  I think this is exactly what
happened.

I also agree that we should materialize the newslot before we fetch
trigtuple from the oldslot which would materialize the oldslot and
release all buffer pins.  But I'm not too familiar with the arounding
codes so need someone else to have a look.

I have a second look at this issue and now I think the fix in v1 patch
is correct.  I think the comment needs to be updated for this change,
maybe something like

  * (Typically, newslot was also generated by ExecGetUpdateNewTuple, so
  * that epqslot_clean will be that same slot and the copy step below
- * is not needed.)
+ * is not needed.  And we need to materialize newslot in this case,
+ * since its tuple might be dependent on oldslot's storage, which
+ * might not be a local copy and be freed before we fetch newslot's
+ * tuple.)

Thanks
Richard
Hello Richard,

19.04.2023 11:29, Richard Guo wrote:
I have a second look at this issue and now I think the fix in v1 patch
is correct.  I think the comment needs to be updated for this change,
maybe something like

  * (Typically, newslot was also generated by ExecGetUpdateNewTuple, so
  * that epqslot_clean will be that same slot and the copy step below
- * is not needed.)
+ * is not needed.  And we need to materialize newslot in this case,
+ * since its tuple might be dependent on oldslot's storage, which
+ * might not be a local copy and be freed before we fetch newslot's
+ * tuple.)

Thanks for looking into this!

I've updated the comment based on your suggestion. Please look at the patch
attached.

I hesitate to add the isolation test to the patch as adding a dedicated
test just for this case seems too wasteful to me. Maybe that scenario could
be added to one of the existing trigger-related tests, but I couldn't find a
test, which is generic enough for that.

Best regards,
Alexander
Attachment
Hello Tom,

20.04.2023 13:00, Alexander Lakhin wrote:

19.04.2023 11:29, Richard Guo wrote:
I have a second look at this issue and now I think the fix in v1 patch
is correct.  I think the comment needs to be updated for this change,
maybe something like


I've updated the comment based on your suggestion. Please look at the patch
attached.


Could you please confirm the issue, Richard's and my conclusions, and the
correctness of the patch, in light of the upcoming May releases?
Maybe I'm wrong, but I think that this bug can lead to data corruption in
databases where BRU triggers are used.

If the defect is real, but the patch proposed is not good enough, I would
register it on the commitfest to continue working on it.

Best regards,
Alexander
Alexander Lakhin <exclusion@gmail.com> writes:
> Could you please confirm the issue, Richard's and my conclusions, and the
> correctness of the patch, in light of the upcoming May releases?
> Maybe I'm wrong, but I think that this bug can lead to data corruption in
> databases where BRU triggers are used.

Clearly it could, though the probability seems low since the just-released
buffer would have to get recycled very quickly.

I am not entirely comfortable with blaming this on 86dc90056 though,
even though "git bisect" seems to finger that.  The previous coding
there with ExecFilterJunk() also produced a virtual tuple, as you
can see by examining ExecFilterJunk, so why didn't the problem
manifest before?  I think the answer may be that before 86dc90056,
we were effectively relying on the underlying SeqScan node to keep
the buffer pinned, but now we're re-fetching the tuple and no pin
is held by lower plan nodes.  I'm not entirely sure about that though;
ISTM a SeqScan ought to hold pin on its current tuple in any case.
However, if the UPDATE plan were more complicated (involving a join)
then it's quite believable that we'd get here with no relevant pin
being held.

The practical impact of that concern is that I'm not convinced that
the proposed patch fixes all variants of the issue.  Maybe we need
to materialize even if newslot != epqslot_clean.  Maybe we need to
do it even if there wasn't a concurrent update.  Maybe we need to
do it in pre-v14 branches, too.  It's certainly not obvious that this
function ought to assume anything about which slots are pointing at
what.

Also, if we do materialize here, does this code a little further down
become redundant?

            if (should_free_trig && newtuple == trigtuple)
                ExecMaterializeSlot(newslot);

A completely different way of looking at it is that we should not
treat this as ExecBRUpdateTriggers's bug at all.  The slot mechanisms
are supposed to protect the data referenced by a slot, so why is that
failing to happen in this example?  The correct fix might involve
newslot acquiring a buffer pin, for example.

Bottom line is that I'm afraid the present patch is band-aiding a
specific problem without solving the general case.

            regards, tom lane



30.04.2023 01:24, Tom Lane wrote:
> Alexander Lakhin <exclusion@gmail.com> writes:
>> Could you please confirm the issue, Richard's and my conclusions, and the
>> correctness of the patch, in light of the upcoming May releases?
>> Maybe I'm wrong, but I think that this bug can lead to data corruption in
>> databases where BRU triggers are used.
> Clearly it could, though the probability seems low since the just-released
> buffer would have to get recycled very quickly.

So I've made several mistakes when trying to reach the solution walking on
thin ice. I'd initially observed the issue with asan (and perhaps that had
happened under some specific conditions (memory pressure? (I can't say
now)), but when I produced and simplified the repro for the bug report,
I was focused on valgrind, so the final repro doesn't demonstrate a real
memory corruption and it's not clear now, which conditions needed to cause it.
So I really can't estimate the probability of the corruption.

> I am not entirely comfortable with blaming this on 86dc90056 though,
> even though "git bisect" seems to finger that.  The previous coding
> there with ExecFilterJunk() also produced a virtual tuple, as you
> can see by examining ExecFilterJunk, so why didn't the problem
> manifest before?  I think the answer may be that before 86dc90056,
> we were effectively relying on the underlying SeqScan node to keep
> the buffer pinned, but now we're re-fetching the tuple and no pin
> is held by lower plan nodes.  I'm not entirely sure about that though;
> ISTM a SeqScan ought to hold pin on its current tuple in any case.
> However, if the UPDATE plan were more complicated (involving a join)
> then it's quite believable that we'd get here with no relevant pin
> being held.

Yeah, also the environment changed since 86dc90056, so I couldn't just revert
that commit to test the ExecFilterJunk() behavior on master.

All the questions you raised require a more thorough investigation.

> The practical impact of that concern is that I'm not convinced that
> the proposed patch fixes all variants of the issue.  Maybe we need
> to materialize even if newslot != epqslot_clean.  Maybe we need to
> do it even if there wasn't a concurrent update.  Maybe we need to
> do it in pre-v14 branches, too.  It's certainly not obvious that this
> function ought to assume anything about which slots are pointing at
> what.
>
> Also, if we do materialize here, does this code a little further down
> become redundant?
>
>             if (should_free_trig && newtuple == trigtuple)
>                 ExecMaterializeSlot(newslot);
>
> A completely different way of looking at it is that we should not
> treat this as ExecBRUpdateTriggers's bug at all.  The slot mechanisms
> are supposed to protect the data referenced by a slot, so why is that
> failing to happen in this example?  The correct fix might involve
> newslot acquiring a buffer pin, for example.

Yes, it was tough to decide how to fix it (and the correct buffer pinning
seemed more plausible), but 75e03eabe gave me hope that it could be fixed
by the simple one-line change.

> Bottom line is that I'm afraid the present patch is band-aiding a
> specific problem without solving the general case.

I agree with your analysis and conclusions, so I'm going to turn the alarm
mode off and research the issue deeper.

Thank you for the detailed answer!

Best regards,
Alexander



On Sun, Apr 30, 2023 at 5:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> Yeah, also the environment changed since 86dc90056, so I couldn't just revert
> that commit to test the ExecFilterJunk() behavior on master.
>
> All the questions you raised require a more thorough investigation.

Are you aware of the fact that Valgrind has custom instrumentation
that makes it directly capable of detecting access to no-longer-pinned
buffers? See commits 7b7ed046 and 1e0dfd16.

I think that that may be a factor here. If it is, then it's a little
surprising that you ever found the problem with ASAN, since of course
we don't have custom ASAN instrumentation that tells ASAN things like
"until I say otherwise, it is not okay to read from this range of
memory, which is this backend's memory mapping for an individual
shared buffer that was just unpinned".

--
Peter Geoghegan



Hi,

On 2023-04-29 18:24:47 -0400, Tom Lane wrote:
> A completely different way of looking at it is that we should not
> treat this as ExecBRUpdateTriggers's bug at all.  The slot mechanisms
> are supposed to protect the data referenced by a slot, so why is that
> failing to happen in this example?  The correct fix might involve
> newslot acquiring a buffer pin, for example.

I think the slot mechanism today (and historically) doesn't protect against
quite a few such scenarios - and it's not clear how to change that. Most of
the scenarios where we need to materialize are because the underlying memory
is going to be released / reset, i.e. we don't hold a buffer pin in the first
place. I guess we could try to protect against that by registering a memory
context hook, but that'd be a very heavyweight mechanism.

Greetings,

Andres Freund



Hello Peter,

30.04.2023 23:39, Peter Geoghegan wrote:
> On Sun, Apr 30, 2023 at 5:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
>> Yeah, also the environment changed since 86dc90056, so I couldn't just revert
>> that commit to test the ExecFilterJunk() behavior on master.
>>
>> All the questions you raised require a more thorough investigation.
> Are you aware of the fact that Valgrind has custom instrumentation
> that makes it directly capable of detecting access to no-longer-pinned
> buffers? See commits 7b7ed046 and 1e0dfd16.

Yes, I am. That's why I can't demonstrate real risks now — my repro
triggers a Valgrind warning only, and no other consequences are visible.

> I think that that may be a factor here. If it is, then it's a little
> surprising that you ever found the problem with ASAN, since of course
> we don't have custom ASAN instrumentation that tells ASAN things like
> "until I say otherwise, it is not okay to read from this range of
> memory, which is this backend's memory mapping for an individual
> shared buffer that was just unpinned".

The ASAN detected a consequence of the invalid memory read:
==1912964==ERROR: AddressSanitizer: heap-buffer-overflow on address
0x7f223dd52c78 at pc 0x55c9e846146a bp 0x7fff0c63e560 sp 0x7fff0c63dd30
WRITE of size 42205122 at 0x7f223dd52c78 thread T0
...
#10 0x000055c9e84d09c5 in fill_val (...) at heaptuple.c:270
...
Note the abnormal size 42205122 (the target table contained much shorter
values at the moment). When exploring the coredump and the source code,
I found that that invalid size was gotten as follows:
ExecBRUpdateTriggers(): newtuple = ExecFetchSlotHeapTuple(newslot, true, &should_free_new);
ExecFetchSlotHeapTuple(): slot->tts_ops->materialize(slot)
tts_buffer_heap_materialize(): bslot->base.tuple = heap_form_tuple(...)
heap_form_tuple(): data_len = heap_compute_data_size(tupleDescriptor, values, isnull);
heap_compute_data_size(): here garbage data were read.

So I'm going to start it over and find a repro that will show something
interesting without Valgrind (and let estimate the probability of a
problem in the field).

Best regards,
Alexander



Alexander Lakhin <exclusion@gmail.com> writes:
> So I'm going to start it over and find a repro that will show something
> interesting without Valgrind (and let estimate the probability of a
> problem in the field).

I think a probability estimate is not all that exciting: "it's low
but not zero" is sufficient information to act on.  What I am curious
about is how come 86dc90056's changes triggered a visible problem.
Understanding that might give us a better feel for what is the real
structural issue here --- which I'm worried about because I have
little confidence that there aren't similar bugs lurking in other
code paths.

            regards, tom lane



On Sun, Apr 30, 2023 at 9:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> Yes, I am. That's why I can't demonstrate real risks now — my repro
> triggers a Valgrind warning only, and no other consequences are visible.

It would be nice to be able to say something about the severity with
high confidence, but this seems like the kind of thing that rarely
goes that way. Does it really matter? A bug is a bug.

--
Peter Geoghegan



01.05.2023 07:04, Tom Lane wrote:
> Alexander Lakhin <exclusion@gmail.com> writes:
>> So I'm going to start it over and find a repro that will show something
>> interesting without Valgrind (and let estimate the probability of a
>> problem in the field).
> I think a probability estimate is not all that exciting: "it's low
> but not zero" is sufficient information to act on.  What I am curious
> about is how come 86dc90056's changes triggered a visible problem.
> Understanding that might give us a better feel for what is the real
> structural issue here --- which I'm worried about because I have
> little confidence that there aren't similar bugs lurking in other
> code paths.

I've found one significant difference between old and new code paths.
On 86dc90056 I see the following call chain:
ExecGetUpdateNewTuple(planSlot = epqslot_candidate, oldSlot = oldslot):
econtext->ecxt_scantuple = oldSlot;
-> internalGetUpdateNewTuple() -> ExecProject(projInfo=relinfo->ri_projectNew)
-> ExecEvalExprSwitchContext(state=projInfo->pi_state, econtext=projInfo->pi_exprContext)
-> state->evalfunc(state, econtext, isNull);
-> ExecInterpExpr():
scanslot = econtext->ecxt_scantuple;
...
EEOP_SCAN_FETCHSOME: slot_getsomeattrs(scanslot, op->d.fetch.last_var);

So someattrs loaded into oldslot, which is released after
         trigtuple = ExecFetchSlotHeapTuple(oldslot, true, &should_free_trig)

But before 86dc90056 we had:
ExecFilterJunk(junkfilter=relinfo->ri_junkFilter, slot=epqslot_candidate):
-> slot_getallattrs(slot);
-> slot_getsomeattrs(slot, slot->tts_tupleDescriptor->natts);

I. e., someattrs were loaded into the epqslot_candidate slot,
which wasn't touched by ExecFetchSlotHeapTuple(oldslot,...) and
the newslot's contents survived the call.

It's not a final conclusion, but maybe it could be helpful for someone who
investigates it too.

Best regards,
Alexander



Hello Tom,

I've found enough time this week to investigate the issue deeper and now
I can answer your questions, hopefully.

30.04.2023 01:24, Tom Lane wrote:
> I am not entirely comfortable with blaming this on 86dc90056 though,
> even though "git bisect" seems to finger that.  The previous coding
> there with ExecFilterJunk() also produced a virtual tuple, as you
> can see by examining ExecFilterJunk, so why didn't the problem
> manifest before?  I think the answer may be that before 86dc90056,
> we were effectively relying on the underlying SeqScan node to keep
> the buffer pinned, but now we're re-fetching the tuple and no pin
> is held by lower plan nodes.  I'm not entirely sure about that though;
> ISTM a SeqScan ought to hold pin on its current tuple in any case.
> However, if the UPDATE plan were more complicated (involving a join)
> then it's quite believable that we'd get here with no relevant pin
> being held.

I couldn't confirm that the issue is specific to a SeqScan plan; it is
reproduced (and fixed by the patch proposed) with a more complex query:
cat << "EOF" | psql
CREATE TABLE bruttest(id int, cnt int DEFAULT 0, t text);
CREATE FUNCTION noop_tfunc() RETURNS trigger LANGUAGE plpgsql AS $$ BEGIN
RETURN NEW; END$$;
CREATE TRIGGER brupdate_trigger BEFORE UPDATE ON bruttest FOR EACH ROW
EXECUTE FUNCTION noop_tfunc();
INSERT INTO bruttest(id, t) VALUES (1, repeat('x', 1000));

CREATE TABLE t2(id int, delta int);
INSERT INTO t2 VALUES(1, 1);
EOF

for ((i=1;i<=4;i++)); do
   echo "iteration $i"
   psql -c "UPDATE bruttest SET cnt = cnt + delta FROM t2 WHERE bruttest.id = t2.id" &
   psql -c "UPDATE bruttest SET cnt = cnt + delta FROM t2 WHERE bruttest.id = t2.id" &
   wait
done

---
psql -c "EXPLAIN UPDATE bruttest SET cnt = cnt + delta FROM t2 WHERE bruttest.id = t2.id"
                                   QUERY PLAN
-------------------------------------------------------------------------------
  Update on bruttest  (cost=241.88..485.18 rows=0 width=0)
    ->  Merge Join  (cost=241.88..485.18 rows=13560 width=16)
          Merge Cond: (bruttest.id = t2.id)
          ->  Sort  (cost=83.37..86.37 rows=1200 width=14)
                Sort Key: bruttest.id
                ->  Seq Scan on bruttest  (cost=0.00..22.00 rows=1200 width=14)
          ->  Sort  (cost=158.51..164.16 rows=2260 width=14)
                Sort Key: t2.id
                ->  Seq Scan on t2  (cost=0.00..32.60 rows=2260 width=14)
(9 rows)

IIUC, the buffer is pinned by oldslot only, and newslot uses it without
explicit pinning because of the following call chain:
epqslot_clean = ExecGetUpdateNewTuple(relinfo=relinfo, planSlot=epqslot_candidate, oldSlot=oldslot):
     ProjectionInfo *newProj = relinfo->ri_projectNew;
     ...
     econtext->ecxt_scantuple = oldSlot;
     ...
     ExecProject(projInfo=newProj):
         ExprState  *state = &projInfo->pi_state;
         ...
         ExecEvalExprSwitchContext(state=state, econtext=projInfo->pi_exprContext):
             state->evalfunc(state, econtext, isNull) -> ExecInterpExpr(state, econtext, isNull):
                 resultslot = state->resultslot;
                 ...
                 scanslot = econtext->ecxt_scantuple;
                 ...
                 EEOP_SCAN_FETCHSOME:
                     slot_getsomeattrs(scanslot, op->d.fetch.last_var);
                 ...
                 EEOP_ASSIGN_SCAN_VAR:
                     resultslot->tts_values[resultnum] = scanslot->tts_values[attnum];

Note that ExecGetUpdateNewTuple() returns relinfo->ri_projectNew->pi_state.resultslot;
So the only place where the buffer could be pinned for newslot correctly is
EEOP_ASSIGN_SCAN_VAR, but I think that changing something there is not an option.

> The practical impact of that concern is that I'm not convinced that
> the proposed patch fixes all variants of the issue.  Maybe we need
> to materialize even if newslot != epqslot_clean.

I investigated all code paths and came to a conclusion that the case
newslot != epqslot_clean is impossible.

1) When ExecBRUpdateTriggers() called via ExecUpdatePrologue(), ...,
ExecModifyTable(), we have:
ExecModifyTable():
     /* Initialize projection info if first time for this table */
     if (unlikely(!resultRelInfo->ri_projectNewInfoValid))
         ExecInitUpdateProjection(node, resultRelInfo);

     slot = ExecGetUpdateNewTuple(resultRelInfo, context.planSlot, oldSlot);
...
     ExecUpdate(..., resultRelInfo=resultRelInfo, ..., slot=slot, ...):
         ExecUpdatePrologue(..., resultRelInfo=resultRelInfo, ... , ..., slot=slot, ...):
             ExecMaterializeSlot(slot);
             ...
             ExecBRUpdateTriggers(..., relinfo=resultRelInfo, ..., newslot=slot, ...)
                 epqslot_clean = ExecGetUpdateNewTuple(relinfo, epqslot_candidate, oldslot);
(so, epqslot_clean and newslot both are equal to
relinfo->ri_projectNew->pi_state.resultslot)

2) When ExecBRUpdateTriggers() called via ExecMergeMatched(), .. .,
ExecMerge(), epqslot_candidate == NULL always, because of
                     /*
                      * Recheck the tuple using EPQ. For MERGE, we leave this
                      * to the caller (it must do additional rechecking, and
                      * might end up executing a different action entirely).
                      */
inside GetTupleForTrigger().

3) When ExecBRUpdateTriggers() called via ExecSimpleRelationUpdate(), ...,
apply_handle_tuple_routing():
FindReplTupleInLocalRel()
...
ExecSimpleRelationUpdate()
A concurrent update of target tuple prevented by table_tuple_lock() in
RelationFindReplTupleSeq()/RelationFindReplTupleByIndex(), called from
FindReplTupleInLocalRel().

Moreover, if epqslot_candidate != NULL were true when ExecBRUpdateTriggers()
called from ExecSimpleRelationUpdate()/ExecMergeMatched(), then we would
get a crash inside
epqslot_clean = ExecGetUpdateNewTuple(relinfo, epqslot_candidate, oldslot);
because of
ExecGetUpdateNewTuple(relinfo, ...)
         ProjectionInfo *newProj = relinfo->ri_projectNew;
...
         Assert(relinfo->ri_projectNewInfoValid);
...
         econtext = newProj->pi_exprContext;

as ExecInitUpdateProjection(), which initializes ri_projectNew, called only
from ExecModifyTable().

So I think it's safe to assume that newslot == epqslot_clean when
epqslot_candidate != NULL, thus ExecCopySlot may be removed.

> Maybe we need to
> do it even if there wasn't a concurrent update.

Without a concurrent update, i. e. when epqslot_candidate == NULL,
newslot is filled earlier, in ExecModifyTable():
     oldSlot = resultRelInfo->ri_oldTupleSlot;
...
     slot = ExecGetUpdateNewTuple(resultRelInfo, context.planSlot, oldSlot);
so newslot uses oldSlot existing in ExecModifyTable(), not oldslot created
in ExecBRUpdateTriggers().

> Maybe we need to
> do it in pre-v14 branches, too.

As I noted in my previous message, before 86dc90056 a different slot was
used when getting "some attrs", so I see no need to fix that state.

> It's certainly not obvious that this
> function ought to assume anything about which slots are pointing at
> what.

I think the problem is in the sequence of calls:

         epqslot_clean = ExecGetUpdateNewTuple(relinfo, epqslot_candidate, oldslot);
         trigtuple = ExecFetchSlotHeapTuple(oldslot, true, &should_free_trig);

As ExecGetUpdateNewTuple() can fill attributes in oldslot and
ExecFetchSlotHeapTuple() can release the oldslot's underlying storage (as
the comment for that function says), ExecMaterializeSlot() should be called
in between to save the new tuple contents.

> Also, if we do materialize here, does this code a little further down
> become redundant?
>
>             if (should_free_trig && newtuple == trigtuple)
>                 ExecMaterializeSlot(newslot);

That is another interesting case, and I think it's the separate one. I
could not reproduce that issue (commit 75e03eabea from 2019 mentions
zheap), but I suppose it was real at the moment of the commit, and it was
two years before 86dc90056. So I think that ExecMaterializeSlot() inside
the condition "if (epqslot_candidate != NULL)" would not cover that case.

As an outcome, I propose an improved patch, that:
1) Removes unreachable ExecCopySlot() call;
2) Adds one ExecMaterializeSlot() call, that covers both cases;
3) Removes separate ExecMaterializeSlot() call, that becomes redundant.

It simplifies ExecBRUpdateTriggers() a little, but if it might affect
performance significantly (I don't see how), maybe leave previous
ExecMaterializeSlot() call intact, and add another one just after
epqslot_clean = ExecGetUpdateNewTuple(...);

> A completely different way of looking at it is that we should not
> treat this as ExecBRUpdateTriggers's bug at all.  The slot mechanisms
> are supposed to protect the data referenced by a slot, so why is that
> failing to happen in this example?  The correct fix might involve
> newslot acquiring a buffer pin, for example.

Unfortunately, I could not find a place where such pin could be acquired
(except for EEOP_ASSIGN_SCAN_VAR at the lower level).

> Bottom line is that I'm afraid the present patch is band-aiding a
> specific problem without solving the general case.

The next one is more generic, but still limited to the ExecBRUpdateTriggers()
bounds, as I still see no problem outside.

Best regards,
Alexander
Attachment

Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger

From
Alexander Lakhin
Date:
Hello Tom,

09.07.2023 18:00, Alexander Lakhin wrote:
>
> 30.04.2023 01:24, Tom Lane wrote:
>
>> Bottom line is that I'm afraid the present patch is band-aiding a
>> specific problem without solving the general case.
>
> The next one is more generic, but still limited to the ExecBRUpdateTriggers()
> bounds, as I still see no problem outside.
>

If this issue is not approaching the top of your todo list yet and you don't
mind, I'll queue this patch on the commitfest.

Best regards,
Alexander



Alexander Lakhin <exclusion@gmail.com> writes:
> If this issue is not approaching the top of your todo list yet and you don't
> mind, I'll queue this patch on the commitfest.

Please do, along with anything else that you're worried will get lost.

            regards, tom lane



On Sun, Jul 9, 2023 at 11:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> I've found enough time this week to investigate the issue deeper and now
> I can answer your questions, hopefully.

This patch lacks a commit message, which I think would be a good thing
to add. Even if somebody doesn't use the message you write, it would
help with understanding the patch itself. Deleting the call to
ExecMaterializeSlot with no explanation in the patch file of why
that's OK to do is not ideal. But the real heart of this small patch
seems to be this:

+               /*
+                * We need to materialize newslot because 1) it might
share oldslot's,
+                * buffer, which might be released on fetching
trigtuple from oldslot
+                * if oldslot is the only owner of buffer,
+                * and 2) if some trigger returns/stores the old trigtuple,
+                * and the heap tuple passed to the trigger was located locally,
+                * newslot might reference that tuple storage, which
is freed at the
+                * end of this function.
+                */

Reading this, I found (1) to be very surprising. I would expect that
advancing the scan would potentially release the buffer pin, but
fetching the tuple shouldn't advance the scan. I guess this is
referring to the call, just below, to ExecFetchSlotHeapTuple. I'm
guessing that can end up calling tts_buffer_heap_materialize which,
after materializing copying the tuple, thinks that it's OK to release
the buffer pin. But that makes me wonder ... how exactly do oldslot
and newslot end up sharing the same buffer without holding two pins on
it? tts_heap_copyslot() looks like it basically forces the tuple to be
materialized first and then copies that, so you'd end up with, I
guess, no buffer pins at all, rather than 1 or 2.

I'm obviously missing something important here...

--
Robert Haas
EDB: http://www.enterprisedb.com



Hello Robert,

06.01.2024 00:18, Robert Haas wrote:
> On Sun, Jul 9, 2023 at 11:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
>> I've found enough time this week to investigate the issue deeper and now
>> I can answer your questions, hopefully.
> This patch lacks a commit message, which I think would be a good thing
> to add. Even if somebody doesn't use the message you write, it would
> help with understanding the patch itself. Deleting the call to
> ExecMaterializeSlot with no explanation in the patch file of why
> that's OK to do is not ideal.

Thanks for looking into this!

I've added a commit message, please look at the new patch version.

> But the real heart of this small patch
> seems to be this:
>
> +               /*
> +                * We need to materialize newslot because 1) it might
> share oldslot's,
> +                * buffer, which might be released on fetching
> trigtuple from oldslot
> +                * if oldslot is the only owner of buffer,
> +                * and 2) if some trigger returns/stores the old trigtuple,
> +                * and the heap tuple passed to the trigger was located locally,
> +                * newslot might reference that tuple storage, which
> is freed at the
> +                * end of this function.
> +                */
>
> Reading this, I found (1) to be very surprising. I would expect that
> advancing the scan would potentially release the buffer pin, but
> fetching the tuple shouldn't advance the scan. I guess this is
> referring to the call, just below, to ExecFetchSlotHeapTuple. I'm
> guessing that can end up calling tts_buffer_heap_materialize which,
> after materializing copying the tuple, thinks that it's OK to release
> the buffer pin.

Yes, this is what happening there as I diagnosed in [1].

>   But that makes me wonder ... how exactly do oldslot
> and newslot end up sharing the same buffer without holding two pins on
> it? tts_heap_copyslot() looks like it basically forces the tuple to be
> materialized first and then copies that, so you'd end up with, I
> guess, no buffer pins at all, rather than 1 or 2.
>
> I'm obviously missing something important here...

As my analysis [2] showed, epqslot_clean is always equal to newslot, so
ExecCopySlot() isn't called at all.

[1] https://www.postgresql.org/message-id/17798-0907404928dcf0dd%40postgresql.org
[2] https://www.postgresql.org/message-id/e989408c-1838-61cd-c968-1fcf47c6fbba%40gmail.com

Best regards,
Alexander
Attachment
On Mon, Jan 8, 2024 at 3:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> >   But that makes me wonder ... how exactly do oldslot
> > and newslot end up sharing the same buffer without holding two pins on
> > it? tts_heap_copyslot() looks like it basically forces the tuple to be
> > materialized first and then copies that, so you'd end up with, I
> > guess, no buffer pins at all, rather than 1 or 2.
> >
> > I'm obviously missing something important here...
>
> As my analysis [2] showed, epqslot_clean is always equal to newslot, so
> ExecCopySlot() isn't called at all.
>
> [1] https://www.postgresql.org/message-id/17798-0907404928dcf0dd%40postgresql.org
> [2] https://www.postgresql.org/message-id/e989408c-1838-61cd-c968-1fcf47c6fbba%40gmail.com

Sorry, I still don't get it. I can't really follow the analysis in
[2]. Your analysis there relies on analyzing how
ExecBRUpdateTriggers() gets called, but it seems to me that what
matters is what happens inside that function, specifically what
GetTupleForTrigger does. And I think that function is going to either
produce an EPQ tuple or not depending on whether table_tuple_lock
returns TM_Ok and whether it sets tmfd.traversed, independently of how
ExecBRUpdateTriggers() is called.

Also, even if you're right that epqslot_clean always ends up equal to
newslot, my question was about how oldslot and newslot end up sharing
the same buffer without holding two pins on it, and I don't quite see
what epqslot_clean has to do with that. Apologies if I'm being dense
here; this code seems dramatically under-commented to me. But FWICT
the design here is that a slot only holds a pin until somebody copies
it or materializes it. So I don't understand what conceptually could
happen that would end up with two slots holding the same pin, unless
it was just that you had two variables holding the same pointer value.
But if that's what is happening, then materializing one slot would
also materialize the other.

--
Robert Haas
EDB: http://www.enterprisedb.com



08.01.2024 22:46, Robert Haas wrote:
> On Mon, Jan 8, 2024 at 3:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
>>>    But that makes me wonder ... how exactly do oldslot
>>> and newslot end up sharing the same buffer without holding two pins on
>>> it? tts_heap_copyslot() looks like it basically forces the tuple to be
>>> materialized first and then copies that, so you'd end up with, I
>>> guess, no buffer pins at all, rather than 1 or 2.
>>>
>>> I'm obviously missing something important here...
>> As my analysis [2] showed, epqslot_clean is always equal to newslot, so
>> ExecCopySlot() isn't called at all.
>>
>> [1] https://www.postgresql.org/message-id/17798-0907404928dcf0dd%40postgresql.org
>> [2] https://www.postgresql.org/message-id/e989408c-1838-61cd-c968-1fcf47c6fbba%40gmail.com
> Sorry, I still don't get it. I can't really follow the analysis in
> [2]. Your analysis there relies on analyzing how
> ExecBRUpdateTriggers() gets called, but it seems to me that what
> matters is what happens inside that function, specifically what
> GetTupleForTrigger does. And I think that function is going to either
> produce an EPQ tuple or not depending on whether table_tuple_lock
> returns TM_Ok and whether it sets tmfd.traversed, independently of how
> ExecBRUpdateTriggers() is called.

Yes, GetTupleForTrigger() returns an EPQ tuple in our case (because of
concurrent update).

> Also, even if you're right that epqslot_clean always ends up equal to
> newslot, my question was about how oldslot and newslot end up sharing
> the same buffer without holding two pins on it, and I don't quite see
> what epqslot_clean has to do with that. Apologies if I'm being dense
> here; this code seems dramatically under-commented to me. But FWICT
> the design here is that a slot only holds a pin until somebody copies
> it or materializes it. So I don't understand what conceptually could
> happen that would end up with two slots holding the same pin, unless
> it was just that you had two variables holding the same pointer value.
> But if that's what is happening, then materializing one slot would
> also materialize the other.

As far as I can see, oldslot and newslot hold different pointer values;
e. g., with debug logging added (see attached (please forgive me dirty
hacks inside)), I get:
LOG:  ExecBRUpdateTriggers after ExecGetUpdateNewTuple()  1; oldslot: 0x758e3c8, slot->tts_tupleDescriptor: 0x7362fc8,

slot->buffer: 1810, slot->buffer->refcount: 1, slot->tts_values: 0x758e438, i: 1, slot->tts_values[ii]: 0x9982c74
STATEMENT:  UPDATE bruttest SET cnt = cnt + 1;
LOG:  ExecBRUpdateTriggers after ExecGetUpdateNewTuple()  1; newslot: 0x747fb48, slot->tts_tupleDescriptor: 0x7362fc8,

slot->buffer: 0, slot->buffer->refcount: 0, slot->tts_values: 0x747fbb8, i: 1, slot->tts_values[ii]: 0x9982c74
STATEMENT:  UPDATE bruttest SET cnt = cnt + 1;
==00:00:00:05.052 3934530== Invalid read of size 1
==00:00:00:05.052 3934530==    at 0x1EECAC: heap_compute_data_size (heaptuple.c:244)
...
==00:00:00:05.052 3934530==  Address 0x9982c74 is in a rw- anonymous segment
==00:00:00:05.052 3934530==
(when executing `make check` for the isolation test [1] under Valgrind)

But as we can see, both slots use the same buffer (have pointers to the
same attribute values) as a result of the following operations:
                  EEOP_SCAN_FETCHSOME:
                      slot_getsomeattrs(scanslot, op->d.fetch.last_var);
                  ...
                  EEOP_ASSIGN_SCAN_VAR:
                      resultslot->tts_values[resultnum] = scanslot->tts_values[attnum]
(More details upthread in [2])

In the meantime, newslot doesn't hold a pin at all.

[1] https://www.postgresql.org/message-id/950f4f1a-fb71-3e45-bb65-6a57da9f9b9e%40gmail.com
[2] https://www.postgresql.org/message-id/9f23591c-610a-60d4-2b29-26a860f7c3a8%40gmail.com

Best regards,
Alexander
Attachment
On Wed, Jan 10, 2024 at 1:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> As far as I can see, oldslot and newslot hold different pointer values;
> e. g., with debug logging added (see attached (please forgive me dirty
> hacks inside)), I get:
> LOG:  ExecBRUpdateTriggers after ExecGetUpdateNewTuple()  1; oldslot: 0x758e3c8, slot->tts_tupleDescriptor:
0x7362fc8,
> slot->buffer: 1810, slot->buffer->refcount: 1, slot->tts_values: 0x758e438, i: 1, slot->tts_values[ii]: 0x9982c74
> STATEMENT:  UPDATE bruttest SET cnt = cnt + 1;
> LOG:  ExecBRUpdateTriggers after ExecGetUpdateNewTuple()  1; newslot: 0x747fb48, slot->tts_tupleDescriptor:
0x7362fc8,
> slot->buffer: 0, slot->buffer->refcount: 0, slot->tts_values: 0x747fbb8, i: 1, slot->tts_values[ii]: 0x9982c74
> STATEMENT:  UPDATE bruttest SET cnt = cnt + 1;
> ==00:00:00:05.052 3934530== Invalid read of size 1
> ==00:00:00:05.052 3934530==    at 0x1EECAC: heap_compute_data_size (heaptuple.c:244)
> ...
> ==00:00:00:05.052 3934530==  Address 0x9982c74 is in a rw- anonymous segment
> ==00:00:00:05.052 3934530==
> (when executing `make check` for the isolation test [1] under Valgrind)
>
> But as we can see, both slots use the same buffer (have pointers to the
> same attribute values) as a result of the following operations:
>                   EEOP_SCAN_FETCHSOME:
>                       slot_getsomeattrs(scanslot, op->d.fetch.last_var);
>                   ...
>                   EEOP_ASSIGN_SCAN_VAR:
>                       resultslot->tts_values[resultnum] = scanslot->tts_values[attnum]
> (More details upthread in [2])
>
> In the meantime, newslot doesn't hold a pin at all.

OK, thanks for the further explanation. I think I'm starting to
understand what's going on here. Maybe you already understand all of
this, but in my own words:

ExecGetUpdateNewTuple() calls ExecProject() which documents that the
results are short-lived and you should materialize the slot if that's
a problem. ExecModifyTable() does slot = ExecGetUpdateNewTuple(...),
does not materialize it, and then passes that slot as the penultimate
argument to ExecUpdate(), which calls ExecUpdatePrologue(), which
calls ExecBRUpdateTriggers(), and the value returned by
ExecModifyTable()'s call to ExecGetUpdateNewTuple() is now stored in
newslot, but it still hasn't been materialized. So any operation
that's performed on oldslot at this point invalidates newslot.

But ExecBRUpdateTriggers() now calls GetTupleForTrigger() to clobber
oldslot, so now newslot is also junk. That's technically OK, because
we're not going to access the contents of newslot any more. We're
instead going to refresh them:

        if (epqslot_candidate != NULL)
        {
            TupleTableSlot *epqslot_clean;

            epqslot_clean = ExecGetUpdateNewTuple(relinfo, epqslot_candidate,
                                                  oldslot);

            if (newslot != epqslot_clean)
                ExecCopySlot(newslot, epqslot_clean);
        }

Since the slot returned by this call to ExecGetUpdateNewTuple() is a
function of the same projection info as when it was called from
ExecModifyTable(), the slot returned here must be the same slot that
was returned there, namely newslot. So, as you say, newslot ==
epqslot_clean, and we've just clobbered it with updated values.
Therefore we're back in the situation where oldslot is fully valid and
newslot is only valid until somebody does something to oldslot. Thus,
the portion of your patch that simplifies this code doesn't fix any
bug, but rather just simplifies some unnecessarily complex logic. But
the next thing that happens in the existing code is:

        trigtuple = ExecFetchSlotHeapTuple(oldslot, true, &should_free_trig);

That is going to again invalidate newslot, because it can materialize
oldslot. That's a problem if we ever use newslot after this point, and
we do, unless we materialize newslot first, so your patch adds
ExecMaterializeSlot(newslot) right before this, fixing the problem.

What I don't really understand is why failure to do this causes an
invalid memory access in ExecFetchSlotHeapTuple(). According to the
above analysis, the invalid memory access ought to occur the first
time something touches newslot after this call to
ExecFetchSlotHeapTuple(), but if I understand your valgrind trace
correctly, it's showing that ExecFetchSlotHeapTuple() is touching
something it shouldn't. Do you understand what's happening here?

I think the question we should be asking ourselves here is whether
we're really OK with having such a vast distance between the point
where newslot is first populated and where materialization happens. It
seems like playing with fire to populate a slot with ephemeral data in
ExecModifyTable() and then press the materialize button four stack
frames down. There's a lot of opportunity for things to go wrong
there. On the other hand, this is very performance-sensitive code, so
maybe materializing here is the only workable solution. If that's the
case, though, we should probably add comments making the lifetime of
various slots clearer, because neither Tom nor I was able to
understand what the heck was happening here on first reading, which is
probably not a good sign.

--
Robert Haas
EDB: http://www.enterprisedb.com



10.01.2024 23:04, Robert Haas wrote:
> OK, thanks for the further explanation. I think I'm starting to
> understand what's going on here. Maybe you already understand all of
> this, but in my own words:
>
> ExecGetUpdateNewTuple() calls ExecProject() which documents that the
> results are short-lived and you should materialize the slot if that's
> a problem. ExecModifyTable() does slot = ExecGetUpdateNewTuple(...),
> does not materialize it, and then passes that slot as the penultimate
> argument to ExecUpdate(), which calls ExecUpdatePrologue(), which
> calls ExecBRUpdateTriggers(), and the value returned by
> ExecModifyTable()'s call to ExecGetUpdateNewTuple() is now stored in
> newslot, but it still hasn't been materialized. So any operation
> that's performed on oldslot at this point invalidates newslot.

Yes, I see this the same way.

> But ExecBRUpdateTriggers() now calls GetTupleForTrigger() to clobber
> oldslot, so now newslot is also junk. That's technically OK, because
> we're not going to access the contents of newslot any more. We're
> instead going to refresh them:
>
>          if (epqslot_candidate != NULL)
>          {
>              TupleTableSlot *epqslot_clean;
>
>              epqslot_clean = ExecGetUpdateNewTuple(relinfo, epqslot_candidate,
>                                                    oldslot);
>
>              if (newslot != epqslot_clean)
>                  ExecCopySlot(newslot, epqslot_clean);
>          }
>
> Since the slot returned by this call to ExecGetUpdateNewTuple() is a
> function of the same projection info as when it was called from
> ExecModifyTable(), the slot returned here must be the same slot that
> was returned there, namely newslot. So, as you say, newslot ==
> epqslot_clean, and we've just clobbered it with updated values.
> Therefore we're back in the situation where oldslot is fully valid and
> newslot is only valid until somebody does something to oldslot. Thus,
> the portion of your patch that simplifies this code doesn't fix any
> bug, but rather just simplifies some unnecessarily complex logic.

Yes, in that portion I was just trying to address Tom's suspicion: "Maybe
we need to materialize even if newslot != epqslot_clean"...

>   But
> the next thing that happens in the existing code is:
>
>          trigtuple = ExecFetchSlotHeapTuple(oldslot, true, &should_free_trig);
>
> That is going to again invalidate newslot, because it can materialize
> oldslot. That's a problem if we ever use newslot after this point, and
> we do, unless we materialize newslot first, so your patch adds
> ExecMaterializeSlot(newslot) right before this, fixing the problem.

Yes, that's the way I find possible to fix it.

> What I don't really understand is why failure to do this causes an
> invalid memory access in ExecFetchSlotHeapTuple(). According to the
> above analysis, the invalid memory access ought to occur the first
> time something touches newslot after this call to
> ExecFetchSlotHeapTuple(), but if I understand your valgrind trace
> correctly, it's showing that ExecFetchSlotHeapTuple() is touching
> something it shouldn't. Do you understand what's happening here?

Maybe you are confused by the first call to ExecFetchSlotHeapTuple()?
The full Valgrind stack from my yesterday's test run is:
==00:00:00:05.052 3934530== Invalid read of size 1
==00:00:00:05.052 3934530==    at 0x1EECAC: heap_compute_data_size (heaptuple.c:244)
==00:00:00:05.052 3934530==    by 0x1EFA84: heap_form_tuple (heaptuple.c:1158)
==00:00:00:05.052 3934530==    by 0x423B5E: tts_buffer_heap_materialize (execTuples.c:747)
==00:00:00:05.052 3934530==    by 0x424EB1: ExecFetchSlotHeapTuple (execTuples.c:1653)
==00:00:00:05.052 3934530==    by 0x3ECD45: ExecBRUpdateTriggers (trigger.c:3055)
==00:00:00:05.052 3934530==    by 0x4483F0: ExecUpdatePrologue (nodeModifyTable.c:1923)
==00:00:00:05.052 3934530==    by 0x44A146: ExecUpdate (nodeModifyTable.c:2284)
==00:00:00:05.052 3934530==    by 0x44CE4E: ExecModifyTable (nodeModifyTable.c:3844)
==00:00:00:05.052 3934530==    by 0x41E9A1: ExecProcNodeFirst (execProcnode.c:464)
==00:00:00:05.052 3934530==    by 0x416F5D: ExecProcNode (executor.h:273)
==00:00:00:05.052 3934530==    by 0x416F5D: ExecutePlan (execMain.c:1670)
==00:00:00:05.052 3934530==    by 0x417120: standard_ExecutorRun (execMain.c:365)
==00:00:00:05.052 3934530==    by 0x4171FA: ExecutorRun (execMain.c:309)
==00:00:00:05.052 3934530==  Address 0x9982c74 is in a rw- anonymous segment
==00:00:00:05.052 3934530==

here trigger.c:3055 points at the second call in ExecBRUpdateTriggers():
             newtuple = ExecFetchSlotHeapTuple(newslot, true, &should_free_new);

Does it makes more sense?

> I think the question we should be asking ourselves here is whether
> we're really OK with having such a vast distance between the point
> where newslot is first populated and where materialization happens. It
> seems like playing with fire to populate a slot with ephemeral data in
> ExecModifyTable() and then press the materialize button four stack
> frames down. There's a lot of opportunity for things to go wrong
> there. On the other hand, this is very performance-sensitive code, so
> maybe materializing here is the only workable solution. If that's the
> case, though, we should probably add comments making the lifetime of
> various slots clearer, because neither Tom nor I was able to
> understand what the heck was happening here on first reading, which is
> probably not a good sign.

I was discouraged by that vast distance and implicit buffer usage too, but
I found no other feasible way to fix it. On the other hand, 75e03eabe and
Andres's words upthread made me believe that it's an acceptable solution.

Thank you very much for diving deep into this subject!

Best regards,
Alexander



On Wed, Jan 10, 2024 at 10:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
> here trigger.c:3055 points at the second call in ExecBRUpdateTriggers():
>              newtuple = ExecFetchSlotHeapTuple(newslot, true, &should_free_new);
>
> Does it makes more sense?

Ah, OK, makes tons of sense.

> I was discouraged by that vast distance and implicit buffer usage too, but
> I found no other feasible way to fix it. On the other hand, 75e03eabe and
> Andres's words upthread made me believe that it's an acceptable solution.

I agree that it's potentially acceptable. I just wonder if Tom or
someone else is going to want to propose a bigger change to avoid some
of this messiness. I don't know what that would look like, though.

If not, then I think your patch should be committed and back-patched
pretty much as you have it, except with better comments.

> Thank you very much for diving deep into this subject!

yw.

--
Robert Haas
EDB: http://www.enterprisedb.com



Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Jan 10, 2024 at 10:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>> I was discouraged by that vast distance and implicit buffer usage too, but
>> I found no other feasible way to fix it. On the other hand, 75e03eabe and
>> Andres's words upthread made me believe that it's an acceptable solution.

> I agree that it's potentially acceptable. I just wonder if Tom or
> someone else is going to want to propose a bigger change to avoid some
> of this messiness. I don't know what that would look like, though.

I'm still feeling itchy about it.  Maybe the problem could be fixed
better if we didn't re-use slots with such abandon in this code,
but I don't have a concrete proposal.  Also, that'd likely be a
nontrivial rewrite bringing its own possibilities of adding bugs.

Some concrete thoughts:

* Maybe better to Assert(newslot == epqslot_clean) instead of
what you did here?  Not sure.

* Why is the ExecMaterializeSlot step needed if we don't have
an epqslot_candidate?

            regards, tom lane



Hello Tom,

12.01.2024 00:43, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Jan 10, 2024 at 10:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>>> I was discouraged by that vast distance and implicit buffer usage too, but
>>> I found no other feasible way to fix it. On the other hand, 75e03eabe and
>>> Andres's words upthread made me believe that it's an acceptable solution.
>> I agree that it's potentially acceptable. I just wonder if Tom or
>> someone else is going to want to propose a bigger change to avoid some
>> of this messiness. I don't know what that would look like, though.
> I'm still feeling itchy about it.  Maybe the problem could be fixed
> better if we didn't re-use slots with such abandon in this code,
> but I don't have a concrete proposal.  Also, that'd likely be a
> nontrivial rewrite bringing its own possibilities of adding bugs.

I'm not too excited by this approach either, but maybe it's the only
solution which can be implemented for now.
I have rechecked other ExecGetUpdateNewTuple() calls and confirmed that all
of these materialize newslot returned.
Namely,
1) ExecCrossPartitionUpdate() called by ExecUpdateAct(), where in case of a
  retry we have:
     /* ensure slot is independent, consider e.g. EPQ */
     ExecMaterializeSlot(slot);

2) ExecModifyTable() passes the slot returned to ExecUpdate(), which begins
  with a call to ExecUpdatePrologue(), which materializes the slot.

3) ExecUpdate() itself calls ExecGetUpdateNewTuple() in a loop (redo_act),
  where ExecUpdateAct() called, which also materializes the slot.

So this place in ExecBRUpdateTriggers() is the only one where we have no
  materialization after ExecGetUpdateNewTuple().

Maybe another option is to always materialize the slot inside
  ExecGetUpdateNewTuple(), but I'm afraid such a change is not suitable for
  back-patching...

> Some concrete thoughts:
>
> * Maybe better to Assert(newslot == epqslot_clean) instead of
> what you did here?  Not sure.

Me too. Other ExecGetUpdateNewTuple() calls don't have such Asserts nearby,
but I haven't studied yet whether they could have ones...

> * Why is the ExecMaterializeSlot step needed if we don't have
> an epqslot_candidate?

I had decided to move that step out of "if (epqslot_candidate != NULL)"
in v3 to play safe when removing ExecMaterializeSlot(newslot) brought by
75e03eabe (more details upthread [1]). I thought that if that commit stated
that the slot needs materialization (it was time before 86dc90056, which
raised the current issue), then the materialization must be preserved.
(The same commit can be found in zheap repository [2], but I found no other
explanation of it's purpose...)
Though looking at the current core code, I'm inclined to perform
ExecMaterializeSlot() only when we have epqslot_candidate, if we can afford
not to bother about that zheap-specific or similar scenario...

[1] https://www.postgresql.org/message-id/e989408c-1838-61cd-c968-1fcf47c6fbba%40gmail.com
[2] https://github.com/EnterpriseDB/zheap/commit/75e03eabe

Best regards,
Alexander



On Fri, Jan 12, 2024 at 8:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
> I had decided to move that step out of "if (epqslot_candidate != NULL)"
> in v3 to play safe when removing ExecMaterializeSlot(newslot) brought by
> 75e03eabe (more details upthread [1]). I thought that if that commit stated
> that the slot needs materialization (it was time before 86dc90056, which
> raised the current issue), then the materialization must be preserved.
> (The same commit can be found in zheap repository [2], but I found no other
> explanation of it's purpose...)
> Though looking at the current core code, I'm inclined to perform
> ExecMaterializeSlot() only when we have epqslot_candidate, if we can afford
> not to bother about that zheap-specific or similar scenario...

I don't understand why it's ever safe to skip ExecMaterializeSlot
here. IIUC, if we have no EPQ slot, newslot is still not materialized
and thus dependent on oldslot ... and we're about to
ExecFetchSlotHeapTuple on oldslot.

--
Robert Haas
EDB: http://www.enterprisedb.com



12.01.2024 17:56, Robert Haas wrote:
> On Fri, Jan 12, 2024 at 8:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
>> I had decided to move that step out of "if (epqslot_candidate != NULL)"
>> in v3 to play safe when removing ExecMaterializeSlot(newslot) brought by
>> 75e03eabe (more details upthread [1]). I thought that if that commit stated
>> that the slot needs materialization (it was time before 86dc90056, which
>> raised the current issue), then the materialization must be preserved.
>> (The same commit can be found in zheap repository [2], but I found no other
>> explanation of it's purpose...)
>> Though looking at the current core code, I'm inclined to perform
>> ExecMaterializeSlot() only when we have epqslot_candidate, if we can afford
>> not to bother about that zheap-specific or similar scenario...
> I don't understand why it's ever safe to skip ExecMaterializeSlot
> here. IIUC, if we have no EPQ slot, newslot is still not materialized
> and thus dependent on oldslot ... and we're about to
> ExecFetchSlotHeapTuple on oldslot.

To my understanding, when we have no epqslot_candidate, it means that
newslot came from ExecBRUpdateTriggers's callers and it can't be dependent
on an internal oldslot's buffer.
For example, ExecModifyTable() does the following:
             slot = ExecGetUpdateNewTuple(resultRelInfo, context.planSlot,
                                             oldSlot);
             context.relaction = NULL;

             /* Now apply the update. */
             slot = ExecUpdate(&context, resultRelInfo, tupleid, oldtuple,
                                 slot, node->canSetTag);

Thus, newslot (aka slot here) depends on oldSlot inside ExecModifyTable(),
but not on oldslot inside ExecBRUpdateTriggers(), so
ExecFetchSlotHeapTuple(oldslot,...) can't affect it. Though then we still
have the question regarding zheap open -- can some external code introduce
new implicit dependencies?

Best regards,
Alexander



Alexander Lakhin <exclusion@gmail.com> writes:
> 12.01.2024 17:56, Robert Haas wrote:
>> I don't understand why it's ever safe to skip ExecMaterializeSlot
>> here. IIUC, if we have no EPQ slot, newslot is still not materialized
>> and thus dependent on oldslot ... and we're about to
>> ExecFetchSlotHeapTuple on oldslot.

> To my understanding, when we have no epqslot_candidate, it means that
> newslot came from ExecBRUpdateTriggers's callers and it can't be dependent
> on an internal oldslot's buffer.

After sleeping on it, I'm inclined to the opinion that unconditionally
materializing newslot here is a good idea.  Given that we're about to
call one or more trigger functions, the incremental cost ought to be
pretty negligible.  The reduction in the state space that we have to
reason about seems fairly significant, too.  Remember that a trigger
function could do almost anything.  So I like this not only for
providing a straightforward fix for the current problem, but as a
guard against unrecognized or future bugs in this area.  75e03eabe
should be sufficient evidence that there's a lot of hazard here
(and I especially like being able to revert that wart).

            regards, tom lane



On Fri, Jan 12, 2024 at 12:28 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> After sleeping on it, I'm inclined to the opinion that unconditionally
> materializing newslot here is a good idea.

I'm glad you came to that conclusion, and I agree. I think it's really
bad that we're passing around un-materialized slots across many layers
of the call stack; having the behavior of the inner code know in
detail which outer code paths materialize the slot and which don't
seems like a recipe for infinite future bugs. The only concern is
overhead, IMV.

> 75e03eabe
> should be sufficient evidence that there's a lot of hazard here
> (and I especially like being able to revert that wart).

I haven't analyzed whether this allow reverting that, but it seems
nice if it does.

Do you want to take this forward from here?

--
Robert Haas
EDB: http://www.enterprisedb.com



Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Jan 12, 2024 at 12:28 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> After sleeping on it, I'm inclined to the opinion that unconditionally
>> materializing newslot here is a good idea.

> I'm glad you came to that conclusion, and I agree.

Cool.

> Do you want to take this forward from here?

Happy to do that, if no one objects to this way forward.

            regards, tom lane



I wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Do you want to take this forward from here?

> Happy to do that, if no one objects to this way forward.

OK, I think I've finally sussed the reason why 86dc90056 created
this problem.  Sure, we made a virtual tuple before that (in the
ExecFilterJunk call), but all its by-ref values were referencing the
epqslot_candidate tuple, that is the output of evaluating the EPQ
plan.  It was that plan tree's responsibility to produce a valid
tuple, and it did --- there might be pointers to disk buffers in that
virtual tuple, but the EPQ planstate tree would be holding the pins
required to make it legal.

But now, the UPDATE plan tree only delivers changed columns, and the
EPQ tree is the same.  So we build a virtual tuple whose not-changed
columns are referencing the "oldslot" on-disk tuple just fetched by
GetTupleForTrigger, and there isn't (or might not be) any pin
protecting those values except oldslot's.

There's one more angle that I'd not understood.  We *do* still hold
a pin on the original target tuple --- that's owned by the oldSlot
back in ExecModifyTable, that is resultRelInfo->ri_oldTupleSlot.
That's why there's no problem with the initial newslot, even if
it's virtual.  This bug can only manifest when GetTupleForTrigger
locates an updated tuple version that is in a different page.
Then, oldslot's pin is the only one protecting our access to that
tuple.

Also, I realized that my concerns about a vast state space for
execution of the triggers were unfounded.  That's because each
time we're about to call a trigger, we'll do

        if (!newtuple)
            newtuple = ExecFetchSlotHeapTuple(newslot, true, &should_free_new);

which causes newslot to become materialized if it wasn't already.
And of course we've forced oldslot into a materialized state.
So the triggers will never see anything but two materialized slots.

So I'm now convinced that all we need is an ExecMaterializeSlot
call in the EPQ path, much as in Alexander's first patch.

I don't think that the changes in the v4 patch are improvements.
I really do not like

+            newslot = ExecGetUpdateNewTuple(relinfo, epqslot_candidate,
+                                            oldslot);

and here's why: it is critical that we update the caller's slot,
because that's the only way trigger-driven changes will get back to
ExecModifyTable.  This coding obscures what's happening, and would
break badly if ExecGetUpdateNewTuple ever acted differently than it
does now or if a caller passed a slot that wasn't the same slot (which
I'm not convinced never happens; see ExecSimpleRelationUpdate for what
looks like a counterexample).  And it's not even saving anything
meaningful.  So I want to keep the ExecCopySlot call, although I think
it'd be sensible to do

-            if (newslot != epqslot_clean)
+            if (unlikely(newslot != epqslot_clean))

for whatever tiny benefit we can get there.

Also, on looking closer, this change doesn't entitle us to revert
75e03eabe.  That's guarding against a dangling-pointer situation
that could be created later, not the one we're fixing here.

So, after a bit more work on the comments, I end with the attached.

BTW, I noticed that ExecUpdatePrologue does

    ExecMaterializeSlot(slot);

without any comment as to why it's necessary ... and I rather bet
it isn't.  But I'm not going to experiment with changing that in
a bug-fix patch.

            regards, tom lane

diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 6873c3b4d9..0880ca51fb 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2978,10 +2978,6 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
          * received in newslot.  Neither we nor our callers have any further
          * interest in the passed-in tuple, so it's okay to overwrite newslot
          * with the newer data.
-         *
-         * (Typically, newslot was also generated by ExecGetUpdateNewTuple, so
-         * that epqslot_clean will be that same slot and the copy step below
-         * is not needed.)
          */
         if (epqslot_candidate != NULL)
         {
@@ -2990,14 +2986,36 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
             epqslot_clean = ExecGetUpdateNewTuple(relinfo, epqslot_candidate,
                                                   oldslot);

-            if (newslot != epqslot_clean)
+            /*
+             * Typically, the caller's newslot was also generated by
+             * ExecGetUpdateNewTuple, so that epqslot_clean will be the same
+             * slot and copying is not needed.  But do the right thing if it
+             * isn't.
+             */
+            if (unlikely(newslot != epqslot_clean))
                 ExecCopySlot(newslot, epqslot_clean);
+
+            /*
+             * At this point newslot contains a virtual tuple that may
+             * reference some fields of oldslot's tuple in some disk buffer.
+             * If that tuple is in a different page than the original target
+             * tuple, then our only pin on that buffer is oldslot's, and we're
+             * about to release it.  Hence we'd better materialize newslot to
+             * ensure it doesn't contain references into an unpinned buffer.
+             * (We'd materialize it below anyway, but too late for safety.)
+             */
+            ExecMaterializeSlot(newslot);
         }

+        /*
+         * Here we convert oldslot to a materialized slot holding trigtuple.
+         * Neither slot passed to the triggers will hold any buffer pin.
+         */
         trigtuple = ExecFetchSlotHeapTuple(oldslot, true, &should_free_trig);
     }
     else
     {
+        /* Put the FDW-supplied tuple into oldslot to unify the cases */
         ExecForceStoreHeapTuple(fdw_trigtuple, oldslot, false);
         trigtuple = fdw_trigtuple;
     }

Hello Tom,

14.01.2024 02:29, Tom Lane wrote:
> So, after a bit more work on the comments, I end with the attached.

Thank you very much for the detailed explanation and for the final patch!
I have no questions regarding the fix proposed.

I've retested it with my demo test just to be sure and encountered no
issues.

Best regards,
Alexander



Alexander Lakhin <exclusion@gmail.com> writes:
> Thank you very much for the detailed explanation and for the final patch!
> I have no questions regarding the fix proposed.
> I've retested it with my demo test just to be sure and encountered no
> issues.

Pushed, thanks for testing!

            regards, tom lane



Thanks for taking care of this.

On Sat, Jan 13, 2024 at 6:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So I'm now convinced that all we need is an ExecMaterializeSlot
> call in the EPQ path, much as in Alexander's first patch.
>
> I don't think that the changes in the v4 patch are improvements.
> I really do not like
>
> +            newslot = ExecGetUpdateNewTuple(relinfo, epqslot_candidate,
> +                                            oldslot);
>
> and here's why: it is critical that we update the caller's slot,
> because that's the only way trigger-driven changes will get back to
> ExecModifyTable.  This coding obscures what's happening, and would
> break badly if ExecGetUpdateNewTuple ever acted differently than it
> does now or if a caller passed a slot that wasn't the same slot (which
> I'm not convinced never happens; see ExecSimpleRelationUpdate for what
> looks like a counterexample).  And it's not even saving anything
> meaningful.

I find the way we program with slots to be unintuitive and, indeed,
outright confusing. Your comment here is about making sure that the
slot to which newslot pointed on entry to the function ends up with
the correct contents, which is fair enough. But the naive reader could
be forgiven for thinking that epqslot_clean =
ExecGetUpdateNewTuple(relinfo, epqslot_candidate, oldslot) doesn't
affect the validity of newslot, and it definitely does. It seems like
ExecBRUpdateTriggers critically depends on the fact that on entry,
oldslot is fully valid, whereas newslot may depend on oldslot ... but
that isn't clearly documented. Nor is the fact that the original
newslot must itself get updated for the benefit of the caller. There's
no header comment explaining what's supposed to be true on entry and
on exit -- you just have to know how it's supposed to work. And it
seems to me that we have a ton of slot-related code that's like that.

You might think for example that each executor node type would have
comments explaining what slots it used, what tuple descriptor each one
uses and why, and the anticipated lifetime of each slot. But you often
don't get that, or at least not very clearly. nodeIndexonlyscan.c
actually has some comments in this area that are pretty good ("We need
another slot, in a format that's suitable for the table AM, for when
we need to fetch a tuple from the table for rechecking visibility.")
But nodeMergejoin.c just kinda does a bunch of stuff without really
explaining anything, and unfortunately I think that's more typical.

I'm not blaming you for any of this, nor asking you to fix anything,
unless you feel so moved. I'm just explaining how I feel about it. I
have a feeling that if I had started hacking on PostgreSQL a decade or
two earlier when the code was simpler, my understanding probably would
have grown up with the code and this would all make more sense to me.
But I've "only" been around since 2008!

--
Robert Haas
EDB: http://www.enterprisedb.com



Robert Haas <robertmhaas@gmail.com> writes:
> I find the way we program with slots to be unintuitive and, indeed,
> outright confusing. Your comment here is about making sure that the
> slot to which newslot pointed on entry to the function ends up with
> the correct contents, which is fair enough. But the naive reader could
> be forgiven for thinking that epqslot_clean =
> ExecGetUpdateNewTuple(relinfo, epqslot_candidate, oldslot) doesn't
> affect the validity of newslot, and it definitely does. It seems like
> ExecBRUpdateTriggers critically depends on the fact that on entry,
> oldslot is fully valid, whereas newslot may depend on oldslot ... but
> that isn't clearly documented.

Hm?  oldslot doesn't even exist on entry, it's set up within the
function.  The present bug existed because the result of that
ExecGetUpdateNewTuple call depended on oldslot, which is something
that ought not surprise the reader all that much.  We just neglected
to follow out the consequences of that.

> Nor is the fact that the original
> newslot must itself get updated for the benefit of the caller. There's
> no header comment explaining what's supposed to be true on entry and
> on exit -- you just have to know how it's supposed to work. And it
> seems to me that we have a ton of slot-related code that's like that.

I agree that ExecBRUpdateTriggers is woefully underdocumented, but
I don't find that that's particularly associated with its use of
slots.  I thought about adding a header comment as part of this patch,
but desisted because really three-quarters of trigger.c needs work
that way and it didn't seem like bug-fix material.

ISTM the really problematic thing here is just that the same slot
is being used for more than one purpose, in that ExecGetUpdateNewTuple
returns its result in the same slot no matter where it's called from.
That doesn't look to be trivial to change, but maybe we should look
for a way.  In the meantime, the comments in ExecBRUpdateTriggers do
point out that hazard.

> You might think for example that each executor node type would have
> comments explaining what slots it used, what tuple descriptor each one
> uses and why, and the anticipated lifetime of each slot.

There's a lot of stuff that's pretty woefully underdocumented, and
we very frequently accept patches that make things worse not better.
But again, I don't buy that tuple slots are worse than anywhere else.

            regards, tom lane