Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger - Mailing list pgsql-bugs

From Alexander Lakhin
Subject Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger
Date
Msg-id b64c183f-4ae1-0b16-afcf-11101ebc4213@gmail.com
Whole thread Raw
In response to Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: BUG #17798: Incorrect memory access occurs when using BEFORE ROW UPDATE trigger  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-bugs
10.01.2024 23:04, Robert Haas wrote:
> OK, thanks for the further explanation. I think I'm starting to
> understand what's going on here. Maybe you already understand all of
> this, but in my own words:
>
> ExecGetUpdateNewTuple() calls ExecProject() which documents that the
> results are short-lived and you should materialize the slot if that's
> a problem. ExecModifyTable() does slot = ExecGetUpdateNewTuple(...),
> does not materialize it, and then passes that slot as the penultimate
> argument to ExecUpdate(), which calls ExecUpdatePrologue(), which
> calls ExecBRUpdateTriggers(), and the value returned by
> ExecModifyTable()'s call to ExecGetUpdateNewTuple() is now stored in
> newslot, but it still hasn't been materialized. So any operation
> that's performed on oldslot at this point invalidates newslot.

Yes, I see this the same way.

> But ExecBRUpdateTriggers() now calls GetTupleForTrigger() to clobber
> oldslot, so now newslot is also junk. That's technically OK, because
> we're not going to access the contents of newslot any more. We're
> instead going to refresh them:
>
>          if (epqslot_candidate != NULL)
>          {
>              TupleTableSlot *epqslot_clean;
>
>              epqslot_clean = ExecGetUpdateNewTuple(relinfo, epqslot_candidate,
>                                                    oldslot);
>
>              if (newslot != epqslot_clean)
>                  ExecCopySlot(newslot, epqslot_clean);
>          }
>
> Since the slot returned by this call to ExecGetUpdateNewTuple() is a
> function of the same projection info as when it was called from
> ExecModifyTable(), the slot returned here must be the same slot that
> was returned there, namely newslot. So, as you say, newslot ==
> epqslot_clean, and we've just clobbered it with updated values.
> Therefore we're back in the situation where oldslot is fully valid and
> newslot is only valid until somebody does something to oldslot. Thus,
> the portion of your patch that simplifies this code doesn't fix any
> bug, but rather just simplifies some unnecessarily complex logic.

Yes, in that portion I was just trying to address Tom's suspicion: "Maybe
we need to materialize even if newslot != epqslot_clean"...

>   But
> the next thing that happens in the existing code is:
>
>          trigtuple = ExecFetchSlotHeapTuple(oldslot, true, &should_free_trig);
>
> That is going to again invalidate newslot, because it can materialize
> oldslot. That's a problem if we ever use newslot after this point, and
> we do, unless we materialize newslot first, so your patch adds
> ExecMaterializeSlot(newslot) right before this, fixing the problem.

Yes, that's the way I find possible to fix it.

> What I don't really understand is why failure to do this causes an
> invalid memory access in ExecFetchSlotHeapTuple(). According to the
> above analysis, the invalid memory access ought to occur the first
> time something touches newslot after this call to
> ExecFetchSlotHeapTuple(), but if I understand your valgrind trace
> correctly, it's showing that ExecFetchSlotHeapTuple() is touching
> something it shouldn't. Do you understand what's happening here?

Maybe you are confused by the first call to ExecFetchSlotHeapTuple()?
The full Valgrind stack from my yesterday's test run is:
==00:00:00:05.052 3934530== Invalid read of size 1
==00:00:00:05.052 3934530==    at 0x1EECAC: heap_compute_data_size (heaptuple.c:244)
==00:00:00:05.052 3934530==    by 0x1EFA84: heap_form_tuple (heaptuple.c:1158)
==00:00:00:05.052 3934530==    by 0x423B5E: tts_buffer_heap_materialize (execTuples.c:747)
==00:00:00:05.052 3934530==    by 0x424EB1: ExecFetchSlotHeapTuple (execTuples.c:1653)
==00:00:00:05.052 3934530==    by 0x3ECD45: ExecBRUpdateTriggers (trigger.c:3055)
==00:00:00:05.052 3934530==    by 0x4483F0: ExecUpdatePrologue (nodeModifyTable.c:1923)
==00:00:00:05.052 3934530==    by 0x44A146: ExecUpdate (nodeModifyTable.c:2284)
==00:00:00:05.052 3934530==    by 0x44CE4E: ExecModifyTable (nodeModifyTable.c:3844)
==00:00:00:05.052 3934530==    by 0x41E9A1: ExecProcNodeFirst (execProcnode.c:464)
==00:00:00:05.052 3934530==    by 0x416F5D: ExecProcNode (executor.h:273)
==00:00:00:05.052 3934530==    by 0x416F5D: ExecutePlan (execMain.c:1670)
==00:00:00:05.052 3934530==    by 0x417120: standard_ExecutorRun (execMain.c:365)
==00:00:00:05.052 3934530==    by 0x4171FA: ExecutorRun (execMain.c:309)
==00:00:00:05.052 3934530==  Address 0x9982c74 is in a rw- anonymous segment
==00:00:00:05.052 3934530==

here trigger.c:3055 points at the second call in ExecBRUpdateTriggers():
             newtuple = ExecFetchSlotHeapTuple(newslot, true, &should_free_new);

Does it makes more sense?

> I think the question we should be asking ourselves here is whether
> we're really OK with having such a vast distance between the point
> where newslot is first populated and where materialization happens. It
> seems like playing with fire to populate a slot with ephemeral data in
> ExecModifyTable() and then press the materialize button four stack
> frames down. There's a lot of opportunity for things to go wrong
> there. On the other hand, this is very performance-sensitive code, so
> maybe materializing here is the only workable solution. If that's the
> case, though, we should probably add comments making the lifetime of
> various slots clearer, because neither Tom nor I was able to
> understand what the heck was happening here on first reading, which is
> probably not a good sign.

I was discouraged by that vast distance and implicit buffer usage too, but
I found no other feasible way to fix it. On the other hand, 75e03eabe and
Andres's words upthread made me believe that it's an acceptable solution.

Thank you very much for diving deep into this subject!

Best regards,
Alexander



pgsql-bugs by date:

Previous
From: Michael Paquier
Date:
Subject: Re: BUG #18240: Undefined behaviour in cash_mul_flt8() and friends
Next
From: Richard Guo
Date:
Subject: Re: Postgres 16.1 - Bug: cache entry already complete