Thread: BUG #17994: Invalidating relcache corrupts tupDesc inside ExecEvalFieldStoreDeForm()

BUG #17994: Invalidating relcache corrupts tupDesc inside ExecEvalFieldStoreDeForm()

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      17994
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 16beta1
Operating system:   Ubuntu 22.04
Description:

The following script:
for ((n=1;n<=10;n++)); do
echo "ITERATION $n"

createdb test
numclients=50
for ((c=1;c<=$numclients;c++)); do
  cat << EOF | psql test -o /dev/null &
CREATE PUBLICATION testpub_$c FOR ALL TABLES;
CREATE TYPE rec_$c AS (f1 text, f2 text);
CREATE TEMP TABLE tbl_$c (f rec_$c);
INSERT INTO tbl_$c VALUES ((repeat('1234567890', 100000), 'a'));

UPDATE tbl_$c SET f.f2 = 'b';
EOF
done
wait
dropdb test

grep 'TRAP:' server.log && break;
done

invokes errors or a server crash:
ITERATION 1
ITERATION 2
ERROR:  unsupported byval length: 32639
ITERATION 3
ITERATION 4
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
connection to server was lost
WARNING:  terminating connection because of crash of another server
process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
dropdb: error: database removal failed: server closed the connection
unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
TRAP: failed Assert("(thisatt->attalign) == TYPALIGN_SHORT"), File:
"heaptuple.c", Line: 1312, PID: 1737623

The stack trace of the crash:
Core was generated by `postgres: law test [local] UPDATE
                        '.
Program terminated with signal SIGABRT, Aborted.

warning: Section `.reg-xstate/1737623' in core file too small.
#0  __pthread_kill_implementation (no_tid=0, signo=6,
threadid=140071061268416) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6,
threadid=140071061268416) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140071061268416) at
./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140071061268416, signo=signo@entry=6) at
./nptl/pthread_kill.c:89
#3  0x00007f64d7c27476 in __GI_raise (sig=sig@entry=6) at
../sysdeps/posix/raise.c:26
#4  0x00007f64d7c0d7f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x0000555ede8dc3b3 in ExceptionalCondition (
    conditionName=0x555edeb49700 "(thisatt->attalign) == TYPALIGN_SHORT", 
    fileName=0x555edeb49020 "heaptuple.c", lineNumber=1312) at assert.c:66
#6  0x0000555edc7b76d7 in heap_deform_tuple (tuple=0x7ffe49cfe570,
tupleDesc=0x7f64ca4562d8, 
    values=0x62500005ad40, isnull=0x62500005b130) at heaptuple.c:1312
#7  0x0000555edd389d0b in ExecEvalFieldStoreDeForm (state=0x62500005b230,
op=0x62500005b338, 
    econtext=0x62500005ac20) at execExprInterp.c:3210
#8  0x0000555edd3729b4 in ExecInterpExpr (state=0x62500005b230,
econtext=0x62500005ac20, isnull=0x7ffe49cfe900)
    at execExprInterp.c:1446
#9  0x0000555edd3791f2 in ExecInterpExprStillValid (state=0x62500005b230,
econtext=0x62500005ac20, 
    isNull=0x7ffe49cfe900) at execExprInterp.c:1870
#10 0x0000555edd406e84 in ExecEvalExprSwitchContext (state=0x62500005b230,
econtext=0x62500005ac20, 
    isNull=0x7ffe49cfe900) at ../../../src/include/executor/executor.h:355
#11 0x0000555edd4070ab in ExecProject (projInfo=0x62500005b228) at
../../../src/include/executor/executor.h:389
#12 0x0000555edd40883e in ExecScan (node=0x62500005ab10,
accessMtd=0x555edd5843cf <SeqNext>, 
    recheckMtd=0x555edd5847cd <SeqRecheck>) at execScan.c:237
#13 0x0000555edd58484e in ExecSeqScan (pstate=0x62500005ab10) at
nodeSeqscan.c:112
#14 0x0000555edd3f5a71 in ExecProcNodeFirst (node=0x62500005ab10) at
execProcnode.c:464
#15 0x0000555edd5495a5 in ExecProcNode (node=0x62500005ab10) at
../../../src/include/executor/executor.h:273
#16 0x0000555edd5674b3 in ExecModifyTable (pstate=0x62500005a490) at
nodeModifyTable.c:3616
#17 0x0000555edd3f5a71 in ExecProcNodeFirst (node=0x62500005a490) at
execProcnode.c:464
#18 0x0000555edd3af500 in ExecProcNode (node=0x62500005a490) at
../../../src/include/executor/executor.h:273
#19 0x0000555edd3be25a in ExecutePlan (estate=0x62500005a218,
planstate=0x62500005a490, 
    use_parallel_mode=false, operation=CMD_UPDATE, sendTuples=false,
numberTuples=0, 
    direction=ForwardScanDirection, dest=0x625000072770, execute_once=true)
at execMain.c:1670
#20 0x0000555edd3b17f2 in standard_ExecutorRun (queryDesc=0x619000001598,
direction=ForwardScanDirection, 
    count=0, execute_once=true) at execMain.c:365
#21 0x0000555edd3b0e37 in ExecutorRun (queryDesc=0x619000001598,
direction=ForwardScanDirection, count=0, 
    execute_once=true) at execMain.c:309
#22 0x0000555ede0d688a in ProcessQuery (plan=0x625000006c28, 
    sourceText=0x625000005218 "UPDATE tbl_14 SET f.f2 = 'b';", params=0x0,
queryEnv=0x0, dest=0x625000072770, 
    qc=0x7ffe49cff1c0) at pquery.c:160
#23 0x0000555ede0deb02 in PortalRunMulti (portal=0x625000025a18,
isTopLevel=true, setHoldSnapshot=false, 
    dest=0x625000072770, altdest=0x625000072770, qc=0x7ffe49cff1c0) at
pquery.c:1277
#24 0x0000555ede0dba90 in PortalRun (portal=0x625000025a18,
count=9223372036854775807, isTopLevel=true, 
    run_once=true, dest=0x625000072770, altdest=0x625000072770,
qc=0x7ffe49cff1c0) at pquery.c:791
#25 0x0000555ede0c592d in exec_simple_query (query_string=0x625000005218
"UPDATE tbl_14 SET f.f2 = 'b';")
    at postgres.c:1274
#26 0x0000555ede0d347a in PostgresMain (dbname=0x62900001b358 "test",
username=0x6250000020f8 "law")
    at postgres.c:4632
#27 0x0000555eddcf401f in BackendRun (port=0x614000016040) at
postmaster.c:4461
#28 0x0000555eddcf2723 in BackendStartup (port=0x614000016040) at
postmaster.c:4189
#29 0x0000555eddce84dd in ServerLoop () at postmaster.c:1779
#30 0x0000555eddce69fe in PostmasterMain (argc=3, argv=0x6030000006a0) at
postmaster.c:1463
#31 0x0000555edd67850e in main (argc=3, argv=0x6030000006a0) at main.c:198

(gdb) frame 6
#6  0x0000555edc7b76d7 in heap_deform_tuple (tuple=0x7ffe49cfe570,
tupleDesc=0x7f64ca4562d8, 
    values=0x62500005ad40, isnull=0x62500005b130) at heaptuple.c:1312
1312                            off = att_align_nominal(off,
thisatt->attalign);
(gdb) p thisatt->attalign
$1 = 0 '\000'

(gdb) frame 7
#7  0x0000555edd389d0b in ExecEvalFieldStoreDeForm (state=0x62500005b230,
op=0x62500005b338, 
    econtext=0x62500005ac20) at execExprInterp.c:3210
3210                    heap_deform_tuple(&tmptup, tupDesc,
(gdb) p *tupDesc
$2 = {natts = 18406, tdtypeid = 1952409456, tdtypmod = 1953718639,
tdrefcount = 859320671, constr = 0x3338, 
  attrs = 0x7f64ca4562f0}

With some debug logging added I see (note the natts value):
2023-06-24 15:23:12.488 MSK [1741232] LOG:  get_cached_rowtype() after
lookup_type_cache| type_id: 16389, typentry->tupDesc: 0x7ff12f045640,
typentry->tupDesc->natts: 2
2023-06-24 15:23:12.488 MSK [1741232] STATEMENT:  UPDATE tbl_5 SET f.f2 =
'b';
2023-06-24 15:23:12.488 MSK [1741293] LOG:  statement: CREATE TYPE rec_23 AS
(f1 text, f2 text);
2023-06-24 15:23:12.488 MSK [1741232] LOG:  ExecEvalFieldStoreDeForm()
before heap_deform_tuple| tupDesc: 0x7ff12f045640, tupDesc->natts: 16427
2023-06-24 15:23:12.488 MSK [1741232] STATEMENT:  UPDATE tbl_5 SET f.f2 =
'b';
TRAP: failed Assert("(thisatt->attalign) == TYPALIGN_SHORT"), File:
"heaptuple.c", Line: 1312, PID: 1741232

So it looks like *tupDesc memory was overridden due to relcache invalidation
during ExecEvalFieldStoreDeForm() execution after the get_cached_rowtype()
call.

The pg_usleep(1000) call added inside ExecEvalFieldStoreDeForm() after
get_cached_rowtype() greatly increases the probability of the crash.

Reproduced on REL_11_STABLE (since 96e38fa5e) .. master.


PG Bug reporting form <noreply@postgresql.org> writes:
> The following script:
> ...
> invokes errors or a server crash:

Thanks for the report!  The good news is that the case shown here
is easily fixed, as attached.  get_cached_rowtype() explicitly
specifies that its result doesn't hold good across database
access if you don't increment the tupdesc's refcount.  Most of
the callers get this right, but ExecEvalFieldStoreDeForm() is
doing things in the wrong order by fetching the tupdesc before
detoasting the input datum.  We could fix that by temporarily
incrementing the refcount, but I don't see any reason we can't
just reorder the steps to make it safe.

The bad news is that while investigating this I came across
another hazard that seems much messier to fix.  To wit, if
you have a tuple with "missing" pass-by-ref columns, then
some of the columns output by heap_deform_tuple() will
contain pointers into the tupdesc (cf. getmissingattr()).
That's not sufficient lifespan in the scenario we're dealing
with here: if the tupdesc gets trashed anytime before the
eventual ExecEvalFieldStoreForm(), we'll have garbage in the
result tuple.

Worse, I fear there may be many other places with latent bugs of the
same ilk.  Before the attmissingval code was added, one could assume
that the result of heap_deform_tuple would hold good as long as you
had a pin on the source tuple.  But now it depends on metadata as
well, and I don't think we have a coherent story about that.

Any thoughts what to do?  I considered making getmissingattr
apply datumCopy, but that would probably lead to unpleasant
memory leaks.

            regards, tom lane

diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 7a4d7a4eee..851946a927 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -2015,7 +2015,8 @@ CheckOpSlotCompatibility(ExprEvalStep *op, TupleTableSlot *slot)
  * changed: if not NULL, *changed is set to true on any update
  *
  * The returned TupleDesc is not guaranteed pinned; caller must pin it
- * to use it across any operation that might incur cache invalidation.
+ * to use it across any operation that might incur cache invalidation,
+ * including for example detoasting of input tuples.
  * (The TupleDesc is always refcounted, so just use IncrTupleDescRefCount.)
  *
  * NOTE: because composite types can change contents, we must be prepared
@@ -3174,17 +3175,6 @@ ExecEvalFieldSelect(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
 void
 ExecEvalFieldStoreDeForm(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
 {
-    TupleDesc    tupDesc;
-
-    /* Lookup tupdesc if first time through or if type changes */
-    tupDesc = get_cached_rowtype(op->d.fieldstore.fstore->resulttype, -1,
-                                 op->d.fieldstore.rowcache, NULL);
-
-    /* Check that current tupdesc doesn't have more fields than we allocated */
-    if (unlikely(tupDesc->natts > op->d.fieldstore.ncolumns))
-        elog(ERROR, "too many columns in composite type %u",
-             op->d.fieldstore.fstore->resulttype);
-
     if (*op->resnull)
     {
         /* Convert null input tuple into an all-nulls row */
@@ -3200,6 +3190,7 @@ ExecEvalFieldStoreDeForm(ExprState *state, ExprEvalStep *op, ExprContext *econte
         Datum        tupDatum = *op->resvalue;
         HeapTupleHeader tuphdr;
         HeapTupleData tmptup;
+        TupleDesc    tupDesc;

         tuphdr = DatumGetHeapTupleHeader(tupDatum);
         tmptup.t_len = HeapTupleHeaderGetDatumLength(tuphdr);
@@ -3207,6 +3198,20 @@ ExecEvalFieldStoreDeForm(ExprState *state, ExprEvalStep *op, ExprContext *econte
         tmptup.t_tableOid = InvalidOid;
         tmptup.t_data = tuphdr;

+        /*
+         * Lookup tupdesc if first time through or if type changes.  Because
+         * we don't pin the tupdesc, we must not do this lookup until after
+         * doing DatumGetHeapTupleHeader: that could do database access while
+         * detoasting the datum.
+         */
+        tupDesc = get_cached_rowtype(op->d.fieldstore.fstore->resulttype, -1,
+                                     op->d.fieldstore.rowcache, NULL);
+
+        /* Check that current tupdesc doesn't have more fields than allocated */
+        if (unlikely(tupDesc->natts > op->d.fieldstore.ncolumns))
+            elog(ERROR, "too many columns in composite type %u",
+                 op->d.fieldstore.fstore->resulttype);
+
         heap_deform_tuple(&tmptup, tupDesc,
                           op->d.fieldstore.values,
                           op->d.fieldstore.nulls);

Hi,

On 2023-06-28 14:49:34 -0400, Tom Lane wrote:
> The bad news is that while investigating this I came across
> another hazard that seems much messier to fix.  To wit, if
> you have a tuple with "missing" pass-by-ref columns, then
> some of the columns output by heap_deform_tuple() will
> contain pointers into the tupdesc (cf. getmissingattr()).
> That's not sufficient lifespan in the scenario we're dealing
> with here: if the tupdesc gets trashed anytime before the
> eventual ExecEvalFieldStoreForm(), we'll have garbage in the
> result tuple.

What are the scenarios that could lead to the tupledesc being released before
ExecEvalFieldStoreForm()? I guess a function call that does an ALTER TABLE
might do the trick?  But shouldn't such cases be prohibited by
CheckTableNotInUse()?  If other sessions caused the tupledesc to be changed,
we should already hang onto the old definition via the
RememberToFreeTupleDescAtEOX() mechanism?


What about erroring out if the get_cached_rowtype() in
ExecEvalFieldStoreForm() indicates the tupledesc changed?  Hm, I guess that
could cause spurious errors, our mechanism for determining whether tupledescs
changed is pretty coarse.


> Worse, I fear there may be many other places with latent bugs of the
> same ilk.  Before the attmissingval code was added, one could assume
> that the result of heap_deform_tuple would hold good as long as you
> had a pin on the source tuple.  But now it depends on metadata as
> well, and I don't think we have a coherent story about that.
> 
> Any thoughts what to do?  I considered making getmissingattr
> apply datumCopy, but that would probably lead to unpleasant
> memory leaks.

Ugh.  This is pretty nasty.

The only real thing I can think of is to use refcounted tupledescs more
widely. Part of the problem would be dealing with releasing the refcounts,
another what to do about non-refcounted descs.  WRT the former, I guess we
could add a variant of ResourceOwnerRememberTupleDesc() that doesn't warn on
commit if there are unreleased references.

Greetings,

Andres Freund



Andres Freund <andres@anarazel.de> writes:
> On 2023-06-28 14:49:34 -0400, Tom Lane wrote:
>> That's not sufficient lifespan in the scenario we're dealing
>> with here: if the tupdesc gets trashed anytime before the
>> eventual ExecEvalFieldStoreForm(), we'll have garbage in the
>> result tuple.

> What are the scenarios that could lead to the tupledesc being released before
> ExecEvalFieldStoreForm()? I guess a function call that does an ALTER TABLE
> might do the trick?  But shouldn't such cases be prohibited by
> CheckTableNotInUse()?

Any old cache flush could do it; you don't need a local trigger event.
(Alexander's original test case uses CREATE PUBLICATION to trigger
the cache flush, but that's just an easy way to make it more-or-less
deterministic.)

> If other sessions caused the tupledesc to be changed,
> we should already hang onto the old definition via the
> RememberToFreeTupleDescAtEOX() mechanism?

I believe the tupdesc in question is actually in the typcache,
which doesn't have anything like RememberToFreeTupleDescAtEOX
(which is a horrid hack anyway if you ask me).

We could probably make things better for this specific case by
teaching the typcache not to replace a cached tupdesc unless its
contents actually change.  But that just makes it harder to get
to a bug instance; it's not a cure-all.

            regards, tom lane



Hi,

On 2023-06-28 15:52:28 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > If other sessions caused the tupledesc to be changed,
> > we should already hang onto the old definition via the
> > RememberToFreeTupleDescAtEOX() mechanism?
> 
> I believe the tupdesc in question is actually in the typcache,
> which doesn't have anything like RememberToFreeTupleDescAtEOX
> (which is a horrid hack anyway if you ask me).

It's in the typecache, but that just uses the relcache's tupledesc for
non-record composites. But looks like it doesn't suffice, because
TypeCacheRelCallback() releases the refcount the typecache held, regardless of
the tupledesc having changed meaningfully or not.

So even if there can't have been "important" changes to the tupledesc due to
locking, we end up with the newer tupledesc on a second lookup...


I agree that the RememberToFreeTupleDescAtEOX thing is a ugly hack, but I
don't think it's easy to come up with something good...


> We could probably make things better for this specific case by
> teaching the typcache not to replace a cached tupdesc unless its
> contents actually change.  But that just makes it harder to get
> to a bug instance; it's not a cure-all.

Yea :(.

Greetings,

Andres Freund




On 2023-06-28 We 16:54, Andres Freund wrote:
Hi,

On 2023-06-28 15:52:28 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
If other sessions caused the tupledesc to be changed,
we should already hang onto the old definition via the
RememberToFreeTupleDescAtEOX() mechanism?
I believe the tupdesc in question is actually in the typcache,
which doesn't have anything like RememberToFreeTupleDescAtEOX
(which is a horrid hack anyway if you ask me).
It's in the typecache, but that just uses the relcache's tupledesc for
non-record composites. But looks like it doesn't suffice, because
TypeCacheRelCallback() releases the refcount the typecache held, regardless of
the tupledesc having changed meaningfully or not.

So even if there can't have been "important" changes to the tupledesc due to
locking, we end up with the newer tupledesc on a second lookup...


I agree that the RememberToFreeTupleDescAtEOX thing is a ugly hack, but I
don't think it's easy to come up with something good...


We could probably make things better for this specific case by
teaching the typcache not to replace a cached tupdesc unless its
contents actually change.  But that just makes it harder to get
to a bug instance; it's not a cure-all.
Yea :(.


:-(

I thought about  whether the datumCopy() idea might be manageable, but getmissingattr() is inlined by heap_getttr() which has distressingly large number of call sites, including in third party code.

Could we maybe cache missing values elsewhere in a way that's less volatile? That's an extremely vague idea, just thinking out loud.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com


On 2023-06-28 We 17:57, Andrew Dunstan wrote:


On 2023-06-28 We 16:54, Andres Freund wrote:
Hi,

On 2023-06-28 15:52:28 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
If other sessions caused the tupledesc to be changed,
we should already hang onto the old definition via the
RememberToFreeTupleDescAtEOX() mechanism?
I believe the tupdesc in question is actually in the typcache,
which doesn't have anything like RememberToFreeTupleDescAtEOX
(which is a horrid hack anyway if you ask me).
It's in the typecache, but that just uses the relcache's tupledesc for
non-record composites. But looks like it doesn't suffice, because
TypeCacheRelCallback() releases the refcount the typecache held, regardless of
the tupledesc having changed meaningfully or not.

So even if there can't have been "important" changes to the tupledesc due to
locking, we end up with the newer tupledesc on a second lookup...


I agree that the RememberToFreeTupleDescAtEOX thing is a ugly hack, but I
don't think it's easy to come up with something good...


We could probably make things better for this specific case by
teaching the typcache not to replace a cached tupdesc unless its
contents actually change.  But that just makes it harder to get
to a bug instance; it's not a cure-all.
Yea :(.


:-(

I thought about  whether the datumCopy() idea might be manageable, but getmissingattr() is inlined by heap_getttr() which has distressingly large number of call sites, including in third party code.

Could we maybe cache missing values elsewhere in a way that's less volatile? That's an extremely vague idea, just thinking out loud.




After (not) sleeping on this overnight, and discussing it with a colleague this morning, here's a suggestion. We have a hash table, keyed by (tdtypeid, attnum) where we store a datumCopy'd version of the value. If it's present just return the value instead of getting it from the tupdesc. The hash table is blown away at the end of the transaction. Assuming that's workable I think it would not be a large patch.

My colleague did ask if we had live reproducible case.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes:
> After (not) sleeping on this overnight, and discussing it with a 
> colleague this morning, here's a suggestion. We have a hash table, keyed 
> by (tdtypeid, attnum) where we store a datumCopy'd version of the value. 
> If it's present just return the value instead of getting it from the 
> tupdesc. The hash table is blown away at the end of the transaction. 
> Assuming that's workable I think it would not be a large patch.

That sounds possibly workable.  I'm a bit concerned about added
overhead, and about whether the hashtable needs invalidation support.
It might be better to key it off (relfilenode, attnum).

> My colleague did ask if we had live reproducible case.

I have not attempted to build one, but it'd probably be fairly
easy to do if you turn on debug_discard_caches.

            regards, tom lane




On 2023-06-29 Th 10:26, Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
After (not) sleeping on this overnight, and discussing it with a 
colleague this morning, here's a suggestion. We have a hash table, keyed 
by (tdtypeid, attnum) where we store a datumCopy'd version of the value. 
If it's present just return the value instead of getting it from the 
tupdesc. The hash table is blown away at the end of the transaction. 
Assuming that's workable I think it would not be a large patch.
That sounds possibly workable.  


OK, good, we have a plan.


I'm a bit concerned about added
overhead, and about whether the hashtable needs invalidation support.
It might be better to key it off (relfilenode, attnum).


re overhead: getmissingattr isn't called at all except for "off the end" attributes. Setting up the hash table at most once per txn doesn't seem likely to cost much, and even that won't be done if getmissingattr isn't called.

re relfilenode: we don't have it in getmissingattr, so that would involve looking it up or whacking around the API. Neither of those seem good. Do you have any reason for thinking tdtypeid will not be sufficient?

re invalidation: that seems to suggest that the missing value could change under us. I don't think it can, but if it can then more than just this is broken. If not, how would invalidation affect us?


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes:
> On 2023-06-29 Th 10:26, Tom Lane wrote:
>> I'm a bit concerned about added
>> overhead, and about whether the hashtable needs invalidation support.
>> It might be better to key it off (relfilenode, attnum).

> re relfilenode: we don't have it in getmissingattr, so that would 
> involve looking it up or whacking around the API.

Yeah, I was afraid of that.

> re invalidation: that seems to suggest that the missing value could 
> change under us. I don't think it can, but if it can then more than just 
> this is broken. If not, how would invalidation affect us?

The scenario I'm afraid of is that we cache a missingval for table X,
then X gets dropped, then a new table Y gets created that by bad luck
has the same type OID as X did, then we add a column to Y that
requires a missingval, and now we have an entry in the hashtable that
matches Y but contains the wrong value.  Admittedly, it seems very
low probability that this would all happen within the span of one
transaction, so maybe we can get away with ignoring the case.  But
if we used relfilenode, we'd have at least a little more protection
because of the tombstone files that prevent immediate re-use of a
relfilenode OID.  I'm not sure that it'd be bulletproof even with
relfilenode, though.  Maybe we should bite the bullet and provide
invalidation based on a pg_type inval callback.

            regards, tom lane




On 2023-06-29 Th 15:25, Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
On 2023-06-29 Th 10:26, Tom Lane wrote:
I'm a bit concerned about added
overhead, and about whether the hashtable needs invalidation support.
It might be better to key it off (relfilenode, attnum).
re relfilenode: we don't have it in getmissingattr, so that would 
involve looking it up or whacking around the API.
Yeah, I was afraid of that.

re invalidation: that seems to suggest that the missing value could 
change under us. I don't think it can, but if it can then more than just 
this is broken. If not, how would invalidation affect us?
The scenario I'm afraid of is that we cache a missingval for table X,
then X gets dropped, then a new table Y gets created that by bad luck
has the same type OID as X did, then we add a column to Y that
requires a missingval, and now we have an entry in the hashtable that
matches Y but contains the wrong value.  Admittedly, it seems very
low probability that this would all happen within the span of one
transaction, so maybe we can get away with ignoring the case.  But
if we used relfilenode, we'd have at least a little more protection
because of the tombstone files that prevent immediate re-use of a
relfilenode OID.  I'm not sure that it'd be bulletproof even with
relfilenode, though.  Maybe we should bite the bullet and provide
invalidation based on a pg_type inval callback.
			


Yeah, Robert has just convinced me, so I'll do it like that. It doesn't look too hard.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes:
> On 2023-06-29 Th 15:25, Tom Lane wrote:
>> Maybe we should bite the bullet and provide
>> invalidation based on a pg_type inval callback.

> Yeah, Robert has just convinced me, so I'll do it like that. It doesn't 
> look too hard.

Oh, I have a better idea.  We're only going to need all this for
pass-by-ref types, right?  Why not make the hash key be the value
itself?  Wrap it in a bytea perhaps to avoid needing a bespoke
hash function.

            regards, tom lane




On 2023-06-29 Th 18:41, Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
On 2023-06-29 Th 15:25, Tom Lane wrote:
Maybe we should bite the bullet and provide
invalidation based on a pg_type inval callback.
Yeah, Robert has just convinced me, so I'll do it like that. It doesn't 
look too hard.
Oh, I have a better idea.  We're only going to need all this for
pass-by-ref types, right?  


Yes, the value we get back for byval types isn't a pointer that might disappear.


Why not make the hash key be the value
itself?  Wrap it in a bytea perhaps to avoid needing a bespoke
hash function.
			


Not sure I understand.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes:
> On 2023-06-29 Th 18:41, Tom Lane wrote:
>> Why not make the hash key be the value
>> itself?  Wrap it in a bytea perhaps to avoid needing a bespoke
>> hash function.

> Not sure I understand.

Say the missingval for a particular column is text 'abc'.
We don't actually care which column it is, all we need is a
copy of that datum that will stay put for the rest of the
transaction.  So I'm thinking that the lookup key for the
hash table should actually be the contents of the datum,
and we don't need to store anything else at all.  (If we
happen to have two columns with the same missingval, they
can perfectly well share this hash entry.)  Then there's
no question of invalidation, or at least the existing
invalidation mechanisms for tupdescs do all we need.

            regards, tom lane




On 2023-06-30 Fr 08:46, Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
On 2023-06-29 Th 18:41, Tom Lane wrote:
Why not make the hash key be the value
itself?  Wrap it in a bytea perhaps to avoid needing a bespoke
hash function.
Not sure I understand.
Say the missingval for a particular column is text 'abc'.
We don't actually care which column it is, all we need is a
copy of that datum that will stay put for the rest of the
transaction.  So I'm thinking that the lookup key for the
hash table should actually be the contents of the datum,
and we don't need to store anything else at all.  (If we
happen to have two columns with the same missingval, they
can perfectly well share this hash entry.)  Then there's
no question of invalidation, or at least the existing
invalidation mechanisms for tupdescs do all we need.
			


OK, I get it. Do we have a routine to wrap a Datum in a bytea, or do I need to write one? It's slightly amusing that the original patch for fast defaults stored the missing value as a bytea, but someone (Andres IIRC) preferred that we store it as a one element array, so that's what we went with.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes:
> OK, I get it. Do we have a routine to wrap a Datum in a bytea, or do I 
> need to write one?

I don't recall that functionality being exported anywhere,
but it'd only be a couple lines of code in any case.

(Whether it's actually worth doing, I'm not sure about.
It only helps if it lets you use existing hashtable support
routines, but I'm not quite sure that any of those exist
for bytea either.)

            regards, tom lane




On 2023-06-30 Fr 09:43, Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
OK, I get it. Do we have a routine to wrap a Datum in a bytea, or do I 
need to write one?
I don't recall that functionality being exported anywhere,
but it'd only be a couple lines of code in any case.

(Whether it's actually worth doing, I'm not sure about.
It only helps if it lets you use existing hashtable support
routines, but I'm not quite sure that any of those exist
for bytea either.)
			


Me either. I think this might call for too much invention so I'm going to revert to plan A. The invalidation code won't be very much, and it should be a fairly rare event, so it doesn't need to be very clever.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes:
> Me either. I think this might call for too much invention so I'm going 
> to revert to plan A. The invalidation code won't be very much, and it 
> should be a fairly rare event, so it doesn't need to be very clever.

The problem with rarely-executed code is that it's also hard to test.

            regards, tom lane




On 2023-06-30 Fr 15:08, Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
Me either. I think this might call for too much invention so I'm going 
to revert to plan A. The invalidation code won't be very much, and it 
should be a fairly rare event, so it doesn't need to be very clever.
The problem with rarely-executed code is that it's also hard to test.
			


Your Plan B in the end proved less difficult than I thought, and certainly seems more robust than having to tangle with invalidations. I didn't try to do anything to wrap the values in a bytea, it didn't seem necessary. Here's a patch - it's not terribly long or invasive. I haven't tried backpatching it yet.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Attachment
Hi,

On 2023-07-01 09:14:39 -0400, Andrew Dunstan wrote:
> On 2023-06-30 Fr 15:08, Tom Lane wrote:
> > Andrew Dunstan<andrew@dunslane.net>  writes:
> > > Me either. I think this might call for too much invention so I'm going
> > > to revert to plan A. The invalidation code won't be very much, and it
> > > should be a fairly rare event, so it doesn't need to be very clever.
> > The problem with rarely-executed code is that it's also hard to test.

> Your Plan B in the end proved less difficult than I thought, and certainly
> seems more robust than having to tangle with invalidations. I didn't try to
> do anything to wrap the values in a bytea, it didn't seem necessary. Here's
> a patch - it's not terribly long or invasive. I haven't tried backpatching
> it yet.

Hm. Is tying this to top-level transactions the right choice? What about
sequences like

SAVEPOINT x; ALTER TABLE ... ADD COLUMN ... DEFAULT; SELECT use_default FROM ...; ROLLBACK TO SAVEPOINT x;
SAVEPOINT y; ALTER TABLE ... ADD COLUMN ... DEFAULT; SELECT use_default FROM ...; ROLLBACK TO SAVEPOINT y;

Isn't that going to break the assumption that the key is unique within a
transaction?  I think at the very least you'd need to make 

Separately, will this work correctly with procedures keeping values alive
across transactions? I don't feel like I have a good grasp at how that whole
area is supposed to work...

Greetings,

Andres Freund



Andres Freund <andres@anarazel.de> writes:
> Isn't that going to break the assumption that the key is unique within a
> transaction?

Huh?  "abc" is "abc", no matter what.  At least if Andrew did what
I suggested (I didn't look at the patch yet).

> Separately, will this work correctly with procedures keeping values alive
> across transactions?

That might be an issue.  But couldn't we make this cache just live for
the life of the process?  It's unlikely to get large.

            regards, tom lane



Hi,

On 2023-07-02 17:57:03 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Isn't that going to break the assumption that the key is unique within a
> > transaction?
> 
> Huh?  "abc" is "abc", no matter what.  At least if Andrew did what
> I suggested (I didn't look at the patch yet).

Yea, I think that was a brainfart after too briefly skimming the code.


> > Separately, will this work correctly with procedures keeping values alive
> > across transactions?
> 
> That might be an issue.  But couldn't we make this cache just live for
> the life of the process?  It's unlikely to get large.

I don't have a good handle about how big it'd end up being in some of the less
common workloads. I can imagine workloads with temp tables or such churning
through a lot of default values - often the "keyed by value" approach will
save the day, but I imagine not always.

.oO(Perhaps we need to add a boehm style GC ... No.)

Perhaps we could defer resetting the cache to when we're not inside a
procedure?

I kinda wonder if this isn't basically the start of a "string interning" style
infrastructure, except for more types than just strings... I've wondered about
having that quite a few times.

Greetings,

Andres Freund




On 2023-07-02 Su 18:13, Andres Freund wrote:
Hi,

On 2023-07-02 17:57:03 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
Isn't that going to break the assumption that the key is unique within a
transaction?
Huh?  "abc" is "abc", no matter what.  At least if Andrew did what
I suggested (I didn't look at the patch yet).
Yea, I think that was a brainfart after too briefly skimming the code.


Separately, will this work correctly with procedures keeping values alive
across transactions?
That might be an issue.  But couldn't we make this cache just live for
the life of the process?  It's unlikely to get large.
I don't have a good handle about how big it'd end up being in some of the less
common workloads. I can imagine workloads with temp tables or such churning
through a lot of default values - often the "keyed by value" approach will
save the day, but I imagine not always.

The maximum number of entries in the table is the number of pg_attribute rows with atthasmissing = true and attbyval = false. In practice I suspect that's mostly going to be fairly low.



.oO(Perhaps we need to add a boehm style GC ... No.)

Perhaps we could defer resetting the cache to when we're not inside a
procedure?


I'm kinda leaning towards Tom's suggestion to just make it session-persistent.



I kinda wonder if this isn't basically the start of a "string interning" style
infrastructure, except for more types than just strings... I've wondered about
having that quite a few times.


maybe.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com


On 2023-07-02 Su 22:15, Andrew Dunstan wrote:


On 2023-07-02 Su 18:13, Andres Freund wrote:
Hi,

On 2023-07-02 17:57:03 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
Isn't that going to break the assumption that the key is unique within a
transaction?
Huh?  "abc" is "abc", no matter what.  At least if Andrew did what
I suggested (I didn't look at the patch yet).
Yea, I think that was a brainfart after too briefly skimming the code.


Separately, will this work correctly with procedures keeping values alive
across transactions?
That might be an issue.  But couldn't we make this cache just live for
the life of the process?  It's unlikely to get large.
I don't have a good handle about how big it'd end up being in some of the less
common workloads. I can imagine workloads with temp tables or such churning
through a lot of default values - often the "keyed by value" approach will
save the day, but I imagine not always.

The maximum number of entries in the table is the number of pg_attribute rows with atthasmissing = true and attbyval = false. In practice I suspect that's mostly going to be fairly low.


.oO(Perhaps we need to add a boehm style GC ... No.)

Perhaps we could defer resetting the cache to when we're not inside a
procedure?


I'm kinda leaning towards Tom's suggestion to just make it session-persistent.




The thread seems to have died down a bit. Do we have a consensus on Tom's approach?


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Hi,

On 2023-07-08 08:48:00 -0400, Andrew Dunstan wrote:
> On 2023-07-02 Su 22:15, Andrew Dunstan wrote:
> > > > > Separately, will this work correctly with procedures keeping values alive
> > > > > across transactions?
> > > > That might be an issue.  But couldn't we make this cache just live for
> > > > the life of the process?  It's unlikely to get large.
> > > I don't have a good handle about how big it'd end up being in some of the less
> > > common workloads. I can imagine workloads with temp tables or such churning
> > > through a lot of default values - often the "keyed by value" approach will
> > > save the day, but I imagine not always.
> > 
> > The maximum number of entries in the table is the number of pg_attribute
> > rows with atthasmissing = true and attbyval = false. In practice I
> > suspect that's mostly going to be fairly low.

It's not really bound by that, because the set of rows can change over
time. Particularly with temp tables.


> The thread seems to have died down a bit. Do we have a consensus on Tom's
> approach?

I guess so. It's far from pretty, but nobody really has come up with something
better.

Greetings,

Andres Freund




On 2023-07-10 Mo 15:51, Andres Freund wrote:
Hi,

On 2023-07-08 08:48:00 -0400, Andrew Dunstan wrote:
On 2023-07-02 Su 22:15, Andrew Dunstan wrote:
Separately, will this work correctly with procedures keeping values alive
across transactions?
That might be an issue.  But couldn't we make this cache just live for
the life of the process?  It's unlikely to get large.
I don't have a good handle about how big it'd end up being in some of the less
common workloads. I can imagine workloads with temp tables or such churning
through a lot of default values - often the "keyed by value" approach will
save the day, but I imagine not always.
The maximum number of entries in the table is the number of pg_attribute
rows with atthasmissing = true and attbyval = false. In practice I
suspect that's mostly going to be fairly low.
It's not really bound by that, because the set of rows can change over
time. Particularly with temp tables.


How many times are people going to add a new column with a non-null default to a temp table? Usually you know the shape you want for a temp table when you create it, I should think. Even in a long-running pgbouncer session I wouldn't expect this to balloon substantially.




The thread seems to have died down a bit. Do we have a consensus on Tom's
approach?
I guess so. It's far from pretty, but nobody really has come up with something
better.


OK, I'll send a revised patch.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com


On 2023-07-11 Tu 10:15, Andrew Dunstan wrote:


On 2023-07-10 Mo 15:51, Andres Freund wrote:
Hi,

On 2023-07-08 08:48:00 -0400, Andrew Dunstan wrote:
On 2023-07-02 Su 22:15, Andrew Dunstan wrote:
Separately, will this work correctly with procedures keeping values alive
across transactions?
That might be an issue.  But couldn't we make this cache just live for
the life of the process?  It's unlikely to get large.
I don't have a good handle about how big it'd end up being in some of the less
common workloads. I can imagine workloads with temp tables or such churning
through a lot of default values - often the "keyed by value" approach will
save the day, but I imagine not always.
The maximum number of entries in the table is the number of pg_attribute
rows with atthasmissing = true and attbyval = false. In practice I
suspect that's mostly going to be fairly low.
It's not really bound by that, because the set of rows can change over
time. Particularly with temp tables.


How many times are people going to add a new column with a non-null default to a temp table? Usually you know the shape you want for a temp table when you create it, I should think. Even in a long-running pgbouncer session I wouldn't expect this to balloon substantially.


The thread seems to have died down a bit. Do we have a consensus on Tom's
approach?
I guess so. It's far from pretty, but nobody really has come up with something
better.


OK, I'll send a revised patch.




Here it is. The nice thing here is that the code changes are entirely confined to heaptuple.c


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Attachment
Hello,

28.06.2023 21:49, Tom Lane wrote:
> The bad news is that while investigating this I came across
> another hazard that seems much messier to fix.  To wit, if
> you have a tuple with "missing" pass-by-ref columns, then
> some of the columns output by heap_deform_tuple() will
> contain pointers into the tupdesc (cf. getmissingattr()).
> That's not sufficient lifespan in the scenario we're dealing
> with here: if the tupdesc gets trashed anytime before the
> eventual ExecEvalFieldStoreForm(), we'll have garbage in the
> result tuple.

Please excuse me, as I'm definitely late to the party, but may I ask some
questions to understand the issue discussed? I still can't see a practical
way to get garbage data in a tuple with missing columns, so maybe I'm
missing something (except for columns).

If we talk about ExecEvalFieldStoreDeForm() -> heap_deform_tuple() ->
getmissingattr(), then I can't find a way to add a non-null default for a
record type to exploit this path. The rowtypes test contains:
-- at the moment this will not work due to ALTER TABLE inadequacy:
alter table fullname add column suffix text default '';
ERROR:  cannot alter table "fullname" because column "people.fn" uses its row type

So it seems to me, that ExecEvalFieldStoreDeForm() can't be called for a
record/tuple having missing non-null attributes. And even if ALTER TABLE was
"adequate", does it mean that ExecEvalFieldStoreDeForm() would return default
values from the underlying table type?

> Worse, I fear there may be many other places with latent bugs of the
> same ilk.  Before the attmissingval code was added, one could assume
> that the result of heap_deform_tuple would hold good as long as you
> had a pin on the source tuple.  But now it depends on metadata as
> well, and I don't think we have a coherent story about that.

But is it true, that having a pin on tupdesc is enough to make sure that
the result of heap_deform_tuple() holds good?

If so, then probably we should find places, where tupdesc gets unpinned
before that result is saved somewhere.
I see the following callers of the heap_deform_tuple():
src/backend/replication/logical/reorderbuffer.c
     ReorderBufferToastReplace(): heap_deform_tuple() followed by heap_form_tuple()
src/backend/executor/execTuples.c
     ExecForceStoreHeapTuple(), ExecForceStoreMinimalTuple(), ExecStoreHeapTupleDatum(): use long-living and hopefully

pinned slot->tts_tupleDescriptor
src/backend/executor/execExprInterp.c
     ExecEvalFieldStoreDeForm(): the above question applies here
src/backend/executor/spi.c
     SPI_modifytuple(): heap_deform_tuple() followed by heap_form_tuple()
src/backend/utils/adt/rowtypes.c
     record_out(), hash_record(), hash_record_extended(): tupdesc released after extracting/processing data
     record_send(): tupdesc released after sending data
     record_cmp(), record_eq(), record_image_cmp(), record_image_eq(): tupdesc1. tupdesc2 released after comparing
data
src/backend/utils/adt/jsonfuncs.c
     populate_record(): heap_deform_tuple() followed by heap_form_tuple()
src/backend/utils/adt/expandedrecord.c
     deconstruct_expanded_record() uses long-living (pinned) erh->er_tupdesc
src/backend/utils/adt/ri_triggers.c
     RI_Initial_Check(), RI_PartitionRemove_Check(): use long-living? SPI_tuptable->tupdesc
src/backend/access/heap/heapam_handler.c
     reform_and_rewrite_tuple(): uses RelationGetDescr(OldHeap) (pinned, I suppose)
src/backend/access/heap/heaptoast.c
     heap_toast_delete(), heap_toast_insert_or_update() use long-living rel->rd_att;
     toast_flatten_tuple(): heap_deform_tuple(..., tupleDesc, ...) followed by heap_form_tuple(..., tupleDesc, ...)
     toast_flatten_tuple_to_datum(): heap_deform_tuple followed by detoasting attrs
src/backend/access/heap/heapam.c
     ExtractReplicaIdentity(): heap_deform_tuple(..., desc, ...) followed by heap_form_tuple(..., desc, ...)
src/backend/access/common/heaptuple.c
     heap_modify_tuple(): heap_deform_tuple(..., tupleDesc, ...) followed by heap_form_tuple(..., tupleDesc, ...)
     heap_modify_tuple_by_cols(): heap_deform_tuple(..., tupleDesc, ...) followed by heap_form_tuple(..., tupleDesc,
...)
src/backend/access/common/tupconvert.c
     execute_attr_map_tuple(): heap_deform_tuple() followed by heap_form_tuple()
src/test/regress/regress.c
     make_tuple_indirect(): heap_deform_tuple() followed by heap_form_tuple()
src/pl/plpgsql/src/pl_exec.c
     exec_move_row() called with tupdesc = NULL, SPI_tuptable->tupdesc, or tupdesc released after the call
contrib/hstore/hstore_io.c
     hstore_from_record(), hstore_populate_record(): tupdesc released after extracting data

And even if I missed some possibly problematic calls, maybe it's worth to
consider fixing exactly that places...

Also, besides heap_deform_tuple(), getmissingattr() is called from
heap_getattr(). And heap_getattr() is used in many places, but most of them
are protected too due to tupdesc pinned.
Two suspicious places that I found are GetAttributeByName() and
GetAttributeByNum(), so when using these functions you can really get the
missing attribute value with the tupdesc released. But what circumstances do
we need to end up with an invalid data?
1) Write a function that will call GetAttributeByName() and hold it's
  result inside for some time.
2) Add a passed-by-ref column with a default value to a table.
3) Pass to that function a tuple without the "missing" column (just
  "SELECT func(table) FROM table" won't do, but func(OLD) in a trigger should).
4) Trigger a relcache invalidation.
5) Perform a database access inside that function to process the invalidation.
6) Access the result of GetAttributeByName() after that.

Maybe we could construct such test case, e.g. add to pg_regress one more
SQL-accessible C function, bu9ccvhhihvcnb·t I wonder, is this the way to go?

On the other hand, probably there are extensions, that use
GetAttributeByName() in the non-safe way (as the function documentation
doesn't warn about such issues), but maybe just perform datumCopy inside
that function (and similar one(s))?
Though, perhaps I just don't notice the elephant in the core...

Best regards,
Alexander



On 2023-07-15 Sa 03:00, Alexander Lakhin wrote:
> Hello,
>
> 28.06.2023 21:49, Tom Lane wrote:
>> The bad news is that while investigating this I came across
>> another hazard that seems much messier to fix.  To wit, if
>> you have a tuple with "missing" pass-by-ref columns, then
>> some of the columns output by heap_deform_tuple() will
>> contain pointers into the tupdesc (cf. getmissingattr()).
>> That's not sufficient lifespan in the scenario we're dealing
>> with here: if the tupdesc gets trashed anytime before the
>> eventual ExecEvalFieldStoreForm(), we'll have garbage in the
>> result tuple.
>
> Please excuse me, as I'm definitely late to the party, but may I ask some
> questions to understand the issue discussed? I still can't see a 
> practical
> way to get garbage data in a tuple with missing columns, so maybe I'm
> missing something (except for columns).
>
> If we talk about ExecEvalFieldStoreDeForm() -> heap_deform_tuple() ->
> getmissingattr(), then I can't find a way to add a non-null default for a
> record type to exploit this path. The rowtypes test contains:
> -- at the moment this will not work due to ALTER TABLE inadequacy:
> alter table fullname add column suffix text default '';
> ERROR:  cannot alter table "fullname" because column "people.fn" uses 
> its row type
>
> So it seems to me, that ExecEvalFieldStoreDeForm() can't be called for a
> record/tuple having missing non-null attributes. And even if ALTER 
> TABLE was
> "adequate", does it mean that ExecEvalFieldStoreDeForm() would return 
> default
> values from the underlying table type?
>
>> Worse, I fear there may be many other places with latent bugs of the
>> same ilk.  Before the attmissingval code was added, one could assume
>> that the result of heap_deform_tuple would hold good as long as you
>> had a pin on the source tuple.  But now it depends on metadata as
>> well, and I don't think we have a coherent story about that.
>
> But is it true, that having a pin on tupdesc is enough to make sure that
> the result of heap_deform_tuple() holds good?
>
> If so, then probably we should find places, where tupdesc gets unpinned
> before that result is saved somewhere.
> I see the following callers of the heap_deform_tuple():
> src/backend/replication/logical/reorderbuffer.c
>     ReorderBufferToastReplace(): heap_deform_tuple() followed by 
> heap_form_tuple()
> src/backend/executor/execTuples.c
>     ExecForceStoreHeapTuple(), ExecForceStoreMinimalTuple(), 
> ExecStoreHeapTupleDatum(): use long-living and hopefully pinned 
> slot->tts_tupleDescriptor
> src/backend/executor/execExprInterp.c
>     ExecEvalFieldStoreDeForm(): the above question applies here
> src/backend/executor/spi.c
>     SPI_modifytuple(): heap_deform_tuple() followed by heap_form_tuple()
> src/backend/utils/adt/rowtypes.c
>     record_out(), hash_record(), hash_record_extended(): tupdesc 
> released after extracting/processing data
>     record_send(): tupdesc released after sending data
>     record_cmp(), record_eq(), record_image_cmp(), record_image_eq(): 
> tupdesc1. tupdesc2 released after comparing data
> src/backend/utils/adt/jsonfuncs.c
>     populate_record(): heap_deform_tuple() followed by heap_form_tuple()
> src/backend/utils/adt/expandedrecord.c
>     deconstruct_expanded_record() uses long-living (pinned) 
> erh->er_tupdesc
> src/backend/utils/adt/ri_triggers.c
>     RI_Initial_Check(), RI_PartitionRemove_Check(): use long-living? 
> SPI_tuptable->tupdesc
> src/backend/access/heap/heapam_handler.c
>     reform_and_rewrite_tuple(): uses RelationGetDescr(OldHeap) 
> (pinned, I suppose)
> src/backend/access/heap/heaptoast.c
>     heap_toast_delete(), heap_toast_insert_or_update() use long-living 
> rel->rd_att;
>     toast_flatten_tuple(): heap_deform_tuple(..., tupleDesc, ...) 
> followed by heap_form_tuple(..., tupleDesc, ...)
>     toast_flatten_tuple_to_datum(): heap_deform_tuple followed by 
> detoasting attrs
> src/backend/access/heap/heapam.c
>     ExtractReplicaIdentity(): heap_deform_tuple(..., desc, ...) 
> followed by heap_form_tuple(..., desc, ...)
> src/backend/access/common/heaptuple.c
>     heap_modify_tuple(): heap_deform_tuple(..., tupleDesc, ...) 
> followed by heap_form_tuple(..., tupleDesc, ...)
>     heap_modify_tuple_by_cols(): heap_deform_tuple(..., tupleDesc, 
> ...) followed by heap_form_tuple(..., tupleDesc, ...)
> src/backend/access/common/tupconvert.c
>     execute_attr_map_tuple(): heap_deform_tuple() followed by 
> heap_form_tuple()
> src/test/regress/regress.c
>     make_tuple_indirect(): heap_deform_tuple() followed by 
> heap_form_tuple()
> src/pl/plpgsql/src/pl_exec.c
>     exec_move_row() called with tupdesc = NULL, SPI_tuptable->tupdesc, 
> or tupdesc released after the call
> contrib/hstore/hstore_io.c
>     hstore_from_record(), hstore_populate_record(): tupdesc released 
> after extracting data
>
> And even if I missed some possibly problematic calls, maybe it's worth to
> consider fixing exactly that places...
>
> Also, besides heap_deform_tuple(), getmissingattr() is called from
> heap_getattr(). And heap_getattr() is used in many places, but most of 
> them
> are protected too due to tupdesc pinned.
> Two suspicious places that I found are GetAttributeByName() and
> GetAttributeByNum(), so when using these functions you can really get the
> missing attribute value with the tupdesc released. But what 
> circumstances do
> we need to end up with an invalid data?
> 1) Write a function that will call GetAttributeByName() and hold it's
>  result inside for some time.
> 2) Add a passed-by-ref column with a default value to a table.
> 3) Pass to that function a tuple without the "missing" column (just
>  "SELECT func(table) FROM table" won't do, but func(OLD) in a trigger 
> should).
> 4) Trigger a relcache invalidation.
> 5) Perform a database access inside that function to process the 
> invalidation.
> 6) Access the result of GetAttributeByName() after that.
>
> Maybe we could construct such test case, e.g. add to pg_regress one more
> SQL-accessible C function, bu9ccvhhihvcnb·t I wonder, is this the way 
> to go?
>
> On the other hand, probably there are extensions, that use
> GetAttributeByName() in the non-safe way (as the function documentation
> doesn't warn about such issues), but maybe just perform datumCopy inside
> that function (and similar one(s))?
> Though, perhaps I just don't notice the elephant in the core...


I was waiting to see if others had a reaction to this, but ...

I think the last point (possible unsafe use by extensions) is reason 
enough to proceed with the proposed mitigation, which is pretty simple 
and non-invasive..


cheers


andrew

-- 

Andrew Dunstan
EDB: https://www.enterprisedb.com




Hello,

On 2023-Jul-12, Andrew Dunstan wrote:

> +            entry = hash_search(missing_cache, &key, HASH_ENTER, &found);
> +
> +            if (!found)
> +            {
> +                /* cache miss, so we need a non-transient copy of the datum */
> +                oldctx = MemoryContextSwitchTo(TopMemoryContext);
> +                entry->value =
> +                    datumCopy(attrmiss->am_value, false, att->attlen);
> +                MemoryContextSwitchTo(oldctx);
> +            }

Hmm ... when exactly do these values get freed if no longer needed?  Is
the theory that leaking them is not relevant?

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Oh, great altar of passive entertainment, bestow upon me thy discordant images
at such speed as to render linear thought impossible" (Calvin a la TV)




On 2023-08-21 Mo 08:33, Alvaro Herrera wrote:
Hello,

On 2023-Jul-12, Andrew Dunstan wrote:

+			entry = hash_search(missing_cache, &key, HASH_ENTER, &found);
+
+			if (!found)
+			{
+				/* cache miss, so we need a non-transient copy of the datum */
+				oldctx = MemoryContextSwitchTo(TopMemoryContext);
+				entry->value =
+					datumCopy(attrmiss->am_value, false, att->attlen);
+				MemoryContextSwitchTo(oldctx);
+			}
Hmm ... when exactly do these values get freed if no longer needed?  Is
the theory that leaking them is not relevant?


Not sure I understand "relevant" here. They don't get freed. There will be at most one entry per row in pg_attribute where atthasmissing is true. I originally suggested cleaning them up at transaction end, but there was an argument that that might not make them sufficiently long-lived, and I don't know of any time more coarse grained when we can conveniently clean them up.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes:
> On 2023-08-21 Mo 08:33, Alvaro Herrera wrote:
>> Hmm ... when exactly do these values get freed if no longer needed?  Is
>> the theory that leaking them is not relevant?

> Not sure I understand "relevant" here. They don't get freed. There will 
> be at most one entry per row in pg_attribute where atthasmissing is 
> true.

Yeah.  My feeling is that as long as we don't make duplicate hashtable
entries, there will never be so many entries that it'd be worth the
cost and intellectual complexity of trying to clean them up.  When
and if that theory is disproven, we can think harder; but for now
I think this approach is good enough.

            regards, tom lane




On 2023-08-21 Mo 12:30, Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
On 2023-08-21 Mo 08:33, Alvaro Herrera wrote:
Hmm ... when exactly do these values get freed if no longer needed?  Is
the theory that leaking them is not relevant?
Not sure I understand "relevant" here. They don't get freed. There will 
be at most one entry per row in pg_attribute where atthasmissing is 
true.
Yeah.  My feeling is that as long as we don't make duplicate hashtable
entries, there will never be so many entries that it'd be worth the
cost and intellectual complexity of trying to clean them up.  When
and if that theory is disproven, we can think harder; but for now
I think this approach is good enough.
			


Fix pushed to all relevant branches.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com