Re: aggregate crash - Mailing list pgsql-hackers

From Andres Freund
Subject Re: aggregate crash
Date
Msg-id 20200116031630.7ip7lid23tpyhna2@alap3.anarazel.de
Whole thread Raw
In response to Re: aggregate crash  (Andres Freund <andres@anarazel.de>)
Responses Re: aggregate crash  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
Hi,

On 2020-01-15 12:47:47 -0800, Andres Freund wrote:
> FWIW, I'm working on narrowing it down to something small. I can
> reliably trigger the bug, and I understand the mechanics, I
> think. Interestingly enough the reproducer currently only triggers on
> v12, not on v11 and before.

That's just happenstance due to allocation changes in plpgsql,
though. The attached small reproducer, for me, reliably triggers crashes
on 10 - master.

It's hard to hit intentionally, because plpgsql does a datumCopy() to
its non-null return value, which means that to hit the bug, one needs
different numbers of allocations between setting up the transition value
with transvalueisnull = true, transvalue = 0xsomepointer (because
plpgsql doesn't copy NULLs), and the transition output with
transvalueisnull = false, transvalue = 0xsomepointer. Which is necessary
to trigger the bug, as it's then not reparented into a long lived enough
context. To be then freed/accessed for the next group input value.

I think this is too finnicky to actually keep as a regression test.

The bug, in a way, exists all the way back, but it's a bit harder to
create NULL values where the datum component isn't 0.


To fix I suggest we, in all branches, do the equivalent of adding
something like:
diff --git i/src/backend/executor/execExprInterp.c w/src/backend/executor/execExprInterp.c
index 790380051be..3260a63ac6b 100644
--- i/src/backend/executor/execExprInterp.c
+++ w/src/backend/executor/execExprInterp.c
@@ -4199,6 +4199,12 @@ ExecAggTransReparent(AggState *aggstate, AggStatePerTrans pertrans,
                                  pertrans->transtypeByVal,
                                  pertrans->transtypeLen);
     }
+    else
+    {
+        /* ensure datum component is 0 for NULL transition values */
+        newValue = (Datum) 0;
+    }
+
     if (!oldValueIsNull)
     {
         if (DatumIsReadWriteExpandedObject(oldValue,

and a comment explaining why it's (now) safe to rely on datum
comparisons for
    if (DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))


I don't think it makes sense to add something like it to the byval case
- there's plenty other ways a function returning != 0 with
fcinfo->isnull == true can cause such values to exist. And that's
longstanding.


A separate question is whether it's worth adding code to
e.g. EEO_CASE(EEOP_FUNCEXPR_STRICT) also resetting *op->resvalue to
(Datum) 0.  I don't personally don't think ensuring the datum is always
0 when isnull true is all that helpful, if we can't guarantee it
everywhere. So I'm a bit loathe to add cycles to places that don't need
it, and are hot.

Regards,

Andres

Attachment

pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: pgindent && weirdness
Next
From: Michael Paquier
Date:
Subject: Re: Setting min/max TLS protocol in clientside libpq