Re: aggregate crash - Mailing list pgsql-hackers

From Andres Freund
Subject Re: aggregate crash
Date
Msg-id 20200114224059.5itu2ejoutayh2re@alap3.anarazel.de
Whole thread Raw
In response to Re: aggregate crash  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: aggregate crash
List pgsql-hackers
Hi,

On 2020-01-14 17:01:01 -0500, Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > On 2019-Dec-27, Teodor Sigaev wrote:
> >> Found crash on production instance, assert-enabled build crashes in pfree()
> >> call, with default config. v11, v12 and head are affected, but, seems, you
> >> need to be a bit lucky.
> 
> > Is this bug being considered for the next set of minors?
> 
> I think Andres last touched that code, so I was sort of expecting
> him to have an opinion on this.

Well, I commented a few days ago, also asking for further input...

To me it looks like that code has effectively been the same for quite a
while.  While today the code is:

            newVal = FunctionCallInvoke(fcinfo);

            /*
             * For pass-by-ref datatype, must copy the new value into
             * aggcontext and free the prior transValue.  But if transfn
             * returned a pointer to its first input, we don't need to do
             * anything.  Also, if transfn returned a pointer to a R/W
             * expanded object that is already a child of the aggcontext,
             * assume we can adopt that value without copying it.
             */
            if (DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
                newVal = ExecAggTransReparent(aggstate, pertrans,
                                              newVal, fcinfo->isnull,
                                              pergroup->transValue,
                                              pergroup->transValueIsNull);
...
ExecAggTransReparent(AggState *aggstate, AggStatePerTrans pertrans,
                     Datum newValue, bool newValueIsNull,
                     Datum oldValue, bool oldValueIsNull)
...
    if (!newValueIsNull)
    {
        MemoryContextSwitchTo(aggstate->curaggcontext->ecxt_per_tuple_memory);
        if (DatumIsReadWriteExpandedObject(newValue,
                                           false,
                                           pertrans->transtypeLen) &&
            MemoryContextGetParent(DatumGetEOHP(newValue)->eoh_context) == CurrentMemoryContext)
             /* do nothing */ ;
        else
            newValue = datumCopy(newValue,
                                 pertrans->transtypeByVal,
                                 pertrans->transtypeLen);
    }
    if (!oldValueIsNull)
    {
        if (DatumIsReadWriteExpandedObject(oldValue,
                                           false,
                                           pertrans->transtypeLen))
            DeleteExpandedObject(oldValue);
        else
            pfree(DatumGetPointer(oldValue));
    }

before it was (in v10):

    if (!pertrans->transtypeByVal &&
        DatumGetPointer(newVal) != DatumGetPointer(pergroupstate->transValue))
    {
        if (!fcinfo->isnull)
        {
            MemoryContextSwitchTo(aggstate->curaggcontext->ecxt_per_tuple_memory);
            if (DatumIsReadWriteExpandedObject(newVal,
                                               false,
                                               pertrans->transtypeLen) &&
                MemoryContextGetParent(DatumGetEOHP(newVal)->eoh_context) == CurrentMemoryContext)
                 /* do nothing */ ;
            else
                newVal = datumCopy(newVal,
                                   pertrans->transtypeByVal,
                                   pertrans->transtypeLen);
        }
        if (!pergroupstate->transValueIsNull)
        {
            if (DatumIsReadWriteExpandedObject(pergroupstate->transValue,
                                               false,
                                               pertrans->transtypeLen))
                DeleteExpandedObject(pergroupstate->transValue);
            else
                pfree(DatumGetPointer(pergroupstate->transValue));
        }
    }

there's no need in the current code to check !pertrans->transtypeByVal,
as byval has a separate expression opcode.  So I don't think things have
changed?

As far as I can tell, comparing the values by pointer goes back a *long*
while. We didn't use to handle expanded objects, but otherwise it looked
pretty similar back to 7.4 (oldest version I've checked out):

    newVal = FunctionCallInvoke(&fcinfo);

    /*
     * If pass-by-ref datatype, must copy the new value into aggcontext
     * and pfree the prior transValue.    But if transfn returned a pointer
     * to its first input, we don't need to do anything.
     */
    if (!peraggstate->transtypeByVal &&
    DatumGetPointer(newVal) != DatumGetPointer(pergroupstate->transValue))
    {
        if (!fcinfo.isnull)
        {
            MemoryContextSwitchTo(aggstate->aggcontext);
            newVal = datumCopy(newVal,
                               peraggstate->transtypeByVal,
                               peraggstate->transtypeLen);
        }
        if (!pergroupstate->transValueIsNull)
            pfree(DatumGetPointer(pergroupstate->transValue));
    }


> But I agree that not checking null-ness
> explicitly is kind of unsafe.  We've never before had any expectation
> that the Datum value of a null is anything in particular.

I'm still not sure I actually fully understand the bug. It's obvious how
returning the input value again could lead to memory not being freed (so
that leak seems to go all the way back). And similarly, since the
introduction of expanded objects, it can also lead to the expanded
object not being deleted.

But that's not the problem causing the crash here. What I think must
instead be the problem is that pergroupstate->transValueIsNull, but
pergroupstate->transValue is set to something looking like a
pointer. Which caused us not to datumCopy() a new transition value into
a long lived context. and then a later transition causes us to free the
short-lived value?

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: planner support functions: handle GROUP BY estimates ?
Next
From: Tom Lane
Date:
Subject: Re: Add FOREIGN to ALTER TABLE in pg_dump