Thread: crash with assertions and WAL_DEBUG

crash with assertions and WAL_DEBUG

From
Alvaro Herrera
Date:
I noticed that HEAD crashes at startup with assertions disabled and
WAL_DEBUG turned on:

#2  0x00000000007af987 in ExceptionalCondition (
    conditionName=conditionName@entry=0x974448 "!(CritSectionCount == 0 || (CurrentMemoryContext) == ErrorContext ||
(MyAuxProcType== CheckpointerProcess))",  
    errorType=errorType@entry=0x7e9bec "FailedAssertion",
    fileName=fileName@entry=0x974150 "/pgsql/source/minmax/src/backend/utils/mmgr/mcxt.c",
    lineNumber=lineNumber@entry=670) at /pgsql/source/minmax/src/backend/utils/error/assert.c:54
#3  0x00000000007d2080 in palloc (size=1024) at /pgsql/source/minmax/src/backend/utils/mmgr/mcxt.c:670
#4  0x00000000005f3abe in initStringInfo (str=str@entry=0x7fff11f68a80)
    at /pgsql/source/minmax/src/backend/lib/stringinfo.c:50
#5  0x00000000004f42ca in XLogInsert (rmid=48 '0', rmid@entry=0 '\000', info=<optimized out>,
    info@entry=0 '\000', rdata=rdata@entry=0x7fff11f68d30)
    at /pgsql/source/minmax/src/backend/access/transam/xlog.c:1262
#6  0x00000000004f4f17 in CreateCheckPoint (flags=flags@entry=6)
    at /pgsql/source/minmax/src/backend/access/transam/xlog.c:8197
#7  0x00000000004f8079 in StartupXLOG () at /pgsql/source/minmax/src/backend/access/transam/xlog.c:7097


I'm using an interim fix by excepting the startup process (see attached
patch), like the current code does for checkpointer, but I guess a more
robust solution should be sought.

I find it strange that nobody has seen this before.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: crash with assertions and WAL_DEBUG

From
Alvaro Herrera
Date:
Alvaro Herrera wrote:
> I noticed that HEAD crashes at startup with assertions disabled and
> WAL_DEBUG turned on:

> I'm using an interim fix by excepting the startup process (see attached
> patch), like the current code does for checkpointer, but I guess a more
> robust solution should be sought.

Spoke too soon.  This also causes trouble when any process inserts an
xlog record and tries to print it, of course, not just startup.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: crash with assertions and WAL_DEBUG

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> I noticed that HEAD crashes at startup with assertions disabled and
> WAL_DEBUG turned on:

I'm beginning to think we're going to have to give up on that
no-pallocs-in-critical-sections Assert.  It was useful to catch
unnecessarily-dangerous allocations in mainline cases, but getting rid
of every last corner-case palloc is looking to be, if not impossible,
at least a lot more trouble than it is worth.
        regards, tom lane



Re: crash with assertions and WAL_DEBUG

From
Andres Freund
Date:
On 2014-06-14 16:53:12 -0400, Alvaro Herrera wrote:
> Alvaro Herrera wrote:
> > I noticed that HEAD crashes at startup with assertions disabled and
> > WAL_DEBUG turned on:
> 
> > I'm using an interim fix by excepting the startup process (see attached
> > patch), like the current code does for checkpointer, but I guess a more
> > robust solution should be sought.
> 
> Spoke too soon.  This also causes trouble when any process inserts an
> xlog record and tries to print it, of course, not just startup.

I personally think we should just remove WAL_DEBUG, but I guess I won't
convince others of that... I think for now the least bad solution is to
add a #ifndef WAL_DEBUG around that assertion.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: crash with assertions and WAL_DEBUG

From
Andres Freund
Date:
On 2014-06-14 16:57:33 -0400, Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > I noticed that HEAD crashes at startup with assertions disabled and
> > WAL_DEBUG turned on:
> 
> I'm beginning to think we're going to have to give up on that
> no-pallocs-in-critical-sections Assert.  It was useful to catch
> unnecessarily-dangerous allocations in mainline cases, but getting rid
> of every last corner-case palloc is looking to be, if not impossible,
> at least a lot more trouble than it is worth.

I think we at least need to remove it from 9.4. We shouldn't release
with an assertion that still regularly triggers in more or less
'harmless' situations.
I think it might be worthwile to keep it in master to help maintain the
rule against allocations in critical sections. And perhaps as a reminder
that e.g. the checkpointer is doing bad things...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: crash with assertions and WAL_DEBUG

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-06-14 16:57:33 -0400, Tom Lane wrote:
>> I'm beginning to think we're going to have to give up on that
>> no-pallocs-in-critical-sections Assert.  It was useful to catch
>> unnecessarily-dangerous allocations in mainline cases, but getting rid
>> of every last corner-case palloc is looking to be, if not impossible,
>> at least a lot more trouble than it is worth.

> I think we at least need to remove it from 9.4. We shouldn't release
> with an assertion that still regularly triggers in more or less
> 'harmless' situations.

Yeah, it's certainly not looking like it's a production-ready check,
even bearing in mind that we recommend against enabling asserts in
production.

> I think it might be worthwile to keep it in master to help maintain the
> rule against allocations in critical sections. And perhaps as a reminder
> that e.g. the checkpointer is doing bad things...

I've looked at the checkpointer issue, and I don't actually think it's
particularly unsafe there, mainly because the checkpointer's memory
usage is pretty well bounded.  I don't like the hacky solution of
excepting *all* pallocs in the checkpointer process, though.

I wonder whether we could set up some sort of marking of specific pallocs
as being considered OK for the purposes of this assert.

Another thought: the checkpointer has a guaranteed maximum on the size
of the palloc in AbsorbFsyncRequests, viz CheckpointerShmem->max_requests
... maybe we should just make it prealloc that much space and be done.
However, I wouldn't bother unless we are committed to keeping the Assert.
        regards, tom lane



Re: crash with assertions and WAL_DEBUG

From
Alvaro Herrera
Date:
Andres Freund wrote:
> On 2014-06-14 16:57:33 -0400, Tom Lane wrote:
> > Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > > I noticed that HEAD crashes at startup with assertions disabled and
> > > WAL_DEBUG turned on:
> > 
> > I'm beginning to think we're going to have to give up on that
> > no-pallocs-in-critical-sections Assert.  It was useful to catch
> > unnecessarily-dangerous allocations in mainline cases, but getting rid
> > of every last corner-case palloc is looking to be, if not impossible,
> > at least a lot more trouble than it is worth.
> 
> I think we at least need to remove it from 9.4. We shouldn't release
> with an assertion that still regularly triggers in more or less
> 'harmless' situations.

Yeah, removing it in 9.4 is likely a good idea -- we have an open item
about it in connection with LWLOCK_DEBUG, and now this.  Who knows what
other debugging features will cause trouble.

> I think it might be worthwile to keep it in master to help maintain the
> rule against allocations in critical sections. And perhaps as a reminder
> that e.g. the checkpointer is doing bad things...

I also agree with keeping it in 9.5.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: crash with assertions and WAL_DEBUG

From
Heikki Linnakangas
Date:
On 06/15/2014 12:26 AM, Alvaro Herrera wrote:
> Andres Freund wrote:
>> On 2014-06-14 16:57:33 -0400, Tom Lane wrote:
>>> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>>>> I noticed that HEAD crashes at startup with assertions disabled and
>>>> WAL_DEBUG turned on:
>>>
>>> I'm beginning to think we're going to have to give up on that
>>> no-pallocs-in-critical-sections Assert.  It was useful to catch
>>> unnecessarily-dangerous allocations in mainline cases, but getting rid
>>> of every last corner-case palloc is looking to be, if not impossible,
>>> at least a lot more trouble than it is worth.
>>
>> I think we at least need to remove it from 9.4. We shouldn't release
>> with an assertion that still regularly triggers in more or less
>> 'harmless' situations.
>
> Yeah, removing it in 9.4 is likely a good idea -- we have an open item
> about it in connection with LWLOCK_DEBUG, and now this.  Who knows what
> other debugging features will cause trouble.

Agreed, I'll remove it from REL9_4_STABLE.

>> I think it might be worthwile to keep it in master to help maintain the
>> rule against allocations in critical sections. And perhaps as a reminder
>> that e.g. the checkpointer is doing bad things...
>
> I also agree with keeping it in 9.5.

Yeah.

Now, to fix the case at hand, the quickest fix would be to allocate the 
messages in ErrorContext, which is already exempt from assertion. That's 
pretty hacky, though. Tom's suggestion to somehow mark specific pallocs 
as OK seems cleaner.

It's a bit difficult to attach the mark to the palloc calls, as neither 
the WAL_DEBUG or LWLOCK_STATS code is calling palloc directly, but 
marking specific MemoryContexts as sanctioned ought to work. I'll take a 
stab at that.

- Heikki




Re: crash with assertions and WAL_DEBUG

From
Heikki Linnakangas
Date:
On 06/21/2014 01:58 PM, Heikki Linnakangas wrote:
> It's a bit difficult to attach the mark to the palloc calls, as neither
> the WAL_DEBUG or LWLOCK_STATS code is calling palloc directly, but
> marking specific MemoryContexts as sanctioned ought to work. I'll take a
> stab at that.

I came up with the attached patch. It adds a function called
MemoryContextAllowInCriticalSection(), which can be used to exempt
specific memory contexts from the assertion. The following contexts are
exempted:

* ErrorContext
* MdCxt, which is used in checkpointer to absorb fsync requests. (the
checkpointer process as a whole is no longer exempt)
* The temporary StringInfos used in WAL_DEBUG (a new memory "WAL Debug"
context is now created for them)
* LWLock stats hash table (a new "LWLock stats" context is created for it)

Barring objections, I'll commit this to master, and remove the assertion
from REL9_4_STABLE.

- Heikki


Attachment

Re: crash with assertions and WAL_DEBUG

From
Andres Freund
Date:
On 2014-06-23 12:58:19 +0300, Heikki Linnakangas wrote:
> On 06/21/2014 01:58 PM, Heikki Linnakangas wrote:
> >It's a bit difficult to attach the mark to the palloc calls, as neither
> >the WAL_DEBUG or LWLOCK_STATS code is calling palloc directly, but
> >marking specific MemoryContexts as sanctioned ought to work. I'll take a
> >stab at that.
> 
> I came up with the attached patch. It adds a function called
> MemoryContextAllowInCriticalSection(), which can be used to exempt specific
> memory contexts from the assertion. The following contexts are exempted:
> 
> * ErrorContext
> * MdCxt, which is used in checkpointer to absorb fsync requests. (the
> checkpointer process as a whole is no longer exempt)
> * The temporary StringInfos used in WAL_DEBUG (a new memory "WAL Debug"
> context is now created for them)
> * LWLock stats hash table (a new "LWLock stats" context is created for it)
> 
> Barring objections, I'll commit this to master, and remove the assertion
> from REL9_4_STABLE.

Sounds generally sane. Some comments:

> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> index abc5682..e141a28 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -60,6 +60,7 @@
>  #include "storage/spin.h"
>  #include "utils/builtins.h"
>  #include "utils/guc.h"
> +#include "utils/memutils.h"
>  #include "utils/ps_status.h"
>  #include "utils/relmapper.h"
>  #include "utils/snapmgr.h"
> @@ -1258,6 +1259,25 @@ begin:;
>      if (XLOG_DEBUG)
>      {
>          StringInfoData buf;
> +        static MemoryContext walDebugCxt = NULL;
> +        MemoryContext oldCxt;
> +
> +        /*
> +         * Allocations within a critical section are normally not allowed,
> +         * because allocation failure would lead to a PANIC. But this is just
> +         * debugging code that no-one is going to enable in production, so we
> +         * don't care. Use a memory context that's exempt from the rule.
> +         */
> +        if (walDebugCxt == NULL)
> +        {
> +            walDebugCxt = AllocSetContextCreate(TopMemoryContext,
> +                                                "WAL Debug",
> +                                                ALLOCSET_DEFAULT_MINSIZE,
> +                                                ALLOCSET_DEFAULT_INITSIZE,
> +                                                ALLOCSET_DEFAULT_MAXSIZE);
> +            MemoryContextAllowInCriticalSection(walDebugCxt, true);
> +        }
> +        oldCxt = MemoryContextSwitchTo(walDebugCxt);

This will only work though if the first XLogInsert() isn't called from a
critical section. I'm not sure it's a good idea to rely on that.

> diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
> index 3c1c81a..4264373 100644
> --- a/src/backend/storage/smgr/md.c
> +++ b/src/backend/storage/smgr/md.c
> @@ -219,6 +219,16 @@ mdinit(void)
>                                        &hash_ctl,
>                                     HASH_ELEM | HASH_FUNCTION | HASH_CONTEXT);
>          pendingUnlinks = NIL;
> +
> +        /*
> +         * XXX: The checkpointer needs to add entries to the pending ops
> +         * table when absorbing fsync requests. That is done within a critical
> +         * section. It means that there's a theoretical possibility that you
> +         * run out of memory while absorbing fsync requests, which leads to
> +         * a PANIC. Fortunately the hash table is small so that's unlikely to
> +         * happen in practice.
> +         */
> +        MemoryContextAllowInCriticalSection(MdCxt, true);
>      }
>  }

Isn't that allowing a bit too much? We e.g. shouldn't allow
_fdvec_alloc() within a crritical section. Might make sense to create a
child context for it.

Greetings,

Andres Freund



Re: crash with assertions and WAL_DEBUG

From
Rahila Syed
Date:
Hello,

The patch on compilation gives following error,

mcxt.c: In function ‘MemoryContextAllowInCriticalSection’:
mcxt.c:322: error: ‘struct MemoryContextData’ has no member named ‘allowInCriticalSection’

The member in MemoryContextData is defined as 'allowInCritSection' while the MemoryContextAllowInCriticalSection accesses the field as 'context->allowInCriticalSection'.


Thank you,





On Mon, Jun 23, 2014 at 3:28 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 06/21/2014 01:58 PM, Heikki Linnakangas wrote:
It's a bit difficult to attach the mark to the palloc calls, as neither
the WAL_DEBUG or LWLOCK_STATS code is calling palloc directly, but
marking specific MemoryContexts as sanctioned ought to work. I'll take a
stab at that.

I came up with the attached patch. It adds a function called MemoryContextAllowInCriticalSection(), which can be used to exempt specific memory contexts from the assertion. The following contexts are exempted:

* ErrorContext
* MdCxt, which is used in checkpointer to absorb fsync requests. (the checkpointer process as a whole is no longer exempt)
* The temporary StringInfos used in WAL_DEBUG (a new memory "WAL Debug" context is now created for them)
* LWLock stats hash table (a new "LWLock stats" context is created for it)

Barring objections, I'll commit this to master, and remove the assertion from REL9_4_STABLE.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: crash with assertions and WAL_DEBUG

From
Alvaro Herrera
Date:
Heikki Linnakangas wrote:
> On 06/21/2014 01:58 PM, Heikki Linnakangas wrote:
> >It's a bit difficult to attach the mark to the palloc calls, as neither
> >the WAL_DEBUG or LWLOCK_STATS code is calling palloc directly, but
> >marking specific MemoryContexts as sanctioned ought to work. I'll take a
> >stab at that.
> 
> I came up with the attached patch. It adds a function called
> MemoryContextAllowInCriticalSection(), which can be used to exempt
> specific memory contexts from the assertion. The following contexts
> are exempted:

There is a typo in the comment to that function, "This functions can be
used", s/functions/function/

Andres Freund wrote:

> > @@ -1258,6 +1259,25 @@ begin:;
> >      if (XLOG_DEBUG)
> >      {
> >          StringInfoData buf;
> > +        static MemoryContext walDebugCxt = NULL;
> > +        MemoryContext oldCxt;
> > +
> > +        /*
> > +         * Allocations within a critical section are normally not allowed,
> > +         * because allocation failure would lead to a PANIC. But this is just
> > +         * debugging code that no-one is going to enable in production, so we
> > +         * don't care. Use a memory context that's exempt from the rule.
> > +         */
> > +        if (walDebugCxt == NULL)
> > +        {
> > +            walDebugCxt = AllocSetContextCreate(TopMemoryContext,
> > +                                                "WAL Debug",
> > +                                                ALLOCSET_DEFAULT_MINSIZE,
> > +                                                ALLOCSET_DEFAULT_INITSIZE,
> > +                                                ALLOCSET_DEFAULT_MAXSIZE);
> > +            MemoryContextAllowInCriticalSection(walDebugCxt, true);
> > +        }
> > +        oldCxt = MemoryContextSwitchTo(walDebugCxt);
> 
> This will only work though if the first XLogInsert() isn't called from a
> critical section. I'm not sure it's a good idea to rely on that.

Ah, true -- AllocSetContextCreate cannot be called from within a
critical section.

> > diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
> > index 3c1c81a..4264373 100644
> > --- a/src/backend/storage/smgr/md.c
> > +++ b/src/backend/storage/smgr/md.c
> > @@ -219,6 +219,16 @@ mdinit(void)
> >                                        &hash_ctl,
> >                                     HASH_ELEM | HASH_FUNCTION | HASH_CONTEXT);
> >          pendingUnlinks = NIL;
> > +
> > +        /*
> > +         * XXX: The checkpointer needs to add entries to the pending ops
> > +         * table when absorbing fsync requests. That is done within a critical
> > +         * section. It means that there's a theoretical possibility that you
> > +         * run out of memory while absorbing fsync requests, which leads to
> > +         * a PANIC. Fortunately the hash table is small so that's unlikely to
> > +         * happen in practice.
> > +         */
> > +        MemoryContextAllowInCriticalSection(MdCxt, true);
> >      }
> >  }
> 
> Isn't that allowing a bit too much? We e.g. shouldn't allow
> _fdvec_alloc() within a crritical section. Might make sense to create a
> child context for it.

I agree.

Rahila Syed wrote:

> The patch on compilation gives following error,
> 
> mcxt.c: In function ‘MemoryContextAllowInCriticalSection’:
> mcxt.c:322: error: ‘struct MemoryContextData’ has no member named
> ‘allowInCriticalSection’
> 
> The member in MemoryContextData is defined as 'allowInCritSection' while
> the MemoryContextAllowInCriticalSection accesses the field as
> 'context->allowInCriticalSection'.

It appears Heikki did a search'n replace for "->allowInCritSection"
before submitting, which failed to match the struct declaration.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: crash with assertions and WAL_DEBUG

From
Heikki Linnakangas
Date:
On 06/24/2014 05:47 PM, Alvaro Herrera wrote:
>>> diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
>>> index 3c1c81a..4264373 100644
>>> --- a/src/backend/storage/smgr/md.c
>>> +++ b/src/backend/storage/smgr/md.c
>>> @@ -219,6 +219,16 @@ mdinit(void)
>>>                                         &hash_ctl,
>>>                                      HASH_ELEM | HASH_FUNCTION | HASH_CONTEXT);
>>>           pendingUnlinks = NIL;
>>> +
>>> +        /*
>>> +         * XXX: The checkpointer needs to add entries to the pending ops
>>> +         * table when absorbing fsync requests. That is done within a critical
>>> +         * section. It means that there's a theoretical possibility that you
>>> +         * run out of memory while absorbing fsync requests, which leads to
>>> +         * a PANIC. Fortunately the hash table is small so that's unlikely to
>>> +         * happen in practice.
>>> +         */
>>> +        MemoryContextAllowInCriticalSection(MdCxt, true);
>>>       }
>>>   }
>>
>> Isn't that allowing a bit too much? We e.g. shouldn't allow
>> _fdvec_alloc() within a crritical section. Might make sense to create a
>> child context for it.
>
> I agree.

Ok. That leaves nothing but _fdvec_alloc() in MdCxt.

> Rahila Syed wrote:
>
>> The patch on compilation gives following error,
>>
>> mcxt.c: In function ‘MemoryContextAllowInCriticalSection’:
>> mcxt.c:322: error: ‘struct MemoryContextData’ has no member named
>> ‘allowInCriticalSection’
>>
>> The member in MemoryContextData is defined as 'allowInCritSection' while
>> the MemoryContextAllowInCriticalSection accesses the field as
>> 'context->allowInCriticalSection'.
>
> It appears Heikki did a search'n replace for "->allowInCritSection"
> before submitting, which failed to match the struct declaration.

Uh, looks like I failed to test this at all with assertions enabled.
Once you fix that error, you get a lot of assertion failures :-(.

Attached is a new attempt:

* walDebugCxt is now created at backend startup (in XLOGShmemInit()).
This fixes the issue that Andres pointed out, that the context creation
would PANIC if the first XLogInsert in a backend happens within a
critical section. LWLOCK_STATS had the same issue, it would fail if the
first LWLock acquisition in a process happens within a critical section.
I fixed that by adding an explicit InitLWLockAccess() function and moved
the initialization there.

* hash_create also creates a new memory context within the given
context. That new context needs to inherit the allowInCritSection flag,
otherwise allocating entries in the hash will still fail. Both
LWLOCK_STATS and the pending ops hash table in the checkpointer had this
issue.

* Checkpointer does one palloc to allocate private space to copy the
requests to, in addition to using the hash table. I moved that palloc to
before entering the critical section.

- Heikki

Attachment