Re: crash with assertions and WAL_DEBUG - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: crash with assertions and WAL_DEBUG
Date
Msg-id 53AC1E69.3030802@vmware.com
Whole thread Raw
In response to Re: crash with assertions and WAL_DEBUG  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-hackers
On 06/24/2014 05:47 PM, Alvaro Herrera wrote:
>>> diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
>>> index 3c1c81a..4264373 100644
>>> --- a/src/backend/storage/smgr/md.c
>>> +++ b/src/backend/storage/smgr/md.c
>>> @@ -219,6 +219,16 @@ mdinit(void)
>>>                                         &hash_ctl,
>>>                                      HASH_ELEM | HASH_FUNCTION | HASH_CONTEXT);
>>>           pendingUnlinks = NIL;
>>> +
>>> +        /*
>>> +         * XXX: The checkpointer needs to add entries to the pending ops
>>> +         * table when absorbing fsync requests. That is done within a critical
>>> +         * section. It means that there's a theoretical possibility that you
>>> +         * run out of memory while absorbing fsync requests, which leads to
>>> +         * a PANIC. Fortunately the hash table is small so that's unlikely to
>>> +         * happen in practice.
>>> +         */
>>> +        MemoryContextAllowInCriticalSection(MdCxt, true);
>>>       }
>>>   }
>>
>> Isn't that allowing a bit too much? We e.g. shouldn't allow
>> _fdvec_alloc() within a crritical section. Might make sense to create a
>> child context for it.
>
> I agree.

Ok. That leaves nothing but _fdvec_alloc() in MdCxt.

> Rahila Syed wrote:
>
>> The patch on compilation gives following error,
>>
>> mcxt.c: In function ‘MemoryContextAllowInCriticalSection’:
>> mcxt.c:322: error: ‘struct MemoryContextData’ has no member named
>> ‘allowInCriticalSection’
>>
>> The member in MemoryContextData is defined as 'allowInCritSection' while
>> the MemoryContextAllowInCriticalSection accesses the field as
>> 'context->allowInCriticalSection'.
>
> It appears Heikki did a search'n replace for "->allowInCritSection"
> before submitting, which failed to match the struct declaration.

Uh, looks like I failed to test this at all with assertions enabled.
Once you fix that error, you get a lot of assertion failures :-(.

Attached is a new attempt:

* walDebugCxt is now created at backend startup (in XLOGShmemInit()).
This fixes the issue that Andres pointed out, that the context creation
would PANIC if the first XLogInsert in a backend happens within a
critical section. LWLOCK_STATS had the same issue, it would fail if the
first LWLock acquisition in a process happens within a critical section.
I fixed that by adding an explicit InitLWLockAccess() function and moved
the initialization there.

* hash_create also creates a new memory context within the given
context. That new context needs to inherit the allowInCritSection flag,
otherwise allocating entries in the hash will still fail. Both
LWLOCK_STATS and the pending ops hash table in the checkpointer had this
issue.

* Checkpointer does one palloc to allocate private space to copy the
requests to, in addition to using the hash table. I moved that palloc to
before entering the critical section.

- Heikki

Attachment

pgsql-hackers by date:

Previous
From: "MauMau"
Date:
Subject: Re: [Fwd: Re: proposal: new long psql parameter --on-error-stop]
Next
From: Pavel Stehule
Date:
Subject: Re: [Fwd: Re: proposal: new long psql parameter --on-error-stop]