Thread: BUG #13128: Postgres deadlock on startup failure when max_prepared_transactions is not sufficiently high.

The following bug has been logged on the website:

Bug reference:      13128
Logged by:          Grant McAlister
Email address:      grant@amazon.com
PostgreSQL version: 9.3.6
Operating system:   Linux 64bit
Description:

Steps to reproduce

1.       Set max_prepared_transactions to 2 in postgresql.conf.
2.       start Postgres.
3.       Create two uncommitted prepared transactions: BEGIN; PREPARE
TRANSACTION ‘test1’; BEGIN; PREPARE TRANSACTION ‘test2’;
4.       Set max_prepared_transactions to 1 in postgresql.conf.
5.       Restart Postgres again.

At this point the startup will fail with a fatal but the postmaster process
keeps running.

LOG:  database system was interrupted; last known up at 2015-04-16 17:19:56
PDT
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  record with zero length at 0/1826C70
LOG:  redo is not required
LOG:  recovering prepared transaction 1685
LOG:  recovering prepared transaction 1683
FATAL:  maximum number of prepared transactions reached
HINT:  Increase max_prepared_transactions (currently 1).

Looks like their may be a LWLock that is not correctly getting released
which causes the process to hang rather then exit.
On Thu, Apr 23, 2015 at 7:09 AM,  <grant@amazon.com> wrote:
> 1.       Set max_prepared_transactions to 2 in postgresql.conf.
> 2.       start Postgres.
> 3.       Create two uncommitted prepared transactions: BEGIN; PREPARE
> TRANSACTION 'test1'; BEGIN; PREPARE TRANSACTION 'test2';
> 4.       Set max_prepared_transactions to 1 in postgresql.conf.
> 5.       Restart Postgres again.
>
> At this point the startup will fail with a fatal but the postmaster process
> keeps running.
>
> LOG:  database system was interrupted; last known up at 2015-04-16 17:19:56
> PDT
> LOG:  database system was not properly shut down; automatic recovery in
> progress
> LOG:  record with zero length at 0/1826C70
> LOG:  redo is not required
> LOG:  recovering prepared transaction 1685
> LOG:  recovering prepared transaction 1683
> FATAL:  maximum number of prepared transactions reached
> HINT:  Increase max_prepared_transactions (currently 1).
>
> Looks like their may be a LWLock that is not correctly getting released
> which causes the process to hang rather then exit.

Yep, the startup process remains stuck here, so we should release the
lock before issuing ERROR in twophase.c:
  * frame #0: 0x00007fff95004b72 libsystem_kernel.dylib`semop + 10
    frame #1: 0x000000010f59530f
postgres`PGSemaphoreLock(sema=0x00000001187a1190) + 63 at
pg_sema.c:387
    frame #2: 0x000000010f632af9
postgres`LWLockAcquireCommon(lock=0x000000010fc452e0,
mode=LW_EXCLUSIVE, valptr=0x0000000000000000, val=0) + 377 at
lwlock.c:1037
    frame #3: 0x000000010f63296b
postgres`LWLockAcquire(l=0x000000010fc452e0, mode=LW_EXCLUSIVE) + 43
at lwlock.c:900
    frame #4: 0x000000010f2f975d postgres`AtAbort_Twophase + 93 at
twophase.c:284
    frame #5: 0x000000010f2f9f24 postgres`AtProcExit_Twophase(code=1,
arg=0) + 20 at twophase.c:246

The patch attached fixes the problem for me, and I think that this
should be backpatched as well.
(Note to self: check the other calls of LWLockAcquire/Release to see
if there are other code paths in the same situation).
Regards,
--
Michael

Attachment
On 04/23/2015 04:20 AM, Michael Paquier wrote:
> On Thu, Apr 23, 2015 at 7:09 AM,  <grant@amazon.com> wrote:
>> 1.       Set max_prepared_transactions to 2 in postgresql.conf.
>> 2.       start Postgres.
>> 3.       Create two uncommitted prepared transactions: BEGIN; PREPARE
>> TRANSACTION 'test1'; BEGIN; PREPARE TRANSACTION 'test2';
>> 4.       Set max_prepared_transactions to 1 in postgresql.conf.
>> 5.       Restart Postgres again.
>>
>> At this point the startup will fail with a fatal but the postmaster process
>> keeps running.
>>
>> LOG:  database system was interrupted; last known up at 2015-04-16 17:19:56
>> PDT
>> LOG:  database system was not properly shut down; automatic recovery in
>> progress
>> LOG:  record with zero length at 0/1826C70
>> LOG:  redo is not required
>> LOG:  recovering prepared transaction 1685
>> LOG:  recovering prepared transaction 1683
>> FATAL:  maximum number of prepared transactions reached
>> HINT:  Increase max_prepared_transactions (currently 1).
>>
>> Looks like their may be a LWLock that is not correctly getting released
>> which causes the process to hang rather then exit.
>
> Yep, the startup process remains stuck here, so we should release the
> lock before issuing ERROR in twophase.c:

As Tom noted in the other thread, that is not generally required.
LWLockReleaseAll() is call during process exit.

Hmm. What happens here is that when the startup process is about to
exit, because of the ERROR, it calls the shmem-exit hooks.
AtProcExit_Twophase sees that MyLockedGxact != NULL, and it tries to
clear MyLockedGxact->locking_backend to release the entry.

That's bogus. MyLockedGxact != NULL indicates that the backend is
currently operating on the entry. But that's not the case at that point
during RecoverPreparedTransactions(). It has already completed
recovering the previous transaction, and has not yet locked a
GlobalTransaction entry for the next one.

RecoverPreparedTransactions() should clear MyLockedGxact after it has
recovered the transaction, after the StandbyReleaseLockTree() call.
That's analogous to the normal PREPARE TRANSACTION path, in
PrepareTransaction(), where we call PostPrepare_Twophase() after
releasing the locks.

Attached is a patch for that. I introduced this bug in commit
bb38fb0d43c8d7ff54072bfd8bd63154e536b384, which added the proc-exit hook
and the call PostPrepare_Twophase(), so unless you see more bugs in
this, I'll commit and backpatch this to all versions.

- Heikki


Attachment
On 04/23/2015 10:32 AM, Heikki Linnakangas wrote:
> Attached is a patch for that. I introduced this bug in commit
> bb38fb0d43c8d7ff54072bfd8bd63154e536b384, which added the proc-exit hook
> and the call PostPrepare_Twophase(), so unless you see more bugs in
> this, I'll commit and backpatch this to all versions.

Committed.

- Heikki