Hi hackers,
While working on wait event related stuff I observed a failed assertion:
"
TRAP: failed Assert("node->next == 0 && node->prev == 0"), File: "../../../../src/include/storage/proclist.h", Line:
91
"
during pg_regress/database.
To reproduce, add an ereport(LOG,..) or CHECK_FOR_INTERRUPTS() or whatever
would trigger ConditionVariableBroadcast() in pgstat_report_wait_end():
And launch:
postgres=# CREATE TABLESPACE regress_tblspace LOCATION '/home/postgres/bdttbs';
CREATE TABLESPACE
postgres=# create database regression_bdt9;
CREATE DATABASE
postgres=# ALTER DATABASE regression_bdt9 SET TABLESPACE regress_tblspace;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
The issue is that in ConditionVariableTimedSleep() during the WaitLatch() we’ll
also trigger ProcessInterrupts() that will call ConditionVariableCancelSleep()
that will:
- remove the process from the wait list
- and also set cv_sleep_target to NULL.
Then in ConditionVariableTimedSleep(), we’ll go through:
“
/*
* If this process has been taken out of the wait list, then we know
* that it has been signaled by ConditionVariableSignal (or
* ConditionVariableBroadcast), so we should return to the caller. But
* that doesn't guarantee that the exit condition is met, only that we
* ought to check it. So we must put the process back into the wait
* list, to ensure we don't miss any additional wakeup occurring while
* the caller checks its exit condition. We can take ourselves out of
* the wait list only when the caller calls
* ConditionVariableCancelSleep.
*
* If we're still in the wait list, then the latch must have been set
* by something other than ConditionVariableSignal; though we don't
* guarantee not to return spuriously, we'll avoid this obvious case.
*/
SpinLockAcquire(lock: &cv->mutex);
if (!proclist_contains(&cv->wakeup, MyProcNumber, cvWaitLink))
{
done = true;
proclist_push_tail(&cv->wakeup, MyProcNumber, cvWaitLink);
}
SpinLockRelease(&cv->mutex);
“
Means we re-add the process in the wait list. The issue is that cv_sleep_target
is still NULL so that we’ll exit ConditionVariableTimedSleep() due to:
"
if (cv != cv_sleep_target)
done = true;
/* We were signaled, so return */
if (done)
return false;
"
Later we’ll want to add our process to a wait list, calling ConditionVariablePrepareToSleep()
that will not call ConditionVariableCancelSleep() due to:
“
if (cv_sleep_target != NULL)
ConditionVariableCancelSleep();
“
As cv_sleep_target is still NULL.
Finally calling proclist_push_tail() that will trigger the failed assertion.
The proposed fix attached is done in ConditionVariableTimedSleep() as this is the
place that introduces the race condition. It re-assigns cv_sleep_target to cv
and then ensures that cv_sleep_target accurately describes which condition
variable we’re prepared to wait on.
Looking forward to your feedback,
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com