Thread: self-deadlock at FATAL exit of boostrap process on read error

self-deadlock at FATAL exit of boostrap process on read error

From
"Qingqing Zhou"
Date:
I encounter a situation that the server can't shutdown when a boostrap
process does ReadBuffer() but gets an read error. I guess the problem may be
like this - the boostrap process can't read at line:
   smgrread(reln->rd_smgr, blockNum, (char *) bufBlock);

So it does a FATAL exit and shmem_exit() is called:
    while (--on_shmem_exit_index >= 0)     (*on_shmem_exit_list[on_shmem_exit_index].function) (code,
on_shmem_exit_list[on_shmem_exit_index].arg);
Where   on_shmem_exit_list[0] = DummyProcKill   on_shmem_exit_list[1] = AtProcExit_Buffers

The above callback is called in a stack order, so AtProcExit_Buffers() will
call AbortBufferIO() which is blocked by itself on "io_in_progress_lock"
(which is not the case as the comment says "since LWLockReleaseAll has
already been called, we're not holding the buffer's io_in_progress_lock").

There may other similar problems for bootstrap process like this, so I am
not sure the best fix for this ...

Regards,
Qingqing




Re: self-deadlock at FATAL exit of boostrap process on read error

From
Tom Lane
Date:
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:
> I encounter a situation that the server can't shutdown when a boostrap
> process does ReadBuffer() but gets an read error.

Hm, AtProcExit_Buffers is assuming that we've done AbortTransaction,
but the WAL-replay process doesn't do that because it's not running a
transaction.  Seems like we need to stack another on-proc-exit function
to do the appropriate subset of AbortTransaction ... LWLockReleaseAll at
least, not sure what else.

Do you have a test case to reproduce this problem?
        regards, tom lane


Re: self-deadlock at FATAL exit of boostrap process on read error

From
"Qingqing Zhou"
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> wrote
>
> Do you have a test case to reproduce this problem?
>

According to the error message, the problem happens during reading
pg_database. I just tried to plug in this line in mdread():

+        /* pretend there is an error reading pg_database */
+        if (reln->smgr_rnode.relNode == 1262)
+        {
+                fprintf(stderr, "Ooops \n");
+                return false;
+        }
       v = _mdfd_getseg(reln, blocknum, false);

And it works.

Regards,
Qingqing





Re: self-deadlock at FATAL exit of boostrap process on read error

From
Tom Lane
Date:
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:
> "Tom Lane" <tgl@sss.pgh.pa.us> wrote
>> Do you have a test case to reproduce this problem?

> According to the error message, the problem happens during reading
> pg_database. I just tried to plug in this line in mdread():

OK, patch applied for this.
        regards, tom lane