On Thu, 2009-06-25 at 17:02 +0300, Heikki Linnakangas wrote:
> I think the real problem is this in mdunlink():
>
> > /* Register request to unlink first segment later */
> > if (!isRedo && forkNum == MAIN_FORKNUM)
> > register_unlink(rnode);
>
> When we replay the unlink of the relation, we don't te bgwriter about
> it. Normally we do, so bgwriter knows that if the fsync() fails with
> ENOENT, it's ok since the file was deleted.
>
> It's tempting to just remove the "!isRedo" condition, but then we have
> another problem: if bgwriter hasn't been started yet, and the shmem
> queue is full, we get stuck in register_unlink() trying to send the
> message and failing.
>
> In archive recovery, we always start bgwriter at the beginning of WAL
> replay. In crash recovery, we don't start bgwriter until the end of wAL
> replay. So we could change the "!isRedo" condition to
> "!InArchiveRecovery". It's not a very clean solution, but it's simple.
That seems to work for me, though I have some doubts as to the way two
phase commit is coded. 2PC seems to assume that if a file still exists
we must be in recovery and its OK to ignore.
Clean? We've changed the conditions under which the unlink needs to be
registered and !InArchiveRecovery defines the changed conditions
precisely.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support