Thread: FATAL: could not open relation pg_tblspc/491086/467369/491103: No such file or directory

Hello,

we found a bug while testing the latest version of Hot Standby. Then
we could reproduce it on the unpatched HEAD, so we are going to ignore
it in the next few days.

During a Warm Standby session using current HEAD I obtained the
following error on the standby node:

---8<------8<------8<------8<------8<------8<------8<------8<------8<---
2009-01-16 16:24:01 GMT[30678]LOG:  restored log file "0000000100000001000000C2" from archive
2009-01-16 16:24:01 GMT[30678]FATAL:  could not open relation pg_tblspc/491086/467369/491103: No such file or
directory
2009-01-16 16:24:01 GMT[30678]CONTEXT:  writing block 1 of relation pg_tblspc/491086/467369/491103xlog redo checkpoint:
redo1/C2001AB8; tli 1; xid 0/89982; oid 491520; multi 1; offset 0; online
 
2009-01-16 16:24:01 GMT[30665]LOG:  startup process (PID 30678) exited with exit code 1
2009-01-16 16:24:01 GMT[30665]LOG:  aborting startup due to startup process failure
2009-01-16 16:24:01 GMT[30677]DEBUG:  logger shutting down
---8<------8<------8<------8<------8<------8<------8<------8<------8<---

After setting up the session, I started an endless loop of "make
installcheck" on the primary node; the error happened after 40/50
minutes.

At the present I can't say exactly which test was responsible for
that, but this information should be obtainable by raising debug level
on the primary and comparing WAL segment numbers while looking at both
logfiles. Anyway, since the error was raised by bgwriter when running
with Hot Standby patch applied, is likely to be something to do with
the guts of checkpointing.

Best regards,
Dr. Gianni Ciolli - 2ndQuadrant Italia
PostgreSQL Training, Services and Support
gianni.ciolli@2ndquadrant.it | www.2ndquadrant.it



Gianni Ciolli <gianni.ciolli@2ndquadrant.it> writes:
> we found a bug while testing the latest version of Hot Standby. Then
> we could reproduce it on the unpatched HEAD, so we are going to ignore
> it in the next few days.

You didn't actually say how to repeat it on unpatched HEAD.
        regards, tom lane


On Fri, Jan 16, 2009 at 06:39:11PM +0100, Gianni Ciolli wrote:
(...)
> During a Warm Standby session using current HEAD I obtained the
> following error on the standby node:

On Fri, Jan 16, 2009 at 12:56:59PM -0500, Tom Lane wrote:
> Gianni Ciolli <gianni.ciolli@2ndquadrant.it> writes:
> > we found a bug while testing the latest version of Hot Standby. Then
> > we could reproduce it on the unpatched HEAD, so we are going to ignore
> > it in the next few days.
> 
> You didn't actually say how to repeat it on unpatched HEAD.
> 
>             regards, tom lane

Sorry for the misunderstanding; I used "current HEAD" and "unpatched
HEAD" as synonymous.

All the procedure that I described in that mail has been done with
unpatched HEAD; the only mentions of Hot Standby are outside that
procedure.

Best regards,
Dr. Gianni Ciolli - 2ndQuadrant Italia
PostgreSQL Training, Services and Support
gianni.ciolli@2ndquadrant.it | www.2ndquadrant.it



On Fri, 2009-01-16 at 19:12 +0100, Gianni Ciolli wrote:
> On Fri, Jan 16, 2009 at 06:39:11PM +0100, Gianni Ciolli wrote:
> (...)
> > During a Warm Standby session using current HEAD I obtained the
> > following error on the standby node:

I think I understand the cause of these bugs in CVS HEAD now.

In various places in current HEAD we throw a checkpoint when we want to
be certain that all buffers have been flushed.

In recovery, a checkpoint isn't always a restartpoint for two reasons:
timing and rmgr state. This gives both a cause for the error and an
explanation of why it does not occur consistently. ISTM this could
likely effect previous releases as well.

We need to put some marker into WAL to allow the same actions to be
repeated in recovery. We can't just force these "correctness
checkpoints" to be restartpoints because they might be invalid, but we
can force CheckPointGuts() (or something less) without updating the
control file.

With regard to various changes I have in motion, the CheckPointGuts()
would need to be executed in full before further WAL replay occurs, so
would need to be executed by the Startup process and not by the bgwriter
to ensure we performed the correct sequence of actions. 

CHECKPOINT_FORCE might be the right indicator of when to do take special
action in recovery, not sure. Will look at this again later.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Simon Riggs wrote:
> In various places in current HEAD we throw a checkpoint when we want to
> be certain that all buffers have been flushed.
> 
> In recovery, a checkpoint isn't always a restartpoint for two reasons:
> timing and rmgr state. This gives both a cause for the error and an
> explanation of why it does not occur consistently. ISTM this could
> likely effect previous releases as well.

Were you able to narrow this down? Do you know exactly what command 
caused it? At least replay of CREATE DATABASE already calls 
FlushDatabaseBuffers(), but are we missing that from some other place?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


On Mon, 2009-01-26 at 09:48 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > In various places in current HEAD we throw a checkpoint when we want to
> > be certain that all buffers have been flushed.
> > 
> > In recovery, a checkpoint isn't always a restartpoint for two reasons:
> > timing and rmgr state. This gives both a cause for the error and an
> > explanation of why it does not occur consistently. ISTM this could
> > likely effect previous releases as well.
> 
> Were you able to narrow this down? Do you know exactly what command 
> caused it? 

We know it wasn't any specific command because it caused the bgwriter to
crash when HS patch was applied. But no, I'm not looking at it yet,
until we're done with HS.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support