Thread: "PANIC: cannot make new WAL entries during recovery" in the wild

"PANIC: cannot make new WAL entries during recovery" in the wild

From
Alvaro Herrera
Date:
Today we got a report in the spanish list about the message in $subject.
The server is 8.4 running on Windows.

Any ideas?  I'm wondering what kind of diagnostics we can run to debug
the problem.  xlogdump perhaps?

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Estimada Lista,

Tengo un servidor corriendo la version 8.4 para windows xp, atiende a
9 usuarios concurrentes todo sin problemas hasta hace unos minutos que
hubo una baja de tension electrica y provoco que mi servidor se
apagara bruscamente. El problema es que no puedo hacer que el servicio
levante, me arroja el siguiente log:

==========================
2009-08-05 11:58:19 COTLOG:  el sistema de bases de datos fue
interrumpido durante la recuperación en 2009-08-05 11:12:14 COT
2009-08-05 11:58:19 COTHINT:  Esto probablemente significa que algunos
datos están corruptos y tendrá que usar el respaldo más reciente para
la recuperación.
2009-08-05 11:58:19 COTLOG:  el sistema de bases de datos no fue
apagado apropiadamente; se está efectuando la recuperación automática
2009-08-05 11:58:19 COTLOG:  redo comienza en 0/75334970
2009-08-05 11:58:19 COTLOG:  la dirección de página 0/6D374000 en el
archivo de registro 0, segmento 117, posición 3620864 es inesperada
2009-08-05 11:58:19 COTLOG:  redo listo en 0/75370F68
2009-08-05 11:58:19 COTLOG:  última transacción completada al tiempo
de registro 2009-08-05 09:45:59.796-05
2009-08-05 11:58:19 COTPANIC:  cannot make new WAL entries during recovery


This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

2009-08-05 11:58:19 COTLOG:  proceso de inicio (PID 2636) terminó con
código de salida 3
2009-08-05 11:58:19 COTLOG:  abortando el inicio debido a una falla en
el procesamiento de inicio
==========================

Por favor si alguien puede ayudarme le estare muy agradecido.

Saludos,

Marcos Rios
Arequipa-Peru

--
TIP 2: puedes desuscribirte de todas las listas simultáneamente
    (envía "unregister TuDirecciónDeCorreo" a majordomo@postgresql.org)

Re: "PANIC: cannot make new WAL entries during recovery" in the wild

From
Alvaro Herrera
Date:
Alvaro Herrera wrote:

> 2009-08-05 11:58:19 COTLOG:  el sistema de bases de datos fue
> interrumpido durante la recuperación en 2009-08-05 11:12:14 COT
> 2009-08-05 11:58:19 COTHINT:  Esto probablemente significa que
> algunos datos están corruptos y tendrá que usar el respaldo más
> reciente para la recuperación.
> 2009-08-05 11:58:19 COTLOG:  el sistema de bases de datos no fue
> apagado apropiadamente; se está efectuando la recuperación
> automática
> 2009-08-05 11:58:19 COTLOG:  redo comienza en 0/75334970
> 2009-08-05 11:58:19 COTLOG:  la dirección de página 0/6D374000 en el
> archivo de registro 0, segmento 117, posición 3620864 es inesperada
> 2009-08-05 11:58:19 COTLOG:  redo listo en 0/75370F68
> 2009-08-05 11:58:19 COTLOG:  última transacción completada al tiempo
> de registro 2009-08-05 09:45:59.796-05
> 2009-08-05 11:58:19 COTPANIC:  cannot make new WAL entries during recovery

After adding %p to the log_line_prefix, it becomes clear that the
process calling XLogInsert here is the startup process.

This is new code in 8.4.  Is no one concerned about this?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: "PANIC: cannot make new WAL entries during recovery" in the wild

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> This is new code in 8.4.  Is no one concerned about this?

[ shrug... ]  It's uninvestigatable with only this amount of detail.
How about a test case, or at least a backtrace?
        regards, tom lane


Re: "PANIC: cannot make new WAL entries during recovery" in the wild

From
Greg Stark
Date:
2009/8/6 Alvaro Herrera <alvherre@commandprompt.com>:
> After adding %p to the log_line_prefix, it becomes clear that the
> process calling XLogInsert here is the startup process.
>
> This is new code in 8.4.  Is no one concerned about this?

Can you get a backtrace?

--
greg
http://mit.edu/~gsstark/resume.pdf


Re: "PANIC: cannot make new WAL entries during recovery" in the wild

From
Alvaro Herrera
Date:
Greg Stark wrote:
> 2009/8/6 Alvaro Herrera <alvherre@commandprompt.com>:
> > After adding %p to the log_line_prefix, it becomes clear that the
> > process calling XLogInsert here is the startup process.
> >
> > This is new code in 8.4.  Is no one concerned about this?
> 
> Can you get a backtrace?

I'll ask the user.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: "PANIC: cannot make new WAL entries during recovery" in the wild

From
Greg Stark
Date:
2009/8/6 Alvaro Herrera <alvherre@commandprompt.com>:
>> 2009-08-05 11:58:19 COTLOG:  la dirección de página 0/6D374000 en el
>> archivo de registro 0, segmento 117, posición 3620864 es inesperada
>

Incidentally, Google's translate gives me the impression that the
above message corresponds to:
if (!XLByteEQ(hdr->xlp_pageaddr, recaddr)){    ereport(emode,            (errmsg("unexpected pageaddr %X/%X in log file
%u,segment %u, offset %u",                    hdr->xlp_pageaddr.xlogid, hdr->xlp_pageaddr.xrecoff,
readId,readSeg, readOff)));    return false;} 


I'm not sure what this indicates though.

--
greg
http://mit.edu/~gsstark/resume.pdf


Re: "PANIC: cannot make new WAL entries during recovery" in the wild

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> 2009/8/6 Alvaro Herrera <alvherre@commandprompt.com>:
> 2009-08-05 11:58:19 COTLOG: �la direcci�n de p�gina 0/6D374000 en el
> archivo de registro 0, segmento 117, posici�n 3620864 es inesperada
>> 

> Incidentally, Google's translate gives me the impression that the
> above message corresponds to:

>     if (!XLByteEQ(hdr->xlp_pageaddr, recaddr))
>     {
>         ereport(emode,
>                 (errmsg("unexpected pageaddr %X/%X in log file %u, segment %u, offset %u",
>                         hdr->xlp_pageaddr.xlogid, hdr->xlp_pageaddr.xrecoff,
>                         readId, readSeg, readOff)));
>         return false;
>     }

> I'm not sure what this indicates though.

It probably means nothing: this is a typical termination condition when
reading from a recycled WAL file.  You come to a page that doesn't have
the expected page address, because what it's got is whatever it had in
the WAL file's previous cycle of life.

The PANIC is evidently happening during post-WAL-scanning cleanup,
perhaps while we're trying to repair incomplete btree page splits
or some such.  But I don't want to speculate too much without a
stack trace.
        regards, tom lane


Re: "PANIC: cannot make new WAL entries during recovery" in the wild

From
Alvaro Herrera
Date:
Greg Stark wrote:
> 2009/8/6 Alvaro Herrera <alvherre@commandprompt.com>:
> >> 2009-08-05 11:58:19 COTLOG:  la dirección de página 0/6D374000 en el
> >> archivo de registro 0, segmento 117, posición 3620864 es inesperada
> >
> 
> Incidentally, Google's translate gives me the impression that the
> above message corresponds to:
> 
>     if (!XLByteEQ(hdr->xlp_pageaddr, recaddr))
>     {
>         ereport(emode,
>                 (errmsg("unexpected pageaddr %X/%X in log file %u, segment %u, offset %u",
>                         hdr->xlp_pageaddr.xlogid, hdr->xlp_pageaddr.xrecoff,
>                         readId, readSeg, readOff)));
>         return false;
>     }

Yes, that's correct.  (It would be great to have a system to
reverse-translate the messages based on the PO files ...)

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: "PANIC: cannot make new WAL entries during recovery" in the wild

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Today we got a report in the spanish list about the message in $subject.
> The server is 8.4 running on Windows.

I accidentally managed to reproduce this in HEAD just now, by kill -9'ing
a backend that was in the midst of a COPY IN operation (I was trying to
reproduce Neil Best's unrelated issue...)  The server log is

LOG:  server process (PID 23846) was terminated by signal 9
LOG:  terminating any other active server processes
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted; last known up at 2009-08-07 11:27:36 EDT
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 0/1B9D7790
LOG:  unexpected pageaddr 0/1532E000 in log file 0, segment 28, offset 3334144
LOG:  redo done at 0/1C32D200
PANIC:  cannot make new WAL entries during recovery
LOG:  startup process (PID 23883) was terminated by signal 6
LOG:  aborting startup due to startup process failure

and the stack trace of the panic'd startup process looks like

#4  0x4b6e20 in errfinish (dummy=1) at elog.c:503
#5  0x4b86a0 in elog_finish (elevel=1073803952, fmt=0x7b0394b0 "") at elog.c:1142
#6  0x1f722c in XLogInsert (rmid=11 '\013', info=114 'r', rdata=0xc004d07c) at xlog.c:555
#7  0x1df290 in _bt_insertonpg (rel=0x4006cf28, buf=70, stack=0x3, itup=0x4006d150, newitemoff=38,
split_only_page=0)at nbtinsert.c:833
 
#8  0x1e0898 in _bt_insert_parent (rel=0x4006cf28, buf=304, rbuf=854, stack=0x7b03b9d8, is_root=0, is_only=0)   at
nbtinsert.c:1627
#9  0x1ef098 in btree_xlog_cleanup () at nbtxlog.c:927
#10 0x201c44 in StartupXLOG () at xlog.c:5767
#11 0x206134 in StartupProcessMain () at xlog.c:8034
#12 0x228d0c in AuxiliaryProcessMain (argc=2, argv=0x7b03b6d8) at bootstrap.c:433
#13 0x39bb68 in StartChildProcess (type=StartupProcess) at postmaster.c:4243

So that confirms my speculation that btree index cleanup is the source
of the message.  We have two basic approaches to dealing with it:

1. Decide that the check added to XLogInsert is wrong and take it out.

2. Arrange for some sort of explicit state transition between the
WAL-reading and cleanup phases of recovery, and make sure XLogInsert
knows about it.

Thoughts?
        regards, tom lane


Re: "PANIC: cannot make new WAL entries during recovery" in the wild

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> Today we got a report in the spanish list about the message in $subject.
>> The server is 8.4 running on Windows.
> 
> I accidentally managed to reproduce this in HEAD just now, by kill -9'ing
> a backend that was in the midst of a COPY IN operation (I was trying to
> reproduce Neil Best's unrelated issue...)  The server log is

You're lucky. I once tried to trigger the rm_cleanup() code with
repeated "killall -9 postmaster", but failed. IIRC I just put an abort()
at the right place in the end.

> So that confirms my speculation that btree index cleanup is the source
> of the message.  We have two basic approaches to dealing with it:
> 
> 1. Decide that the check added to XLogInsert is wrong and take it out.
> 
> 2. Arrange for some sort of explicit state transition between the
> WAL-reading and cleanup phases of recovery, and make sure XLogInsert
> knows about it.

I'd suggest we temporarily allow XLog insertion by calling
LocalSetXLogInsertAllowed() just before the rm_cleanup() loop, and reset
it with "LocalXLogInsertAllowed = -1" just after the loop. Like we do at
the startup checkpoint. The sanity check still feels very useful to me,
I'd hate to lose it.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: "PANIC: cannot make new WAL entries during recovery" in the wild

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> So that confirms my speculation that btree index cleanup is the source
>> of the message.  We have two basic approaches to dealing with it:
>> 
>> 1. Decide that the check added to XLogInsert is wrong and take it out.
>> 
>> 2. Arrange for some sort of explicit state transition between the
>> WAL-reading and cleanup phases of recovery, and make sure XLogInsert
>> knows about it.

> I'd suggest we temporarily allow XLog insertion by calling
> LocalSetXLogInsertAllowed() just before the rm_cleanup() loop, and reset
> it with "LocalXLogInsertAllowed = -1" just after the loop. Like we do at
> the startup checkpoint. The sanity check still feels very useful to me,
> I'd hate to lose it.

Yeah, that looks like a sane solution.  The disturbing thing is that
we didn't catch this sooner.
        regards, tom lane