Thread: UTC4115FATAL: the database system is in recovery mode

UTC4115FATAL: the database system is in recovery mode

From

Mathew Samuel

Date:

27 May 2011, 18:32:30

Hi,

I see the following error as found in pg.log:
UTC4115FATAL:  the database system is in recovery mode

Actually that message was logged repeatedly for about 4 hours according to =
the logs (I don't have access to the system itself, just the logs).

Leading up to that error were the following in pg.log:
2011-03-28 10:44:06 UTC3609LOG:  checkpoints are occurring too frequently (=
11 seconds apart)
2011-03-28 10:44:06 UTC3609HINT:  Consider increasing the configuration par=
ameter "checkpoint_segments".
2011-03-28 10:44:18 UTC3609LOG:  checkpoints are occurring too frequently (=
12 seconds apart)
2011-03-28 10:44:18 UTC3609HINT:  Consider increasing the configuration par=
ameter "checkpoint_segments".
2011-03-28 10:44:28 UTC3609LOG:  checkpoints are occurring too frequently (=
10 seconds apart)
2011-03-28 10:44:28 UTC3609HINT:  Consider increasing the configuration par=
ameter "checkpoint_segments".
2011-03-28 10:44:38 UTC3609LOG:  checkpoints are occurring too frequently (=
10 seconds apart)
2011-03-28 10:44:38 UTC3609HINT:  Consider increasing the configuration par=
ameter "checkpoint_segments".
2011-03-28 10:44:42 UTC3932ERROR:  canceling statement due to statement tim=
eout
2011-03-28 10:44:42 UTC3932STATEMENT:  vacuum full analyze _zamboni.sl_log_1
2011-03-28 10:44:42 UTC3932PANIC:  cannot abort transaction 1827110275, it =
was already committed
2011-03-28 10:44:42 UTC3566LOG:  server process (PID 3932) was terminated b=
y signal 6
2011-03-28 10:44:42 UTC3566LOG:  terminating any other active server proces=
ses
2011-03-28 10:44:42 UTC13142WARNING:  terminating connection because of cra=
sh of another server process
2011-03-28 10:44:42 UTC13142DETAIL:  The postmaster has commanded this serv=
er process to roll back the current transaction and exit, because another s=
erver process exited abnormally and possibly corrupted shared memory.
2011-03-28 10:44:42 UTC13142HINT:  In a moment you should be able to reconn=
ect to the database and repeat your command.
2011-03-28 10:44:42 UTC29834WARNING:  terminating connection because of cra=
sh of another server process
2011-03-28 10:44:42 UTC29834DETAIL:  The postmaster has commanded this serv=
er process to roll back the current transaction and exit, because another s=
erver process exited abnormally and possibly corrupted shared memory.
2011-03-28 10:44:42 UTC29834HINT:  In a moment you should be able to reconn=
ect to the database and repeat your command.
2011-03-28 10:44:42 UTC3553WARNING:  terminating connection because of cras=
h of another server process
2011-03-28 10:44:42 UTC3553DETAIL:  The postmaster has commanded this serve=
r process to roll back the current transaction and exit, because another se=
rver process exited abnormally and possibly corrupted shared memory.
In fact those last 3 lines are repeated over and over again repeatedly unti=
l "UTC4115FATAL:  the database system is in recovery mode" is logged for 4 =
hours. At some point, 4 hours later of course, it appears that the system r=
ecovers.

The Checkpoints Settings in postgresql.conf are commented out so I guess th=
e defaults are being used:
#checkpoint_segments =3D 3                # in logfile segments, min 1, 16M=
B each
#checkpoint_timeout =3D 5min              # range 30s-1h
#checkpoint_warning =3D 30s               # 0 is off
That system where this was seen was using pgsql-8.2.6 on RHEL4.

Not sure if this is a known bug (or if it is a bug at all or something I ca=
n address using different configuration) but I thought I would post here fi=
rst if any one might be familiar with this issue and suggest a possible sol=
ution. Any ideas?

Cheers,
Matt

Re: UTC4115FATAL: the database system is in recovery mode

From

Robert Haas

Date:

02 June 2011, 18:29:49

On Fri, May 27, 2011 at 2:26 PM, Mathew Samuel
<Mathew.Samuel@entrust.com> wrote:
> I see the following error as found in pg.log:
> UTC4115FATAL:=A0 the database system is in recovery mode
>
> Actually that message was logged repeatedly for about 4 hours according to
> the logs (I don't have access to the system itself, just the logs).
>
> Leading up to that error were the following in pg.log:
> 2011-03-28 10:44:06 UTC3609LOG:=A0 checkpoints are occurring too frequent=
ly
> (11 seconds apart)
> 2011-03-28 10:44:06 UTC3609HINT:=A0 Consider increasing the configuration
> parameter "checkpoint_segments".
> 2011-03-28 10:44:18 UTC3609LOG:=A0 checkpoints are occurring too frequent=
ly
> (12 seconds apart)
> 2011-03-28 10:44:18 UTC3609HINT:=A0 Consider increasing the configuration
> parameter "checkpoint_segments".
> 2011-03-28 10:44:28 UTC3609LOG:=A0 checkpoints are occurring too frequent=
ly
> (10 seconds apart)
> 2011-03-28 10:44:28 UTC3609HINT:=A0 Consider increasing the configuration
> parameter "checkpoint_segments".
> 2011-03-28 10:44:38 UTC3609LOG:=A0 checkpoints are occurring too frequent=
ly
> (10 seconds apart)
> 2011-03-28 10:44:38 UTC3609HINT:=A0 Consider increasing the configuration
> parameter "checkpoint_segments".
> 2011-03-28 10:44:42 UTC3932ERROR:=A0 canceling statement due to statement
> timeout
> 2011-03-28 10:44:42 UTC3932STATEMENT:=A0 vacuum full analyze _zamboni.sl_=
log_1
> 2011-03-28 10:44:42 UTC3932PANIC:=A0 cannot abort transaction 1827110275,=
 it
> was already committed
> 2011-03-28 10:44:42 UTC3566LOG:=A0 server process (PID 3932) was terminat=
ed by
> signal 6
> 2011-03-28 10:44:42 UTC3566LOG:=A0 terminating any other active server
> processes
> 2011-03-28 10:44:42 UTC13142WARNING:=A0 terminating connection because of
> crash of another server process
> 2011-03-28 10:44:42 UTC13142DETAIL:=A0 The postmaster has commanded this
> server process to roll back the current transaction and exit, because
> another server process exited abnormally and possibly corrupted shared
> memory.
> 2011-03-28 10:44:42 UTC13142HINT:=A0 In a moment you should be able to
> reconnect to the database and repeat your command.
> 2011-03-28 10:44:42 UTC29834WARNING:=A0 terminating connection because of
> crash of another server process
> 2011-03-28 10:44:42 UTC29834DETAIL:=A0 The postmaster has commanded this
> server process to roll back the current transaction and exit, because
> another server process exited abnormally and possibly corrupted shared
> memory.
> 2011-03-28 10:44:42 UTC29834HINT:=A0 In a moment you should be able to
> reconnect to the database and repeat your command.
> 2011-03-28 10:44:42 UTC3553WARNING:=A0 terminating connection because of =
crash
> of another server process
> 2011-03-28 10:44:42 UTC3553DETAIL:=A0 The postmaster has commanded this s=
erver
> process to roll back the current transaction and exit, because another
> server process exited abnormally and possibly corrupted shared memory.
> In fact those last 3 lines are repeated over and over again repeatedly un=
til
> "UTC4115FATAL:=A0 the database system is in recovery mode" is logged for 4
> hours. At some point, 4 hours later of course, it appears that the system
> recovers.
>
> The Checkpoints Settings in postgresql.conf are commented out so I guess =
the
> defaults are being used:
> #checkpoint_segments =3D 3=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 #=
 in logfile segments, min 1, 16MB
> each
> #checkpoint_timeout =3D 5min=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 # ran=
ge 30s-1h
> #checkpoint_warning =3D 30s=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 # 0=
 is off
> That system where this was seen was using pgsql-8.2.6 on RHEL4.
>
> Not sure if this is a known bug (or if it is a bug at all or something I =
can
> address using different configuration) but I thought I would post here fi=
rst
> if any one might be familiar with this issue and suggest a possible
> solution. Any ideas?

Not really, although I've seen a few other crashes attributable to
trying to cancel VACUUM FULL.  As of PostgreSQL 9.0, VACUUM FULL has
been completely rewritten, so whatever bug is causing this problem is
almost certainly gone in more recent versions.  It would be nice to
track done and fix whatever is wrong in the older branches, but we'd
probably need a reproducible test case of some sort.  I don't think it
craps out like that EVERY time you interrupt it...

--=20
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: UTC4115FATAL: the database system is in recovery mode

From

Tom Lane

Date:

02 June 2011, 19:45:10

Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, May 27, 2011 at 2:26 PM, Mathew Samuel
> <Mathew.Samuel@entrust.com> wrote:
>> 2011-03-28 10:44:42 UTC3932PANIC:  cannot abort transaction 1827110275, it
>> was already committed
>>
>> Not sure if this is a known bug (or if it is a bug at all or something I can
>> address using different configuration) but I thought I would post here first
>> if any one might be familiar with this issue and suggest a possible
>> solution. Any ideas?

> Not really, although I've seen a few other crashes attributable to
> trying to cancel VACUUM FULL.  As of PostgreSQL 9.0, VACUUM FULL has
> been completely rewritten, so whatever bug is causing this problem is
> almost certainly gone in more recent versions.

That was in fact one of the primary motivations for the rewrite.

> It would be nice to
> track done and fix whatever is wrong in the older branches, but we'd
> probably need a reproducible test case of some sort.  I don't think it
> craps out like that EVERY time you interrupt it...

It does if you interrupt it after the midstream commit :-(.  See around
line 2800 in commands/vacuum.c in 8.4, for example, and note that the
lame HOLD_INTERRUPTS hack isn't in the version the OP is running.

            regards, tom lane

Re: UTC4115FATAL: the database system is in recovery mode

From

Craig Ringer

Date:

03 June 2011, 02:14:12

On 05/28/2011 02:26 AM, Mathew Samuel wrote:

> That system where this was seen was using pgsql-8.2.6 on RHEL4.

Upgrade. A workaround is in newer versions of 8.2 .

--
Craig Ringer